Evaluating Generative AI as an HMI between domain expert and DSL code
AI & Data
Semester programme:Master of Applied IT
Research group:High Tech Embedded Software
Project group members:Panagiotis Kalogeropoulos
Project description
Generative AI has allowed domain experts (machinists, process engineers, industrial engineers, etc.) to convert their envisioned behavior from natural language to the structured syntax of a Domain Specific Programming Language. But how is that generated DSL code evaluated? How can we assert that the generated code can be relied upon? As DSLs are frequently deployed in high-risk environments, it is imperative to assert that changes by LLMs consider context-specific limitations (temperature cutoffs, harmonic frequencies of motors, etc.), to reduce the risk of damages to equipment and risk to human life.
Context
The research takes place in the forefront of LLMs, formal methods, and the evaluation of Generative AI outputs. The research took place around using LLMs to generate MermaidJS Sequence Diagram DSL code, that allows the LLM under test to design the software architecture of features that have been requested in the GitHub repository of the OpenRemote open-source IoT platform. The LLM “domain expert judges” check the database, backend, frontend, legal, compliance, and cybersecurity domains. The LLMs tested were the current lineup of OpenAI (gpt-5.2, gpt-5-mini, gpt-5-nano) and the judge LLM used is Claude Sonnet 4.5.
Results
We introduce a framework that allows the evaluation of AI-generated DSL code by a panel of LLM “domain expert” judges each focused on their own domain. The framework provides a (monetary) cost projection of the damage/profit compared to a baseline implementation, which allows the real domain expert to estimate the potential gains or losses by using the introduced LLM-aided changes. The dataset used was generated using Synthetic Data Generation by mining the GitHub issues of the OpenRemote repository. The results show that the judges respond more positively on models with bigger parameter counts, which is consistent with the hypothesis.
About the project group
This research was conducted under the HTES lectorate, with the help of Herman Jurjus, professor at Fontys ICT, who originally suggested the idea in the context of the first semester of the Master of Applied IT degree.