LLM-as-a-Judge: Agentic Flow Evaluation

Transformative Technology:

AI & Data

Semester programme:

Enhanced AI Techniques

Partner

Project group members:

Spasiana Karadzhova
Pavela Karadzhova
Preslava Dimkova
Matthew Marinchev
Mihail Kenarov

Transformative Technology:

AI & Data

Semester programme:

Enhanced AI Techniques

Partner

Project group members:

Spasiana Karadzhova
Pavela Karadzhova
Preslava Dimkova
Matthew Marinchev
Mihail Kenarov

Previous project DevSecOps Platform for Kubernetes: eBPF Enforcement, Zero Trust, GitOps, and Quality Gates Next project2.5D - From photo to Art work

Project description

The BDO Enhanced AI project focuses on designing and evaluating an agentic AI workflow for financial question answering. The main design challenge is to build a RAG-based system that retrieves relevant company data and generates accurate answers, while using an LLM-as-a-Judge to assess the quality of those answers.

The central research question is: How can an LLM-based evaluator reliably assess the correctness, completeness, groundedness, and numerical accuracy of answers produced by an agentic financial QA system? The project explores how automated evaluation can support human review through clear rubrics, citation-based grounding, and different judge prompt versions.

Context

The project is situated in the domain of Finance and AI workflow evaluation, where accuracy, transparency, and evidence-based responses are especially important. In financial contexts, users often ask questions about company performance, filings, risks, revenue, costs, or numerical results, and incorrect or unsupported answers can lead to misleading conclusions.

The BDO Enhanced AI project addresses this by working with a financial Q&A dataset containing company-related questions, answers, and supporting context from financial documents.

The system uses a Retrieval-Augmented Generation approach, meaning it first retrieves relevant financial information and then generates an answer based on that context. Because generative AI systems can still produce incomplete, incorrect, or weakly grounded answers, the project also includes an LLM-as-a-Judge component that evaluates the generated responses. This places the project within the broader context of trustworthy AI, explainable AI, and agentic workflows, where the goal is not only to generate useful outputs but also to assess whether those outputs are reliable.

The project therefore combines practical AI application development with research into evaluation methods, focusing on how automated judging can support human review in a high-stakes information domain.

Results

The most important outcomes of this project are both tangible products and validated insights about how LLM-as-a-judge can be improved in a RAG setting.

The first major product is the working prototype itself: a RAG-based application that answers financial-report questions using retrieved context from the dataset and then evaluates those answers with a separate judge model. This is valuable because it turns an abstract research idea into a usable system. The prototype includes a local SQLite/vector retrieval layer, an answering agent, configurable judge prompts, logging of judge outputs, and a small evaluation workflow. In practical terms, this means the project moved beyond theory and produced something that can actually be demonstrated, tested, and extended.

The second major product is the validation framework that was built around the prototype. This includes the 10-question labeled test set, the four evaluation metrics (groundedness, completeness, correctness, and numerical accuracy), the comparison workbook, and the structured prompt iterations from V1 to V4. This is important because the value of an evaluation system depends on whether its performance can itself be evaluated. The project therefore did not only build a judge, but also created a way to measure whether that judge behaves as intended.

A third important outcome is the final refined judge prompt. The validation showed that prompt engineering had a measurable effect on performance. The baseline prompt (V1) already performed reasonably well, but it showed weaknesses, especially on unsupported subjective answers and on the handling of numerical information. Through structured iteration, V2 introduced a checklist, V3 added few-shot examples, and V4 refined the distinction between core claims and incidental extra claims. This process led to a clear improvement in results. In the final V4 test, the judge achieved 100% cell-level agreement, 100% full-question accuracy, and 0 absolute score error on the 10-question labeled set. Even though the set is small, this is still a strong result because it shows that targeted prompt-engineering changes can materially improve judge reliability.

The project also produced several important insights. One key insight is that judge quality is highly sensitive to prompt design. A strong model alone is not enough; the wording and structure of the prompt significantly affect groundedness and numerical reasoning. Another key insight is that unsupported extra details in answers create ambiguity in scoring. This led to the useful refinement of separating core answer claims from incidental extra claims, which improved fairness and consistency. A third insight is that consistency must be tested explicitly, not assumed. Repeated-run testing showed that some difficult cases were initially unstable, while later prompt versions became much more reliable.Based on the validation and maturity, the project fits around TRL 5, moving toward TRL 6: it is an integrated prototype tested with real data and measurable validation results, but it still needs larger-scale testing, frozen-context judge-only evaluation, and broader BDO use cases before it can be considered robust in an operational environment.

About the project group

Our team members have different backgrounds - software, infrastructure and business. We have been working on the project for the duration of the semester.