Skip to content

Latest commit

 

History

History
 
 

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

src/eval/

Provider-agnostic evaluation harness — golden QA cases driven by a caller-supplied answer_fn, three tolerance modes (exact / numeric / semantic), and a Protocol-based LLM-judge seam so no SDK leaks into this layer. Designed to be runnable on a fresh clone without LLM credentials (the example case uses exact_match); the nightly workflow opt-in is documented in docs/EVAL_HARNESS.md.

Key interfaces

  • models.EvalCase — Pydantic shape of an entry in eval/golden_qa.json: id, question, expected_answer, tolerance, plus optional category / difficulty / expected_tools / notes.
  • models.EvalResult — per-case outcome: actual answer, latency, pass/fail, tolerance score, failure reason.
  • runner.load_golden_dataset(path=None) — pure JSON loader. Standalone so call sites can introspect the dataset without paying for an EvalRunner.
  • runner.EvalRunner(answer_fn, judge_client=None, judge_model="") — the orchestrator. answer_fn: Callable[[str], str] is the single seam — wire your agent loop, direct LLM client, or stub. judge_client implements the judge.LLMClient Protocol and is only consulted for semantic_similar cases.
    • evaluate(case) — run one case; returns EvalResult.
    • evaluate_all() — load the golden dataset, evaluate every case, return the list.
  • judge.LLMClientProtocol with one method: complete_json(*, model, prompt) -> str. Concrete adapters live in your agent code; the harness never imports OpenAI/Anthropic/Azure SDKs.
  • judge.evaluate_semantic_similarity(question, expected, actual, client, model) — calls the judge, returns (score, explanation). Returns (None, "no LLM client configured") when client is None (inconclusive — runner treats it as pass with an explanatory note rather than a hard fail).
  • report.generate_report(results) — markdown summary: overall accuracy, per-category, per-difficulty, failure analysis with reason text.
  • __main__python -m src.eval runs the dataset with an identity answer_fn (echoes the question) and prints the markdown report.

Conventions

  • coverage omits this package — the eval suite is exercised by eval/test_golden_qa.py, not by tests/. Counting it would inflate misses on every PR that touches behaviour without re-running the eval workflow.
  • No domain coupling. The harness ships one trivial echo case; real users replace eval/golden_qa.json with their domain dataset and wire answer_fn to their agent.
  • Tolerance picks happen per case, not per runner. A dataset can mix exact_match, numeric_close (within 1%), and semantic_similar (LLM judge ≥ 0.8) cases freely.