eval

`src/eval/`

Provider-agnostic evaluation harness — golden QA cases driven by a caller-supplied answer_fn, three tolerance modes (exact / numeric / semantic), and a Protocol-based LLM-judge seam so no SDK leaks into this layer. Designed to be runnable on a fresh clone without LLM credentials (the example case uses exact_match); the nightly workflow opt-in is documented in docs/EVAL_HARNESS.md.

Key interfaces

models.EvalCase — Pydantic shape of an entry in eval/golden_qa.json: id, question, expected_answer, tolerance, plus optional category / difficulty / expected_tools / notes.
models.EvalResult — per-case outcome: actual answer, latency, pass/fail, tolerance score, failure reason.
runner.load_golden_dataset(path=None) — pure JSON loader. Standalone so call sites can introspect the dataset without paying for an EvalRunner.
runner.EvalRunner(answer_fn, judge_client=None, judge_model="") — the orchestrator. answer_fn: Callable[[str], str] is the single seam — wire your agent loop, direct LLM client, or stub. judge_client implements the judge.LLMClient Protocol and is only consulted for semantic_similar cases.
- evaluate(case) — run one case; returns EvalResult.
- evaluate_all() — load the golden dataset, evaluate every case, return the list.
judge.LLMClient — Protocol with one method: complete_json(*, model, prompt) -> str. Concrete adapters live in your agent code; the harness never imports OpenAI/Anthropic/Azure SDKs.
judge.evaluate_semantic_similarity(question, expected, actual, client, model) — calls the judge, returns (score, explanation). Returns (None, "no LLM client configured") when client is None (inconclusive — runner treats it as pass with an explanatory note rather than a hard fail).
report.generate_report(results) — markdown summary: overall accuracy, per-category, per-difficulty, failure analysis with reason text.
__main__ — python -m src.eval runs the dataset with an identity answer_fn (echoes the question) and prints the markdown report.

Conventions

coverage omits this package — the eval suite is exercised by eval/test_golden_qa.py, not by tests/. Counting it would inflate misses on every PR that touches behaviour without re-running the eval workflow.
No domain coupling. The harness ships one trivial echo case; real users replace eval/golden_qa.json with their domain dataset and wire answer_fn to their agent.
Tolerance picks happen per case, not per runner. A dataset can mix exact_match, numeric_close (within 1%), and semantic_similar (LLM judge ≥ 0.8) cases freely.

Name		Name	Last commit message	Last commit date
parent directory ..
adapters		adapters
README.md		README.md
__init__.py		__init__.py
__main__.py		__main__.py
judge.py		judge.py
models.py		models.py
report.py		report.py
runner.py		runner.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

`src/eval/`

Key interfaces

Conventions

FilesExpand file tree

eval

Directory actions

More options

Directory actions

More options

Latest commit

History

eval

Folders and files

parent directory

README.md

src/eval/

Key interfaces

Conventions

`src/eval/`