Provider-agnostic evaluation harness — golden QA cases driven by a caller-supplied answer_fn, three tolerance modes (exact / numeric / semantic), and a Protocol-based LLM-judge seam so no SDK leaks into this layer. Designed to be runnable on a fresh clone without LLM credentials (the example case uses exact_match); the nightly workflow opt-in is documented in docs/EVAL_HARNESS.md.
models.EvalCase— Pydantic shape of an entry ineval/golden_qa.json:id,question,expected_answer,tolerance, plus optionalcategory/difficulty/expected_tools/notes.models.EvalResult— per-case outcome: actual answer, latency, pass/fail, tolerance score, failure reason.runner.load_golden_dataset(path=None)— pure JSON loader. Standalone so call sites can introspect the dataset without paying for anEvalRunner.runner.EvalRunner(answer_fn, judge_client=None, judge_model="")— the orchestrator.answer_fn: Callable[[str], str]is the single seam — wire your agent loop, direct LLM client, or stub.judge_clientimplements thejudge.LLMClientProtocol and is only consulted forsemantic_similarcases.evaluate(case)— run one case; returnsEvalResult.evaluate_all()— load the golden dataset, evaluate every case, return the list.
judge.LLMClient—Protocolwith one method:complete_json(*, model, prompt) -> str. Concrete adapters live in your agent code; the harness never imports OpenAI/Anthropic/Azure SDKs.judge.evaluate_semantic_similarity(question, expected, actual, client, model)— calls the judge, returns(score, explanation). Returns(None, "no LLM client configured")whenclient is None(inconclusive — runner treats it as pass with an explanatory note rather than a hard fail).report.generate_report(results)— markdown summary: overall accuracy, per-category, per-difficulty, failure analysis with reason text.__main__—python -m src.evalruns the dataset with an identityanswer_fn(echoes the question) and prints the markdown report.
coverageomits this package — the eval suite is exercised byeval/test_golden_qa.py, not bytests/. Counting it would inflate misses on every PR that touches behaviour without re-running the eval workflow.- No domain coupling. The harness ships one trivial echo case; real users replace
eval/golden_qa.jsonwith their domain dataset and wireanswer_fnto their agent. - Tolerance picks happen per case, not per runner. A dataset can mix
exact_match,numeric_close(within 1%), andsemantic_similar(LLM judge ≥ 0.8) cases freely.