harness-python-react/eval/golden_patterns.json at develop · constk/harness-python-react

38 lines (38 loc) · 2.45 KB

    "id": "factual-http-200",
    "question": "What HTTP status code means OK? Respond with only the number, no prose.",
    "category": "factual-recall",
    "expected_answer": "200",
    "tolerance": "exact_match",
    "difficulty": "easy",
    "notes": "Pattern: factual recall with format-constrained output. exact_match works because the prompt forces a single canonical token. If the model adds prose (\"The status code is 200.\") this fails loudly — which is the point: format adherence is part of the assertion."
    "id": "numeric-seconds-per-day",
    "question": "How many seconds are in 24 hours? Respond with the integer only.",
    "category": "numeric-reasoning",
    "expected_answer": "86400",
    "tolerance": "numeric_close",
    "difficulty": "easy",
    "notes": "Pattern: numeric extraction with 1% tolerance. The runner pulls the first number from each side and compares ratios, so '86,400', '86400 seconds', and '86400.0' all match. Use this tolerance for math, conversions, and any case where formatting around the number is uninteresting."
    "id": "definitional-fastapi-depends",
    "question": "In one sentence: what does FastAPI's Depends() do?",
    "category": "definitional",
    "expected_answer": "Depends declares a callable that FastAPI resolves at request time and injects the result into the parameter, enabling dependency injection for things like authentication, database sessions, or settings.",
    "tolerance": "semantic_similar",
    "difficulty": "medium",
    "notes": "Pattern: free-form prose scored by LLM judge. semantic_similar passes at score >= 0.8 via the judge in src/eval/judge.py. Use this for definitions, explanations, and any case where wording can legitimately vary but the underlying claim is checkable."
    "id": "structured-json-status",
    "question": "Return exactly this JSON object and nothing else (no markdown fence, no prose, no trailing newline): {\"ok\": true, \"version\": 1}",
    "category": "structured-output",
    "expected_answer": "{\"ok\": true, \"version\": 1}",
    "tolerance": "exact_match",
    "difficulty": "medium",
    "notes": "Pattern: format adherence on structured output. Models commonly wrap JSON in ```json``` fences or add a preamble; exact_match after normalisation (lowercase + whitespace-collapse) accepts a clean response but rejects the fenced or prose-wrapped version. This is the failure mode you want to catch — downstream parsers break the same way."

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

golden_patterns.json

Latest commit

History

golden_patterns.json

File metadata and controls