Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

A2UI Evaluation Framework

This folder contains evaluation tests (aka evals) for the A2UI project. An evaluation test verifies that a prompt produces expected results conforming to the A2UI schema and semantic rules.

Design

For a detailed overview of the design, secrets management, and contamination prevention, see the DESIGN.md file.

Running Evaluations

To run the evaluations, you need to use the Inspect AI CLI via uv. Make sure you are in this directory (evals/eval).

Prerequisites

  1. Set your Gemini API key:

    export GEMINI_API_KEY="your_api_key"
  2. Decrypt Datasets (First Time Setup): The evaluation datasets are encrypted at rest in the repository to prevent base model contamination. To decrypt them in your repo for evaluation, you need to initialize Transcrypt with the shared password. From the evals/eval directory, run:

    bin/transcrypt -p <PASSWORD>

You can request the password from any member of the A2UI team (it's not really a secret, but it's also not going on Github in plaintext).

After this one time setup, you will have local plaintext access to the decrypted datasets in the datasets/ directory, and they will be encrypted and decrypted transparently by git.

Run Evals

To run the evaluations with a specific model (e.g., Gemini 2.0 Flash):

uv run inspect eval tasks.py --model google/gemini-3-flash-preview --display plain

Viewing Evaluation Results

Inspect AI provides a web-based log viewer to explore the results of your evaluations.

To start the log viewer:

uv run inspect view start

This will start a local web server (usually at http://localhost:7575) and open the viewer in your browser. It will automatically load logs from the logs/ directory.

Listing Available Models

To list the available Gemini models supported by your API key:

uv run inspect eval tasks.py -T list_models=True --model google/gemini-3-flash-preview

(the --model flag is required even though it is ignored)

Running Unit Tests

To run the unit tests for the evaluation framework (dataset loader, solvers, scorers):

uv run python -m pytest