This folder contains evaluation tests (aka evals) for the A2UI project. An evaluation test verifies that a prompt produces expected results conforming to the A2UI schema and semantic rules.
For a detailed overview of the design, secrets management, and contamination prevention, see the DESIGN.md file.
To run the evaluations, you need to use the Inspect AI CLI via uv. Make sure you are in this directory (evals/eval).
-
Set your Gemini API key:
export GEMINI_API_KEY="your_api_key"
-
Decrypt Datasets (First Time Setup): The evaluation datasets are encrypted at rest in the repository to prevent base model contamination. To decrypt them in your repo for evaluation, you need to initialize Transcrypt with the shared password. From the
evals/evaldirectory, run:bin/transcrypt -p <PASSWORD>
You can request the password from any member of the A2UI team (it's not really a secret, but it's also not going on Github in plaintext).
After this one time setup, you will have local plaintext access to the decrypted datasets in the datasets/ directory, and they will be encrypted and decrypted transparently by git.
To run the evaluations with a specific model (e.g., Gemini 2.0 Flash):
uv run inspect eval tasks.py --model google/gemini-3-flash-preview --display plainInspect AI provides a web-based log viewer to explore the results of your evaluations.
To start the log viewer:
uv run inspect view startThis will start a local web server (usually at http://localhost:7575) and open the viewer in your browser. It will automatically load logs from the logs/ directory.
To list the available Gemini models supported by your API key:
uv run inspect eval tasks.py -T list_models=True --model google/gemini-3-flash-preview(the --model flag is required even though it is ignored)
To run the unit tests for the evaluation framework (dataset loader, solvers, scorers):
uv run python -m pytest