Skip to content

feat: add semantic memory evaluation harness#1242

Draft
o-love wants to merge 9 commits intocluster-integrationfrom
eval-harness
Draft

feat: add semantic memory evaluation harness#1242
o-love wants to merge 9 commits intocluster-integrationfrom
eval-harness

Conversation

@o-love
Copy link
Copy Markdown
Contributor

@o-love o-love commented Mar 18, 2026

Purpose of the change

Add a LoCoMo-based evaluation harness for measuring semantic memory retrieval quality, enabling systematic benchmarking of ingestion and search.

Description

  • Evaluation runner (runner.py) orchestrating ingestion → search → scoring pipeline
  • ingest.py and search.py for harness-specific ingestion and retrieval
  • locomo_evaluate.py for computing retrieval metrics against LoCoMo ground truth
  • generate_scores.py for aggregating and reporting results
  • semantic_harness.py as the top-level entry point
  • Test coverage for all harness components
  • YAML config fixture for evaluation scenarios

Stack: PR 4/4mainstorage-interface-refactorcluster-enginecluster-integrationeval-harness

Depends on #1241

Type of change

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

  • Unit Test

  • test_runner.py — evaluation runner orchestration

  • test_ingest.py — harness ingestion pipeline

  • test_search.py — harness search and scoring

  • test_semantic_harness.py / test_semantic_harness_build.py — end-to-end harness

Test Results: All evaluation tests pass locally.

Checklist

  • My code follows the style guidelines of this project (See STYLE_GUIDE.md)
  • I have performed a self-review of my own code
  • I have commented my code
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added unit tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I have checked my code and corrected any misspellings

Maintainer Checklist

  • Confirmed all checks passed
  • Contributor has signed the commit(s)
  • Reviewed the code
  • Run, Tested, and Verified the change(s) work as expected

Screenshots/Gifs

N/A

Further comments

This PR is entirely additive (16 new files, 714 lines) under evaluation/ and tests/evaluation/. The harness is self-contained and does not affect the main application.

o-love added 2 commits March 18, 2026 11:41
…ing)

Widen parameter types to Sequence/Mapping for covariance, narrow return
types where mutation is needed (MutableMapping). Convert async def methods
returning iterators to def returning AsyncIterator. Add delete_history_set
to the storage interface.
Adapt all callers of storage methods that now return AsyncIterator:
- SemanticService propagates AsyncIterator for search, get_set_features,
  list_set_id_starts_with
- SemanticSessionManager propagates AsyncIterator to the boundary
- MemMachine collects AsyncIterator into lists at the API boundary
- IngestionService collects internally where lists are needed
- Add merge_async_iterators utility for parallel iterator merging
- Update test files to collect from AsyncIterator
- Fix ruff import sorting in semantic_memory.py and test_background
- Fix ty invalid-assignment: use Sequence[SemanticFeature] for
  consolidation sections, convert to list at llm boundary
- Fix ty invalid-argument-type: revert Protocol widening in session
  manager where config_store hasn't been updated yet, convert at
  call sites instead
- Fix ruff formatting in test_semantic_ingestion.py
The router constructs response models with concrete list/dict fields
but the widened model types now expose Sequence/Mapping. Convert at
the serialization boundary.
o-love added 3 commits March 18, 2026 13:42
Introduce ClusterManager, ClusterSplitter, and ClusterStore abstraction
with SQLAlchemy and in-memory implementations. Clusters group incoming
messages by semantic similarity before ingestion.
o-love added 2 commits March 18, 2026 14:18
The ResourceManager protocol requires get_reranker but the test's
local _ResourceManager was missing it, causing isinstance check to
fail at runtime.
@sscargal sscargal added this to the v0.3.4 milestone Apr 6, 2026
@sscargal sscargal requested review from edwinyyyu and sscargal April 6, 2026 21:38
@o-love o-love force-pushed the cluster-integration branch 3 times, most recently from bae9bb0 to f7ae24e Compare April 8, 2026 22:03
@o-love o-love marked this pull request as draft April 10, 2026 22:25
@sscargal sscargal modified the milestones: v0.3.4, v0.3.5 Apr 13, 2026
@edwinyyyu edwinyyyu requested review from malatewang April 14, 2026 17:27
@edwinyyyu edwinyyyu requested a review from malatewang April 14, 2026 17:27
@sscargal sscargal modified the milestones: v0.3.5, v0.3.6 Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants