feat: add semantic memory evaluation harness by o-love · Pull Request #1242 · MemMachine/MemMachine

o-love · 2026-03-18T18:44:33Z

Purpose of the change

Add a LoCoMo-based evaluation harness for measuring semantic memory retrieval quality, enabling systematic benchmarking of ingestion and search.

Description

Evaluation runner (runner.py) orchestrating ingestion → search → scoring pipeline
ingest.py and search.py for harness-specific ingestion and retrieval
locomo_evaluate.py for computing retrieval metrics against LoCoMo ground truth
generate_scores.py for aggregating and reporting results
semantic_harness.py as the top-level entry point
Test coverage for all harness components
YAML config fixture for evaluation scenarios

Stack: PR 4/4 — main ← storage-interface-refactor ← cluster-engine ← cluster-integration ← eval-harness

Depends on #1241

Type of change

New feature (non-breaking change which adds functionality)

How Has This Been Tested?

Unit Test
test_runner.py — evaluation runner orchestration
test_ingest.py — harness ingestion pipeline
test_search.py — harness search and scoring
test_semantic_harness.py / test_semantic_harness_build.py — end-to-end harness

Test Results: All evaluation tests pass locally.

Checklist

My code follows the style guidelines of this project (See STYLE_GUIDE.md)
I have performed a self-review of my own code
I have commented my code
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added unit tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published in downstream modules
I have checked my code and corrected any misspellings

Maintainer Checklist

Confirmed all checks passed
Contributor has signed the commit(s)
Reviewed the code
Run, Tested, and Verified the change(s) work as expected

Screenshots/Gifs

N/A

Further comments

This PR is entirely additive (16 new files, 714 lines) under evaluation/ and tests/evaluation/. The harness is self-contained and does not affect the main application.

…ing) Widen parameter types to Sequence/Mapping for covariance, narrow return types where mutation is needed (MutableMapping). Convert async def methods returning iterators to def returning AsyncIterator. Add delete_history_set to the storage interface.

Adapt all callers of storage methods that now return AsyncIterator: - SemanticService propagates AsyncIterator for search, get_set_features, list_set_id_starts_with - SemanticSessionManager propagates AsyncIterator to the boundary - MemMachine collects AsyncIterator into lists at the API boundary - IngestionService collects internally where lists are needed - Add merge_async_iterators utility for parallel iterator merging - Update test files to collect from AsyncIterator

- Fix ruff import sorting in semantic_memory.py and test_background - Fix ty invalid-assignment: use Sequence[SemanticFeature] for consolidation sections, convert to list at llm boundary - Fix ty invalid-argument-type: revert Protocol widening in session manager where config_store hasn't been updated yet, convert at call sites instead - Fix ruff formatting in test_semantic_ingestion.py

The router constructs response models with concrete list/dict fields but the widened model types now expose Sequence/Mapping. Convert at the serialization boundary.

Introduce ClusterManager, ClusterSplitter, and ClusterStore abstraction with SQLAlchemy and in-memory implementations. Clusters group incoming messages by semantic similarity before ingestion.

The ResourceManager protocol requires get_reranker but the test's local _ResourceManager was missing it, causing isinstance check to fail at runtime.

o-love added 2 commits March 18, 2026 11:41

o-love force-pushed the cluster-integration branch from 5c3ccd6 to 685c9ee Compare March 18, 2026 19:28

o-love force-pushed the eval-harness branch from 89b5a0d to 4039fa7 Compare March 18, 2026 19:28

o-love force-pushed the cluster-integration branch from 685c9ee to 1131247 Compare March 18, 2026 19:42

o-love force-pushed the eval-harness branch from 4039fa7 to 0e895c4 Compare March 18, 2026 19:42

fix: convert Sequence/Mapping to list/dict at API response boundary

1e58170

The router constructs response models with concrete list/dict fields but the widened model types now expose Sequence/Mapping. Convert at the serialization boundary.

o-love force-pushed the cluster-integration branch from 1131247 to c60cb13 Compare March 18, 2026 20:01

o-love force-pushed the eval-harness branch from 0e895c4 to ad97a84 Compare March 18, 2026 20:01

o-love added 3 commits March 18, 2026 13:42

fix: ruff format router.py ternary expressions

7a90993

feat: add cluster engine for semantic message grouping

1a5e28b

Introduce ClusterManager, ClusterSplitter, and ClusterStore abstraction with SQLAlchemy and in-memory implementations. Clusters group incoming messages by semantic similarity before ingestion.

feat: wire clustering into ingestion pipeline and API

ae1f158

o-love force-pushed the cluster-integration branch from c60cb13 to ae1f158 Compare March 18, 2026 20:43

o-love force-pushed the eval-harness branch from ad97a84 to 5340cf6 Compare March 18, 2026 20:43

o-love added 2 commits March 18, 2026 14:18

fix: add get_reranker to _ResourceManager in integration test

bae9bb0

The ResourceManager protocol requires get_reranker but the test's local _ResourceManager was missing it, causing isinstance check to fail at runtime.

feat: add semantic memory evaluation harness

b907ea0

o-love force-pushed the eval-harness branch from 5340cf6 to b907ea0 Compare March 18, 2026 21:18

sscargal added this to the v0.3.4 milestone Apr 6, 2026

sscargal requested review from edwinyyyu and sscargal April 6, 2026 21:38

o-love force-pushed the cluster-integration branch 3 times, most recently from bae9bb0 to f7ae24e Compare April 8, 2026 22:03

edwinyyyu approved these changes Apr 10, 2026

View reviewed changes

o-love marked this pull request as draft April 10, 2026 22:25

sscargal modified the milestones: v0.3.4, v0.3.5 Apr 13, 2026

edwinyyyu requested review from malatewang April 14, 2026 17:27

edwinyyyu requested a review from malatewang April 14, 2026 17:27

sscargal modified the milestones: v0.3.5, v0.3.6 Apr 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add semantic memory evaluation harness#1242

feat: add semantic memory evaluation harness#1242
o-love wants to merge 9 commits intocluster-integrationfrom
eval-harness

o-love commented Mar 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

o-love commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose of the change

Description

Type of change

How Has This Been Tested?

Checklist

Maintainer Checklist

Screenshots/Gifs

Further comments

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

o-love commented Mar 18, 2026 •

edited

Loading