Add client side MemMachine retrieval skill, and unify/skill-ify server side retrieval agent#1277
Add client side MemMachine retrieval skill, and unify/skill-ify server side retrieval agent#1277Tianyang-Zhang wants to merge 17 commits intomainfrom
Conversation
packages/server/src/memmachine_server/retrieval_agent/agents/retrieve_agent.py
Fixed
Show fixed
Hide fixed
packages/server/src/memmachine_server/retrieval_agent/agents/retrieve_agent.py
Fixed
Show fixed
Hide fixed
There was a problem hiding this comment.
Pull request overview
This PR migrates retrieval-agent execution to provider-native multi-turn “skill” sessions (with strict contract validation and fallback policy), adds client-side skill/CLI helpers, and refactors evaluation to run via REST + installed skills while emitting a retrieval_trace in search results.
Changes:
- Introduce strict v1 retrieval-agent request/result contract helpers, fallback policy, session-state tracing, and provider-native skill session wrappers (OpenAI + Anthropic).
- Add shared provider-backed skill installation primitives (
memmachine_common) and a new client CLI (memmachine) plus API surface updates (retrieval_trace,agent_modeserialization). - Refactor evaluation harness from legacy
retrieval_agenttoretrieval_skill, using REST entry points and configurable answer/judge models.
Reviewed changes
Copilot reviewed 83 out of 101 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/server/src/memmachine_server/retrieval_agent/agents/runtime.py | New helpers to build/validate strict retrieval-agent v1 request/result contracts. |
| packages/server/src/memmachine_server/retrieval_agent/agents/memmachine_retriever.py | Removes legacy server-side MemMachineAgent retriever implementation. |
| packages/server/src/memmachine_server/retrieval_agent/agents/fallback_policy.py | Adds explicit retry/fallback decision policy for orchestration guardrails. |
| packages/server/src/memmachine_server/retrieval_agent/agents/coq_agent.py | Removes legacy server-side ChainOfQueryAgent implementation. |
| packages/server/src/memmachine_server/retrieval_agent/agents/init.py | Re-exports new contract/runtime/session-state APIs as the public surface. |
| packages/server/src/memmachine_server/common/vector_graph_store/neo4j_vector_graph_store.py | Changes Neo4j index cache population to avoid awaitIndexes by reading state. |
| packages/server/src/memmachine_server/common/language_model/provider_skill_bundle.py | Adds provider-neutral skill bundle materialization helper (SKILL.md bundles). |
| packages/server/src/memmachine_server/common/language_model/openai_responses_language_model.py | Exposes OpenAI client/model fields for live installed-skill sessions. |
| packages/server/src/memmachine_server/common/language_model/openai_chat_completions_language_model.py | Exposes OpenAI client/model fields for installed-skill execution. |
| packages/server/src/memmachine_server/common/language_model/init.py | Exports new skill-session model types and provider skill bundle helpers. |
| packages/server/src/memmachine_server/common/configuration/retrieval_config.py | Extends retrieval-agent config for provider-native sessions + Anthropic settings. |
| packages/server/server_tests/memmachine_server/server/api_v2/test_router.py | Adds coverage for retrieval_trace in REST search responses. |
| packages/server/server_tests/memmachine_server/server/api_v2/test_mcp.py | Asserts MCP search variants default retrieval_trace to None. |
| packages/server/server_tests/memmachine_server/retrieval_agent/test_retrieve_agent_session_state.py | New tests for session-state merging and emitted orchestration metrics. |
| packages/server/server_tests/memmachine_server/retrieval_agent/test_retrieve_agent_route_selection.py | New tests covering selected route/agent metrics and multi-tool behavior. |
| packages/server/server_tests/memmachine_server/retrieval_agent/test_retrieve_agent_orchestration_loop.py | New tests for orchestration loop behaviors and trace payloads. |
| packages/server/server_tests/memmachine_server/retrieval_agent/test_retrieve_agent_fallback_policy.py | New tests for fallback triggers (budgets, invalid tools, max steps). |
| packages/server/server_tests/memmachine_server/retrieval_agent/test_retrieve_agent_bootstrap.py | New tests for bootstrap path, error mapping, and skill bundle attachment. |
| packages/server/server_tests/memmachine_server/retrieval_agent/test_agent_runtime.py | New tests for contract enforcement, normalization, and tool schema parsing. |
| packages/server/server_tests/memmachine_server/retrieval_agent/test_agent_markdown_specs.py | New tests validating agent markdown specs structure and content anchors. |
| packages/server/server_tests/memmachine_server/retrieval_agent/test_agent_markdown_prompt_parity.py | New tests ensuring spec/prompt parity and disallowing legacy tools. |
| packages/server/server_tests/memmachine_server/retrieval_agent/test_agent_interface_contract.py | New protocol conformance test for retrieval-agent interface. |
| packages/server/server_tests/memmachine_server/retrieval_agent/provider_runner_stub.py | Adds fake OpenAI installed-agent client + REST memory stubs for tests. |
| packages/server/server_tests/memmachine_server/main/test_memmachine_mock.py | Updates MemMachine mock tests to use RetrievalAgentProtocol and retrieval_trace. |
| packages/server/server_tests/memmachine_server/common/vector_graph_store/test_neo4j_vector_graph_store_unit.py | Adds unit test for new Neo4j index state cache behavior. |
| packages/server/server_tests/memmachine_server/common/language_model/test_skill_language_model.py | Adds tests for OpenAI skill-session wrapper (retries, tool call parsing). |
| packages/server/server_tests/memmachine_server/common/language_model/test_skill_anthropic_session_language_model.py | Adds tests for Anthropic skill sessions, tool chaining, skill attachment. |
| packages/server/server_tests/memmachine_server/common/language_model/test_provider_skill_bundle.py | Adds tests for provider skill bundle materialization stability. |
| packages/server/server_tests/memmachine_server/common/configuration/test_retrieval_config.py | Adds tests for new retrieval-agent config defaults and env resolution. |
| packages/server/pyproject.toml | Adds Anthropic dependency; adjusts ty extra-paths. |
| packages/common/src/memmachine_common/skill_installer.py | Adds shared skill installer that uploads markdown to provider Files APIs with caching. |
| packages/common/src/memmachine_common/skill.py | Adds immutable Skill metadata model for installed skills. |
| packages/common/src/memmachine_common/api/spec.py | Adds serialization alias for agent_mode; adds retrieval_trace to search response content. |
| packages/common/src/memmachine_common/api/doc.py | Documents retrieval_trace field. |
| packages/common/src/memmachine_common/init.py | Exposes Skill primitives (Skill, install_skill, SkillRunner). |
| packages/common/common_tests/test_skill_installer.py | Adds tests for provider uploads, caching, and missing SDK import errors. |
| packages/common/common_tests/test_skill.py | Adds tests for Skill immutability and validation. |
| packages/common/common_tests/init.py | Adds package marker for common tests. |
| packages/client/src/memmachine_client/cli.py | Adds memmachine CLI with search and ingest commands. |
| packages/client/src/memmachine_client/init.py | Reformats exports list. |
| packages/client/pyproject.toml | Adds optional deps (anthropic/openai) and CLI script entry point. |
| packages/client/client_tests/test_cli_integration.py | Adds integration tests for CLI (skipped by default). |
| packages/client/README.md | Updates client docs to include retrieval-agent mode, retrieval_trace, and CLI usage. |
| evaluation/tests/init.py | Adds evaluation tests package marker. |
| evaluation/retrieval_skill/wikimultihop_search.py | New retrieval_skill benchmark runner using REST + skill runner. |
| evaluation/retrieval_skill/wikimultihop_ingest.py | Refactors ingestion to REST message ingestion + warmup search. |
| evaluation/retrieval_skill/run_test.sh | New unified benchmark runner script with concurrency/config flags. |
| evaluation/retrieval_skill/longmemeval_test.py | Refactors LongMemEval ingest/search to REST + skill runner + configurable models. |
| evaluation/retrieval_skill/locomo_search.py | Refactors LoCoMo search to skill runner with configurable models. |
| evaluation/retrieval_skill/locomo_ingest.py | Refactors LoCoMo ingestion to REST episode ingestion + warmup searches. |
| evaluation/retrieval_skill/llm_judge.py | Refactors judge to use benchmark config or OpenAI env fallback without server config. |
| evaluation/retrieval_skill/hotpotQA_test.py | Refactors HotpotQA ingest/search to REST messages + skill runner + warmup. |
| evaluation/retrieval_skill/generate_scores.py | Updates scoring aggregation terminology from tools to skills. |
| evaluation/retrieval_skill/evaluate.py | Updates evaluator to use new judge config flow and store extra fields. |
| evaluation/retrieval_skill/benchmark_config.yml | Adds benchmark config for answer + evaluation models. |
| evaluation/retrieval_skill/benchmark_config.py | Adds benchmark config loader + unified LLM client wrapper. |
| evaluation/retrieval_skill/README.md | Documents retrieval_skill harness, config format, and usage. |
| evaluation/retrieval_agent/wikimultihop_search.py | Removes legacy retrieval_agent benchmark runner. |
| evaluation/retrieval_agent/test_run_test.py | Removes legacy run_test.sh behavior tests. |
| evaluation/retrieval_agent/test_preflight.py | Removes legacy preflight tests. |
| evaluation/retrieval_agent/test_locomo_search.py | Removes legacy locomo search parser tests. |
| evaluation/retrieval_agent/test_benchmark_concurrency.py | Removes legacy benchmark concurrency parser tests. |
| evaluation/retrieval_agent/requirements.txt | Removes legacy retrieval_agent requirements file. |
| evaluation/retrieval_agent/preflight.py | Removes legacy preflight script. |
| evaluation/retrieval_agent/cli_utils.py | Removes legacy argparse utilities. |
| evaluation/README.md | Updates top-level evaluation docs to recommend retrieval_skill suite. |
| docs/open_source/retrieval_agent_architecture.mdx | Updates architecture doc to reflect provider-native session + single-tool contract. |
| docs/open_source/configuration.mdx | Documents new retrieval_agent session provider settings (OpenAI/Anthropic). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| """Validate output against the v1 agent result contract.""" | ||
| try: | ||
| return AgentResultV1.model_validate(raw_result) | ||
| except ValidationError as first_err: | ||
| if normalizer is None: | ||
| raise AgentContractError( | ||
| code=AgentContractErrorCode.INVALID_OUTPUT, | ||
| payload=AgentContractErrorPayload( | ||
| what_failed="Agent result validation failed", | ||
| why=str(first_err), | ||
| how_to_fix="Return a strict v1 agent result object.", | ||
| where="agents.runtime.validate_agent_result", | ||
| fallback_trigger_reason="invalid_agent_output", | ||
| ), | ||
| validation_error=first_err, | ||
| ) from first_err | ||
|
|
There was a problem hiding this comment.
The function has built-in tuple normalization support ((result, perf_metrics)), but that normalization is unreachable when normalizer=None because the function raises immediately on the first ValidationError. If raw_result is a legacy (payload, metrics) tuple, it will always fail unless a normalizer is provided. Consider applying tuple normalization before raising (e.g., attempt _normalize_result_from_tuple(raw_result, route_name=...) and re-validate) or treating tuple normalization as the default normalizer when normalizer is not provided.
| record["name"]: ( | ||
| Neo4jVectorGraphStore.CacheIndexState.ONLINE | ||
| if record["state"] == "ONLINE" | ||
| else Neo4jVectorGraphStore.CacheIndexState.CREATING |
There was a problem hiding this comment.
Mapping every non-ONLINE index state to CREATING can misclassify terminal or error states (e.g., FAILED) as “still creating.” If downstream logic relies on this cache to decide whether to wait/retry/use an index, this can cause incorrect behavior or endless “creating” semantics. Suggest explicitly mapping known states (e.g., POPULATING/ONLINE) and either preserving unknown states in the cache (as a string) or introducing a distinct cache state for failure/unknown so callers can handle it deterministically.
| else Neo4jVectorGraphStore.CacheIndexState.CREATING | |
| else ( | |
| Neo4jVectorGraphStore.CacheIndexState.CREATING | |
| if record["state"] == "POPULATING" | |
| else record["state"] | |
| ) |
| if not skill_file.exists(): | ||
| skill_file.write_text(normalized_markdown, encoding="utf-8") |
There was a problem hiding this comment.
The exists() check + write_text() sequence is not atomic and can race across processes/threads materializing the same bundle directory (e.g., concurrent sessions starting). Consider using an atomic write strategy (write to a temp file then replace) or opening the file in exclusive-create mode so concurrent writers don’t interleave/partially write SKILL.md.
| if not skill_file.exists(): | |
| skill_file.write_text(normalized_markdown, encoding="utf-8") | |
| try: | |
| with skill_file.open("x", encoding="utf-8") as f: | |
| f.write(normalized_markdown) | |
| except FileExistsError: | |
| # Another process/thread has already materialized this skill file. | |
| # The contents are deterministic, so we can safely reuse it. | |
| pass |
| if not cache_path.exists(): | ||
| return {} | ||
|
|
||
| raw = json.loads(cache_path.read_text(encoding="utf-8")) |
There was a problem hiding this comment.
If the cache file exists but is corrupted/truncated, json.loads(...) will raise and break install_skill() entirely. Consider catching json.JSONDecodeError and either treating it as a cache miss (return {}) or raising a clearer error that instructs users how to delete/repair the cache file.
| raw = json.loads(cache_path.read_text(encoding="utf-8")) | |
| try: | |
| text = cache_path.read_text(encoding="utf-8") | |
| raw = json.loads(text) | |
| except (OSError, json.JSONDecodeError): | |
| # Treat unreadable or corrupted cache as a cache miss. | |
| return {} |
| return cache | ||
|
|
||
|
|
||
| def _save_cache(cache_path: Path, cache: dict[str, dict[str, str]]) -> None: |
There was a problem hiding this comment.
When cache_path is set to a location whose parent directory doesn’t exist, write_text() will fail. Consider ensuring cache_path.parent.mkdir(parents=True, exist_ok=True) before writing so custom cache paths work reliably in CI and container environments.
| def _save_cache(cache_path: Path, cache: dict[str, dict[str, str]]) -> None: | |
| def _save_cache(cache_path: Path, cache: dict[str, dict[str, str]]) -> None: | |
| cache_path.parent.mkdir(parents=True, exist_ok=True) |
|
|
||
| captured = capsys.readouterr() | ||
| # The content should appear somewhere in stdout | ||
| assert unique_content in captured.out or len(captured.out) >= 0 # results may vary |
There was a problem hiding this comment.
len(captured.out) >= 0 is always true, so this assertion can never fail (and the test won’t detect regressions). If results can legitimately vary, consider asserting something meaningful (e.g., unique_content in captured.out with retry/backoff; or at least captured.out.strip() != ""; or validating the JSON/line format and presence of any episodes).
| assert unique_content in captured.out or len(captured.out) >= 0 # results may vary | |
| assert unique_content in captured.out |
| def _get_openai_client() -> openai.OpenAI: | ||
| global _openai_client | ||
| if _openai_client is None: | ||
| _openai_client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY")) |
There was a problem hiding this comment.
When --config is not provided and OPENAI_API_KEY is unset, openai.OpenAI(api_key=None) will likely fail later with a less actionable error. Consider validating the env var here and raising a clear exception (e.g., “OPENAI_API_KEY must be set when --config is not provided”) so benchmark runs fail fast with an actionable message.
| _openai_client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY")) | |
| api_key = os.getenv("OPENAI_API_KEY") | |
| if not api_key: | |
| raise RuntimeError( | |
| "OPENAI_API_KEY must be set when no eval_llm client is provided." | |
| ) | |
| _openai_client = openai.OpenAI(api_key=api_key) |
| import json | ||
| import os | ||
| import re | ||
| from dataclasses import dataclass, field |
There was a problem hiding this comment.
json and field are imported but not used in this module. Removing unused imports will keep lint/type-check output clean and avoid confusion about intended functionality.
| import json | |
| import os | |
| import re | |
| from dataclasses import dataclass, field | |
| import os | |
| import re | |
| from dataclasses import dataclass |
| from typing import Any | ||
|
|
||
| from dotenv import load_dotenv | ||
| from memmachine_server.common.utils import async_with |
There was a problem hiding this comment.
async_with is imported but not used in the updated script. Removing it will avoid unused-import lint failures and keep the benchmark runner minimal.
| from memmachine_server.common.utils import async_with |
Purpose of the change
Use multi-turn APIs and provider native skill upload/attach methods to execute retrieval agent "prompts". Now its more skill-like on the API calls.
Adding a client-side helper to allow user to use MemMachine as a retrieval skill.
Notice that this change increasing the recall and accuracy for multi-hop questions by ~1-3%, but also increasing the token usage 10x more, from ~2k to ~20k.
Refactored evaluation benchmarks. Now all retrieval_skill benchmarks use the client-side installed skill and REST API as the entry point of MemMachine, instead of the previous python SDK +
agent_modeAPI call.Type of change
[Please delete options that are not relevant.]
How Has This Been Tested?
Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration.
[Please delete options that are not relevant.]
Test Results: [Attach logs, screenshots, or relevant output]
Checklist
[Please delete options that are not relevant.]
Maintainer Checklist