Skip to content

Add client side MemMachine retrieval skill, and unify/skill-ify server side retrieval agent#1277

Open
Tianyang-Zhang wants to merge 17 commits intomainfrom
skill
Open

Add client side MemMachine retrieval skill, and unify/skill-ify server side retrieval agent#1277
Tianyang-Zhang wants to merge 17 commits intomainfrom
skill

Conversation

@Tianyang-Zhang
Copy link
Copy Markdown
Contributor

@Tianyang-Zhang Tianyang-Zhang commented Mar 26, 2026

Purpose of the change

Use multi-turn APIs and provider native skill upload/attach methods to execute retrieval agent "prompts". Now its more skill-like on the API calls.

Adding a client-side helper to allow user to use MemMachine as a retrieval skill.

Notice that this change increasing the recall and accuracy for multi-hop questions by ~1-3%, but also increasing the token usage 10x more, from ~2k to ~20k.

Refactored evaluation benchmarks. Now all retrieval_skill benchmarks use the client-side installed skill and REST API as the entry point of MemMachine, instead of the previous python SDK + agent_mode API call.

Type of change

[Please delete options that are not relevant.]

  • New feature (non-breaking change which adds functionality)
  • Refactor (does not change functionality, e.g., code style improvements, linting)
  • Documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration.

[Please delete options that are not relevant.]

  • Unit Test
  • Integration Test
  • End-to-end Test
  • Test Script (please provide)
  • Manual verification (list step-by-step instructions)

Test Results: [Attach logs, screenshots, or relevant output]

Checklist

[Please delete options that are not relevant.]

  • I have signed the commit(s) within this pull request
  • My code follows the style guidelines of this project (See STYLE_GUIDE.md)
  • I have performed a self-review of my own code
  • I have commented my code
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added unit tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I have checked my code and corrected any misspellings

Maintainer Checklist

  • Confirmed all checks passed
  • Contributor has signed the commit(s)
  • Reviewed the code
  • Run, Tested, and Verified the change(s) work as expected

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR migrates retrieval-agent execution to provider-native multi-turn “skill” sessions (with strict contract validation and fallback policy), adds client-side skill/CLI helpers, and refactors evaluation to run via REST + installed skills while emitting a retrieval_trace in search results.

Changes:

  • Introduce strict v1 retrieval-agent request/result contract helpers, fallback policy, session-state tracing, and provider-native skill session wrappers (OpenAI + Anthropic).
  • Add shared provider-backed skill installation primitives (memmachine_common) and a new client CLI (memmachine) plus API surface updates (retrieval_trace, agent_mode serialization).
  • Refactor evaluation harness from legacy retrieval_agent to retrieval_skill, using REST entry points and configurable answer/judge models.

Reviewed changes

Copilot reviewed 83 out of 101 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
packages/server/src/memmachine_server/retrieval_agent/agents/runtime.py New helpers to build/validate strict retrieval-agent v1 request/result contracts.
packages/server/src/memmachine_server/retrieval_agent/agents/memmachine_retriever.py Removes legacy server-side MemMachineAgent retriever implementation.
packages/server/src/memmachine_server/retrieval_agent/agents/fallback_policy.py Adds explicit retry/fallback decision policy for orchestration guardrails.
packages/server/src/memmachine_server/retrieval_agent/agents/coq_agent.py Removes legacy server-side ChainOfQueryAgent implementation.
packages/server/src/memmachine_server/retrieval_agent/agents/init.py Re-exports new contract/runtime/session-state APIs as the public surface.
packages/server/src/memmachine_server/common/vector_graph_store/neo4j_vector_graph_store.py Changes Neo4j index cache population to avoid awaitIndexes by reading state.
packages/server/src/memmachine_server/common/language_model/provider_skill_bundle.py Adds provider-neutral skill bundle materialization helper (SKILL.md bundles).
packages/server/src/memmachine_server/common/language_model/openai_responses_language_model.py Exposes OpenAI client/model fields for live installed-skill sessions.
packages/server/src/memmachine_server/common/language_model/openai_chat_completions_language_model.py Exposes OpenAI client/model fields for installed-skill execution.
packages/server/src/memmachine_server/common/language_model/init.py Exports new skill-session model types and provider skill bundle helpers.
packages/server/src/memmachine_server/common/configuration/retrieval_config.py Extends retrieval-agent config for provider-native sessions + Anthropic settings.
packages/server/server_tests/memmachine_server/server/api_v2/test_router.py Adds coverage for retrieval_trace in REST search responses.
packages/server/server_tests/memmachine_server/server/api_v2/test_mcp.py Asserts MCP search variants default retrieval_trace to None.
packages/server/server_tests/memmachine_server/retrieval_agent/test_retrieve_agent_session_state.py New tests for session-state merging and emitted orchestration metrics.
packages/server/server_tests/memmachine_server/retrieval_agent/test_retrieve_agent_route_selection.py New tests covering selected route/agent metrics and multi-tool behavior.
packages/server/server_tests/memmachine_server/retrieval_agent/test_retrieve_agent_orchestration_loop.py New tests for orchestration loop behaviors and trace payloads.
packages/server/server_tests/memmachine_server/retrieval_agent/test_retrieve_agent_fallback_policy.py New tests for fallback triggers (budgets, invalid tools, max steps).
packages/server/server_tests/memmachine_server/retrieval_agent/test_retrieve_agent_bootstrap.py New tests for bootstrap path, error mapping, and skill bundle attachment.
packages/server/server_tests/memmachine_server/retrieval_agent/test_agent_runtime.py New tests for contract enforcement, normalization, and tool schema parsing.
packages/server/server_tests/memmachine_server/retrieval_agent/test_agent_markdown_specs.py New tests validating agent markdown specs structure and content anchors.
packages/server/server_tests/memmachine_server/retrieval_agent/test_agent_markdown_prompt_parity.py New tests ensuring spec/prompt parity and disallowing legacy tools.
packages/server/server_tests/memmachine_server/retrieval_agent/test_agent_interface_contract.py New protocol conformance test for retrieval-agent interface.
packages/server/server_tests/memmachine_server/retrieval_agent/provider_runner_stub.py Adds fake OpenAI installed-agent client + REST memory stubs for tests.
packages/server/server_tests/memmachine_server/main/test_memmachine_mock.py Updates MemMachine mock tests to use RetrievalAgentProtocol and retrieval_trace.
packages/server/server_tests/memmachine_server/common/vector_graph_store/test_neo4j_vector_graph_store_unit.py Adds unit test for new Neo4j index state cache behavior.
packages/server/server_tests/memmachine_server/common/language_model/test_skill_language_model.py Adds tests for OpenAI skill-session wrapper (retries, tool call parsing).
packages/server/server_tests/memmachine_server/common/language_model/test_skill_anthropic_session_language_model.py Adds tests for Anthropic skill sessions, tool chaining, skill attachment.
packages/server/server_tests/memmachine_server/common/language_model/test_provider_skill_bundle.py Adds tests for provider skill bundle materialization stability.
packages/server/server_tests/memmachine_server/common/configuration/test_retrieval_config.py Adds tests for new retrieval-agent config defaults and env resolution.
packages/server/pyproject.toml Adds Anthropic dependency; adjusts ty extra-paths.
packages/common/src/memmachine_common/skill_installer.py Adds shared skill installer that uploads markdown to provider Files APIs with caching.
packages/common/src/memmachine_common/skill.py Adds immutable Skill metadata model for installed skills.
packages/common/src/memmachine_common/api/spec.py Adds serialization alias for agent_mode; adds retrieval_trace to search response content.
packages/common/src/memmachine_common/api/doc.py Documents retrieval_trace field.
packages/common/src/memmachine_common/init.py Exposes Skill primitives (Skill, install_skill, SkillRunner).
packages/common/common_tests/test_skill_installer.py Adds tests for provider uploads, caching, and missing SDK import errors.
packages/common/common_tests/test_skill.py Adds tests for Skill immutability and validation.
packages/common/common_tests/init.py Adds package marker for common tests.
packages/client/src/memmachine_client/cli.py Adds memmachine CLI with search and ingest commands.
packages/client/src/memmachine_client/init.py Reformats exports list.
packages/client/pyproject.toml Adds optional deps (anthropic/openai) and CLI script entry point.
packages/client/client_tests/test_cli_integration.py Adds integration tests for CLI (skipped by default).
packages/client/README.md Updates client docs to include retrieval-agent mode, retrieval_trace, and CLI usage.
evaluation/tests/init.py Adds evaluation tests package marker.
evaluation/retrieval_skill/wikimultihop_search.py New retrieval_skill benchmark runner using REST + skill runner.
evaluation/retrieval_skill/wikimultihop_ingest.py Refactors ingestion to REST message ingestion + warmup search.
evaluation/retrieval_skill/run_test.sh New unified benchmark runner script with concurrency/config flags.
evaluation/retrieval_skill/longmemeval_test.py Refactors LongMemEval ingest/search to REST + skill runner + configurable models.
evaluation/retrieval_skill/locomo_search.py Refactors LoCoMo search to skill runner with configurable models.
evaluation/retrieval_skill/locomo_ingest.py Refactors LoCoMo ingestion to REST episode ingestion + warmup searches.
evaluation/retrieval_skill/llm_judge.py Refactors judge to use benchmark config or OpenAI env fallback without server config.
evaluation/retrieval_skill/hotpotQA_test.py Refactors HotpotQA ingest/search to REST messages + skill runner + warmup.
evaluation/retrieval_skill/generate_scores.py Updates scoring aggregation terminology from tools to skills.
evaluation/retrieval_skill/evaluate.py Updates evaluator to use new judge config flow and store extra fields.
evaluation/retrieval_skill/benchmark_config.yml Adds benchmark config for answer + evaluation models.
evaluation/retrieval_skill/benchmark_config.py Adds benchmark config loader + unified LLM client wrapper.
evaluation/retrieval_skill/README.md Documents retrieval_skill harness, config format, and usage.
evaluation/retrieval_agent/wikimultihop_search.py Removes legacy retrieval_agent benchmark runner.
evaluation/retrieval_agent/test_run_test.py Removes legacy run_test.sh behavior tests.
evaluation/retrieval_agent/test_preflight.py Removes legacy preflight tests.
evaluation/retrieval_agent/test_locomo_search.py Removes legacy locomo search parser tests.
evaluation/retrieval_agent/test_benchmark_concurrency.py Removes legacy benchmark concurrency parser tests.
evaluation/retrieval_agent/requirements.txt Removes legacy retrieval_agent requirements file.
evaluation/retrieval_agent/preflight.py Removes legacy preflight script.
evaluation/retrieval_agent/cli_utils.py Removes legacy argparse utilities.
evaluation/README.md Updates top-level evaluation docs to recommend retrieval_skill suite.
docs/open_source/retrieval_agent_architecture.mdx Updates architecture doc to reflect provider-native session + single-tool contract.
docs/open_source/configuration.mdx Documents new retrieval_agent session provider settings (OpenAI/Anthropic).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +74 to +90
"""Validate output against the v1 agent result contract."""
try:
return AgentResultV1.model_validate(raw_result)
except ValidationError as first_err:
if normalizer is None:
raise AgentContractError(
code=AgentContractErrorCode.INVALID_OUTPUT,
payload=AgentContractErrorPayload(
what_failed="Agent result validation failed",
why=str(first_err),
how_to_fix="Return a strict v1 agent result object.",
where="agents.runtime.validate_agent_result",
fallback_trigger_reason="invalid_agent_output",
),
validation_error=first_err,
) from first_err

Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function has built-in tuple normalization support ((result, perf_metrics)), but that normalization is unreachable when normalizer=None because the function raises immediately on the first ValidationError. If raw_result is a legacy (payload, metrics) tuple, it will always fail unless a normalizer is provided. Consider applying tuple normalization before raising (e.g., attempt _normalize_result_from_tuple(raw_result, route_name=...) and re-validate) or treating tuple normalization as the default normalizer when normalizer is not provided.

Copilot uses AI. Check for mistakes.
record["name"]: (
Neo4jVectorGraphStore.CacheIndexState.ONLINE
if record["state"] == "ONLINE"
else Neo4jVectorGraphStore.CacheIndexState.CREATING
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mapping every non-ONLINE index state to CREATING can misclassify terminal or error states (e.g., FAILED) as “still creating.” If downstream logic relies on this cache to decide whether to wait/retry/use an index, this can cause incorrect behavior or endless “creating” semantics. Suggest explicitly mapping known states (e.g., POPULATING/ONLINE) and either preserving unknown states in the cache (as a string) or introducing a distinct cache state for failure/unknown so callers can handle it deterministically.

Suggested change
else Neo4jVectorGraphStore.CacheIndexState.CREATING
else (
Neo4jVectorGraphStore.CacheIndexState.CREATING
if record["state"] == "POPULATING"
else record["state"]
)

Copilot uses AI. Check for mistakes.
Comment on lines +48 to +49
if not skill_file.exists():
skill_file.write_text(normalized_markdown, encoding="utf-8")
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exists() check + write_text() sequence is not atomic and can race across processes/threads materializing the same bundle directory (e.g., concurrent sessions starting). Consider using an atomic write strategy (write to a temp file then replace) or opening the file in exclusive-create mode so concurrent writers don’t interleave/partially write SKILL.md.

Suggested change
if not skill_file.exists():
skill_file.write_text(normalized_markdown, encoding="utf-8")
try:
with skill_file.open("x", encoding="utf-8") as f:
f.write(normalized_markdown)
except FileExistsError:
# Another process/thread has already materialized this skill file.
# The contents are deterministic, so we can safely reuse it.
pass

Copilot uses AI. Check for mistakes.
if not cache_path.exists():
return {}

raw = json.loads(cache_path.read_text(encoding="utf-8"))
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the cache file exists but is corrupted/truncated, json.loads(...) will raise and break install_skill() entirely. Consider catching json.JSONDecodeError and either treating it as a cache miss (return {}) or raising a clearer error that instructs users how to delete/repair the cache file.

Suggested change
raw = json.loads(cache_path.read_text(encoding="utf-8"))
try:
text = cache_path.read_text(encoding="utf-8")
raw = json.loads(text)
except (OSError, json.JSONDecodeError):
# Treat unreadable or corrupted cache as a cache miss.
return {}

Copilot uses AI. Check for mistakes.
return cache


def _save_cache(cache_path: Path, cache: dict[str, dict[str, str]]) -> None:
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When cache_path is set to a location whose parent directory doesn’t exist, write_text() will fail. Consider ensuring cache_path.parent.mkdir(parents=True, exist_ok=True) before writing so custom cache paths work reliably in CI and container environments.

Suggested change
def _save_cache(cache_path: Path, cache: dict[str, dict[str, str]]) -> None:
def _save_cache(cache_path: Path, cache: dict[str, dict[str, str]]) -> None:
cache_path.parent.mkdir(parents=True, exist_ok=True)

Copilot uses AI. Check for mistakes.

captured = capsys.readouterr()
# The content should appear somewhere in stdout
assert unique_content in captured.out or len(captured.out) >= 0 # results may vary
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len(captured.out) >= 0 is always true, so this assertion can never fail (and the test won’t detect regressions). If results can legitimately vary, consider asserting something meaningful (e.g., unique_content in captured.out with retry/backoff; or at least captured.out.strip() != ""; or validating the JSON/line format and presence of any episodes).

Suggested change
assert unique_content in captured.out or len(captured.out) >= 0 # results may vary
assert unique_content in captured.out

Copilot uses AI. Check for mistakes.
def _get_openai_client() -> openai.OpenAI:
global _openai_client
if _openai_client is None:
_openai_client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When --config is not provided and OPENAI_API_KEY is unset, openai.OpenAI(api_key=None) will likely fail later with a less actionable error. Consider validating the env var here and raising a clear exception (e.g., “OPENAI_API_KEY must be set when --config is not provided”) so benchmark runs fail fast with an actionable message.

Suggested change
_openai_client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise RuntimeError(
"OPENAI_API_KEY must be set when no eval_llm client is provided."
)
_openai_client = openai.OpenAI(api_key=api_key)

Copilot uses AI. Check for mistakes.
Comment on lines +19 to +22
import json
import os
import re
from dataclasses import dataclass, field
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

json and field are imported but not used in this module. Removing unused imports will keep lint/type-check output clean and avoid confusion about intended functionality.

Suggested change
import json
import os
import re
from dataclasses import dataclass, field
import os
import re
from dataclasses import dataclass

Copilot uses AI. Check for mistakes.
from typing import Any

from dotenv import load_dotenv
from memmachine_server.common.utils import async_with
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

async_with is imported but not used in the updated script. Removing it will avoid unused-import lint failures and keep the benchmark runner minimal.

Suggested change
from memmachine_server.common.utils import async_with

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants