Add client side MemMachine retrieval skill, and unify/skill-ify server side retrieval agent by Tianyang-Zhang · Pull Request #1277 · MemMachine/MemMachine

Tianyang-Zhang · 2026-03-26T23:14:42Z

Purpose of the change

Use multi-turn APIs and provider native skill upload/attach methods to execute retrieval agent "prompts". Now its more skill-like on the API calls.

Adding a client-side helper to allow user to use MemMachine as a retrieval skill.

Notice that this change increasing the recall and accuracy for multi-hop questions by ~1-3%, but also increasing the token usage 10x more, from ~2k to ~20k.

Refactored evaluation benchmarks. Now all retrieval_skill benchmarks use the client-side installed skill and REST API as the entry point of MemMachine, instead of the previous python SDK + agent_mode API call.

Type of change

[Please delete options that are not relevant.]

New feature (non-breaking change which adds functionality)
Refactor (does not change functionality, e.g., code style improvements, linting)
Documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration.

[Please delete options that are not relevant.]

Test Results: [Attach logs, screenshots, or relevant output]

Checklist

[Please delete options that are not relevant.]

Maintainer Checklist

Confirmed all checks passed
Contributor has signed the commit(s)
Reviewed the code
Run, Tested, and Verified the change(s) work as expected

packages/server/src/memmachine_server/retrieval_agent/agents/retrieve_agent.py

Copilot

Pull request overview

This PR migrates retrieval-agent execution to provider-native multi-turn “skill” sessions (with strict contract validation and fallback policy), adds client-side skill/CLI helpers, and refactors evaluation to run via REST + installed skills while emitting a retrieval_trace in search results.

Changes:

Introduce strict v1 retrieval-agent request/result contract helpers, fallback policy, session-state tracing, and provider-native skill session wrappers (OpenAI + Anthropic).
Add shared provider-backed skill installation primitives (memmachine_common) and a new client CLI (memmachine) plus API surface updates (retrieval_trace, agent_mode serialization).
Refactor evaluation harness from legacy retrieval_agent to retrieval_skill, using REST entry points and configurable answer/judge models.

Reviewed changes

Copilot reviewed 83 out of 101 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
packages/server/src/memmachine_server/retrieval_agent/agents/runtime.py	New helpers to build/validate strict retrieval-agent v1 request/result contracts.
packages/server/src/memmachine_server/retrieval_agent/agents/memmachine_retriever.py	Removes legacy server-side MemMachineAgent retriever implementation.
packages/server/src/memmachine_server/retrieval_agent/agents/fallback_policy.py	Adds explicit retry/fallback decision policy for orchestration guardrails.
packages/server/src/memmachine_server/retrieval_agent/agents/coq_agent.py	Removes legacy server-side ChainOfQueryAgent implementation.
packages/server/src/memmachine_server/retrieval_agent/agents/init.py	Re-exports new contract/runtime/session-state APIs as the public surface.
packages/server/src/memmachine_server/common/vector_graph_store/neo4j_vector_graph_store.py	Changes Neo4j index cache population to avoid awaitIndexes by reading state.
packages/server/src/memmachine_server/common/language_model/provider_skill_bundle.py	Adds provider-neutral skill bundle materialization helper (SKILL.md bundles).
packages/server/src/memmachine_server/common/language_model/openai_responses_language_model.py	Exposes OpenAI client/model fields for live installed-skill sessions.
packages/server/src/memmachine_server/common/language_model/openai_chat_completions_language_model.py	Exposes OpenAI client/model fields for installed-skill execution.
packages/server/src/memmachine_server/common/language_model/init.py	Exports new skill-session model types and provider skill bundle helpers.
packages/server/src/memmachine_server/common/configuration/retrieval_config.py	Extends retrieval-agent config for provider-native sessions + Anthropic settings.
packages/server/server_tests/memmachine_server/server/api_v2/test_router.py	Adds coverage for `retrieval_trace` in REST search responses.
packages/server/server_tests/memmachine_server/server/api_v2/test_mcp.py	Asserts MCP search variants default `retrieval_trace` to None.
packages/server/server_tests/memmachine_server/retrieval_agent/test_retrieve_agent_session_state.py	New tests for session-state merging and emitted orchestration metrics.
packages/server/server_tests/memmachine_server/retrieval_agent/test_retrieve_agent_route_selection.py	New tests covering selected route/agent metrics and multi-tool behavior.
packages/server/server_tests/memmachine_server/retrieval_agent/test_retrieve_agent_orchestration_loop.py	New tests for orchestration loop behaviors and trace payloads.
packages/server/server_tests/memmachine_server/retrieval_agent/test_retrieve_agent_fallback_policy.py	New tests for fallback triggers (budgets, invalid tools, max steps).
packages/server/server_tests/memmachine_server/retrieval_agent/test_retrieve_agent_bootstrap.py	New tests for bootstrap path, error mapping, and skill bundle attachment.
packages/server/server_tests/memmachine_server/retrieval_agent/test_agent_runtime.py	New tests for contract enforcement, normalization, and tool schema parsing.
packages/server/server_tests/memmachine_server/retrieval_agent/test_agent_markdown_specs.py	New tests validating agent markdown specs structure and content anchors.
packages/server/server_tests/memmachine_server/retrieval_agent/test_agent_markdown_prompt_parity.py	New tests ensuring spec/prompt parity and disallowing legacy tools.
packages/server/server_tests/memmachine_server/retrieval_agent/test_agent_interface_contract.py	New protocol conformance test for retrieval-agent interface.
packages/server/server_tests/memmachine_server/retrieval_agent/provider_runner_stub.py	Adds fake OpenAI installed-agent client + REST memory stubs for tests.
packages/server/server_tests/memmachine_server/main/test_memmachine_mock.py	Updates MemMachine mock tests to use RetrievalAgentProtocol and `retrieval_trace`.
packages/server/server_tests/memmachine_server/common/vector_graph_store/test_neo4j_vector_graph_store_unit.py	Adds unit test for new Neo4j index state cache behavior.
packages/server/server_tests/memmachine_server/common/language_model/test_skill_language_model.py	Adds tests for OpenAI skill-session wrapper (retries, tool call parsing).
packages/server/server_tests/memmachine_server/common/language_model/test_skill_anthropic_session_language_model.py	Adds tests for Anthropic skill sessions, tool chaining, skill attachment.
packages/server/server_tests/memmachine_server/common/language_model/test_provider_skill_bundle.py	Adds tests for provider skill bundle materialization stability.
packages/server/server_tests/memmachine_server/common/configuration/test_retrieval_config.py	Adds tests for new retrieval-agent config defaults and env resolution.
packages/server/pyproject.toml	Adds Anthropic dependency; adjusts ty extra-paths.
packages/common/src/memmachine_common/skill_installer.py	Adds shared skill installer that uploads markdown to provider Files APIs with caching.
packages/common/src/memmachine_common/skill.py	Adds immutable Skill metadata model for installed skills.
packages/common/src/memmachine_common/api/spec.py	Adds serialization alias for `agent_mode`; adds `retrieval_trace` to search response content.
packages/common/src/memmachine_common/api/doc.py	Documents `retrieval_trace` field.
packages/common/src/memmachine_common/init.py	Exposes Skill primitives (Skill, install_skill, SkillRunner).
packages/common/common_tests/test_skill_installer.py	Adds tests for provider uploads, caching, and missing SDK import errors.
packages/common/common_tests/test_skill.py	Adds tests for Skill immutability and validation.
packages/common/common_tests/init.py	Adds package marker for common tests.
packages/client/src/memmachine_client/cli.py	Adds `memmachine` CLI with `search` and `ingest` commands.
packages/client/src/memmachine_client/init.py	Reformats exports list.
packages/client/pyproject.toml	Adds optional deps (anthropic/openai) and CLI script entry point.
packages/client/client_tests/test_cli_integration.py	Adds integration tests for CLI (skipped by default).
packages/client/README.md	Updates client docs to include retrieval-agent mode, retrieval_trace, and CLI usage.
evaluation/tests/init.py	Adds evaluation tests package marker.
evaluation/retrieval_skill/wikimultihop_search.py	New retrieval_skill benchmark runner using REST + skill runner.
evaluation/retrieval_skill/wikimultihop_ingest.py	Refactors ingestion to REST message ingestion + warmup search.
evaluation/retrieval_skill/run_test.sh	New unified benchmark runner script with concurrency/config flags.
evaluation/retrieval_skill/longmemeval_test.py	Refactors LongMemEval ingest/search to REST + skill runner + configurable models.
evaluation/retrieval_skill/locomo_search.py	Refactors LoCoMo search to skill runner with configurable models.
evaluation/retrieval_skill/locomo_ingest.py	Refactors LoCoMo ingestion to REST episode ingestion + warmup searches.
evaluation/retrieval_skill/llm_judge.py	Refactors judge to use benchmark config or OpenAI env fallback without server config.
evaluation/retrieval_skill/hotpotQA_test.py	Refactors HotpotQA ingest/search to REST messages + skill runner + warmup.
evaluation/retrieval_skill/generate_scores.py	Updates scoring aggregation terminology from tools to skills.
evaluation/retrieval_skill/evaluate.py	Updates evaluator to use new judge config flow and store extra fields.
evaluation/retrieval_skill/benchmark_config.yml	Adds benchmark config for answer + evaluation models.
evaluation/retrieval_skill/benchmark_config.py	Adds benchmark config loader + unified LLM client wrapper.
evaluation/retrieval_skill/README.md	Documents retrieval_skill harness, config format, and usage.
evaluation/retrieval_agent/wikimultihop_search.py	Removes legacy retrieval_agent benchmark runner.
evaluation/retrieval_agent/test_run_test.py	Removes legacy run_test.sh behavior tests.
evaluation/retrieval_agent/test_preflight.py	Removes legacy preflight tests.
evaluation/retrieval_agent/test_locomo_search.py	Removes legacy locomo search parser tests.
evaluation/retrieval_agent/test_benchmark_concurrency.py	Removes legacy benchmark concurrency parser tests.
evaluation/retrieval_agent/requirements.txt	Removes legacy retrieval_agent requirements file.
evaluation/retrieval_agent/preflight.py	Removes legacy preflight script.
evaluation/retrieval_agent/cli_utils.py	Removes legacy argparse utilities.
evaluation/README.md	Updates top-level evaluation docs to recommend retrieval_skill suite.
docs/open_source/retrieval_agent_architecture.mdx	Updates architecture doc to reflect provider-native session + single-tool contract.
docs/open_source/configuration.mdx	Documents new retrieval_agent session provider settings (OpenAI/Anthropic).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-26T23:17:05Z

packages/server/src/memmachine_server/retrieval_agent/agents/runtime.py

+    """Validate output against the v1 agent result contract."""
+    try:
+        return AgentResultV1.model_validate(raw_result)
+    except ValidationError as first_err:
+        if normalizer is None:
+            raise AgentContractError(
+                code=AgentContractErrorCode.INVALID_OUTPUT,
+                payload=AgentContractErrorPayload(
+                    what_failed="Agent result validation failed",
+                    why=str(first_err),
+                    how_to_fix="Return a strict v1 agent result object.",
+                    where="agents.runtime.validate_agent_result",
+                    fallback_trigger_reason="invalid_agent_output",
+                ),
+                validation_error=first_err,
+            ) from first_err
+


The function has built-in tuple normalization support ((result, perf_metrics)), but that normalization is unreachable when normalizer=None because the function raises immediately on the first ValidationError. If raw_result is a legacy (payload, metrics) tuple, it will always fail unless a normalizer is provided. Consider applying tuple normalization before raising (e.g., attempt _normalize_result_from_tuple(raw_result, route_name=...) and re-validate) or treating tuple normalization as the default normalizer when normalizer is not provided.

Copilot · 2026-03-26T23:17:05Z

packages/server/src/memmachine_server/common/vector_graph_store/neo4j_vector_graph_store.py

+                            record["name"]: (
+                                Neo4jVectorGraphStore.CacheIndexState.ONLINE
+                                if record["state"] == "ONLINE"
+                                else Neo4jVectorGraphStore.CacheIndexState.CREATING


Mapping every non-ONLINE index state to CREATING can misclassify terminal or error states (e.g., FAILED) as “still creating.” If downstream logic relies on this cache to decide whether to wait/retry/use an index, this can cause incorrect behavior or endless “creating” semantics. Suggest explicitly mapping known states (e.g., POPULATING/ONLINE) and either preserving unknown states in the cache (as a string) or introducing a distinct cache state for failure/unknown so callers can handle it deterministically.

Suggested change

else Neo4jVectorGraphStore.CacheIndexState.CREATING

else (

Neo4jVectorGraphStore.CacheIndexState.CREATING

if record["state"] == "POPULATING"

else record["state"]

)

Copilot · 2026-03-26T23:17:06Z

packages/server/src/memmachine_server/common/language_model/provider_skill_bundle.py

+    if not skill_file.exists():
+        skill_file.write_text(normalized_markdown, encoding="utf-8")


The exists() check + write_text() sequence is not atomic and can race across processes/threads materializing the same bundle directory (e.g., concurrent sessions starting). Consider using an atomic write strategy (write to a temp file then replace) or opening the file in exclusive-create mode so concurrent writers don’t interleave/partially write SKILL.md.

Suggested change

if not skill_file.exists():

skill_file.write_text(normalized_markdown, encoding="utf-8")

try:

with skill_file.open("x", encoding="utf-8") as f:

f.write(normalized_markdown)

except FileExistsError:

# Another process/thread has already materialized this skill file.

# The contents are deterministic, so we can safely reuse it.

pass

Copilot · 2026-03-26T23:17:06Z

packages/common/src/memmachine_common/skill_installer.py

+    if not cache_path.exists():
+        return {}
+
+    raw = json.loads(cache_path.read_text(encoding="utf-8"))


If the cache file exists but is corrupted/truncated, json.loads(...) will raise and break install_skill() entirely. Consider catching json.JSONDecodeError and either treating it as a cache miss (return {}) or raising a clearer error that instructs users how to delete/repair the cache file.

Suggested change

raw = json.loads(cache_path.read_text(encoding="utf-8"))

try:

text = cache_path.read_text(encoding="utf-8")

raw = json.loads(text)

except (OSError, json.JSONDecodeError):

# Treat unreadable or corrupted cache as a cache miss.

return {}

Copilot · 2026-03-26T23:17:06Z

packages/common/src/memmachine_common/skill_installer.py

+    return cache
+
+
+def _save_cache(cache_path: Path, cache: dict[str, dict[str, str]]) -> None:


When cache_path is set to a location whose parent directory doesn’t exist, write_text() will fail. Consider ensuring cache_path.parent.mkdir(parents=True, exist_ok=True) before writing so custom cache paths work reliably in CI and container environments.

Suggested change

def _save_cache(cache_path: Path, cache: dict[str, dict[str, str]]) -> None:

def _save_cache(cache_path: Path, cache: dict[str, dict[str, str]]) -> None:

cache_path.parent.mkdir(parents=True, exist_ok=True)

Copilot · 2026-03-26T23:17:06Z

packages/client/client_tests/test_cli_integration.py

+
+        captured = capsys.readouterr()
+        # The content should appear somewhere in stdout
+        assert unique_content in captured.out or len(captured.out) >= 0  # results may vary


len(captured.out) >= 0 is always true, so this assertion can never fail (and the test won’t detect regressions). If results can legitimately vary, consider asserting something meaningful (e.g., unique_content in captured.out with retry/backoff; or at least captured.out.strip() != ""; or validating the JSON/line format and presence of any episodes).

Suggested change

assert unique_content in captured.out or len(captured.out) >= 0 # results may vary

assert unique_content in captured.out

Copilot · 2026-03-26T23:17:07Z

evaluation/retrieval_skill/llm_judge.py

+def _get_openai_client() -> openai.OpenAI:
+    global _openai_client
+    if _openai_client is None:
+        _openai_client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))


When --config is not provided and OPENAI_API_KEY is unset, openai.OpenAI(api_key=None) will likely fail later with a less actionable error. Consider validating the env var here and raising a clear exception (e.g., “OPENAI_API_KEY must be set when --config is not provided”) so benchmark runs fail fast with an actionable message.

Suggested change

_openai_client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

api_key = os.getenv("OPENAI_API_KEY")

if not api_key:

raise RuntimeError(

"OPENAI_API_KEY must be set when no eval_llm client is provided."

)

_openai_client = openai.OpenAI(api_key=api_key)

Copilot · 2026-03-26T23:17:07Z

evaluation/retrieval_skill/benchmark_config.py

+import json
+import os
+import re
+from dataclasses import dataclass, field


json and field are imported but not used in this module. Removing unused imports will keep lint/type-check output clean and avoid confusion about intended functionality.

Suggested change

import json

import os

import re

from dataclasses import dataclass, field

import os

import re

from dataclasses import dataclass

Copilot · 2026-03-26T23:17:07Z

evaluation/retrieval_skill/locomo_search.py

 from typing import Any

 from dotenv import load_dotenv
+from memmachine_server.common.utils import async_with


async_with is imported but not used in the updated script. Removing it will avoid unused-import lint failures and keep the benchmark runner minimal.

Suggested change

from memmachine_server.common.utils import async_with

Tianyang-Zhang added 15 commits March 25, 2026 22:31

Remove agent and implement retrieval skill

d0b97d8

Change retrieval skill to use OpenAI and Anthropic native skill workflow

103b703

Accuracy optimization

d5aead8

Improve retrieval answer hint reliability for wikimultihop

f8111ce

Add verification pass for risky retrieval answers

477c27f

Harden fallback and verification for low-sufficiency answers

c3204e0

Support client-side skill install and use

afd87aa

Unify client and server side skill logic

300a3f0

Move SkillRunner from client to common code section

900acd8

Include semantic memory search

324dc2f

Bug fixes and optimization

eecddf8

Fix answer and evaluation LLM configurable

d9d09c0

Fix COQ skill branch

b9af61a

Fix cold start searching timeout bug

254c54e

Remove stale docs and update docs

8c66bf8

Tianyang-Zhang requested review from SarahScargall, Copilot, malatewang, o-love and sscargal March 26, 2026 23:14

github-advanced-security bot found potential problems Mar 26, 2026

View reviewed changes

packages/server/src/memmachine_server/retrieval_agent/agents/retrieve_agent.py Fixed Show fixed Hide fixed

packages/server/src/memmachine_server/retrieval_agent/agents/retrieve_agent.py Fixed Show fixed Hide fixed

Copilot AI reviewed Mar 26, 2026

View reviewed changes

Tianyang-Zhang added 2 commits March 27, 2026 00:14

Issue fixes

bf4ecc6

Fix lint

48d7fd7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add client side MemMachine retrieval skill, and unify/skill-ify server side retrieval agent#1277

Add client side MemMachine retrieval skill, and unify/skill-ify server side retrieval agent#1277
Tianyang-Zhang wants to merge 17 commits intomainfrom
skill

Tianyang-Zhang commented Mar 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		if not skill_file.exists():
		skill_file.write_text(normalized_markdown, encoding="utf-8")

-    if not skill_file.exists():
-        skill_file.write_text(normalized_markdown, encoding="utf-8")
+    try:
+        with skill_file.open("x", encoding="utf-8") as f:
+            f.write(normalized_markdown)
+    except FileExistsError:
+        # Another process/thread has already materialized this skill file.
+        # The contents are deterministic, so we can safely reuse it.
+        pass

-    raw = json.loads(cache_path.read_text(encoding="utf-8"))
+    try:
+        text = cache_path.read_text(encoding="utf-8")
+        raw = json.loads(text)
+    except (OSError, json.JSONDecodeError):
+        # Treat unreadable or corrupted cache as a cache miss.
+        return {}

		return cache


		def _save_cache(cache_path: Path, cache: dict[str, dict[str, str]]) -> None:

	def _save_cache(cache_path: Path, cache: dict[str, dict[str, str]]) -> None:
	def _save_cache(cache_path: Path, cache: dict[str, dict[str, str]]) -> None:
	cache_path.parent.mkdir(parents=True, exist_ok=True)

	assert unique_content in captured.out or len(captured.out) >= 0 # results may vary
	assert unique_content in captured.out

-        _openai_client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
+        api_key = os.getenv("OPENAI_API_KEY")
+        if not api_key:
+            raise RuntimeError(
+                "OPENAI_API_KEY must be set when no eval_llm client is provided."
+            )
+        _openai_client = openai.OpenAI(api_key=api_key)

Conversation

Tianyang-Zhang commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose of the change

Type of change

How Has This Been Tested?

Checklist

Maintainer Checklist

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Tianyang-Zhang commented Mar 26, 2026 •

edited

Loading