Skip to content

fix(semantic-memory): break ingestion retry loop on <think>-prefixed LLM responses#1327

Open
sscargal wants to merge 3 commits into
MemMachine:mainfrom
sscargal:bugfix/1270-semantic-ingestion-infinite-retry-loop
Open

fix(semantic-memory): break ingestion retry loop on <think>-prefixed LLM responses#1327
sscargal wants to merge 3 commits into
MemMachine:mainfrom
sscargal:bugfix/1270-semantic-ingestion-infinite-retry-loop

Conversation

@sscargal
Copy link
Copy Markdown
Contributor

Purpose of the change

Qwen3.5 (and other reasoning-mode models) prefix responses with ... blocks. The OpenAI Responses client returned this text verbatim, pydantic rejected it in llm_feature_update, and the generic except branch in _process_single_set re-queued the message on every 2-second ingestion tick — burning LLM tokens indefinitely for any user on a reasoning-mode model.

Description

Fix A (packages/server/src/memmachine_server/common/language_model/openai_responses_language_model.py): strip leading ... blocks at the LLM client boundary. Applied at both output_text return sites in _generate_response; added a fallback in generate_parsed_response that strips and parses via json_repair + TypeAdapter when the SDK's output_parsed is None.

Fix B (packages/server/src/memmachine_server/semantic_memory/semantic_ingestion.py): catch pydantic.ValidationError in process_semantic_type and mark the message uid as ingested, mirroring the existing context_length_exceeded skip path. Narrow to ValidationError only — architect ruled out a broad (ValueError, TypeError) catch because those leak from the embedder and storage call sites. Generic except Exception and the _debug_fail_loudly re-raise are preserved so unknown errors stay loud.

Fix A closes the immediate trigger; Fix B is defense-in-depth against the next permanent-failure shape (provider schema drift, truncated responses) that will surface the same way.

Tests:

  • test_openai_responses_language_model.py: 3 new cases (leading , multi-line , output_parsed=None fallback).
  • test_semantic_ingestion.py: ValidationError marks message as ingested, and a guard test confirming generic Exception does NOT mark it — defends against future over-broad widening of the catch.

Fixes/Closes

Fixes #1270

Type of change

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration.

  • Unit Test

Checklist

  • I have signed the commit(s) within this pull request
  • My code follows the style guidelines of this project (See STYLE_GUIDE.md)
  • I have performed a self-review of my own code
  • I have commented my code
  • My changes generate no new warnings
  • I have added unit tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have checked my code and corrected any misspellings

Maintainer Checklist

  • Confirmed all checks passed
  • Contributor has signed the commit(s)
  • Reviewed the code
  • Run, Tested, and Verified the change(s) work as expected

Screenshots/Gifs

N/A

Further comments

None

…LLM responses

Qwen3.5 (and other reasoning-mode models) prefix responses with <think>...</think>
blocks. The OpenAI Responses client returned this text verbatim, pydantic rejected
it in llm_feature_update, and the generic except branch in _process_single_set
re-queued the message on every 2-second ingestion tick — burning LLM tokens
indefinitely for any user on a reasoning-mode model.

Fix A (packages/server/src/memmachine_server/common/language_model/openai_responses_language_model.py):
strip leading <think>...</think> blocks at the LLM client boundary. Applied at
both output_text return sites in _generate_response; added a fallback in
generate_parsed_response that strips and parses via json_repair + TypeAdapter
when the SDK's output_parsed is None.

Fix B (packages/server/src/memmachine_server/semantic_memory/semantic_ingestion.py):
catch pydantic.ValidationError in process_semantic_type and mark the message
uid as ingested, mirroring the existing context_length_exceeded skip path.
Narrow to ValidationError only — architect ruled out a broad (ValueError,
TypeError) catch because those leak from the embedder and storage call sites.
Generic except Exception and the _debug_fail_loudly re-raise are preserved so
unknown errors stay loud.

Fix A closes the immediate trigger; Fix B is defense-in-depth against the next
permanent-failure shape (provider schema drift, truncated responses) that will
surface the same way.

Tests:
- test_openai_responses_language_model.py: 3 new cases (leading <think>,
  multi-line <think>, output_parsed=None fallback).
- test_semantic_ingestion.py: ValidationError marks message as ingested, and
  a guard test confirming generic Exception does NOT mark it — defends against
  future over-broad widening of the catch.

Refs: MemMachine#1270
Signed-off-by: Steve Scargall <37674041+sscargal@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes an ingestion retry-loop in semantic memory ingestion caused by reasoning-mode LLM outputs that include <think>...</think> blocks, which previously led to repeated parsing failures and re-queueing.

Changes:

  • Strip <think>...</think> blocks from OpenAI Responses output_text, and add a fallback parse path when output_parsed is None.
  • Treat pydantic.ValidationError during semantic ingestion as non-retryable by marking the message as ingested (preventing infinite re-queues).
  • Add unit tests covering <think> stripping, parsed-response fallback parsing, and the new ValidationError ingestion behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
packages/server/src/memmachine_server/semantic_memory/semantic_ingestion.py Marks messages as ingested on ValidationError to avoid permanent-failure retry loops.
packages/server/src/memmachine_server/common/language_model/openai_responses_language_model.py Strips <think> blocks from responses and adds fallback structured parsing when SDK parsing doesn’t produce output_parsed.
packages/server/server_tests/memmachine_server/semantic_memory/test_semantic_ingestion.py Adds tests ensuring ValidationError marks ingested and generic exceptions do not.
packages/server/server_tests/memmachine_server/common/language_model/test_openai_responses_language_model.py Adds tests for <think> stripping and output_parsed=None fallback parsing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Steve Scargall <37674041+sscargal@users.noreply.github.com>
Signed-off-by: Steve Scargall <37674041+sscargal@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@edwinyyyu edwinyyyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if I want the output to have a <think></think> block on purpose? What about another model that uses e.g. <reason></reason> blocks instead?

And would this apply to that model no matter which API it's going through? If so, I think some generalized optional sanitization utility would be preferable, that has many checkbox-like options for which sanitization operations to apply.


logger = logging.getLogger(__name__)

_THINK_BLOCK_RE = re.compile(r"<think>.*?</think>", re.DOTALL)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the response API is only supported by openAI, which does not return the "think" block.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue is using third-party models with the OpenAI API. OpenAI does a good job of avoiding this problem.

By baking the solution into the API wrapper instead of a sanitization layer, it can ruin intended outputs from a model that follows instructions well, and we have to play whack-a-mole on every single API which a model that does not follow instructions well may use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Dead loop to trigger memmachine_server.semantic_memory.semantic_ingestion when LLM reponse is invalid

4 participants