fix(semantic-memory): break ingestion retry loop on <think>-prefixed LLM responses#1327
Conversation
…LLM responses Qwen3.5 (and other reasoning-mode models) prefix responses with <think>...</think> blocks. The OpenAI Responses client returned this text verbatim, pydantic rejected it in llm_feature_update, and the generic except branch in _process_single_set re-queued the message on every 2-second ingestion tick — burning LLM tokens indefinitely for any user on a reasoning-mode model. Fix A (packages/server/src/memmachine_server/common/language_model/openai_responses_language_model.py): strip leading <think>...</think> blocks at the LLM client boundary. Applied at both output_text return sites in _generate_response; added a fallback in generate_parsed_response that strips and parses via json_repair + TypeAdapter when the SDK's output_parsed is None. Fix B (packages/server/src/memmachine_server/semantic_memory/semantic_ingestion.py): catch pydantic.ValidationError in process_semantic_type and mark the message uid as ingested, mirroring the existing context_length_exceeded skip path. Narrow to ValidationError only — architect ruled out a broad (ValueError, TypeError) catch because those leak from the embedder and storage call sites. Generic except Exception and the _debug_fail_loudly re-raise are preserved so unknown errors stay loud. Fix A closes the immediate trigger; Fix B is defense-in-depth against the next permanent-failure shape (provider schema drift, truncated responses) that will surface the same way. Tests: - test_openai_responses_language_model.py: 3 new cases (leading <think>, multi-line <think>, output_parsed=None fallback). - test_semantic_ingestion.py: ValidationError marks message as ingested, and a guard test confirming generic Exception does NOT mark it — defends against future over-broad widening of the catch. Refs: MemMachine#1270 Signed-off-by: Steve Scargall <37674041+sscargal@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Fixes an ingestion retry-loop in semantic memory ingestion caused by reasoning-mode LLM outputs that include <think>...</think> blocks, which previously led to repeated parsing failures and re-queueing.
Changes:
- Strip
<think>...</think>blocks from OpenAI Responsesoutput_text, and add a fallback parse path whenoutput_parsedisNone. - Treat
pydantic.ValidationErrorduring semantic ingestion as non-retryable by marking the message as ingested (preventing infinite re-queues). - Add unit tests covering
<think>stripping, parsed-response fallback parsing, and the new ValidationError ingestion behavior.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| packages/server/src/memmachine_server/semantic_memory/semantic_ingestion.py | Marks messages as ingested on ValidationError to avoid permanent-failure retry loops. |
| packages/server/src/memmachine_server/common/language_model/openai_responses_language_model.py | Strips <think> blocks from responses and adds fallback structured parsing when SDK parsing doesn’t produce output_parsed. |
| packages/server/server_tests/memmachine_server/semantic_memory/test_semantic_ingestion.py | Adds tests ensuring ValidationError marks ingested and generic exceptions do not. |
| packages/server/server_tests/memmachine_server/common/language_model/test_openai_responses_language_model.py | Adds tests for <think> stripping and output_parsed=None fallback parsing. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Steve Scargall <37674041+sscargal@users.noreply.github.com>
Signed-off-by: Steve Scargall <37674041+sscargal@users.noreply.github.com>
There was a problem hiding this comment.
What if I want the output to have a <think></think> block on purpose? What about another model that uses e.g. <reason></reason> blocks instead?
And would this apply to that model no matter which API it's going through? If so, I think some generalized optional sanitization utility would be preferable, that has many checkbox-like options for which sanitization operations to apply.
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
| _THINK_BLOCK_RE = re.compile(r"<think>.*?</think>", re.DOTALL) |
There was a problem hiding this comment.
I think the response API is only supported by openAI, which does not return the "think" block.
There was a problem hiding this comment.
I think the issue is using third-party models with the OpenAI API. OpenAI does a good job of avoiding this problem.
By baking the solution into the API wrapper instead of a sanitization layer, it can ruin intended outputs from a model that follows instructions well, and we have to play whack-a-mole on every single API which a model that does not follow instructions well may use.
Purpose of the change
Qwen3.5 (and other reasoning-mode models) prefix responses with ... blocks. The OpenAI Responses client returned this text verbatim, pydantic rejected it in llm_feature_update, and the generic except branch in _process_single_set re-queued the message on every 2-second ingestion tick — burning LLM tokens indefinitely for any user on a reasoning-mode model.
Description
Fix A (packages/server/src/memmachine_server/common/language_model/openai_responses_language_model.py): strip leading ... blocks at the LLM client boundary. Applied at both output_text return sites in _generate_response; added a fallback in generate_parsed_response that strips and parses via json_repair + TypeAdapter when the SDK's output_parsed is None.
Fix B (packages/server/src/memmachine_server/semantic_memory/semantic_ingestion.py): catch pydantic.ValidationError in process_semantic_type and mark the message uid as ingested, mirroring the existing context_length_exceeded skip path. Narrow to ValidationError only — architect ruled out a broad (ValueError, TypeError) catch because those leak from the embedder and storage call sites. Generic except Exception and the _debug_fail_loudly re-raise are preserved so unknown errors stay loud.
Fix A closes the immediate trigger; Fix B is defense-in-depth against the next permanent-failure shape (provider schema drift, truncated responses) that will surface the same way.
Tests:
Fixes/Closes
Fixes #1270
Type of change
How Has This Been Tested?
Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration.
Checklist
Maintainer Checklist
Screenshots/Gifs
N/A
Further comments
None