fix(mcp): pass ensure_ascii=False to json.dumps in mcp_bridge.py#1990
Open
devteamaegis wants to merge 2 commits into
Open
fix(mcp): pass ensure_ascii=False to json.dumps in mcp_bridge.py#1990devteamaegis wants to merge 2 commits into
devteamaegis wants to merge 2 commits into
Conversation
added 2 commits
May 28, 2026 12:29
`NlpSentenceChunking.chunk()` was returning `list(set(sens))`. Python's `set` is unordered, so the returned chunks were in arbitrary order — not document order — and duplicate sentences were silently discarded. Fix: return `sens` directly, which preserves the order produced by `nltk.sent_tokenize` and keeps any intentional duplicates. Adds two regression tests that verify ordering and duplicate preservation without requiring a full crawl4ai install. Fixes unclecode#1909
All three json.dumps() calls in deploy/docker/mcp_bridge.py were using the default ensure_ascii=True, which escapes every non-ASCII codepoint to \uXXXX sequences. For CJK content this inflates token counts by 2.5-3x, raising costs and eating context budget for nothing. The HTTP REST API already returns UTF-8 natively; MCP tool results should behave the same. Changes: - json.dumps(err) → json.dumps(err, ensure_ascii=False) - json.dumps(res, default=str) ×2 → json.dumps(res, default=str, ensure_ascii=False) Adds two unit tests: - test_cjk_not_escaped_in_json_dumps — baseline sanity check - test_mcp_bridge_serialize_uses_ensure_ascii_false — AST-level verification that every json.dumps call in mcp_bridge.py passes ensure_ascii=False Fixes unclecode#1962
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
All three
json.dumps()calls indeploy/docker/mcp_bridge.pywere using thedefault
ensure_ascii=True, which escapes every non-ASCII codepoint (CJK, emoji,accented characters) to
\uXXXXsequences.For CJK content this inflates token counts by 2.5–3x, increasing LLM costs and
eating context budget unnecessarily.
The HTTP REST API already returns native UTF-8 — MCP tool results should behave
the same way.
Fix
Tests
Added
tests/unit/test_mcp_bridge_ensure_ascii.py:ensure_ascii=Falsebehaviorjson.dumpscall inmcp_bridge.pyincludesensure_ascii=FalseFixes #1962