Skip to content

fix(mcp): pass ensure_ascii=False to json.dumps in mcp_bridge.py#1990

Open
devteamaegis wants to merge 2 commits into
unclecode:mainfrom
devteamaegis:fix/mcp-bridge-ensure-ascii-1962
Open

fix(mcp): pass ensure_ascii=False to json.dumps in mcp_bridge.py#1990
devteamaegis wants to merge 2 commits into
unclecode:mainfrom
devteamaegis:fix/mcp-bridge-ensure-ascii-1962

Conversation

@devteamaegis
Copy link
Copy Markdown

Summary

All three json.dumps() calls in deploy/docker/mcp_bridge.py were using the
default ensure_ascii=True, which escapes every non-ASCII codepoint (CJK, emoji,
accented characters) to \uXXXX sequences.

For CJK content this inflates token counts by 2.5–3x, increasing LLM costs and
eating context budget unnecessarily.

The HTTP REST API already returns native UTF-8 — MCP tool results should behave
the same way.

Fix

# Before
return [t.TextContent(type="text", text=json.dumps(err))]
return [t.TextContent(type="text", text=json.dumps(res, default=str))]
# After
return [t.TextContent(type="text", text=json.dumps(err, ensure_ascii=False))]
return [t.TextContent(type="text", text=json.dumps(res, default=str, ensure_ascii=False))]

Tests

Added tests/unit/test_mcp_bridge_ensure_ascii.py:

  • Baseline sanity check for ensure_ascii=False behavior
  • AST-level check that every json.dumps call in mcp_bridge.py includes ensure_ascii=False

Fixes #1962

Ishaan Samantray added 2 commits May 28, 2026 12:29
`NlpSentenceChunking.chunk()` was returning `list(set(sens))`.
Python's `set` is unordered, so the returned chunks were in
arbitrary order — not document order — and duplicate sentences
were silently discarded.

Fix: return `sens` directly, which preserves the order produced by
`nltk.sent_tokenize` and keeps any intentional duplicates.

Adds two regression tests that verify ordering and duplicate
preservation without requiring a full crawl4ai install.

Fixes unclecode#1909
All three json.dumps() calls in deploy/docker/mcp_bridge.py were using
the default ensure_ascii=True, which escapes every non-ASCII codepoint to
\uXXXX sequences.  For CJK content this inflates token counts by 2.5-3x,
raising costs and eating context budget for nothing.

The HTTP REST API already returns UTF-8 natively; MCP tool results should
behave the same.

Changes:
- json.dumps(err)             → json.dumps(err, ensure_ascii=False)
- json.dumps(res, default=str) ×2 → json.dumps(res, default=str, ensure_ascii=False)

Adds two unit tests:
- test_cjk_not_escaped_in_json_dumps — baseline sanity check
- test_mcp_bridge_serialize_uses_ensure_ascii_false — AST-level verification
  that every json.dumps call in mcp_bridge.py passes ensure_ascii=False

Fixes unclecode#1962
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] MCP Server json.dumps() escapes non-ASCII characters, causing 2.5-3x token overhead for CJK content

1 participant