Skip to content

fix(api): strip NUL bytes from LLM output fields before API response#3221

Open
firecrawl-spring[bot] wants to merge 2 commits intomainfrom
fix/sanitize-llm-output-nul-bytes
Open

fix(api): strip NUL bytes from LLM output fields before API response#3221
firecrawl-spring[bot] wants to merge 2 commits intomainfrom
fix/sanitize-llm-output-nul-bytes

Conversation

@firecrawl-spring
Copy link
Copy Markdown
Contributor

@firecrawl-spring firecrawl-spring Bot commented Mar 24, 2026

Summary

LLM-generated output fields (summary, extract/json, answer) could contain NUL bytes (\u0000) when the model mishandles special characters like ² (U+00B2). These NUL bytes pass through to the API response unchanged, causing downstream failures for users storing results in PostgreSQL text columns (which reject NUL bytes).

Changes

  • Added removeNulBytes() helper in llmExtract.ts that strips NUL bytes from strings, and uses JSON.stringify with a replacer to handle nested arrays/objects without manual recursion
  • Applied sanitization to all three LLM output assignment paths:
    • document.extract / document.json (extract data)
    • document.summary (summary generation)
    • document.answer (query generation)

This is consistent with existing sanitization patterns in the codebase:

  • sanitizeMetadataValue() in build-document.ts strips control chars via .replace(/[\x00-\x1f\x7f]/g, '')
  • sanitizeString() in log_job.ts strips NUL via .replace(/\u0000/g, '')

Updates since last revision

  • Simplified removeNulBytes() implementation: replaced manual recursion through arrays and objects with JSON.stringify's built-in replacer function. This eliminates the separate array/object branches while maintaining identical behavior. The replacer operates on actual string values (not serialized form), so there are no edge cases with escaped characters.

Test plan

  • TypeScript compilation passes (no new errors introduced)
  • Unit-tested removeNulBytes() logic with strings containing \u0000, nested objects, arrays, null/undefined, numbers, and literal \u0000 text — all pass correctly

Review & Testing Checklist for Human

  • Verify the JSON.stringify replacer approach doesn't silently drop or alter non-string values in deeply nested extract output (e.g. numbers, booleans, nulls inside arrays of objects)
  • Confirm the JSON round-trip doesn't lose data for any real-world extract schemas — LLM output should always be JSON-serializable, but worth a sanity check against a complex extraction
  • Test with a page known to trigger NUL bytes in LLM output (e.g. content with ², ³, or other superscript characters) to verify the fix end-to-end

Notes

  • The JSON.stringify/JSON.parse round-trip is safe for LLM extract output since it's already JSON-parsed data (no Dates, undefined values, or class instances to lose)
  • The formatting change to modelChain in performQuery is from prettier (pre-commit hook), not a functional change

Related Pylon Ticket

https://app.usepylon.com/issues?issueNumber=25035

Link to Devin session: https://app.devin.ai/sessions/dac229a885894c6d8d87db047fe8d440
Requested by: @Chadha93

LLM-generated output (summary, extract/json, answer) could contain NUL
bytes (\u0000) when the model mishandles special characters like ²
(U+00B2). These NUL bytes pass through to the API response unchanged,
causing downstream failures for users storing results in PostgreSQL
text columns (which reject NUL bytes).

Add removeNulBytes() helper that recursively strips NUL bytes from
strings, arrays, and objects, and apply it to all three LLM output
assignment paths in llmExtract.ts.

Co-Authored-By: micahstairs <micah@sideguide.dev>
@firecrawl-spring firecrawl-spring Bot requested a review from micahstairs March 24, 2026 12:46
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 1 file

@blacksmith-sh
Copy link
Copy Markdown
Contributor

blacksmith-sh Bot commented Mar 24, 2026

Found 6 test failures on Blacksmith runners:

Failures

Test View Logs
src/tests/snips/v2/scrape-browser.test.ts/
Scrape browser interact replay returns 404 when scrape job does not exist
View Logs
src/tests/snips/v2/scrape-browser.test.ts/
Scrape browser interact replay returns 404 when scrape job does not exist
View Logs
src/tests/snips/v2/scrape-browser.test.ts/
Scrape browser interact replay returns 404 when scrape job does not exist
View Logs
src/tests/snips/v2/scrape-browser.test.ts/
Scrape browser interact replay returns 404 when scrape job does not exist
View Logs
src/tests/snips/v2/scrape-browser.test.ts/
Scrape browser interact replay returns replay-context error when scrape data is not ret
ained
View Logs
src/tests/snips/v2/scrape-browser.test.ts/
Scrape browser interact replay returns replay-context error when scrape data is not ret
ained
View Logs

Fix in Cursor

Replace manual recursion through arrays and objects with JSON.stringify's
built-in replacer function, which handles all nesting automatically.
This eliminates the separate array/object branches while maintaining
identical behavior.

Co-Authored-By: gaurav <gauravchadha1676@gmail.com>
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 issues found across 1 file (changes from recent commits).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant