Skip to content

Commit 89a9a4c

Browse files
Fix null-separated ASCII misdetected as UTF-16-BE (#347)
* docs: add design spec for null separator tolerance Addresses #346 — ASCII text with null byte separators (common in Unix CLI output) being misdetected as utf-16-be. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: address spec review feedback - Clarify UTF-16 guard applies in both single/dual candidate paths - Note mypyc compilation constraint for utf1632.py - Detail ASCII implementation using existing _ALLOWED_ASCII table - Clarify pipeline reorder: computation order vs return order - Note UniversalDetector propagation - Fix language=None vs "" discrepancy in test expectations Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add implementation plan for null separator tolerance 6-task TDD plan covering UTF-16 guard, null-tolerant ASCII detection, and pipeline reorder. Addresses #346. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add failing tests for null-separator UTF-16 false positive (#346) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: reject null-separator false positives in UTF-16 detector (#346) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * test: add failing tests for null-tolerant ASCII detection (#346) * feat: tolerate sparse null separators in ASCII detection (#346) * fix: reorder pipeline so ASCII precheck prevents false binary classification (#346) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: remove planning docs from branch Spec and plan are preserved in git history but don't need to be in the final merge diff. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: tighten ASCII null-fraction threshold from 10% to 5% Real-world null-separator data (find -print0, git ls-tree -z) is 1-3.5% nulls. 5% covers all realistic cases while staying well below the UTF-16 guard threshold (15%). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: use bytes.translate in null-separator guard for consistency Replace Python-level all() loop with C-level bytes.translate(), matching the pattern used in binary.py and ascii.py. Cross-references the shared ASCII byte set with ascii.py's _ALLOWED_ASCII. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: share ASCII byte-set constant across pipeline modules Extract ASCII_TEXT_BYTES to pipeline/__init__.py and use it in both ascii.py and utf1632.py to prevent drift between the two definitions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent a98f097 commit 89a9a4c

7 files changed

Lines changed: 226 additions & 22 deletions

File tree

src/chardet/pipeline/__init__.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,11 @@
1515
#: Deleting all bytes >= 0x80 and comparing lengths gives the non-ASCII count.
1616
HIGH_BYTES: bytes = bytes(range(0x80, 0x100))
1717

18+
#: Bytes considered valid in ASCII text: tab (0x09), newline (0x0A),
19+
#: carriage return (0x0D), and printable ASCII (0x20-0x7E).
20+
#: Used by ``ascii.py`` directly and by ``utf1632.py`` (with null added).
21+
ASCII_TEXT_BYTES: bytes = bytes([0x09, 0x0A, 0x0D, *range(0x20, 0x7F)])
22+
1823

1924
class DetectionDict(TypedDict):
2025
"""Dictionary representation of a detection result.

src/chardet/pipeline/ascii.py

Lines changed: 23 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,36 @@
1-
"""Stage 1c: Pure ASCII detection."""
1+
"""Stage 1c: Pure ASCII detection (with null-separator tolerance)."""
22

33
from __future__ import annotations
44

5-
from chardet.pipeline import DetectionResult
5+
from chardet.pipeline import ASCII_TEXT_BYTES, DetectionResult
66

7-
# Allowed ASCII bytes: tab (0x09), newline (0x0A), carriage return (0x0D),
8-
# and printable ASCII (0x20-0x7E). bytes.translate deletes these from the
9-
# input; if anything remains, the data is not pure ASCII.
10-
_ALLOWED_ASCII: bytes = bytes([0x09, 0x0A, 0x0D, *range(0x20, 0x7F)])
7+
# Maximum fraction of null bytes to still classify data as ASCII.
8+
# Null-separated CLI output (find -print0, git ls-tree -z) typically has
9+
# 1-3.5% nulls. 5% covers all realistic cases while staying well below
10+
# the UTF-16 guard threshold (15%).
11+
_MAX_NULL_FRACTION = 0.05
1112

1213

1314
def detect_ascii(data: bytes) -> DetectionResult | None:
14-
"""Return an ASCII result if all bytes are printable ASCII plus common whitespace.
15+
r"""Return an ASCII result if all bytes are printable ASCII plus common whitespace.
16+
17+
Tolerates sparse null bytes (``\x00``) up to ``_MAX_NULL_FRACTION`` of
18+
the data, returning confidence 0.99 instead of 1.0 to distinguish from
19+
pure ASCII.
1520
1621
:param data: The raw byte data to examine.
1722
:returns: A :class:`DetectionResult` for ASCII, or ``None``.
1823
"""
1924
if not data:
2025
return None
21-
if data.translate(None, _ALLOWED_ASCII):
22-
return None # Non-allowed bytes remain
23-
return DetectionResult(encoding="ascii", confidence=1.0, language=None)
26+
remainder = data.translate(None, ASCII_TEXT_BYTES)
27+
if not remainder:
28+
return DetectionResult(encoding="ascii", confidence=1.0, language=None)
29+
# Check if the only non-allowed bytes are null separators
30+
if remainder.replace(b"\x00", b""):
31+
return None # Non-null, non-ASCII bytes present
32+
# All non-allowed bytes are nulls — accept if sparse enough
33+
null_fraction = len(remainder) / len(data)
34+
if null_fraction <= _MAX_NULL_FRACTION:
35+
return DetectionResult(encoding="ascii", confidence=0.99, language=None)
36+
return None

src/chardet/pipeline/orchestrator.py

Lines changed: 16 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -535,9 +535,20 @@ def _run_pipeline_core( # noqa: PLR0913
535535
# markup) so that explicit charset declarations still take precedence.
536536
utf8_precheck = detect_utf8(data)
537537

538-
# Stage 0: Binary detection (skip when data is valid multi-byte UTF-8)
538+
# Pre-check ASCII to prevent false binary classification. ASCII text
539+
# with null byte separators (e.g. find -print0 output) would exceed the
540+
# binary threshold due to the null bytes. Like the UTF-8 precheck, we
541+
# compute the result now but return it at the normal position (after
542+
# markup) so explicit charset declarations still take precedence.
543+
ascii_precheck = detect_ascii(data)
544+
545+
# Stage 0: Binary detection (skip when data is valid UTF-8 or ASCII)
539546
# Binary detection (encoding=None) is NOT gated by filters.
540-
if utf8_precheck is None and is_binary(data, max_bytes=max_bytes):
547+
if (
548+
utf8_precheck is None
549+
and ascii_precheck is None
550+
and is_binary(data, max_bytes=max_bytes)
551+
):
541552
return [_BINARY_RESULT]
542553

543554
# Stage 1b: Markup charset extraction (before ASCII/UTF-8 so explicit
@@ -547,10 +558,9 @@ def _run_pipeline_core( # noqa: PLR0913
547558
if markup_result is not None and markup_result.encoding in allowed:
548559
return [markup_result]
549560

550-
# Stage 1c: ASCII
551-
ascii_result = detect_ascii(data)
552-
if ascii_result is not None and ascii_result.encoding in allowed:
553-
return [ascii_result]
561+
# Stage 1c: ASCII (use pre-computed result)
562+
if ascii_precheck is not None and ascii_precheck.encoding in allowed:
563+
return [ascii_precheck]
554564

555565
# Stage 1d: UTF-8 structural validation (use pre-computed result)
556566
if utf8_precheck is not None and utf8_precheck.encoding in allowed:

src/chardet/pipeline/utf1632.py

Lines changed: 39 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111

1212
import unicodedata
1313

14-
from chardet.pipeline import DETERMINISTIC_CONFIDENCE, DetectionResult
14+
from chardet.pipeline import ASCII_TEXT_BYTES, DETERMINISTIC_CONFIDENCE, DetectionResult
1515

1616
# How many bytes to sample for pattern analysis
1717
_SAMPLE_SIZE = 4096
@@ -38,6 +38,38 @@
3838
# considered text rather than binary data.
3939
_MIN_PRINTABLE_FRACTION = 0.7
4040

41+
# Maximum null fraction (in the candidate null-byte position) below which
42+
# the data is checked for a null-separator pattern. If the null fraction
43+
# is below this AND all non-null bytes are printable ASCII, the candidate
44+
# is rejected as a null-separator false positive rather than real UTF-16.
45+
# Real Latin UTF-16 has ~50% nulls; CJK UTF-16 has fewer but non-ASCII
46+
# non-null bytes. 15% is generous — separator data is typically 1-5%.
47+
_NULL_SEPARATOR_MAX_FRACTION = 0.15
48+
49+
# ASCII_TEXT_BYTES plus the null byte — used by the null-separator guard
50+
# to check whether non-null bytes are all printable ASCII.
51+
_NULL_SEPARATOR_ALLOWED: bytes = b"\x00" + ASCII_TEXT_BYTES
52+
53+
54+
def _is_null_separator_pattern(data: bytes, null_frac: float) -> bool:
55+
"""Return True if the data looks like ASCII with null byte separators.
56+
57+
:param data: The raw byte sample to examine.
58+
:param null_frac: The positional null fraction for this UTF-16 candidate
59+
(i.e. fraction of null bytes in even positions for BE, or odd positions
60+
for LE) — not the total null fraction across all bytes.
61+
62+
Checks two conditions:
63+
1. The positional null fraction is below ``_NULL_SEPARATOR_MAX_FRACTION``
64+
2. Every non-null byte is printable ASCII or common whitespace
65+
66+
When both conditions are met, the nulls are likely field separators
67+
(e.g. ``find -print0``), not UTF-16 encoding artifacts.
68+
"""
69+
if null_frac >= _NULL_SEPARATOR_MAX_FRACTION:
70+
return False
71+
return not data.translate(None, _NULL_SEPARATOR_ALLOWED)
72+
4173

4274
def detect_utf1632_patterns(data: bytes) -> DetectionResult | None:
4375
"""Detect UTF-32 or UTF-16 encoding from null-byte patterns.
@@ -149,9 +181,13 @@ def _check_utf16(data: bytes) -> DetectionResult | None:
149181
le_frac = le_null_count / num_units
150182

151183
candidates: list[tuple[str, float]] = []
152-
if le_frac >= _UTF16_MIN_NULL_FRACTION:
184+
if le_frac >= _UTF16_MIN_NULL_FRACTION and not _is_null_separator_pattern(
185+
data[:sample_len], le_frac
186+
):
153187
candidates.append(("utf-16-le", le_frac))
154-
if be_frac >= _UTF16_MIN_NULL_FRACTION:
188+
if be_frac >= _UTF16_MIN_NULL_FRACTION and not _is_null_separator_pattern(
189+
data[:sample_len], be_frac
190+
):
155191
candidates.append(("utf-16-be", be_frac))
156192

157193
if not candidates:

tests/test_ascii.py

Lines changed: 63 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,67 @@ def test_all_printable_ascii():
4141

4242

4343
def test_null_byte_not_ascii():
44-
# Null bytes should have been caught by binary detection (Stage 0),
45-
# but ASCII check should still reject them
46-
result = detect_ascii(b"Hello\x00world")
44+
# 2 nulls in 10 bytes = 20% → above threshold, not ASCII
45+
result = detect_ascii(b"Hello\x00\x00rld")
4746
assert result is None
47+
48+
49+
def test_ascii_with_sparse_null_separators():
50+
"""ASCII with null separators below 5% threshold → confidence 0.99."""
51+
data = (
52+
b"master:README.md\x002\x00For support slack to #kodiak-support\n"
53+
b"master:support.txt\x001\x00For support slack to #kodiak-support\n"
54+
)
55+
result = detect_ascii(data)
56+
assert result is not None
57+
assert result.encoding == "ascii"
58+
assert result.confidence == 0.99
59+
60+
61+
def test_ascii_with_null_separated_paths():
62+
"""Find -print0 style output → ASCII at 0.99."""
63+
data = (
64+
b"/home/user/documents/report.txt\x00"
65+
b"/home/user/documents/notes.txt\x00"
66+
b"/home/user/downloads/image.png\x00"
67+
b"/home/user/music/song.mp3\x00"
68+
)
69+
result = detect_ascii(data)
70+
assert result is not None
71+
assert result.encoding == "ascii"
72+
assert result.confidence == 0.99
73+
74+
75+
def test_ascii_with_null_at_boundary():
76+
"""Exactly 5% nulls (1 in 20 bytes) is at the threshold — still ASCII."""
77+
result = detect_ascii(b"abcdefghij\x00klmnopqrs") # 1/20 = 5%
78+
assert result is not None
79+
assert result.encoding == "ascii"
80+
assert result.confidence == 0.99
81+
82+
83+
def test_ascii_with_null_just_above_boundary():
84+
"""Just above 5% nulls → not ASCII."""
85+
result = detect_ascii(b"abcdefghij\x00klmnopqr") # 1/19 = 5.26%
86+
assert result is None
87+
88+
89+
def test_ascii_with_high_null_fraction():
90+
"""More than 5% null bytes → not ASCII."""
91+
# 5 nulls in 15 bytes = 33%
92+
data = b"ab\x00cd\x00ef\x00gh\x00ij\x00"
93+
result = detect_ascii(data)
94+
assert result is None
95+
96+
97+
def test_ascii_with_nulls_and_high_bytes():
98+
"""Nulls mixed with non-ASCII bytes → not ASCII."""
99+
data = b"Hello\x00\x80World"
100+
result = detect_ascii(data)
101+
assert result is None
102+
103+
104+
def test_pure_ascii_still_confidence_1():
105+
"""Pure ASCII without nulls still returns confidence 1.0."""
106+
result = detect_ascii(b"Hello, world!")
107+
assert result == DetectionResult("ascii", 1.0, None)

tests/test_github_issues.py

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -438,3 +438,34 @@ def test_issue_67_no_crash(self) -> None:
438438
# Just verify it doesn't crash; any result is acceptable
439439
assert isinstance(result, dict)
440440
assert "encoding" in result
441+
442+
443+
# =========================================================================
444+
# NULL SEPARATOR ISSUES
445+
# =========================================================================
446+
447+
448+
class TestNullSeparators:
449+
"""ASCII text with null byte separators."""
450+
451+
def test_issue_346_null_separated_ascii(self) -> None:
452+
"""Issue #346: Null-separated ASCII detected as utf-16-be."""
453+
data = (
454+
b"master:README.md\x002\x00For support slack to #kodiak-support\n"
455+
b"master:support.txt\x001\x00For support slack to #kodiak-support\n"
456+
)
457+
result = chardet.detect(data)
458+
assert result["encoding"] == "ascii"
459+
assert result["confidence"] == 0.99
460+
461+
def test_find_print0_output(self) -> None:
462+
"""Find -print0 style output should be detected as ASCII."""
463+
data = (
464+
b"/home/user/documents/report.txt\x00"
465+
b"/home/user/documents/notes.txt\x00"
466+
b"/home/user/downloads/image.png\x00"
467+
b"/home/user/music/song.mp3\x00"
468+
)
469+
result = chardet.detect(data)
470+
assert result["encoding"] == "ascii"
471+
assert result["confidence"] == 0.99

tests/test_utf1632.py

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -569,3 +569,52 @@ def test_text_quality_no_letters() -> None:
569569
quality = _text_quality(text)
570570
# No letters, so letter ratio is 0, ascii bonus is 0
571571
assert quality < 0.5
572+
573+
574+
# ---------------------------------------------------------------------------
575+
# Null-separator guard: sparse nulls in ASCII should NOT trigger UTF-16
576+
# ---------------------------------------------------------------------------
577+
578+
579+
def test_null_separated_ascii_not_utf16() -> None:
580+
"""ASCII with null byte separators should not be detected as UTF-16.
581+
582+
Regression test for chardet/chardet#346.
583+
"""
584+
data = (
585+
b"master:README.md\x002\x00For support slack to #kodiak-support\n"
586+
b"master:support.txt\x001\x00For support slack to #kodiak-support\n"
587+
)
588+
result = detect_utf1632_patterns(data)
589+
assert result is None
590+
591+
592+
def test_null_separated_paths_not_utf16() -> None:
593+
"""Find -print0 style output should not be detected as UTF-16."""
594+
data = (
595+
b"/home/user/documents/report.txt\x00"
596+
b"/home/user/documents/notes.txt\x00"
597+
b"/home/user/downloads/image.png\x00"
598+
b"/home/user/music/song.mp3\x00"
599+
)
600+
result = detect_utf1632_patterns(data)
601+
assert result is None
602+
603+
604+
def test_real_utf16_be_still_detected() -> None:
605+
"""Real UTF-16-BE text must still be detected after the guard is added."""
606+
text = "The quick brown fox jumps over the lazy dog."
607+
data = text.encode("utf-16-be")
608+
result = detect_utf1632_patterns(data)
609+
assert result is not None
610+
assert result.encoding == "utf-16-be"
611+
assert result.confidence == DETERMINISTIC_CONFIDENCE
612+
613+
614+
def test_real_utf16_le_cjk_still_detected() -> None:
615+
"""CJK UTF-16-LE must still be detected (low null fraction but non-ASCII non-null bytes)."""
616+
text = "This document: \u4f60\u597d\u4e16\u754c\uff0c\u6b22\u8fce\u6765\u5230\u8fd9\u91cc\u3002"
617+
data = text.encode("utf-16-le")
618+
result = detect_utf1632_patterns(data)
619+
assert result is not None
620+
assert result.encoding == "utf-16-le"

0 commit comments

Comments
 (0)