Commit 89a9a4c
Fix null-separated ASCII misdetected as UTF-16-BE (#347)
* docs: add design spec for null separator tolerance
Addresses #346 — ASCII text with null byte separators
(common in Unix CLI output) being misdetected as utf-16-be.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: address spec review feedback
- Clarify UTF-16 guard applies in both single/dual candidate paths
- Note mypyc compilation constraint for utf1632.py
- Detail ASCII implementation using existing _ALLOWED_ASCII table
- Clarify pipeline reorder: computation order vs return order
- Note UniversalDetector propagation
- Fix language=None vs "" discrepancy in test expectations
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: add implementation plan for null separator tolerance
6-task TDD plan covering UTF-16 guard, null-tolerant ASCII detection,
and pipeline reorder. Addresses #346.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: add failing tests for null-separator UTF-16 false positive (#346)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: reject null-separator false positives in UTF-16 detector (#346)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* test: add failing tests for null-tolerant ASCII detection (#346)
* feat: tolerate sparse null separators in ASCII detection (#346)
* fix: reorder pipeline so ASCII precheck prevents false binary classification (#346)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* chore: remove planning docs from branch
Spec and plan are preserved in git history but don't need to be in the
final merge diff.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: tighten ASCII null-fraction threshold from 10% to 5%
Real-world null-separator data (find -print0, git ls-tree -z) is
1-3.5% nulls. 5% covers all realistic cases while staying well below
the UTF-16 guard threshold (15%).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: use bytes.translate in null-separator guard for consistency
Replace Python-level all() loop with C-level bytes.translate(), matching
the pattern used in binary.py and ascii.py. Cross-references the shared
ASCII byte set with ascii.py's _ALLOWED_ASCII.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: share ASCII byte-set constant across pipeline modules
Extract ASCII_TEXT_BYTES to pipeline/__init__.py and use it in both
ascii.py and utf1632.py to prevent drift between the two definitions.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent a98f097 commit 89a9a4c
7 files changed
Lines changed: 226 additions & 22 deletions
File tree
- src/chardet/pipeline
- tests
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
15 | 15 | | |
16 | 16 | | |
17 | 17 | | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
18 | 23 | | |
19 | 24 | | |
20 | 25 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
5 | | - | |
| 5 | + | |
6 | 6 | | |
7 | | - | |
8 | | - | |
9 | | - | |
10 | | - | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
11 | 12 | | |
12 | 13 | | |
13 | 14 | | |
14 | | - | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
15 | 20 | | |
16 | 21 | | |
17 | 22 | | |
18 | 23 | | |
19 | 24 | | |
20 | 25 | | |
21 | | - | |
22 | | - | |
23 | | - | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
535 | 535 | | |
536 | 536 | | |
537 | 537 | | |
538 | | - | |
| 538 | + | |
| 539 | + | |
| 540 | + | |
| 541 | + | |
| 542 | + | |
| 543 | + | |
| 544 | + | |
| 545 | + | |
539 | 546 | | |
540 | | - | |
| 547 | + | |
| 548 | + | |
| 549 | + | |
| 550 | + | |
| 551 | + | |
541 | 552 | | |
542 | 553 | | |
543 | 554 | | |
| |||
547 | 558 | | |
548 | 559 | | |
549 | 560 | | |
550 | | - | |
551 | | - | |
552 | | - | |
553 | | - | |
| 561 | + | |
| 562 | + | |
| 563 | + | |
554 | 564 | | |
555 | 565 | | |
556 | 566 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
11 | 11 | | |
12 | 12 | | |
13 | 13 | | |
14 | | - | |
| 14 | + | |
15 | 15 | | |
16 | 16 | | |
17 | 17 | | |
| |||
38 | 38 | | |
39 | 39 | | |
40 | 40 | | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
41 | 73 | | |
42 | 74 | | |
43 | 75 | | |
| |||
149 | 181 | | |
150 | 182 | | |
151 | 183 | | |
152 | | - | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
153 | 187 | | |
154 | | - | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
155 | 191 | | |
156 | 192 | | |
157 | 193 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
41 | 41 | | |
42 | 42 | | |
43 | 43 | | |
44 | | - | |
45 | | - | |
46 | | - | |
| 44 | + | |
| 45 | + | |
47 | 46 | | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
438 | 438 | | |
439 | 439 | | |
440 | 440 | | |
| 441 | + | |
| 442 | + | |
| 443 | + | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
| 454 | + | |
| 455 | + | |
| 456 | + | |
| 457 | + | |
| 458 | + | |
| 459 | + | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
| 467 | + | |
| 468 | + | |
| 469 | + | |
| 470 | + | |
| 471 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
569 | 569 | | |
570 | 570 | | |
571 | 571 | | |
| 572 | + | |
| 573 | + | |
| 574 | + | |
| 575 | + | |
| 576 | + | |
| 577 | + | |
| 578 | + | |
| 579 | + | |
| 580 | + | |
| 581 | + | |
| 582 | + | |
| 583 | + | |
| 584 | + | |
| 585 | + | |
| 586 | + | |
| 587 | + | |
| 588 | + | |
| 589 | + | |
| 590 | + | |
| 591 | + | |
| 592 | + | |
| 593 | + | |
| 594 | + | |
| 595 | + | |
| 596 | + | |
| 597 | + | |
| 598 | + | |
| 599 | + | |
| 600 | + | |
| 601 | + | |
| 602 | + | |
| 603 | + | |
| 604 | + | |
| 605 | + | |
| 606 | + | |
| 607 | + | |
| 608 | + | |
| 609 | + | |
| 610 | + | |
| 611 | + | |
| 612 | + | |
| 613 | + | |
| 614 | + | |
| 615 | + | |
| 616 | + | |
| 617 | + | |
| 618 | + | |
| 619 | + | |
| 620 | + | |
0 commit comments