How to reproduce the behaviour
nlp = English()
doc = nlp("I can't believe you have done this")
"can't" is a tokenizer exception because of the funky contraction (yay english)
|
{ORTH: "ca", NORM: "can"}, |
Spans that contain these exceptions are marked as has_special
declared here:
set here:
And has_special spans are not cached:
The problem is that has_special, once set to a nonzero value, is never reset. And this means that once the tokenizer encounters a special case, every subsequent span is also marked as special, and none of them get cached, even if they should be.
This has some fairly significant tokenizer performance implications. It should be much faster than it is. I'll put some benchmarking in my PR.
Your Environment
Info about spaCy
- spaCy version: 3.8.14
- Platform: macOS-26.3.1-arm64-arm-64bit
- Python version: 3.12.4
- Pipelines: en_core_web_md (3.8.0), en_core_web_sm (3.8.0)
How to reproduce the behaviour
"can't" is a tokenizer exception because of the funky contraction (yay english)
spaCy/spacy/lang/en/tokenizer_exceptions.py
Line 233 in 0069cf9
Spans that contain these exceptions are marked as
has_specialdeclared here:
spaCy/spacy/tokenizer.pyx
Line 179 in 0069cf9
set here:
spaCy/spacy/tokenizer.pyx
Line 375 in 0069cf9
And
has_specialspans are not cached:spaCy/spacy/tokenizer.pyx
Line 523 in 0069cf9
The problem is that
has_special, once set to a nonzero value, is never reset. And this means that once the tokenizer encounters a special case, every subsequent span is also marked as special, and none of them get cached, even if they should be.This has some fairly significant tokenizer performance implications. It should be much faster than it is. I'll put some benchmarking in my PR.
Your Environment
Info about spaCy