Fix models for which we don't have a dedicated tokenizer class, and the listed one is incorrect#45936
Open
itazap wants to merge 3 commits into
Open
Fix models for which we don't have a dedicated tokenizer class, and the listed one is incorrect#45936itazap wants to merge 3 commits into
itazap wants to merge 3 commits into
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Contributor
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, camembert, mpnet, rembert, xglm, xlnet |
Contributor
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45936&sha=b24a05 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
we don't check the tokenization of all the model paths we have in the transformers repo. Related to #44255
We have models that don't have their own dedicated Tokenizer class and use another model's tokenizer (ex. Granite which uses GPT2Tokenizer - related issue: #45813) ). The different model tokenizer class would be mapped in the
tokenization_auto.pymapping, or in thetokenization_config.json. Sometimes the mapped tokenizer isn't actually the one that is being used, and v5 surfaced these incorrect mappings. In order to "stay true to" the pre-v5 behavior of these models, we can map them toTokenizersBackend(eq. toPreTrainedTokenizerFastin v4) which loads thetokenizer.jsonas is. This happens because in v5 we actually try to load the mapped tokenizer class and force the same tokenizer type.Anyway we only test tokenization of models that have their own tokenizer class but we should test tokenization for every checkpoint we have in the repo!
This PR
compare_tokenizers.py script
(based on that in #44255)
scans
tests/models/test_modeling_*.pyfor.from_pretrained(...)and extracts all the checkpoint paths we list, and compares _tokenizers loaded viaAutoTokenizer.from_pretrainedvsTokenizersBackend.from_pretrained.Report
on all the checkpoints we list: report
TODO
adding a test that will check AutoTokenizer and TokenizersBackend loads equivalent _tokenizer objects for each path we mention in the repo