Skip to content

Fix models for which we don't have a dedicated tokenizer class, and the listed one is incorrect#45936

Open
itazap wants to merge 3 commits into
mainfrom
check_repo_model_tokenizers
Open

Fix models for which we don't have a dedicated tokenizer class, and the listed one is incorrect#45936
itazap wants to merge 3 commits into
mainfrom
check_repo_model_tokenizers

Conversation

@itazap
Copy link
Copy Markdown
Collaborator

@itazap itazap commented May 13, 2026

we don't check the tokenization of all the model paths we have in the transformers repo. Related to #44255

We have models that don't have their own dedicated Tokenizer class and use another model's tokenizer (ex. Granite which uses GPT2Tokenizer - related issue: #45813) ). The different model tokenizer class would be mapped in the tokenization_auto.py mapping, or in the tokenization_config.json. Sometimes the mapped tokenizer isn't actually the one that is being used, and v5 surfaced these incorrect mappings. In order to "stay true to" the pre-v5 behavior of these models, we can map them to TokenizersBackend (eq. to PreTrainedTokenizerFast in v4) which loads the tokenizer.json as is. This happens because in v5 we actually try to load the mapped tokenizer class and force the same tokenizer type.

Anyway we only test tokenization of models that have their own tokenizer class but we should test tokenization for every checkpoint we have in the repo!

This PR

compare_tokenizers.py script

(based on that in #44255)

scans tests/models/test_modeling_*.py for .from_pretrained(...) and extracts all the checkpoint paths we list, and compares _tokenizers loaded via AutoTokenizer.from_pretrained vs TokenizersBackend.from_pretrained.

Report

on all the checkpoints we list: report

TODO

adding a test that will check AutoTokenizer and TokenizersBackend loads equivalent _tokenizer objects for each path we mention in the repo

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@itazap itazap requested a review from ArthurZucker May 15, 2026 05:56
@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, camembert, mpnet, rembert, xglm, xlnet

@github-actions
Copy link
Copy Markdown
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45936&sha=b24a05

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants