Fix models for which we don't have a dedicated tokenizer class, and the listed one is incorrect by itazap · Pull Request #45936 · huggingface/transformers

itazap · 2026-05-13T07:53:44Z

we don't check the tokenization of all the model paths we have in the transformers repo. Related to #44255

We have models that don't have their own dedicated Tokenizer class and use another model's tokenizer (ex. Granite which uses GPT2Tokenizer - related issue: #45813) ). The different model tokenizer class would be mapped in the tokenization_auto.py mapping, or in the tokenization_config.json. Sometimes the mapped tokenizer isn't actually the one that is being used, and v5 surfaced these incorrect mappings. In order to "stay true to" the pre-v5 behavior of these models, we can map them to TokenizersBackend (eq. to PreTrainedTokenizerFast in v4) which loads the tokenizer.json as is. This happens because in v5 we actually try to load the mapped tokenizer class and force the same tokenizer type.

Anyway we only test tokenization of models that have their own tokenizer class but we should test tokenization for every checkpoint we have in the repo!

This PR

compare_tokenizers.py script

(based on that in #44255)

scans tests/models/test_modeling_*.py for .from_pretrained(...) and extracts all the checkpoint paths we list, and compares _tokenizers loaded via AutoTokenizer.from_pretrained vs TokenizersBackend.from_pretrained.

Report

on all the checkpoints we list: report

TODO

adding a test that will check AutoTokenizer and TokenizersBackend loads equivalent _tokenizer objects for each path we mention in the repo

… repo

HuggingFaceDocBuilderDev · 2026-05-13T08:05:19Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

github-actions · 2026-05-15T05:57:50Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, camembert, mpnet, rembert, xglm, xlnet

github-actions · 2026-05-15T06:11:08Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45936&sha=b24a05

we don't check the tokenization of all the model paths we have in the…

96bb162

… repo

itazap added 2 commits May 14, 2026 17:46

fixessss

04207f4

rerun with fixes

b24a056

itazap requested a review from ArthurZucker May 15, 2026 05:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix models for which we don't have a dedicated tokenizer class, and the listed one is incorrect#45936

Fix models for which we don't have a dedicated tokenizer class, and the listed one is incorrect#45936
itazap wants to merge 3 commits into
mainfrom
check_repo_model_tokenizers

itazap commented May 13, 2026 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented May 13, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

itazap commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

This PR

compare_tokenizers.py script

Report

TODO

Uh oh!

HuggingFaceDocBuilderDev commented May 13, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

itazap commented May 13, 2026 •

edited

Loading