Stop align_special_tokens from rewriting eos_token_id when no alignment is needed#45917
Open
1fanwang wants to merge 1 commit into
Open
Stop align_special_tokens from rewriting eos_token_id when no alignment is needed#459171fanwang wants to merge 1 commit into
1fanwang wants to merge 1 commit into
Conversation
…ecial_tokens When the tokenizer's EOS already matches what's in the model config and generation config, align_special_tokens still converted the scalar eos_token_id into a list as a side effect of the membership check it used to detect new EOS tokens. That broke Whisper long-form generation: after Trainer.train() called align_special_tokens, generate() compared seek_sequence[-1] == generation_config.eos_token_id, which silently became scalar == list (always False), and on silence the sequence ended up empty because the pad-stripping branch no longer recognised pad == eos. The result was an IndexError out of generation_whisper.py. Move the int -> list conversion inside the branch that actually adds a new EOS, and handle both shapes when merging. No behaviour change when the tokenizer brings a new EOS token; the only difference is that the scalar form is preserved when nothing needs aligning. Fixes huggingface#45584
Contributor
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45917&sha=da7192 |
tedo001
reviewed
May 12, 2026
| if model.generation_config.eos_token_id is None: | ||
| tokenizer_has_new_eos |= tokenizer.eos_token_id != model.generation_config.eos_token_id | ||
| gen_eos = model.generation_config.eos_token_id | ||
| if gen_eos is None or isinstance(gen_eos, int): |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
align_special_tokenshad a side effect: it convertedgeneration_config.eos_token_idfrom a scalarintinto a single-element list even when the tokenizer's EOS already matched the model's EOS and no alignment was needed. The conversion was buried inside the membership check used to detect new EOS tokens (tokenizer.eos_token_id not in [...]), so it ran on every call toTrainer.train()regardless of whether anything had to change.Whisper's long-form generation expects
generation_config.eos_token_idto be a scalar in the common case. Ingenerate_with_fallbackit doesOnce the side effect turned
eos_token_idinto[50257]:pad == eosbranch became50257 == [50257]-> False, so thenum_paddings -= 1that preserved a trailing EOS no longer ran and the entire sequence was stripped on a silent input.seek_sequencewas empty and the trailing-EOS check raisedIndexError: index -1 is out of bounds for dimension 0 with size 0.The fix moves the
int -> listconversion inside the branch that actually adds a new EOS, and handles both shapes when merging the old and new EOS tokens. When the tokenizer brings a new EOS, behaviour is unchanged (the merged list still wins). When nothing needs to be aligned,eos_token_idis now left as-is.Fixes #45584.
Relationship to #45570
#45570 fixes the symptom in
generation_whisper.pyby normalisinggeneration_config.eos_token_idto a list and using membership checks. That is also a reasonable change in its own right (Whisper should tolerate either shape). This PR is complementary: it stopsalign_special_tokensfrom producing the unnecessary list in the first place, which also avoids the same trap for any other generation path that comparesseek_sequence[-1]against a scalareos_token_id. I'm happy to defer to whichever the maintainers prefer; both can also coexist.Reproducer
The script from #45584 (Whisper + a no-op call to
align_special_tokens) crashed withIndexErroronmain. After this change it returns''for silence, matching the pre-align_special_tokensbehaviour.Tests
Added
TrainerAlignSpecialTokensTestintests/trainer/test_trainer.pycovering:eos_token_idis left as a scalar when the tokenizer's EOS already matches,eos_token_idthat already contains the tokenizer's EOS is left untouched.The existing
TrainerIntegrationTest::test_special_token_alignmentstill passes.Before submitting
Pull Request section?
to it if that's the case. (Fixes Whisper generation fails on empty transcription after align_special_tokens #45584.)
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@SunMarc @eustlb