Stop align_special_tokens from rewriting eos_token_id when no alignment is needed by 1fanwang · Pull Request #45917 · huggingface/transformers

1fanwang · 2026-05-12T13:51:52Z

What does this PR do?

align_special_tokens had a side effect: it converted generation_config.eos_token_id from a scalar int into a single-element list even when the tokenizer's EOS already matched the model's EOS and no alignment was needed. The conversion was buried inside the membership check used to detect new EOS tokens (tokenizer.eos_token_id not in [...]), so it ran on every call to Trainer.train() regardless of whether anything had to change.

Whisper's long-form generation expects generation_config.eos_token_id to be a scalar in the common case. In generate_with_fallback it does

if generation_config.pad_token_id == generation_config.eos_token_id:
    num_paddings -= 1
...
if seek_sequence[-1] == generation_config.eos_token_id:
    seek_sequence = seek_sequence[:-1]

Once the side effect turned eos_token_id into [50257]:

The pad == eos branch became 50257 == [50257] -> False, so the num_paddings -= 1 that preserved a trailing EOS no longer ran and the entire sequence was stripped on a silent input.
After stripping, seek_sequence was empty and the trailing-EOS check raised IndexError: index -1 is out of bounds for dimension 0 with size 0.

The fix moves the int -> list conversion inside the branch that actually adds a new EOS, and handles both shapes when merging the old and new EOS tokens. When the tokenizer brings a new EOS, behaviour is unchanged (the merged list still wins). When nothing needs to be aligned, eos_token_id is now left as-is.

Fixes #45584.

Relationship to #45570

#45570 fixes the symptom in generation_whisper.py by normalising generation_config.eos_token_id to a list and using membership checks. That is also a reasonable change in its own right (Whisper should tolerate either shape). This PR is complementary: it stops align_special_tokens from producing the unnecessary list in the first place, which also avoids the same trap for any other generation path that compares seek_sequence[-1] against a scalar eos_token_id. I'm happy to defer to whichever the maintainers prefer; both can also coexist.

Reproducer

The script from #45584 (Whisper + a no-op call to align_special_tokens) crashed with IndexError on main. After this change it returns '' for silence, matching the pre-align_special_tokens behaviour.

Tests

Added TrainerAlignSpecialTokensTest in tests/trainer/test_trainer.py covering:

a scalar eos_token_id is left as a scalar when the tokenizer's EOS already matches,
a new tokenizer EOS is merged into a list alongside the existing one,
a pre-existing list eos_token_id that already contains the tokenizer's EOS is left untouched.

The existing TrainerIntegrationTest::test_special_token_alignment still passes.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case. (Fixes Whisper generation fails on empty transcription after align_special_tokens #45584.)
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@SunMarc @eustlb

…ecial_tokens When the tokenizer's EOS already matches what's in the model config and generation config, align_special_tokens still converted the scalar eos_token_id into a list as a side effect of the membership check it used to detect new EOS tokens. That broke Whisper long-form generation: after Trainer.train() called align_special_tokens, generate() compared seek_sequence[-1] == generation_config.eos_token_id, which silently became scalar == list (always False), and on silence the sequence ended up empty because the pad-stripping branch no longer recognised pad == eos. The result was an IndexError out of generation_whisper.py. Move the int -> list conversion inside the branch that actually adds a new EOS, and handle both shapes when merging. No behaviour change when the tokenizer brings a new EOS token; the only difference is that the scalar form is preserved when nothing needs aligning. Fixes huggingface#45584

github-actions · 2026-05-12T14:04:46Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45917&sha=da7192

tedo001 · 2026-05-12T17:51:27Z

-        if model.generation_config.eos_token_id is None:
-            tokenizer_has_new_eos |= tokenizer.eos_token_id != model.generation_config.eos_token_id
+        gen_eos = model.generation_config.eos_token_id
+        if gen_eos is None or isinstance(gen_eos, int):


sir pls add a command !!!!

tedo001 reviewed May 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop align_special_tokens from rewriting eos_token_id when no alignment is needed#45917

Stop align_special_tokens from rewriting eos_token_id when no alignment is needed#45917
1fanwang wants to merge 1 commit into
huggingface:mainfrom
1fanwang:fix/align-eos-side-effect

1fanwang commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

tedo001 May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

1fanwang commented May 12, 2026

What does this PR do?

Relationship to #45570

Reproducer

Tests

Before submitting

Who can review?

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

tedo001 May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants