Skip to content

Stop align_special_tokens from rewriting eos_token_id when no alignment is needed#45917

Open
1fanwang wants to merge 1 commit into
huggingface:mainfrom
1fanwang:fix/align-eos-side-effect
Open

Stop align_special_tokens from rewriting eos_token_id when no alignment is needed#45917
1fanwang wants to merge 1 commit into
huggingface:mainfrom
1fanwang:fix/align-eos-side-effect

Conversation

@1fanwang
Copy link
Copy Markdown

What does this PR do?

align_special_tokens had a side effect: it converted generation_config.eos_token_id from a scalar int into a single-element list even when the tokenizer's EOS already matched the model's EOS and no alignment was needed. The conversion was buried inside the membership check used to detect new EOS tokens (tokenizer.eos_token_id not in [...]), so it ran on every call to Trainer.train() regardless of whether anything had to change.

Whisper's long-form generation expects generation_config.eos_token_id to be a scalar in the common case. In generate_with_fallback it does

if generation_config.pad_token_id == generation_config.eos_token_id:
    num_paddings -= 1
...
if seek_sequence[-1] == generation_config.eos_token_id:
    seek_sequence = seek_sequence[:-1]

Once the side effect turned eos_token_id into [50257]:

  1. The pad == eos branch became 50257 == [50257] -> False, so the num_paddings -= 1 that preserved a trailing EOS no longer ran and the entire sequence was stripped on a silent input.
  2. After stripping, seek_sequence was empty and the trailing-EOS check raised IndexError: index -1 is out of bounds for dimension 0 with size 0.

The fix moves the int -> list conversion inside the branch that actually adds a new EOS, and handles both shapes when merging the old and new EOS tokens. When the tokenizer brings a new EOS, behaviour is unchanged (the merged list still wins). When nothing needs to be aligned, eos_token_id is now left as-is.

Fixes #45584.

Relationship to #45570

#45570 fixes the symptom in generation_whisper.py by normalising generation_config.eos_token_id to a list and using membership checks. That is also a reasonable change in its own right (Whisper should tolerate either shape). This PR is complementary: it stops align_special_tokens from producing the unnecessary list in the first place, which also avoids the same trap for any other generation path that compares seek_sequence[-1] against a scalar eos_token_id. I'm happy to defer to whichever the maintainers prefer; both can also coexist.

Reproducer

The script from #45584 (Whisper + a no-op call to align_special_tokens) crashed with IndexError on main. After this change it returns '' for silence, matching the pre-align_special_tokens behaviour.

Tests

Added TrainerAlignSpecialTokensTest in tests/trainer/test_trainer.py covering:

  • a scalar eos_token_id is left as a scalar when the tokenizer's EOS already matches,
  • a new tokenizer EOS is merged into a list alongside the existing one,
  • a pre-existing list eos_token_id that already contains the tokenizer's EOS is left untouched.

The existing TrainerIntegrationTest::test_special_token_alignment still passes.

Before submitting

Who can review?

@SunMarc @eustlb

…ecial_tokens

When the tokenizer's EOS already matches what's in the model config and
generation config, align_special_tokens still converted the scalar
eos_token_id into a list as a side effect of the membership check it used
to detect new EOS tokens. That broke Whisper long-form generation: after
Trainer.train() called align_special_tokens, generate() compared
seek_sequence[-1] == generation_config.eos_token_id, which silently became
scalar == list (always False), and on silence the sequence ended up empty
because the pad-stripping branch no longer recognised pad == eos. The
result was an IndexError out of generation_whisper.py.

Move the int -> list conversion inside the branch that actually adds a new
EOS, and handle both shapes when merging. No behaviour change when the
tokenizer brings a new EOS token; the only difference is that the scalar
form is preserved when nothing needs aligning.

Fixes huggingface#45584
@github-actions
Copy link
Copy Markdown
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45917&sha=da7192

if model.generation_config.eos_token_id is None:
tokenizer_has_new_eos |= tokenizer.eos_token_id != model.generation_config.eos_token_id
gen_eos = model.generation_config.eos_token_id
if gen_eos is None or isinstance(gen_eos, int):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sir pls add a command !!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Whisper generation fails on empty transcription after align_special_tokens

3 participants