Fix deepseek v4 by ArthurZucker · Pull Request #45892 · huggingface/transformers

ArthurZucker · 2026-05-11T10:45:10Z

What does this PR do?

Attention-mask layout per layer type

Tiny-config visualization (sliding_window=8, CSA m=4, HCA m'=8, index_topk=2, S=16) of the actual cat([sliding_mask, block_bias]) each DeepseekV4Attention layer feeds to eager_attention_forward. Green cells = attended-to, dim slate = masked, red = causally available but the indexer's top-k didn't pick. Wide green blocks in the compressor section bundle m source positions into one KV slot; dashed separators inside the block show which tokens were compressed together.

Sliding-only — plain sliding-window-causal [S, S], no compressor section:

CSA — sliding KV + entry-view of the Lightning Indexer's top-k picks; red cells = available but not picked:

HCA — sliding KV + every compressed entry (no indexer); each C_w summarises m' source positions:

Reproduces with python docs/source/en/imgs/deepseek_v4/visualize_attention_masks.py --svg docs/source/en/imgs/deepseek_v4 from the repo root.

…eepseek-v4-csa-per-query-mask

HuggingFaceDocBuilderDev · 2026-05-11T10:57:05Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker · 2026-05-12T00:28:28Z

Hey! I don't get how you want to run sdpa when it does not have attention sinks backed in it? You won't ever get good results.

godofgithub · 2026-05-12T07:12:53Z

ArthurZucker Thanks for your hard work.

ArthurZucker · 2026-05-12T07:16:01Z

Sorry everyone I trusted Claude too much

godofgithub · 2026-05-12T07:26:44Z

Sorry everyone I trusted Claude too much

人之常情，Keep going!

Cyrilvallez

ALright, I don't know the specifics of deepseek v4, so just reviewed the general parts, not the exact mathematics.
I'm mostly a bit worried about all the dtype upcasting everywhere, it's quite expensive in general, both for speed and memory. And I believe some of them are actually useless. And when it's performed on weights directly, let's use keep_in_fp32_strict to always have the weights in fp32 instead of upcasting it every forward!

github-actions · 2026-05-12T07:59:32Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: deepseek_v4

github-actions · 2026-05-12T08:04:42Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45892&sha=acce5a

ArthurZucker added 2 commits May 11, 2026 19:43

up

7d47042

Merge branch 'main' of github.com:huggingface/transformers into fix-d…

8bdbfbb

…eepseek-v4-csa-per-query-mask

Sawyer117 added a commit to Sawyer117/transformers that referenced this pull request May 11, 2026

bench: relabel arthur -> huggingface#45892; rename file

e6e4e87

This comment was marked as off-topic.

Sign in to view

ArthurZucker added 5 commits May 12, 2026 09:47

small updates

b9688d8

a mistery how this got through

0260527

updates

b455143

update

24ba03a

final fixes?

8b3dbd7

ArthurZucker marked this pull request as ready for review May 12, 2026 04:58

ArthurZucker added 5 commits May 11, 2026 22:52

update

f3af33c

up

11ae644

yup

c889726

last nit!

6f8a77b

up

d936ffa

update

61c69b5

Cyrilvallez reviewed May 12, 2026

View reviewed changes

ArthurZucker added 2 commits May 12, 2026 00:54

revert shitty AI work

72e00bb

up

acce5a1

yyups

679d215

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix deepseek v4#45892

Fix deepseek v4#45892
ArthurZucker wants to merge 16 commits into
mainfrom
fix-deepseek-v4-csa-per-query-mask

ArthurZucker commented May 11, 2026 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented May 11, 2026

Uh oh!

This comment was marked as off-topic.

This comment was marked as off-topic.

ArthurZucker commented May 12, 2026

Uh oh!

godofgithub commented May 12, 2026

Uh oh!

ArthurZucker commented May 12, 2026

Uh oh!

godofgithub commented May 12, 2026

Uh oh!

Cyrilvallez left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ArthurZucker commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Attention-mask layout per layer type

Uh oh!

HuggingFaceDocBuilderDev commented May 11, 2026

Uh oh!

This comment was marked as off-topic.

This comment was marked as off-topic.

ArthurZucker commented May 12, 2026

Uh oh!

godofgithub commented May 12, 2026

Uh oh!

ArthurZucker commented May 12, 2026

Uh oh!

godofgithub commented May 12, 2026

Uh oh!

Cyrilvallez left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ArthurZucker commented May 11, 2026 •

edited

Loading