Skip to content

Fix deepseek v4#45892

Open
ArthurZucker wants to merge 16 commits into
mainfrom
fix-deepseek-v4-csa-per-query-mask
Open

Fix deepseek v4#45892
ArthurZucker wants to merge 16 commits into
mainfrom
fix-deepseek-v4-csa-per-query-mask

Conversation

@ArthurZucker
Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker commented May 11, 2026

What does this PR do?

Attention-mask layout per layer type

Tiny-config visualization (sliding_window=8, CSA m=4, HCA m'=8, index_topk=2, S=16) of the actual cat([sliding_mask, block_bias]) each DeepseekV4Attention layer feeds to eager_attention_forward. Green cells = attended-to, dim slate = masked, red = causally available but the indexer's top-k didn't pick. Wide green blocks in the compressor section bundle m source positions into one KV slot; dashed separators inside the block show which tokens were compressed together.

Sliding-only — plain sliding-window-causal [S, S], no compressor section:

sliding mask

CSA — sliding KV + entry-view of the Lightning Indexer's top-k picks; red cells = available but not picked:

CSA mask

HCA — sliding KV + every compressed entry (no indexer); each C_w summarises m' source positions:

HCA mask

Reproduces with python docs/source/en/imgs/deepseek_v4/visualize_attention_masks.py --svg docs/source/en/imgs/deepseek_v4 from the repo root.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sawyer117 added a commit to Sawyer117/transformers that referenced this pull request May 11, 2026
@Sawyer117

This comment was marked as off-topic.

@Sawyer117

This comment was marked as off-topic.

@ArthurZucker
Copy link
Copy Markdown
Collaborator Author

Hey! I don't get how you want to run sdpa when it does not have attention sinks backed in it? You won't ever get good results.

@ArthurZucker ArthurZucker marked this pull request as ready for review May 12, 2026 04:58
@godofgithub
Copy link
Copy Markdown

ArthurZucker Thanks for your hard work.

@ArthurZucker
Copy link
Copy Markdown
Collaborator Author

Sorry everyone I trusted Claude too much

@godofgithub
Copy link
Copy Markdown

Sorry everyone I trusted Claude too much

人之常情,Keep going!

Copy link
Copy Markdown
Member

@Cyrilvallez Cyrilvallez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ALright, I don't know the specifics of deepseek v4, so just reviewed the general parts, not the exact mathematics.
I'm mostly a bit worried about all the dtype upcasting everywhere, it's quite expensive in general, both for speed and memory. And I believe some of them are actually useless. And when it's performed on weights directly, let's use keep_in_fp32_strict to always have the weights in fp32 instead of upcasting it every forward!

Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated
Comment thread src/transformers/models/deepseek_v4/modular_deepseek_v4.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: deepseek_v4

@github-actions
Copy link
Copy Markdown
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45892&sha=acce5a

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants