Fix deepseek v4#45892
Conversation
…eepseek-v4-csa-per-query-mask
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
|
Hey! I don't get how you want to run sdpa when it does not have attention sinks backed in it? You won't ever get good results. |
|
ArthurZucker Thanks for your hard work. |
|
Sorry everyone I trusted Claude too much |
人之常情,Keep going! |
Cyrilvallez
left a comment
There was a problem hiding this comment.
ALright, I don't know the specifics of deepseek v4, so just reviewed the general parts, not the exact mathematics.
I'm mostly a bit worried about all the dtype upcasting everywhere, it's quite expensive in general, both for speed and memory. And I believe some of them are actually useless. And when it's performed on weights directly, let's use keep_in_fp32_strict to always have the weights in fp32 instead of upcasting it every forward!
|
[For maintainers] Suggested jobs to run (before merge) run-slow: deepseek_v4 |
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45892&sha=acce5a |
What does this PR do?
Attention-mask layout per layer type
Tiny-config visualization (
sliding_window=8, CSAm=4, HCAm'=8,index_topk=2,S=16) of the actualcat([sliding_mask, block_bias])eachDeepseekV4Attentionlayer feeds toeager_attention_forward. Green cells = attended-to, dim slate = masked, red = causally available but the indexer's top-k didn't pick. Wide green blocks in the compressor section bundlemsource positions into one KV slot; dashed separators inside the block show which tokens were compressed together.Sliding-only — plain sliding-window-causal
[S, S], no compressor section:CSA — sliding KV + entry-view of the Lightning Indexer's top-
kpicks; red cells = available but not picked:HCA — sliding KV + every compressed entry (no indexer); each
C_wsummarisesm'source positions:Reproduces with
python docs/source/en/imgs/deepseek_v4/visualize_attention_masks.py --svg docs/source/en/imgs/deepseek_v4from the repo root.