Skip to content

Add Optional Attention Residuals (AttnRes) Integration for Unsloth Fast Paths#4863

Closed
kleeedolinux wants to merge 11 commits intounslothai:mainfrom
kleeedolinux:main
Closed

Add Optional Attention Residuals (AttnRes) Integration for Unsloth Fast Paths#4863
kleeedolinux wants to merge 11 commits intounslothai:mainfrom
kleeedolinux:main

Conversation

@kleeedolinux
Copy link
Copy Markdown

1. Why this PR exists

Unsloth fast paths currently use standard residual updates around attention blocks, e.g.:

$$x_{l+1} = x_l + f_l(x_l)$$

This is stable and efficient, but all prior layer information is mixed with fixed weights (implicit weight = 1 for the direct residual stream at each step). The AttnRes idea is to make residual mixing content-adaptive while preserving the same forward structure and default behavior.

Reference paper:

This PR introduces an opt-in, low-risk integration that keeps Unsloth architecture and coding style intact.

2. High-level design

The integration goal is:

  • Do not break existing kernels / fast paths.
  • Keep default behavior identical when AttnRes is disabled.
  • Touch only attention-branch outputs before existing residual merges.

Implemented approach:

  1. Build lightweight AttnRes state per forward pass.
  2. Track block-local states and block summaries.
  3. Before residual add, compute an AttnRes contribution from prior states.
  4. Add that contribution to the attention output, then continue existing residual code.

So if current attention output is (a_l), the transformed output is:

$$\tilde{a}_l = a_l + \alpha \cdot \sum_i w_{l,i} v_i$$

with:

$$w_{l,i} = \mathrm{softmax}\left(\frac{q_l \cdot k_i}{\sqrt{d}}\right)$$

where in this practical implementation:

  • query (q_l): derived from current residual/query tensor,
  • keys/values ((k_i, v_i)): prior block summaries + in-block previous states,
  • (\alpha): configurable residual-mixing scale.

3. Mathematical mapping to implementation

The paper discusses full and block residual attention. This PR implements a practical block-style variant:

  • Maintain two sets:
    • completed block summaries (S_1, \dots, S_{m-1})
    • in-progress block states in current block (h_{m,1}, \dots, h_{m,t-1})
  • Candidate set for layer (l):
$$\mathcal{C}_l = \{S_1, \dots, S_{m-1}\} \cup \{h_{m,1}, \dots, h_{m,t-1}\}$$
  • Residual attention mix:
$$r_l = \sum_{c \in \mathcal{C}_l} \mathrm{softmax}\left(\frac{q_l \cdot c}{\sqrt{d}}\right) c$$
  • Output transform:
$$\tilde{a}_l = a_l + \alpha r_l$$
  • Block summary update at boundary:
$$S_m = \sum_{j \in B_m} h_j$$

4. What was changed

New files

  • unsloth/models/attnres.py
    • forward-state init
    • block summary accumulation
    • attention-output transform
  • unsloth/utils/attnres.py
    • utility-level AttnRes state/config/hook helpers

Updated files

  • unsloth/utils/attention_dispatch.py
    • optional AttnRes-aware post-attention finalization hook path
  • unsloth/utils/__init__.py
  • unsloth/kernels/__init__.py
    • export AttnRes helpers

Model integrations (attention branch only)

  • unsloth/models/llama.py
  • unsloth/models/gemma2.py
  • unsloth/models/cohere.py
  • unsloth/models/granite.py
  • unsloth/models/falcon_h1.py

These sites now call the AttnRes transform immediately after attention output and before existing residual merges.

5. Why this is maintainer-friendly

  • Backward-compatible by default: if AttnRes is off, behavior remains unchanged.
  • Minimal surface area: logic is centralized in models/attnres.py and reused.
  • No kernel rewrites required: existing fast attention paths remain intact.
  • Per-architecture safe integration: only attention branch is modified; non-attention math is unchanged.
  • Reviewable diff: most edits are one-step hooks at known residual merge points.

6. Safety and risk controls

Implemented controls:

  • Opt-in behavior only.
  • Guard against risky stateful contexts (e.g., gradient-checkpointing replay desync).
  • Preserve dtype/device behavior from existing paths.
  • Keep all legacy residual-add code paths (e.g., +=, torch.add) untouched except attention pre-transform.

Known scope limits in this PR:

  • This is not a full reproduction of paper-level distributed systems features (e.g., cross-stage caching infrastructure, full two-phase inference scheduler stack).
  • This PR focuses on architecture-level integration inside current Unsloth fast-path constraints.

7. How to review quickly

  1. Start with unsloth/models/attnres.py for core logic.
  2. Check one model integration (llama.py) for pattern.
  3. Confirm same pattern replicated in gemma2/cohere/granite/falcon_h1.
  4. Verify default-off behavior and unchanged residual merge code.
  5. Validate utility exports and dispatch hooks.

8. Validation performed

  • Static compile check on changed Python files:
    • python -m py_compile ... (pass)

Suggested follow-up validation (maintainer side):

  • A/B forward parity when AttnRes disabled.
  • Small training smoke run with AttnRes enabled.
  • Throughput/memory check on at least one Llama-family and one non-Llama-family model.

9. Expected impact

When enabled, this should allow attention outputs to incorporate dynamically weighted prior layer information (via block summaries + in-block history) while preserving Unsloth’s fast-path structure and compatibility defaults.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 973cd54e69

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread unsloth/models/attnres.py
Comment thread unsloth/models/attnres.py Outdated
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an 'Attention Residual' (AttnRes) mechanism across several models, including Llama, Cohere, Falcon, Gemma2, and Granite, as well as within the attention dispatch utility. The changes implement state management for accumulating and mixing attention residuals, providing hooks for both training and inference. The review feedback highlights a critical issue where tensors stored in the state could be modified in-place, potentially corrupting the residual calculations; it is recommended to clone these tensors before storage. Additionally, the feedback suggests refining exception handling during configuration parsing to catch specific errors and refactoring redundant logic in the state initialization process to improve code maintainability.

Comment thread unsloth/models/attnres.py Outdated
Comment thread unsloth/models/attnres.py
Comment thread unsloth/models/attnres.py
Comment thread unsloth/utils/attnres.py
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@Datta0
Copy link
Copy Markdown
Collaborator

Datta0 commented Apr 6, 2026

Hey @kleeedolinux can you explain to me why we are trying to add attention residuals to existing model architectures? The point of the files that you see in the models is to re-enact the exact modeling behavior that we find in transformers, with unsloth optimisations, for the said models. This change seems beyond the scope of the necessity.

@danielhanchen
Copy link
Copy Markdown
Contributor

Thanks for the PR, but we can't add this for now in Unsloth since it's partially out of scope for Unsloth - thanks for everything though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants