Add Optional Attention Residuals (AttnRes) Integration for Unsloth Fast Paths by kleeedolinux · Pull Request #4863 · unslothai/unsloth

kleeedolinux · 2026-04-05T20:35:50Z

1. Why this PR exists

Unsloth fast paths currently use standard residual updates around attention blocks, e.g.:

$$x_{l+1} = x_l + f_l(x_l)$$

This is stable and efficient, but all prior layer information is mixed with fixed weights (implicit weight = 1 for the direct residual stream at each step). The AttnRes idea is to make residual mixing content-adaptive while preserving the same forward structure and default behavior.

Reference paper:

Attention Residuals: https://arxiv.org/pdf/2603.15031

This PR introduces an opt-in, low-risk integration that keeps Unsloth architecture and coding style intact.

2. High-level design

The integration goal is:

Do not break existing kernels / fast paths.
Keep default behavior identical when AttnRes is disabled.
Touch only attention-branch outputs before existing residual merges.

Implemented approach:

Build lightweight AttnRes state per forward pass.
Track block-local states and block summaries.
Before residual add, compute an AttnRes contribution from prior states.
Add that contribution to the attention output, then continue existing residual code.

So if current attention output is (a_l), the transformed output is:

$$\tilde{a}_l = a_l + \alpha \cdot \sum_i w_{l,i} v_i$$

with:

$$w_{l,i} = \mathrm{softmax}\left(\frac{q_l \cdot k_i}{\sqrt{d}}\right)$$

where in this practical implementation:

query (q_l): derived from current residual/query tensor,
keys/values ((k_i, v_i)): prior block summaries + in-block previous states,
(\alpha): configurable residual-mixing scale.

3. Mathematical mapping to implementation

The paper discusses full and block residual attention. This PR implements a practical block-style variant:

Maintain two sets:
- completed block summaries (S_1, \dots, S_{m-1})
- in-progress block states in current block (h_{m,1}, \dots, h_{m,t-1})
Candidate set for layer (l):

$$\mathcal{C}_l = \{S_1, \dots, S_{m-1}\} \cup \{h_{m,1}, \dots, h_{m,t-1}\}$$

Residual attention mix:

$$r_l = \sum_{c \in \mathcal{C}_l} \mathrm{softmax}\left(\frac{q_l \cdot c}{\sqrt{d}}\right) c$$

Output transform:

$$\tilde{a}_l = a_l + \alpha r_l$$

Block summary update at boundary:

$$S_m = \sum_{j \in B_m} h_j$$

4. What was changed

New files

unsloth/models/attnres.py
- forward-state init
- block summary accumulation
- attention-output transform
unsloth/utils/attnres.py
- utility-level AttnRes state/config/hook helpers

Updated files

unsloth/utils/attention_dispatch.py
- optional AttnRes-aware post-attention finalization hook path
unsloth/utils/__init__.py
unsloth/kernels/__init__.py
- export AttnRes helpers

Model integrations (attention branch only)

unsloth/models/llama.py
unsloth/models/gemma2.py
unsloth/models/cohere.py
unsloth/models/granite.py
unsloth/models/falcon_h1.py

These sites now call the AttnRes transform immediately after attention output and before existing residual merges.

5. Why this is maintainer-friendly

Backward-compatible by default: if AttnRes is off, behavior remains unchanged.
Minimal surface area: logic is centralized in models/attnres.py and reused.
No kernel rewrites required: existing fast attention paths remain intact.
Per-architecture safe integration: only attention branch is modified; non-attention math is unchanged.
Reviewable diff: most edits are one-step hooks at known residual merge points.

6. Safety and risk controls

Implemented controls:

Opt-in behavior only.
Guard against risky stateful contexts (e.g., gradient-checkpointing replay desync).
Preserve dtype/device behavior from existing paths.
Keep all legacy residual-add code paths (e.g., +=, torch.add) untouched except attention pre-transform.

Known scope limits in this PR:

This is not a full reproduction of paper-level distributed systems features (e.g., cross-stage caching infrastructure, full two-phase inference scheduler stack).
This PR focuses on architecture-level integration inside current Unsloth fast-path constraints.

7. How to review quickly

Start with unsloth/models/attnres.py for core logic.
Check one model integration (llama.py) for pattern.
Confirm same pattern replicated in gemma2/cohere/granite/falcon_h1.
Verify default-off behavior and unchanged residual merge code.
Validate utility exports and dispatch hooks.

8. Validation performed

Static compile check on changed Python files:
- python -m py_compile ... (pass)

Suggested follow-up validation (maintainer side):

A/B forward parity when AttnRes disabled.
Small training smoke run with AttnRes enabled.
Throughput/memory check on at least one Llama-family and one non-Llama-family model.

9. Expected impact

When enabled, this should allow attention outputs to incorporate dynamically weighted prior layer information (via block summaries + in-block history) while preserving Unsloth’s fast-path structure and compatibility defaults.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 973cd54e69

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

gemini-code-assist

Code Review

This pull request introduces an 'Attention Residual' (AttnRes) mechanism across several models, including Llama, Cohere, Falcon, Gemma2, and Granite, as well as within the attention dispatch utility. The changes implement state management for accumulating and mixing attention residuals, providing hooks for both training and inference. The review feedback highlights a critical issue where tensors stored in the state could be modified in-place, potentially corrupting the residual calculations; it is recommended to clone these tensors before storage. Additionally, the feedback suggests refining exception handling during configuration parsing to catch specific errors and refactoring redundant logic in the state initialization process to improve code maintainability.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Datta0 · 2026-04-06T05:30:28Z

Hey @kleeedolinux can you explain to me why we are trying to add attention residuals to existing model architectures? The point of the files that you see in the models is to re-enact the exact modeling behavior that we find in transformers, with unsloth optimisations, for the said models. This change seems beyond the scope of the necessity.

danielhanchen · 2026-04-09T11:59:00Z

Thanks for the PR, but we can't add this for now in Unsloth since it's partially out of scope for Unsloth - thanks for everything though!

kleeedolinux added 10 commits April 5, 2026 17:29

Add models AttnRes state and transform helpers

210d349

Add utils AttnRes hook helpers

c9c9ad9

Update attention dispatch for AttnRes hooks

406d176

Update utils exports for AttnRes

8d5279b

Update kernels exports for AttnRes

0260d1c

Update llama fast paths with AttnRes integration

c6ea5d5

Update gemma2 fast paths with AttnRes integration

30cb672

Update cohere fast paths with AttnRes integration

ee896b6

Update granite fast paths with AttnRes integration

01373a1

Update falcon_h1 fast paths with AttnRes integration

973cd54

kleeedolinux requested review from Datta0, danielhanchen and mmathew23 as code owners April 5, 2026 20:35

chatgpt-codex-connector Bot reviewed Apr 5, 2026

View reviewed changes

Comment thread unsloth/models/attnres.py

Comment thread unsloth/models/attnres.py Outdated

gemini-code-assist Bot reviewed Apr 5, 2026

View reviewed changes

Comment thread unsloth/models/attnres.py Outdated

Comment thread unsloth/models/attnres.py

Comment thread unsloth/models/attnres.py

Comment thread unsloth/utils/attnres.py

Apply suggestions from code review

5138881

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

danielhanchen mentioned this pull request Apr 6, 2026

Add Optional Attention Residuals (AttnRes) Integration for Unsloth Fast Paths unslothai/unsloth-staging-1#9

Closed

danielhanchen closed this Apr 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Optional Attention Residuals (AttnRes) Integration for Unsloth Fast Paths#4863

Add Optional Attention Residuals (AttnRes) Integration for Unsloth Fast Paths#4863
kleeedolinux wants to merge 11 commits intounslothai:mainfrom
kleeedolinux:main

kleeedolinux commented Apr 5, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Datta0 commented Apr 6, 2026

Uh oh!

danielhanchen commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

kleeedolinux commented Apr 5, 2026

1. Why this PR exists

2. High-level design

3. Mathematical mapping to implementation

4. What was changed

New files

Updated files

Model integrations (attention branch only)

5. Why this is maintainer-friendly

6. Safety and risk controls

7. How to review quickly

8. Validation performed

9. Expected impact

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Datta0 commented Apr 6, 2026

Uh oh!

danielhanchen commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants