Add Optional Attention Residuals (AttnRes) Integration for Unsloth Fast Paths#4863
Add Optional Attention Residuals (AttnRes) Integration for Unsloth Fast Paths#4863kleeedolinux wants to merge 11 commits intounslothai:mainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 973cd54e69
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Code Review
This pull request introduces an 'Attention Residual' (AttnRes) mechanism across several models, including Llama, Cohere, Falcon, Gemma2, and Granite, as well as within the attention dispatch utility. The changes implement state management for accumulating and mixing attention residuals, providing hooks for both training and inference. The review feedback highlights a critical issue where tensors stored in the state could be modified in-place, potentially corrupting the residual calculations; it is recommended to clone these tensors before storage. Additionally, the feedback suggests refining exception handling during configuration parsing to catch specific errors and refactoring redundant logic in the state initialization process to improve code maintainability.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
Hey @kleeedolinux can you explain to me why we are trying to add attention residuals to existing model architectures? The point of the files that you see in the models is to re-enact the exact modeling behavior that we find in transformers, with unsloth optimisations, for the said models. This change seems beyond the scope of the necessity. |
|
Thanks for the PR, but we can't add this for now in Unsloth since it's partially out of scope for Unsloth - thanks for everything though! |
1. Why this PR exists
Unsloth fast paths currently use standard residual updates around attention blocks, e.g.:
This is stable and efficient, but all prior layer information is mixed with fixed weights (implicit weight = 1 for the direct residual stream at each step). The AttnRes idea is to make residual mixing content-adaptive while preserving the same forward structure and default behavior.
Reference paper:
This PR introduces an opt-in, low-risk integration that keeps Unsloth architecture and coding style intact.
2. High-level design
The integration goal is:
Implemented approach:
So if current attention output is (a_l), the transformed output is:
with:
where in this practical implementation:
3. Mathematical mapping to implementation
The paper discusses full and block residual attention. This PR implements a practical block-style variant:
4. What was changed
New files
unsloth/models/attnres.pyunsloth/utils/attnres.pyUpdated files
unsloth/utils/attention_dispatch.pyunsloth/utils/__init__.pyunsloth/kernels/__init__.pyModel integrations (attention branch only)
unsloth/models/llama.pyunsloth/models/gemma2.pyunsloth/models/cohere.pyunsloth/models/granite.pyunsloth/models/falcon_h1.pyThese sites now call the AttnRes transform immediately after attention output and before existing residual merges.
5. Why this is maintainer-friendly
models/attnres.pyand reused.6. Safety and risk controls
Implemented controls:
+=,torch.add) untouched except attention pre-transform.Known scope limits in this PR:
7. How to review quickly
unsloth/models/attnres.pyfor core logic.llama.py) for pattern.gemma2/cohere/granite/falcon_h1.8. Validation performed
python -m py_compile ...(pass)Suggested follow-up validation (maintainer side):
9. Expected impact
When enabled, this should allow attention outputs to incorporate dynamically weighted prior layer information (via block summaries + in-block history) while preserving Unsloth’s fast-path structure and compatibility defaults.