fix: dtype mismatch in matmul_lora and LoRA backward with bnb-4bit + GRPO by anandn1 · Pull Request #4918 · unslothai/unsloth

anandn1 · 2026-04-08T12:28:56Z

Problem

GRPO training with unsloth/Llama-3.2-3B-Instruct-bnb-4bit (bf16 activations) crashes with two distinct dtype mismatch errors, even after setting dtype=torch.bfloat16, bf16=True, and fp16=False:

Error 1 — forward pass (utils.py:matmul_lora, LoRA branch)

RuntimeError: mat1 and mat2 must have the same dtype, but got Half and BFloat16
    out.addmm_(XA, B.to(dtype), alpha = s)

Error 2 — backward pass (fast_lora.py, LoRA_MLP_SwiGLU / LoRA_QKV)

RuntimeError: self and mat2 must have the same dtype, but got Float and Half
    d_downA.addmm_(h.t(), dY @ downB.t(), ...)

Root Causes

1. `fast_dequantize` returns float16 regardless of load-time `dtype=`

fast_dequantize writes into a global WEIGHT_BUFFERS cache whose dtype is controlled by quant_state.dtype embedded in the bnb-4bit checkpoint — float16 by default. Setting dtype=torch.bfloat16 at load time does not update quant_state.dtype, so the dequantized base weight W comes out float16 even when activations are bfloat16. The out tensor from the base-weight matmul is therefore float16, and the subsequent out.addmm_(XA, B.to(dtype)) in the LoRA branch crashes because out is float16 but B.to(dtype) is bfloat16.

2. `@torch_amp_custom_bwd` inherits TRL's float16 autocast context

TRL's compiled GRPO trainer establishes a float16 autocast context for parts of training. @torch_amp_custom_bwd re-enters that same context during the custom backward pass, silently downcasting float32 gradient tensors (dY, dQ/dK/dV) to float16 mid-computation. The subsequent addmm_ calls see mixed float32/float16 operands and crash.

Fixes

unsloth/kernels/utils.py — matmul_lora

After fast_dequantize, cast W to the activation dtype if they differ.
In the LoRA branch, cast out and X to the activation dtype before addmm_, ensuring the base-weight matmul output dtype never bleeds into the LoRA accumulation.

unsloth/kernels/fast_lora.py — LoRA_MLP_SwiGLU.backward and LoRA_QKV.backward

Wrap the entire backward body in torch.amp.autocast("cuda", enabled=False) to prevent the inherited float16 context from downcasting gradients.
Explicitly cast incoming gradient tensors (dY, dQ/dK/dV) and all dequantized base weights (upW, gateW, QW, KW, VW) to X.dtype.

Relation to PR #4005

PR #4005 threads correct_dtype through patch_model_and_tokenizer. This PR addresses the remaining kernel-level gaps: the global dequant buffer dtype mismatch in matmul_lora forward, and the autocast context inheritance in the LoRA custom backward functions.

Reproducer

from unsloth import FastLanguageModel, PatchFastRL
PatchFastRL("GRPO", FastLanguageModel)
from trl import GRPOConfig, GRPOTrainer
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
    dtype=torch.bfloat16,
    load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(model, r=16, lora_alpha=32, ...)
FastLanguageModel.for_training(model)

# GRPOTrainer with bf16=True, fp16=False -> crashes without this fix

Environment: NVIDIA RTX 5050 Laptop GPU, CUDA 12.8, unsloth 2026.4.4, TRL 0.22.x, PyTorch 2.10

…GRPO Two root causes fixed: 1. matmul_lora (utils.py): fast_dequantize reads from a global buffer whose dtype is controlled by quant_state.dtype embedded in bnb-4bit checkpoints (typically float16), not by the dtype= arg passed at load time. When activations are bfloat16, the subsequent matmul crashes with "got Half and BFloat16". Fix: cast W to activation dtype after fast_dequantize. Same cast applied to `out` in the LoRA branch. 2. LoRA_MLP_SwiGLU.backward and LoRA_QKV.backward (fast_lora.py): @torch_amp_custom_bwd inherits the float16 autocast context established by TRL's compiled GRPO trainer. This silently downcasts float32 gradient tensors (dY, dQ/dK/dV) to float16 mid-computation, causing addmm_ dtype mismatches. Fix: wrap entire backward body in torch.amp.autocast("cuda", enabled=False) and explicitly cast all incoming gradient tensors and dequantized base weights to X.dtype. Reproducer: Llama-3.2-3B-Instruct-bnb-4bit + GRPO, bf16=True, fp16=False, unsloth 2026.4.4, TRL 0.22.x, CUDA 12.8.

for more information, see https://pre-commit.ci

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6b6b5d83d9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

gemini-code-assist

Code Review

This pull request implements explicit dtype management and disables autocast during the backward pass of LoRA kernels to resolve addmm_ mismatches caused by silent downcasting in certain training environments. The feedback focuses on improving device compatibility by replacing hardcoded "cuda" strings with dynamic device types and removing several redundant type casts for tensors that already match the target activation dtype.

I am having trouble creating individual review comments. Click here to see my feedback.

unsloth/kernels/fast_lora.py (142)

Hardcoding "cuda" in torch.amp.autocast limits compatibility with other devices. Using X.device.type makes the context manager device-agnostic.

        with torch.amp.autocast(X.device.type, enabled=False):

unsloth/kernels/fast_lora.py (144-146)

The casts for e and g are redundant because they are produced in the forward pass using matmul_lora, which already returns tensors in the activation dtype. Only the cast for dY is necessary.

            if dY.dtype != dtype: dY = dY.to(dtype)

unsloth/kernels/fast_lora.py (168-171)

These casts are redundant as DW is already in the correct dtype.

            h, df, de = DW, e, g

unsloth/kernels/fast_lora.py (462)

Hardcoding "cuda" in torch.amp.autocast limits compatibility with other devices. Using X.device.type makes the context manager device-agnostic.

        with torch.amp.autocast(X.device.type, enabled=False):

unsloth/kernels/utils.py (1051)

X.to(dtype) is redundant here because dtype is defined as X.dtype at the start of the function.

        XA = torch_matmul(X, A.to(dtype))

Copilot

Pull request overview

Fixes mixed-dtype failures when using bitsandbytes 4-bit checkpoints with bfloat16 activations (notably during GRPO training), by enforcing dtype alignment in the LoRA matmul path and preventing inherited autocast contexts from downcasting gradients during custom backward passes.

Changes:

Cast dequantized 4-bit base weights in matmul_lora to match activation dtype before matmul/addmm.
In LoRA custom backward functions, disable autocast and explicitly cast incoming gradients and dequantized weights to X.dtype to avoid addmm_ dtype mismatches.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
`unsloth/kernels/utils.py`	Adds dtype-normalization around dequantized weights and LoRA accumulation in `matmul_lora`.
`unsloth/kernels/fast_lora.py`	Disables autocast in LoRA custom backward passes and aligns gradient/weight dtypes to the activation dtype.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-08T12:36:10Z

        W = fast_dequantize(W, W_quant, use_global_buffer = True)
+        # fast_dequantize may return float16 from the global buffer even when
+        # activations are bfloat16 (quant_state.dtype from bnb-4bit checkpoints).
+        # Cast W to match the activation dtype to avoid mixed-dtype matmul errors.
+        if W.dtype != dtype:
+            W = W.to(dtype)
        out = torch_matmul(X, W.t(), out = out)
    if W_quant is not None:


This fixes dtype mismatches in matmul_lora, but the same root cause (4-bit quant_state.dtype driving dequant/output dtype) still exists in other hot paths in this file. For example, fast_linear_forward dequantizes with use_global_buffer=True and then calls torch_matmul(X, W, out=out) (and fast_gemv selects fp16/bf16 kernels based on quant_state.dtype), so dtype=torch.bfloat16 activations with fp16 quant_state.dtype can still trigger dtype-mismatch errors or incorrect kernel selection when q_len == 1 (especially for bsz > 1). Consider applying the same dtype-alignment strategy in those q_len==1 paths as well (cast the dequantized weight/output and/or ensure X is in the expected dtype before calling the bnb kernels).

Copilot · 2026-04-08T12:36:11Z

+        # Cast W to match the activation dtype to avoid mixed-dtype matmul errors.
+        if W.dtype != dtype:
+            W = W.to(dtype)
        out = torch_matmul(X, W.t(), out = out)


Casting the full dequantized weight matrix with W = W.to(dtype) can allocate a fresh copy every call when quant_state.dtype != X.dtype, which may negate much of the benefit of use_global_buffer=True and increase memory bandwidth/peak memory. If this mismatch is expected to be common (e.g., bf16 activations + fp16 bnb checkpoints), consider a longer-term approach where the global dequant buffer is keyed by the compute/activation dtype (or otherwise dequantizes directly into the desired dtype) to avoid repeated full-matrix casts.

Suggested change

# Cast W to match the activation dtype to avoid mixed-dtype matmul errors.

if W.dtype != dtype:

W = W.to(dtype)

out = torch_matmul(X, W.t(), out = out)

# Avoid casting the full dequantized weight matrix every call; instead,

# run the base matmul in W.dtype and cast the result back if needed.

base_X = X if W.dtype == dtype else X.to(W.dtype)

base_out = out if (out is None or out.dtype == base_X.dtype) else None

out = torch_matmul(base_X, W.t(), out = base_out)

if out.dtype != dtype:

out = out.to(dtype)

Replace hardcoded torch.amp.autocast("cuda", enabled=False) with torch.amp.autocast(X.device.type, enabled=False) in all three LoRA backward methods (LoRA_MLP_SwiGLU, LoRA_QKV, LoRA_W). The file already switches torch_amp_custom_bwd to device_type="xpu" on Intel XPU (utils.py:53-55). Hardcoding "cuda" in the autocast guard would target the wrong context on XPU and may error on systems without CUDA. Deriving the device type from the input tensor makes the fix backend-agnostic.

for more information, see https://pre-commit.ci

Datta0 · 2026-04-08T13:00:43Z

Hey @anandn1 while the root cause analysis seems to be plausible, the fix doesn't seem to be optimal. It is much easier and cleaner to fix the argument to show bfloat16 instead of having to typecast activations in the kernels. This is definitely not the ideal way as it would cost us performance.

@Datta0

Add W_quant.dtype = dtype assignment before each fast_dequantize call in matmul_lora (forward) and all three LoRA backward methods (MLP, QKV, W). This ensures fast_dequantize selects the correct NF4 kernel (cdequantize_blockwise_bf16_nf4 vs fp16_nf4) and allocates the output buffer in the activation dtype directly, eliminating unnecessary kernel path divergence. Safety post-casts are retained as a fallback. Addresses review feedback from @Datta0 on PR unslothai#4918.

for more information, see https://pre-commit.ci

anandn1 · 2026-04-08T15:33:09Z

Thanks for the review @Datta0! Updated the approach in the latest commit.

Instead of casting after dequantization, we now set W_quant.dtype = dtype (the activation dtype) before each fast_dequantize call. This ensures the correct NF4 kernel is selected directly (cdequantize_blockwise_bf16_nf4 instead of fp16_nf4) and the output buffer is allocated in the right dtype from the start — no extra tensor allocation needed.

The previous post-cast (if W.dtype != dtype: W = W.to(dtype)) is kept as a safety net for any edge case where the pre-assignment alone isn't sufficient (e.g. old list-format quant states), but becomes a no-op in the common path.

Applied at 4 sites: matmul_lora (forward) and all three LoRA backward methods (LoRA_MLP_SwiGLU, LoRA_QKV, LoRA_W). Verified with a 5-step GRPO smoke test — no dtype errors.

Datta0 · 2026-04-08T16:41:54Z

I am not very keen on such a safety net which fails silently tbh. If an error is not supposed to happen, we should report it and not fail silently, especially hurting performance.
Also when I meant dtype, I wasn't meaning at forward time. I was thinking more of load time or post load attribute modification so that forward pass doesn't need to change anything at all.

anandn1 · 2026-04-08T17:07:32Z

@Datta0 Ohk, I'll implement the suggested approach. Feedback appreciated !

anandn1 requested review from Datta0 and danielhanchen as code owners April 8, 2026 12:29

Copilot AI review requested due to automatic review settings April 8, 2026 12:29

[pre-commit.ci] auto fixes from pre-commit.com hooks

6b6b5d8

for more information, see https://pre-commit.ci

Copilot started reviewing on behalf of anandn1 April 8, 2026 12:29 View session

chatgpt-codex-connector Bot reviewed Apr 8, 2026

View reviewed changes

Comment thread unsloth/kernels/fast_lora.py Outdated

anandn1 mentioned this pull request Apr 8, 2026

[Bug] RuntimeError: self and mat2 must have the same dtype (Half and BFloat16) in matmul_lora during GRPO training with 4-bit quantization #4891

Open

gemini-code-assist Bot reviewed Apr 8, 2026

View reviewed changes

Copilot AI reviewed Apr 8, 2026

View reviewed changes

anandn1 and others added 2 commits April 8, 2026 18:12

[pre-commit.ci] auto fixes from pre-commit.com hooks

a2ff751

for more information, see https://pre-commit.ci

anandn1 and others added 2 commits April 8, 2026 20:53

[pre-commit.ci] auto fixes from pre-commit.com hooks

f64dfc6

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: dtype mismatch in matmul_lora and LoRA backward with bnb-4bit + GRPO#4918

fix: dtype mismatch in matmul_lora and LoRA backward with bnb-4bit + GRPO#4918
anandn1 wants to merge 6 commits intounslothai:mainfrom
anandn1:fix/grpo-bnb4bit-bfloat16-dtype-mismatch

anandn1 commented Apr 8, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Datta0 commented Apr 8, 2026 •

edited

Loading

Uh oh!

anandn1 commented Apr 8, 2026

Uh oh!

Datta0 commented Apr 8, 2026

Uh oh!

anandn1 commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-        # Cast W to match the activation dtype to avoid mixed-dtype matmul errors.
-        if W.dtype != dtype:
-            W = W.to(dtype)
-        out = torch_matmul(X, W.t(), out = out)
+        # Avoid casting the full dequantized weight matrix every call; instead,
+        # run the base matmul in W.dtype and cast the result back if needed.
+        base_X = X if W.dtype == dtype else X.to(W.dtype)
+        base_out = out if (out is None or out.dtype == base_X.dtype) else None
+        out = torch_matmul(base_X, W.t(), out = base_out)
+        if out.dtype != dtype:
+            out = out.to(dtype)

Uh oh!

Conversation

anandn1 commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Causes

1. fast_dequantize returns float16 regardless of load-time dtype=

2. @torch_amp_custom_bwd inherits TRL's float16 autocast context

Fixes

Relation to PR #4005

Reproducer

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

unsloth/kernels/fast_lora.py (142)

unsloth/kernels/fast_lora.py (144-146)

unsloth/kernels/fast_lora.py (168-171)

unsloth/kernels/fast_lora.py (462)

unsloth/kernels/utils.py (1051)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Datta0 commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anandn1 commented Apr 8, 2026

Uh oh!

Datta0 commented Apr 8, 2026

Uh oh!

anandn1 commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

anandn1 commented Apr 8, 2026 •

edited

Loading

1. `fast_dequantize` returns float16 regardless of load-time `dtype=`

2. `@torch_amp_custom_bwd` inherits TRL's float16 autocast context

Datta0 commented Apr 8, 2026 •

edited

Loading