⚡️ Speed up method `BnB8BitDiffusersQuantizer.update_device_map` by 7% by codeflash-ai[bot] · Pull Request #125 · aseembits93/diffusers

codeflash-ai · 2025-06-01T08:35:30Z

📄 7% (0.07x) speedup for `BnB8BitDiffusersQuantizer.update_device_map` in `src/diffusers/quantizers/bitsandbytes/bnb_quantizer.py`

⏱️ Runtime : 66.7 microseconds → 62.0 microseconds (best of 196 runs)

📝 Explanation and details

Here is the optimized version of your program, focused on the update_device_map method, which is the performance-critical code according to your line profiler results. The main performance bottleneck is the repeated logging call that formats the string and performs unnecessary operations inside the function on every call, despite often being used in inference loops. Additional speed-ups can be achieved by.

Avoiding repeated expensive device queries if update_device_map is called many times with device_map=None.
Moving logging and string formatting out of the hot path.
Minimizing (eliminate if possible) repeated global attribute lookups inside the function.
Using local variables for module-level functions for even slight improvements.

Here's your optimized program.

Summary of optimizations.

Brought in logger: Ensured logger is used exactly as in the reference/BnB4Bit version, but only formats the string if logging is enabled (which saves a lot of time if INFO logging is off).
Binding to local variables: Used local variables for torch.xpu and torch.cuda to slightly speed up repeated attribute access.
Eliminated repeated string formatting: Only perform the log message formatting if the logger is actually going to log, by checking logger.isEnabledFor(...). Much less computation inside hot call path.
Other minor streamlining: Elided unnecessary string interpolation, passed arguments to logger as recommended.
Behavior is unchanged: Return value and side effects are identical to the original code.

This will yield a measurable speedup in hot inference code, especially when update_device_map(None) is invoked frequently.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 135 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

🌀 Generated Regression Tests Details

import sys
from abc import ABC

# imports
import pytest  # used for our unit tests
import torch
from src.diffusers.quantizers.bitsandbytes.bnb_quantizer import \
    BnB8BitDiffusersQuantizer

# function to test
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.



# Dummy logger for testing, since the real logger is not needed for function correctness.
class DummyLogger:
    def info(self, msg):
        pass

logger = DummyLogger()

# Dummy QuantizationConfigMixin for initialization
class QuantizationConfigMixin:
    llm_int8_skip_modules = None
    quant_method = "dummy"

class DiffusersQuantizer(ABC):
    requires_calibration = False
    required_packages = None

    def __init__(self, quantization_config: QuantizationConfigMixin, **kwargs):
        self.quantization_config = quantization_config
        self.modules_to_not_convert = kwargs.pop("modules_to_not_convert", [])
        self.pre_quantized = kwargs.pop("pre_quantized", True)
        if not self.pre_quantized and self.requires_calibration:
            raise ValueError(
                f"The quantization method {quantization_config.quant_method} does require the model to be pre-quantized."
                f" You explicitly passed `pre_quantized=False` meaning your model weights are not quantized. Make sure to "
                f"pass `pre_quantized=True` while knowing what you are doing."
            )
from src.diffusers.quantizers.bitsandbytes.bnb_quantizer import \
    BnB8BitDiffusersQuantizer

# unit tests

@pytest.fixture
def quantizer():
    # Fixture for a quantizer instance with dummy config
    config = QuantizationConfigMixin()
    return BnB8BitDiffusersQuantizer(config)

# ------------------- Basic Test Cases -------------------

def test_update_device_map_returns_input_when_not_none(quantizer):
    # Should return the same device_map if not None
    input_map = {"layer1": "cuda:0", "layer2": "cpu"}
    codeflash_output = quantizer.update_device_map(input_map); result = codeflash_output

def test_update_device_map_none_cuda(monkeypatch, quantizer):
    # Should return a dict with current CUDA device if device_map is None and CUDA is available
    # Patch torch.xpu.is_available to False and torch.cuda.is_available to True
    monkeypatch.setattr(torch, "xpu", type("xpu", (), {"is_available": staticmethod(lambda: False)}))
    monkeypatch.setattr(torch.cuda, "current_device", lambda: 1)
    codeflash_output = quantizer.update_device_map(None); result = codeflash_output

def test_update_device_map_none_xpu(monkeypatch, quantizer):
    # Should return a dict with current XPU device if device_map is None and XPU is available
    monkeypatch.setattr(torch, "xpu", type("xpu", (), {
        "is_available": staticmethod(lambda: True),
        "current_device": staticmethod(lambda: 2)
    }))
    codeflash_output = quantizer.update_device_map(None); result = codeflash_output

# ------------------- Edge Test Cases -------------------

def test_update_device_map_empty_dict(quantizer):
    # Should return the empty dict as is
    input_map = {}
    codeflash_output = quantizer.update_device_map(input_map); result = codeflash_output

def test_update_device_map_with_non_string_keys_and_values(quantizer):
    # Should return the input as is, even if keys/values are not strings
    input_map = {1: 2, 3: 4}
    codeflash_output = quantizer.update_device_map(input_map); result = codeflash_output

def test_update_device_map_with_nested_dict(quantizer):
    # Should return the input as is, even if nested
    input_map = {"layer1": {"sub1": "cuda:0"}, "layer2": "cpu"}
    codeflash_output = quantizer.update_device_map(input_map); result = codeflash_output

def test_update_device_map_none_no_cuda_no_xpu(monkeypatch, quantizer):
    # Should return cuda:0 if both CUDA and XPU are not available (simulate fallback)
    monkeypatch.setattr(torch, "xpu", type("xpu", (), {"is_available": staticmethod(lambda: False)}))
    monkeypatch.setattr(torch.cuda, "current_device", lambda: 0)
    codeflash_output = quantizer.update_device_map(None); result = codeflash_output

def test_update_device_map_none_xpu_module_missing(monkeypatch, quantizer):
    # Should fallback to CUDA if torch.xpu is missing (simulate by deleting torch.xpu)
    if hasattr(torch, "xpu"):
        delattr(torch, "xpu")
    monkeypatch.setattr(torch.cuda, "current_device", lambda: 3)
    codeflash_output = quantizer.update_device_map(None); result = codeflash_output


def test_update_device_map_large_dict(quantizer):
    # Should return the same large dict as is
    large_map = {f"layer{i}": f"cuda:{i%4}" for i in range(1000)}
    codeflash_output = quantizer.update_device_map(large_map); result = codeflash_output
    for i in range(1000):
        pass

def test_update_device_map_large_nested_dict(quantizer):
    # Should return the same large nested dict as is
    large_nested = {f"layer{i}": {f"sub{j}": f"cuda:{(i+j)%4}" for j in range(5)} for i in range(200)}
    codeflash_output = quantizer.update_device_map(large_nested); result = codeflash_output
    for i in range(200):
        for j in range(5):
            pass

def test_update_device_map_none_many_times(monkeypatch, quantizer):
    # Call update_device_map(None) repeatedly to check for statelessness
    monkeypatch.setattr(torch, "xpu", type("xpu", (), {"is_available": staticmethod(lambda: False)}))
    monkeypatch.setattr(torch.cuda, "current_device", lambda: 5)
    for _ in range(100):
        codeflash_output = quantizer.update_device_map(None); result = codeflash_output

# ------------------- Mutational Robustness Test -------------------

def test_update_device_map_none_returns_new_dict(quantizer, monkeypatch):
    # Should always return a new dict object when device_map is None
    monkeypatch.setattr(torch, "xpu", type("xpu", (), {"is_available": staticmethod(lambda: False)}))
    monkeypatch.setattr(torch.cuda, "current_device", lambda: 0)
    codeflash_output = quantizer.update_device_map(None); result1 = codeflash_output
    codeflash_output = quantizer.update_device_map(None); result2 = codeflash_output

def test_update_device_map_none_does_not_mutate_input(quantizer, monkeypatch):
    # Should not mutate any input dict
    monkeypatch.setattr(torch, "xpu", type("xpu", (), {"is_available": staticmethod(lambda: False)}))
    monkeypatch.setattr(torch.cuda, "current_device", lambda: 0)
    input_map = {"foo": "bar"}
    quantizer.update_device_map(input_map)

# ------------------- Determinism Test -------------------

def test_update_device_map_none_deterministic(monkeypatch, quantizer):
    # Should always return the same device string for the same current device
    monkeypatch.setattr(torch, "xpu", type("xpu", (), {"is_available": staticmethod(lambda: False)}))
    monkeypatch.setattr(torch.cuda, "current_device", lambda: 2)
    results = [quantizer.update_device_map(None) for _ in range(10)]
    for res in results:
        pass

# ------------------- Pytest Parametrization for Diverse Inputs -------------------

@pytest.mark.parametrize("input_map", [
    None,
    {},
    {"": "cuda:0"},
    {"layer": "xpu:0"},
    {"layer1": "cuda:0", "layer2": "cpu"},
    {0: 1, 2: 3},
    {None: None},
    {"layer": None},
])
def test_update_device_map_varied_inputs(monkeypatch, quantizer, input_map):
    # General test for varied input maps
    monkeypatch.setattr(torch, "xpu", type("xpu", (), {"is_available": staticmethod(lambda: False)}))
    monkeypatch.setattr(torch.cuda, "current_device", lambda: 0)
    codeflash_output = quantizer.update_device_map(input_map); result = codeflash_output
    if input_map is None:
        pass
    else:
        pass
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

import sys
from abc import ABC

# imports
import pytest  # used for our unit tests
import torch
from src.diffusers.quantizers.bitsandbytes.bnb_quantizer import \
    BnB8BitDiffusersQuantizer


class DummyQuantizationConfig:
    llm_int8_skip_modules = None

class DiffusersQuantizer(ABC):
    requires_calibration = False
    required_packages = None

    def __init__(self, quantization_config, **kwargs):
        self.quantization_config = quantization_config
        self.modules_to_not_convert = kwargs.pop("modules_to_not_convert", [])
        self.pre_quantized = kwargs.pop("pre_quantized", True)
        if not self.pre_quantized and self.requires_calibration:
            raise ValueError(
                f"The quantization method {quantization_config.quant_method} does require the model to be pre-quantized."
                f" You explicitly passed `pre_quantized=False` meaning your model weights are not quantized. Make sure to "
                f"pass `pre_quantized=True` while knowing what you are doing."
            )
from src.diffusers.quantizers.bitsandbytes.bnb_quantizer import \
    BnB8BitDiffusersQuantizer

# -------------------- UNIT TESTS --------------------

@pytest.fixture
def quantizer():
    # Returns a quantizer instance for use in tests
    return BnB8BitDiffusersQuantizer(DummyQuantizationConfig())

# 1. BASIC TEST CASES

def test_update_device_map_returns_same_when_not_none(quantizer):
    # Should return the same device_map if not None
    input_map = {"layer1": "cuda:0", "layer2": "cuda:1"}
    codeflash_output = quantizer.update_device_map(input_map); result = codeflash_output

def test_update_device_map_none_cuda(monkeypatch, quantizer):
    # Should set device_map to current CUDA device if device_map is None and no XPU
    # Patch torch.xpu to not be available
    monkeypatch.setattr(torch, "xpu", type("xpu", (), {"is_available": staticmethod(lambda: False)}))
    # Patch torch.cuda.current_device to return 2
    monkeypatch.setattr(torch.cuda, "current_device", lambda: 2)
    codeflash_output = quantizer.update_device_map(None); result = codeflash_output

def test_update_device_map_none_xpu(monkeypatch, quantizer):
    # Should set device_map to current XPU device if torch.xpu.is_available() returns True
    monkeypatch.setattr(torch, "xpu", type("xpu", (), {
        "is_available": staticmethod(lambda: True),
        "current_device": staticmethod(lambda: 5)
    }))
    codeflash_output = quantizer.update_device_map(None); result = codeflash_output

def test_update_device_map_empty_dict(quantizer):
    # If given an empty dict, should just return it (not replace)
    input_map = {}
    codeflash_output = quantizer.update_device_map(input_map); result = codeflash_output

def test_update_device_map_non_string_keys(quantizer):
    # Should allow non-string keys (function does not check key types)
    input_map = {1: "cuda:0", 2: "cuda:1"}
    codeflash_output = quantizer.update_device_map(input_map); result = codeflash_output

# 2. EDGE TEST CASES

def test_update_device_map_none_cuda_device_zero(monkeypatch, quantizer):
    # Should work if current_device is 0 (common default)
    monkeypatch.setattr(torch, "xpu", type("xpu", (), {"is_available": staticmethod(lambda: False)}))
    monkeypatch.setattr(torch.cuda, "current_device", lambda: 0)
    codeflash_output = quantizer.update_device_map(None); result = codeflash_output

def test_update_device_map_none_cuda_device_large(monkeypatch, quantizer):
    # Should work with large device ids
    monkeypatch.setattr(torch, "xpu", type("xpu", (), {"is_available": staticmethod(lambda: False)}))
    monkeypatch.setattr(torch.cuda, "current_device", lambda: 999)
    codeflash_output = quantizer.update_device_map(None); result = codeflash_output

def test_update_device_map_none_xpu_device_zero(monkeypatch, quantizer):
    # Should work if XPU current_device is 0
    monkeypatch.setattr(torch, "xpu", type("xpu", (), {
        "is_available": staticmethod(lambda: True),
        "current_device": staticmethod(lambda: 0)
    }))
    codeflash_output = quantizer.update_device_map(None); result = codeflash_output

def test_update_device_map_none_xpu_device_large(monkeypatch, quantizer):
    # Should work with large XPU device ids
    monkeypatch.setattr(torch, "xpu", type("xpu", (), {
        "is_available": staticmethod(lambda: True),
        "current_device": staticmethod(lambda: 999)
    }))
    codeflash_output = quantizer.update_device_map(None); result = codeflash_output



def test_update_device_map_none_cuda_current_device_raises(monkeypatch, quantizer):
    # If torch.cuda.current_device raises, should propagate the error
    monkeypatch.setattr(torch, "xpu", type("xpu", (), {"is_available": staticmethod(lambda: False)}))
    monkeypatch.setattr(torch.cuda, "current_device", lambda: (_ for _ in ()).throw(RuntimeError("cuda error")))
    with pytest.raises(RuntimeError):
        quantizer.update_device_map(None)

def test_update_device_map_none_xpu_current_device_raises(monkeypatch, quantizer):
    # If torch.xpu.is_available True but torch.xpu.current_device raises, should propagate the error
    class BrokenXPU:
        @staticmethod
        def is_available():
            return True
        @staticmethod
        def current_device():
            raise RuntimeError("xpu error")
    monkeypatch.setattr(torch, "xpu", BrokenXPU)
    with pytest.raises(RuntimeError):
        quantizer.update_device_map(None)

def test_update_device_map_none_xpu_is_available_false_no_cuda(monkeypatch, quantizer):
    # If torch.xpu.is_available is False and torch.cuda missing, should raise AttributeError
    monkeypatch.setattr(torch, "xpu", type("xpu", (), {"is_available": staticmethod(lambda: False)}))
    if hasattr(torch, "cuda"):
        monkeypatch.delattr(torch, "cuda", raising=False)
    with pytest.raises(AttributeError):
        quantizer.update_device_map(None)

def test_update_device_map_none_cuda_current_device_returns_str(monkeypatch, quantizer):
    # If torch.cuda.current_device returns a string, should still work (str concatenation)
    monkeypatch.setattr(torch, "xpu", type("xpu", (), {"is_available": staticmethod(lambda: False)}))
    monkeypatch.setattr(torch.cuda, "current_device", lambda: "abc")
    codeflash_output = quantizer.update_device_map(None); result = codeflash_output

def test_update_device_map_none_xpu_current_device_returns_str(monkeypatch, quantizer):
    # If torch.xpu.current_device returns a string, should still work (str concatenation)
    monkeypatch.setattr(torch, "xpu", type("xpu", (), {
        "is_available": staticmethod(lambda: True),
        "current_device": staticmethod(lambda: "xyz")
    }))
    codeflash_output = quantizer.update_device_map(None); result = codeflash_output

# 3. LARGE SCALE TEST CASES

def test_update_device_map_large_map(quantizer):
    # Should handle large device maps efficiently (should not copy or modify)
    large_map = {f"layer{i}": f"cuda:{i%8}" for i in range(1000)}
    codeflash_output = quantizer.update_device_map(large_map); result = codeflash_output
    for i in range(1000):
        pass

def test_update_device_map_large_map_with_empty_keys(quantizer):
    # Should handle large map with some empty string keys
    large_map = {f"layer{i}": f"cuda:{i%8}" for i in range(999)}
    large_map[""] = "cuda:0"
    codeflash_output = quantizer.update_device_map(large_map); result = codeflash_output
    for i in range(999):
        pass

def test_update_device_map_none_many_calls(monkeypatch, quantizer):
    # Should work for many sequential calls with None (simulate repeated inference)
    monkeypatch.setattr(torch, "xpu", type("xpu", (), {"is_available": staticmethod(lambda: False)}))
    monkeypatch.setattr(torch.cuda, "current_device", lambda: 1)
    for _ in range(100):
        codeflash_output = quantizer.update_device_map(None); result = codeflash_output

def test_update_device_map_large_map_with_non_str_keys(quantizer):
    # Should allow large maps with non-string keys
    large_map = {i: f"cuda:{i%4}" for i in range(1000)}
    codeflash_output = quantizer.update_device_map(large_map); result = codeflash_output
    for i in range(1000):
        pass
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-BnB8BitDiffusersQuantizer.update_device_map-mbdeoax3 and push.

Here is the optimized version of your program, focused on the `update_device_map` method, which is the performance-critical code according to your line profiler results. The main performance bottleneck is the repeated logging call that formats the string and performs unnecessary operations inside the function on every call, despite often being used in inference loops. Additional speed-ups can be achieved by. 1. Avoiding repeated expensive device queries if `update_device_map` is called many times with `device_map=None`. 2. Moving logging and string formatting out of the hot path. 3. Minimizing (eliminate if possible) repeated global attribute lookups inside the function. 4. Using local variables for module-level functions for even slight improvements. Here's your optimized program. ### Summary of optimizations. - **Brought in logger**: Ensured logger is used exactly as in the reference/BnB4Bit version, but only formats the string if logging is enabled (which saves a lot of time if INFO logging is off). - **Binding to local variables**: Used local variables for `torch.xpu` and `torch.cuda` to slightly speed up repeated attribute access. - **Eliminated repeated string formatting**: Only perform the log message formatting if the logger is actually going to log, by checking `logger.isEnabledFor(...)`. Much less computation inside hot call path. - **Other minor streamlining**: Elided unnecessary string interpolation, passed arguments to logger as recommended. - **Behavior is unchanged**: Return value and side effects are identical to the original code. This will yield a measurable speedup in hot inference code, especially when `update_device_map(None)` is invoked frequently.

codeflash-ai Bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Jun 1, 2025

codeflash-ai Bot requested a review from aseembits93 June 1, 2025 08:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡️ Speed up method `BnB8BitDiffusersQuantizer.update_device_map` by 7%#125

⚡️ Speed up method `BnB8BitDiffusersQuantizer.update_device_map` by 7%#125
codeflash-ai[bot] wants to merge 1 commit into
mainfrom
codeflash/optimize-BnB8BitDiffusersQuantizer.update_device_map-mbdeoax3

codeflash-ai Bot commented Jun 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

Conversation

codeflash-ai Bot commented Jun 1, 2025

📄 7% (0.07x) speedup for BnB8BitDiffusersQuantizer.update_device_map in src/diffusers/quantizers/bitsandbytes/bnb_quantizer.py

📝 Explanation and details

Summary of optimizations.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

📄 7% (0.07x) speedup for `BnB8BitDiffusersQuantizer.update_device_map` in `src/diffusers/quantizers/bitsandbytes/bnb_quantizer.py`