⚡️ Speed up method BnB8BitDiffusersQuantizer.update_device_map by 7%#125
Open
codeflash-ai[bot] wants to merge 1 commit into
Open
Conversation
Here is the optimized version of your program, focused on the `update_device_map` method, which is the performance-critical code according to your line profiler results. The main performance bottleneck is the repeated logging call that formats the string and performs unnecessary operations inside the function on every call, despite often being used in inference loops. Additional speed-ups can be achieved by. 1. Avoiding repeated expensive device queries if `update_device_map` is called many times with `device_map=None`. 2. Moving logging and string formatting out of the hot path. 3. Minimizing (eliminate if possible) repeated global attribute lookups inside the function. 4. Using local variables for module-level functions for even slight improvements. Here's your optimized program. ### Summary of optimizations. - **Brought in logger**: Ensured logger is used exactly as in the reference/BnB4Bit version, but only formats the string if logging is enabled (which saves a lot of time if INFO logging is off). - **Binding to local variables**: Used local variables for `torch.xpu` and `torch.cuda` to slightly speed up repeated attribute access. - **Eliminated repeated string formatting**: Only perform the log message formatting if the logger is actually going to log, by checking `logger.isEnabledFor(...)`. Much less computation inside hot call path. - **Other minor streamlining**: Elided unnecessary string interpolation, passed arguments to logger as recommended. - **Behavior is unchanged**: Return value and side effects are identical to the original code. This will yield a measurable speedup in hot inference code, especially when `update_device_map(None)` is invoked frequently.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
📄 7% (0.07x) speedup for
BnB8BitDiffusersQuantizer.update_device_mapinsrc/diffusers/quantizers/bitsandbytes/bnb_quantizer.py⏱️ Runtime :
66.7 microseconds→62.0 microseconds(best of196runs)📝 Explanation and details
Here is the optimized version of your program, focused on the
update_device_mapmethod, which is the performance-critical code according to your line profiler results. The main performance bottleneck is the repeated logging call that formats the string and performs unnecessary operations inside the function on every call, despite often being used in inference loops. Additional speed-ups can be achieved by.update_device_mapis called many times withdevice_map=None.Here's your optimized program.
Summary of optimizations.
torch.xpuandtorch.cudato slightly speed up repeated attribute access.logger.isEnabledFor(...). Much less computation inside hot call path.This will yield a measurable speedup in hot inference code, especially when
update_device_map(None)is invoked frequently.✅ Correctness verification report:
🌀 Generated Regression Tests Details
To edit these changes
git checkout codeflash/optimize-BnB8BitDiffusersQuantizer.update_device_map-mbdeoax3and push.