PR #39926: Improvements to the HBM OOM Error page (Error E1000)#115476
Merged
copybara-service[bot] merged 1 commit intomasterfrom Apr 13, 2026
Merged
PR #39926: Improvements to the HBM OOM Error page (Error E1000)#115476copybara-service[bot] merged 1 commit intomasterfrom
copybara-service[bot] merged 1 commit intomasterfrom
Conversation
79f746d to
9b3c603
Compare
Imported from GitHub PR openxla/xla#39926 📝 Summary of Changes This PR adds the following updates to the error E1000 page: - Adds a table with a summary of potential interventions for OOM errors and how to rank them in terms of safety, impact and caveats; - Adds a few notes about drawbacks of specific interventions, such as host offloading, microbatching, manual checkpointing and advanced sharding techniques; - Adds more context on donating input buffers. 🎯 Justification These improvements are suggested by the an internal UXR study. 🚀 Kind of Contribution Please remove what does not apply: 📚 Documentation 📊 Benchmark (for Performance Improvements) N/A 🧪 Unit Tests: N/A 🧪 Execution Tests: N/A Copybara import of the project: -- c4302c3d89bf97e2ffed828ff66c70598c95f242 by Melissa Weber Mendonça <melissawm@gmail.com>: Add table with summary of interventions -- ed754ac276a24ed81650a50d57cca02b18c33d9b by Melissa Weber Mendonça <melissawm@gmail.com>: Add note about host offloading performance impact -- 760d2410bcf8f2f4963bdb36b0bc109cafdf4e99 by Melissa Weber Mendonça <melissawm@gmail.com>: Add caveat for manual checkpointing -- f8fb7c3b9d91d90da2ccc09f821c9fc9850d9419 by Melissa Weber Mendonça <melissawm@gmail.com>: Add caveat about using advanced sharding techniques -- dd1d1d143ed2da7f481f1f0c24961951127e577a by Melissa Weber Mendonça <melissawm@gmail.com>: Fix formatting -- d71b8c5cd39e18b8a972f6d2997a4fa8c25edb80 by Melissa Weber Mendonça <melissawm@gmail.com>: Add note about microbatching -- c836156aa0ca64f97531cb3571abd2a934ca9e76 by Melissa Weber Mendonça <melissawm@gmail.com>: Add JAX docs on gradient checkpointing -- 19807ce99098f33731d97bb47ec62e56be009a46 by Melissa Weber Mendonça <melissawm@gmail.com>: Add note on donating input buffers Merging this change closes #39926 PiperOrigin-RevId: 899155259
9b3c603 to
52b323c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR #39926: Improvements to the HBM OOM Error page (Error E1000)
Imported from GitHub PR openxla/xla#39926
📝 Summary of Changes
This PR adds the following updates to the error E1000 page:
🎯 Justification
These improvements are suggested by the an internal UXR study.
🚀 Kind of Contribution
Please remove what does not apply: 📚 Documentation
📊 Benchmark (for Performance Improvements)
N/A
🧪 Unit Tests:
N/A
🧪 Execution Tests:
N/A
Copybara import of the project:
--
c4302c3d89bf97e2ffed828ff66c70598c95f242 by Melissa Weber Mendonça melissawm@gmail.com:
Add table with summary of interventions
--
ed754ac276a24ed81650a50d57cca02b18c33d9b by Melissa Weber Mendonça melissawm@gmail.com:
Add note about host offloading performance impact
--
760d2410bcf8f2f4963bdb36b0bc109cafdf4e99 by Melissa Weber Mendonça melissawm@gmail.com:
Add caveat for manual checkpointing
--
f8fb7c3b9d91d90da2ccc09f821c9fc9850d9419 by Melissa Weber Mendonça melissawm@gmail.com:
Add caveat about using advanced sharding techniques
--
dd1d1d143ed2da7f481f1f0c24961951127e577a by Melissa Weber Mendonça melissawm@gmail.com:
Fix formatting
--
d71b8c5cd39e18b8a972f6d2997a4fa8c25edb80 by Melissa Weber Mendonça melissawm@gmail.com:
Add note about microbatching
--
c836156aa0ca64f97531cb3571abd2a934ca9e76 by Melissa Weber Mendonça melissawm@gmail.com:
Add JAX docs on gradient checkpointing
--
19807ce99098f33731d97bb47ec62e56be009a46 by Melissa Weber Mendonça melissawm@gmail.com:
Add note on donating input buffers
Merging this change closes #39926
FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#39926 from melissawm:hbm-oom-updates 19807ce99098f33731d97bb47ec62e56be009a46