Skip to content

PR #39926: Improvements to the HBM OOM Error page (Error E1000)#115476

Merged
copybara-service[bot] merged 1 commit intomasterfrom
exported_pr_896628489
Apr 13, 2026
Merged

PR #39926: Improvements to the HBM OOM Error page (Error E1000)#115476
copybara-service[bot] merged 1 commit intomasterfrom
exported_pr_896628489

Conversation

@copybara-service
Copy link
Copy Markdown

PR #39926: Improvements to the HBM OOM Error page (Error E1000)

Imported from GitHub PR openxla/xla#39926

📝 Summary of Changes
This PR adds the following updates to the error E1000 page:

  • Adds a table with a summary of potential interventions for OOM errors and how to rank them in terms of safety, impact and caveats;
  • Adds a few notes about drawbacks of specific interventions, such as host offloading, microbatching, manual checkpointing and advanced sharding techniques;
  • Adds more context on donating input buffers.

🎯 Justification
These improvements are suggested by the an internal UXR study.

🚀 Kind of Contribution
Please remove what does not apply: 📚 Documentation

📊 Benchmark (for Performance Improvements)
N/A

🧪 Unit Tests:
N/A

🧪 Execution Tests:
N/A
Copybara import of the project:

--
c4302c3d89bf97e2ffed828ff66c70598c95f242 by Melissa Weber Mendonça melissawm@gmail.com:

Add table with summary of interventions

--
ed754ac276a24ed81650a50d57cca02b18c33d9b by Melissa Weber Mendonça melissawm@gmail.com:

Add note about host offloading performance impact

--
760d2410bcf8f2f4963bdb36b0bc109cafdf4e99 by Melissa Weber Mendonça melissawm@gmail.com:

Add caveat for manual checkpointing

--
f8fb7c3b9d91d90da2ccc09f821c9fc9850d9419 by Melissa Weber Mendonça melissawm@gmail.com:

Add caveat about using advanced sharding techniques

--
dd1d1d143ed2da7f481f1f0c24961951127e577a by Melissa Weber Mendonça melissawm@gmail.com:

Fix formatting

--
d71b8c5cd39e18b8a972f6d2997a4fa8c25edb80 by Melissa Weber Mendonça melissawm@gmail.com:

Add note about microbatching

--
c836156aa0ca64f97531cb3571abd2a934ca9e76 by Melissa Weber Mendonça melissawm@gmail.com:

Add JAX docs on gradient checkpointing

--
19807ce99098f33731d97bb47ec62e56be009a46 by Melissa Weber Mendonça melissawm@gmail.com:

Add note on donating input buffers

Merging this change closes #39926

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#39926 from melissawm:hbm-oom-updates 19807ce99098f33731d97bb47ec62e56be009a46

@copybara-service copybara-service bot force-pushed the exported_pr_896628489 branch from 79f746d to 9b3c603 Compare April 13, 2026 19:59
Imported from GitHub PR openxla/xla#39926

📝 Summary of Changes
This PR adds the following updates to the error E1000 page:
- Adds a table with a summary of potential interventions for OOM errors and how to rank them in terms of safety, impact and caveats;
- Adds a few notes about drawbacks of specific interventions, such as host offloading, microbatching, manual checkpointing and advanced sharding techniques;
- Adds more context on donating input buffers.

🎯 Justification
These improvements are suggested by the an internal UXR study.

🚀 Kind of Contribution
Please remove what does not apply: 📚 Documentation

📊 Benchmark (for Performance Improvements)
N/A

🧪 Unit Tests:
N/A

🧪 Execution Tests:
N/A
Copybara import of the project:

--
c4302c3d89bf97e2ffed828ff66c70598c95f242 by Melissa Weber Mendonça <melissawm@gmail.com>:

Add table with summary of interventions

--
ed754ac276a24ed81650a50d57cca02b18c33d9b by Melissa Weber Mendonça <melissawm@gmail.com>:

Add note about host offloading performance impact

--
760d2410bcf8f2f4963bdb36b0bc109cafdf4e99 by Melissa Weber Mendonça <melissawm@gmail.com>:

Add caveat for manual checkpointing

--
f8fb7c3b9d91d90da2ccc09f821c9fc9850d9419 by Melissa Weber Mendonça <melissawm@gmail.com>:

Add caveat about using advanced sharding techniques

--
dd1d1d143ed2da7f481f1f0c24961951127e577a by Melissa Weber Mendonça <melissawm@gmail.com>:

Fix formatting

--
d71b8c5cd39e18b8a972f6d2997a4fa8c25edb80 by Melissa Weber Mendonça <melissawm@gmail.com>:

Add note about microbatching

--
c836156aa0ca64f97531cb3571abd2a934ca9e76 by Melissa Weber Mendonça <melissawm@gmail.com>:

Add JAX docs on gradient checkpointing

--
19807ce99098f33731d97bb47ec62e56be009a46 by Melissa Weber Mendonça <melissawm@gmail.com>:

Add note on donating input buffers

Merging this change closes #39926

PiperOrigin-RevId: 899155259
@copybara-service copybara-service bot force-pushed the exported_pr_896628489 branch from 9b3c603 to 52b323c Compare April 13, 2026 20:29
@copybara-service copybara-service bot merged commit 52b323c into master Apr 13, 2026
@copybara-service copybara-service bot deleted the exported_pr_896628489 branch April 13, 2026 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ICPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA

1 participant