From fa5ac52a0d00fd5a24ea74b582a7bcb1fe341373 Mon Sep 17 00:00:00 2001 From: Badr Badawi Date: Mon, 8 Jun 2026 11:09:33 -0700 Subject: [PATCH] Add "Interpreting TPU States" section to megascale debugging doc. PiperOrigin-RevId: 928666227 --- .../xla/docs/megascale/debugging_workflow.md | 22 ++++++++++++++++++- 1 file changed, 21 insertions(+), 1 deletion(-) diff --git a/third_party/xla/docs/megascale/debugging_workflow.md b/third_party/xla/docs/megascale/debugging_workflow.md index c66d52b7c9f5f6..55b1adef7b3814 100644 --- a/third_party/xla/docs/megascale/debugging_workflow.md +++ b/third_party/xla/docs/megascale/debugging_workflow.md @@ -44,6 +44,26 @@ Megascale hang detected: Timed out waiting for 4 graphs to complete at launch_id ### Diagnosis +#### Interpreting TPU States + +Before diagnosing MXLA hangs, it is important to understand the TPU states +report format. Below is a sample report: + +```text +Full error digest: + Potential cause: + Potential culprit workers: + First error timestamp: + First error type: + TPU states: + Launch ID: + Module: jit.step_fn Fingerprint: + Sample worker: @:: + Tag:PC breakdown: + @(HLO): [:@:, ...] + ... +``` + #### Bad TPU Chip (tensor core or sparse core) ```text @@ -313,4 +333,4 @@ in order to share them with the XLA or Megascale team. **Note on Future Tooling:** Google is actively working on open-sourcing versions of diagnostic dashboards to provide a more streamlined experience for Cloud TPU -customers to identify and diagnose stragglers. These will be available soon. \ No newline at end of file +customers to identify and diagnose stragglers. These will be available soon.