Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 21 additions & 1 deletion third_party/xla/docs/megascale/debugging_workflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,26 @@ Megascale hang detected: Timed out waiting for 4 graphs to complete at launch_id

### Diagnosis

#### Interpreting TPU States

Before diagnosing MXLA hangs, it is important to understand the TPU states
report format. Below is a sample report:

```text
Full error digest:
Potential cause: <determined_cause>
Potential culprit workers: <task_name>
First error timestamp: <timestamp>
First error type: <error_type>
TPU states:
Launch ID: <launch_id>
Module: jit.step_fn Fingerprint: <fingerprint>
Sample worker: <task_name>@<host_name>:<tpu_chip>:<tpu_core>
Tag:PC breakdown:
<num_cores>@<location>(HLO): [<task_name>:<host_name>@<tpu_chip>:<tpu_core>, ...]
...
```

#### Bad TPU Chip (tensor core or sparse core)

```text
Expand Down Expand Up @@ -313,4 +333,4 @@ in order to share them with the XLA or Megascale team.

**Note on Future Tooling:** Google is actively working on open-sourcing versions
of diagnostic dashboards to provide a more streamlined experience for Cloud TPU
customers to identify and diagnose stragglers. These will be available soon.
customers to identify and diagnose stragglers. These will be available soon.
Loading