[PJRT] Experimental support for `torch.distributed` and DDP on TPU v2/v3 by will-cromar · Pull Request #4520 · pytorch/xla

will-cromar · 2023-01-26T18:49:43Z

Use experimental ThreadLocalWorld to enable multithreading
Implement a new torch.distributed init_method that uses PJRT runtime parameters and supports multithreading
torch.distributed "rank" will become the same as our "ordinal", meaning we have one fewer set of indices to track
Converge the PJRT ImageNet and MNIST tests with the original ones
Remove experimental pjrt.DistributedDataParallel now that the upstream version works on v3

Performance comparison using ResNet50 with fake data on TPU v3:

Runtime	DDP	Throughput (ex/sec/replica)
XRT	No	418.54
XRT	Yes	395.97
PJRT	No	~640
PJRT	Yes	~565

Example usage:

import torch
import torch.distributed as dist
import torch_xla.core.xla_model as xm
import torch_xla.distributed.xla_multiprocessing as xmp
import torch_xla.experimental.pjrt_backend
from torch_xla.experimental import pjrt

def _all_gather(index: int):
  dist.init_process_group('xla', init_method='pjrt://')
  t = torch.tensor([index], dtype=torch.int32, device=xm.xla_device())
  output = [torch.zeros_like(t) for _ in range(dist.get_world_size())]
  dist.all_gather(output, t)
  xm.mark_step()
  print(output)

if __name__ == '__main__':
  xmp.spawn(self._all_gather)

Needs rebasing after #4504 merges

Follow-up:

Update automated tests
Update PJRT and DDP documentation before release

JackCaoG · 2023-01-27T01:01:55Z

@will-cromar is this one ready for review?

will-cromar · 2023-01-27T01:54:51Z

I'll take another pass tomorrow to polish and add some comments, but this should be ready for review.

alanwaketan · 2023-01-27T19:52:50Z

  for step in range(steps):
    # To make torch.randn produce same results across devices.
-    torch.manual_seed(2022 + step)
+    rng = torch.Generator().manual_seed(2022 + step)


Curios why we use torch.Generator() instead?

The other option would be to wrap these randn calls in a lock and give them a common global seed, but explicitly creating a new generator with the same seed seems clearer to me. I would have done the same for module initialization, but that case doesn't support a custom RNG.

One last comment. Maybe add a comment to the test case to suggest the reasoning behind?

alanwaketan · 2023-01-27T19:55:24Z

    pjrt._run_multiprocess(
-        util.ddp_correctness, ddp=ddp, use_large_net=True, debug=FLAGS.debug)
+        util.ddp_correctness,
+        init_method='pjrt://',


Wonder if we want to parameterized the init_method with env as well?

Good idea. Added another test that skips for TPU <= v3, since env:// doesn't work nicely with multithreading.

alanwaketan

LGTM.

JackCaoG

Thanks!

…/v3 (#4520) * Implement multithreaded XLA process group * Fix tests * Merge PJRT MNIST test * formatting * Clarify random generation in test_ddp.py * Mark some variables private * Remove some extra comments * Add test that uses env:// method * Explain local RNG * Explain --pjrt_distributed flag

will-cromar added DO_NOT_MERGE_YET runtime labels Jan 26, 2023

will-cromar mentioned this pull request Jan 26, 2023

[PT/XLA] Reorganize DDP tests GoogleCloudPlatform/ml-testing-accelerators#793

Merged

will-cromar added 4 commits January 27, 2023 01:47

Implement multithreaded XLA process group

dbac2a4

Fix tests

354460b

Merge PJRT MNIST test

37ccf50

formatting

81b14d1

will-cromar force-pushed the wcromar/pjrt-pg branch from 8699230 to 81b14d1 Compare January 27, 2023 01:47

will-cromar marked this pull request as ready for review January 27, 2023 01:48

will-cromar removed the DO_NOT_MERGE_YET label Jan 27, 2023

will-cromar added 3 commits January 27, 2023 17:07

Clarify random generation in test_ddp.py

0bade7d

Mark some variables private

0c97976

Remove some extra comments

6dd3031

will-cromar requested review from JackCaoG and alanwaketan January 27, 2023 17:13

alanwaketan reviewed Jan 27, 2023

View reviewed changes

Add test that uses env:// method

2711801

alanwaketan approved these changes Jan 27, 2023

View reviewed changes

Explain local RNG

ee4890b

JackCaoG reviewed Jan 27, 2023

View reviewed changes

Comment thread test/test_train_mp_imagenet.py

Explain --pjrt_distributed flag

3ecd7d5

JackCaoG approved these changes Jan 27, 2023

View reviewed changes

will-cromar merged commit 021a1cc into master Jan 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PJRT] Experimental support for `torch.distributed` and DDP on TPU v2/v3#4520

[PJRT] Experimental support for `torch.distributed` and DDP on TPU v2/v3#4520
will-cromar merged 10 commits intomasterfrom
wcromar/pjrt-pg

will-cromar commented Jan 26, 2023 •

edited

Loading

Uh oh!

JackCaoG commented Jan 27, 2023

Uh oh!

will-cromar commented Jan 27, 2023

Uh oh!

alanwaketan Jan 27, 2023

Uh oh!

will-cromar Jan 27, 2023

Uh oh!

alanwaketan Jan 27, 2023

Uh oh!

alanwaketan Jan 27, 2023

Uh oh!

will-cromar Jan 27, 2023

Uh oh!

alanwaketan left a comment

Uh oh!

Uh oh!

JackCaoG left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

will-cromar commented Jan 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JackCaoG commented Jan 27, 2023

Uh oh!

will-cromar commented Jan 27, 2023

Uh oh!

alanwaketan Jan 27, 2023

Choose a reason for hiding this comment

Uh oh!

will-cromar Jan 27, 2023

Choose a reason for hiding this comment

Uh oh!

alanwaketan Jan 27, 2023

Choose a reason for hiding this comment

Uh oh!

alanwaketan Jan 27, 2023

Choose a reason for hiding this comment

Uh oh!

will-cromar Jan 27, 2023

Choose a reason for hiding this comment

Uh oh!

alanwaketan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JackCaoG left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

will-cromar commented Jan 26, 2023 •

edited

Loading