Modernize Tensor Parallelism using the DTensor API for Llama and Granite by aw471 · Pull Request #371 · foundation-model-stack/foundation-model-stack

aw471 · 2024-12-20T20:33:49Z

PR modernizes IBM FMS TP code for Llama and Granite models by using the Tensor Parallel API (built on DTensors).

Requires Torch 2.6 (https://download.pytorch.org/whl/nightly/cu124) to fix DTensor incompatibility with torch.compile (pytorch/pytorch#108840).

Evaluated performance on the IBM benchmark script.

Other Details:

Llama and Granite have comparable inference speed with the original IBM TP implementation, except for the uncompiled uncached benchmarks. This is because the non-distributed MultiHeadAttention layer is used and the Tensor Parallel API wraps the layers. Hence, there is no reduce_from_tensor_model_parallel_region call for the cache like in TPMultiHeadAttention.
Llama and Granite use slightly more allocated memory for all benchmarks.
Llama has significant reserved memory improvements for the uncompiled uncached end to end benchmark and all compiled benchmarks as sequence length increases. Granite on the other hand varies in terms of reserved memory performance. Granite performs better for sequence length 256, worse for sequence length 512, and similar for sequence length 1024 to the original IBM FMS implementation. The benchmarks were ran more than once to validate this behavior
Maintain compatibility with original IBM FMS TP implementations for other models.

Benchmark tables can be seen in the README of our forked repo.

…h DTensor API for Llama and Granite

Add TP DTensor support for llama and granite

9cd6e30

suranimaria added a commit to HPML-Team9/foundation-model-stack that referenced this pull request Mar 30, 2025

Merge PR foundation-model-stack#371: Modernize tensor parallelism wit…

c051ff2

…h DTensor API for Llama and Granite

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modernize Tensor Parallelism using the DTensor API for Llama and Granite#371

Modernize Tensor Parallelism using the DTensor API for Llama and Granite#371
aw471 wants to merge 1 commit into
foundation-model-stack:mainfrom
HPML-IBM-FMS-STACK:add-dtensor-support

aw471 commented Dec 20, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aw471 commented Dec 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aw471 commented Dec 20, 2024 •

edited

Loading