Skip to content

Modernize Tensor Parallelism using the DTensor API for Llama and Granite#371

Open
aw471 wants to merge 1 commit into
foundation-model-stack:mainfrom
HPML-IBM-FMS-STACK:add-dtensor-support
Open

Modernize Tensor Parallelism using the DTensor API for Llama and Granite#371
aw471 wants to merge 1 commit into
foundation-model-stack:mainfrom
HPML-IBM-FMS-STACK:add-dtensor-support

Conversation

@aw471
Copy link
Copy Markdown

@aw471 aw471 commented Dec 20, 2024

PR modernizes IBM FMS TP code for Llama and Granite models by using the Tensor Parallel API (built on DTensors).

Requires Torch 2.6 (https://download.pytorch.org/whl/nightly/cu124) to fix DTensor incompatibility with torch.compile (pytorch/pytorch#108840).

Evaluated performance on the IBM benchmark script.

Other Details:

  • Llama and Granite have comparable inference speed with the original IBM TP implementation, except for the uncompiled uncached benchmarks. This is because the non-distributed MultiHeadAttention layer is used and the Tensor Parallel API wraps the layers. Hence, there is no reduce_from_tensor_model_parallel_region call for the cache like in TPMultiHeadAttention.
  • Llama and Granite use slightly more allocated memory for all benchmarks.
  • Llama has significant reserved memory improvements for the uncompiled uncached end to end benchmark and all compiled benchmarks as sequence length increases. Granite on the other hand varies in terms of reserved memory performance. Granite performs better for sequence length 256, worse for sequence length 512, and similar for sequence length 1024 to the original IBM FMS implementation. The benchmarks were ran more than once to validate this behavior
  • Maintain compatibility with original IBM FMS TP implementations for other models.

Benchmark tables can be seen in the README of our forked repo.

suranimaria added a commit to HPML-Team9/foundation-model-stack that referenced this pull request Mar 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant