A4x Max BM slurm support. by arpit974 · Pull Request #5222 · GoogleCloudPlatform/cluster-toolkit

arpit974 · 2026-02-10T09:11:58Z

This pull request adds support for the a4x-maxgpu-4g-metal bare-metal machine type. It introduces a new
blueprint and deployment configuration to enable users to deploy Slurm clusters on this new hardware. The
changes are aligned with the existing structure for other machine types, such as a4x-highgpu-4g, while
accommodating the specific requirements of a bare-metal environment.

Key Changes

New Blueprint (a4xmax-bm-slurm-blueprint.yaml):
- Introduces a new blueprint specifically for the a4x-maxgpu-4g-metal machine type.
- Includes installation and configuration for MOFED/DOCA to support the advanced networking
  capabilities of the bare-metal hardware.
- Adds the asapd-lite service to the startup sequence.
- Configures a multi-NIC network environment, including two VPCs for IDPF and a separate RDMA network.
- Includes prolog and epilog scripts for managing the nvidia-imex service.
New Deployment File (a4xmax-bm-slurm-deployment.yaml):
- Provides a clean and simple deployment file that externalizes all the necessary variables for the
  new blueprint.
System Benchmarks:
- The run-nccl-tests-via-ramble.sh script has been updated to correctly configure the environment
  variables for the multiple network interfaces on the bare-metal machine.

Testing

The newer blueprint was deployed successfully, and NCCL tests were run as per the ramble script,
verifying the functionality.

gemini-code-assist · 2026-02-10T09:12:19Z

Summary of Changes

Hello @arpit974, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the platform's capabilities by adding full support for the a4x-maxgpu-4g-metal bare-metal machine type within Slurm clusters. It provides all necessary infrastructure as code, including a new blueprint and deployment configuration, to seamlessly provision and manage these high-performance computing environments. The changes ensure that users can leverage the advanced networking and GPU features of the A4X Max hardware, complete with integrated benchmarking tools to validate performance.

Highlights

New Blueprint for A4X Max Bare Metal: Introduced a new blueprint (a4xmax-bm-slurm-blueprint.yaml) specifically for the a4x-maxgpu-4g-metal machine type, enabling Slurm cluster deployments on this new hardware.
Advanced Networking and Driver Configuration: The new blueprint includes comprehensive installation and configuration for MOFED/DOCA drivers, asapd-lite service, and a multi-NIC network environment with two VPCs for IDPF and a dedicated RDMA network.
Slurm Integration with NVIDIA IMEX: Configured Slurm controller prolog and epilog scripts to manage the nvidia-imex service, crucial for optimizing performance on A4X Max bare metal nodes.
Deployment Configuration and Benchmarking: Added a clean deployment file (a4xmax-bm-slurm-deployment.yaml) for easy configuration and updated the run-nccl-tests-via-ramble.sh script to correctly set environment variables for bare metal multi-NIC setups, facilitating system benchmarks.
GPU Definition Update: Updated the internal GPU definition module to recognize the new a4x-maxgpu-4g-metal and a4x-maxgpu-4g-metal-nolssd machine types with nvidia-gb300 GPUs.

Changelog

examples/machine-learning/a4x-maxgpu-4g-metal/README.md
- Added a new README file detailing the deployment steps and blueprint overview for A4X Max bare metal Slurm clusters.
examples/machine-learning/a4x-maxgpu-4g-metal/a4xmax-bm-slurm-blueprint.yaml
- Introduced a new blueprint for the a4x-maxgpu-4g-metal bare metal machine type.
- Configured the image build process to include MOFED/DOCA drivers, asapd-lite service setup, and a networkd workaround.
- Defined a multi-NIC network setup with two IDPF VPCs and an RDMA network.
- Integrated NCCL plugin installation and DCGM enablement.
- Added Slurm controller prolog and epilog scripts for nvidia-imex service management.
examples/machine-learning/a4x-maxgpu-4g-metal/a4xmax-bm-slurm-deployment.yaml
- Provided a new deployment configuration file for the A4X Max Slurm blueprint, externalizing necessary variables.
examples/machine-learning/a4x-maxgpu-4g-metal/system_benchmarks/README.md
- Added documentation on how to run system benchmarks, specifically NCCL tests, using the Ramble framework.
examples/machine-learning/a4x-maxgpu-4g-metal/system_benchmarks/run-nccl-tests-via-ramble.sh
- Implemented a script to automate the setup and execution of NCCL tests via Ramble.
- Configured environment variables within the script to correctly support multi-NIC bare metal setups.
modules/internal/gpu-definition/main.tf
- Added a4x-maxgpu-4g-metal and a4x-maxgpu-4g-metal-nolssd machine types with nvidia-gb300 GPUs to the accelerator machine definitions.

Activity

No human activity (comments, reviews) has been recorded on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for the a4x-maxgpu-4g-metal bare-metal machine type, which is a significant addition. The changes include a new blueprint, deployment configuration, documentation, and benchmark scripts. The implementation is largely well-structured and follows existing patterns.

However, there are a few areas that need attention:

Reproducibility: Several scripts fetch dependencies (like slurm-gcp tools and ramble) from the master branch of their respective GitHub repositories. This should be changed to use specific tags or commit hashes to ensure deployments are reproducible.
Staging Artifacts: The blueprint and documentation reference a staging Docker image for asapd-lite. This should be replaced with a production-ready image for stability.

I've added specific comments with suggestions for these points. Addressing them will improve the robustness and maintainability of this new blueprint.

arpit974 · 2026-02-10T10:59:52Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces support for the a4x-maxgpu-4g-metal bare-metal machine type by adding a new Slurm blueprint, deployment configuration, and associated documentation. The changes are comprehensive and well-structured. My review focuses on improving reproducibility by pinning external dependencies to specific versions, ensuring consistency in documentation, and hardening configurations by removing references to staging resources. All original comments were found to be valid and align with the provided rules, or are not covered by rules that would necessitate modification or removal. Therefore, all comments have been retained as originally suggested.

adding support for the a4x-maxgpu-4g-metal bare-metal machine type. It introduces a new blueprint and deployment configuration to enable users to deploy Slurm clusters on this new hardware. The changes are aligned with the existing structure for other machine types, such as a4x-highgpu-4g, while accommodating the specific requirements of a bare-metal environment.

gemini-code-assist Bot reviewed Feb 10, 2026

View reviewed changes

fix build failures

58bad2f

arpit974 force-pushed the A4xMaxSlurm branch from 0dbe6df to 58bad2f Compare February 10, 2026 10:32

arpit974 added the release-key-new-features Added to release notes under the "Key New Features" heading. label Feb 10, 2026

gemini-code-assist Bot reviewed Feb 10, 2026

View reviewed changes

arpit974 added 2 commits February 10, 2026 11:12

fixing readme

0ca7029

addressing comments.

7e1f706

arpit974 marked this pull request as ready for review February 11, 2026 07:32

arpit974 requested review from a team and samskillman as code owners February 11, 2026 07:32

LAVEEN reviewed Feb 11, 2026

View reviewed changes

Comment thread examples/machine-learning/a4x-maxgpu-4g-metal/a4xmax-bm-slurm-blueprint.yaml

LAVEEN previously approved these changes Feb 11, 2026

View reviewed changes

addressing comments.

ea32d99

arpit974 dismissed LAVEEN’s stale review via ea32d99 February 12, 2026 05:02

LAVEEN approved these changes Feb 12, 2026

View reviewed changes

arpit974 merged commit 095f1dc into GoogleCloudPlatform:develop Feb 12, 2026
12 of 80 checks passed

Conversation

arpit974 commented Feb 10, 2026

Uh oh!

gemini-code-assist Bot commented Feb 10, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arpit974 commented Feb 10, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants