Skip to content

A4x Max BM slurm support.#5222

Merged
arpit974 merged 4 commits intoGoogleCloudPlatform:developfrom
arpit974:A4xMaxSlurm
Feb 12, 2026
Merged

A4x Max BM slurm support.#5222
arpit974 merged 4 commits intoGoogleCloudPlatform:developfrom
arpit974:A4xMaxSlurm

Conversation

@arpit974
Copy link
Copy Markdown
Contributor

This pull request adds support for the a4x-maxgpu-4g-metal bare-metal machine type. It introduces a new
blueprint and deployment configuration to enable users to deploy Slurm clusters on this new hardware. The
changes are aligned with the existing structure for other machine types, such as a4x-highgpu-4g, while
accommodating the specific requirements of a bare-metal environment.

Key Changes

  • New Blueprint (a4xmax-bm-slurm-blueprint.yaml):

    • Introduces a new blueprint specifically for the a4x-maxgpu-4g-metal machine type.
    • Includes installation and configuration for MOFED/DOCA to support the advanced networking
      capabilities of the bare-metal hardware.
    • Adds the asapd-lite service to the startup sequence.
    • Configures a multi-NIC network environment, including two VPCs for IDPF and a separate RDMA network.
    • Includes prolog and epilog scripts for managing the nvidia-imex service.
  • New Deployment File (a4xmax-bm-slurm-deployment.yaml):

    • Provides a clean and simple deployment file that externalizes all the necessary variables for the
      new blueprint.
  • System Benchmarks:

    • The run-nccl-tests-via-ramble.sh script has been updated to correctly configure the environment
      variables for the multiple network interfaces on the bare-metal machine.

Testing

  • The newer blueprint was deployed successfully, and NCCL tests were run as per the ramble script,
    verifying the functionality.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @arpit974, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the platform's capabilities by adding full support for the a4x-maxgpu-4g-metal bare-metal machine type within Slurm clusters. It provides all necessary infrastructure as code, including a new blueprint and deployment configuration, to seamlessly provision and manage these high-performance computing environments. The changes ensure that users can leverage the advanced networking and GPU features of the A4X Max hardware, complete with integrated benchmarking tools to validate performance.

Highlights

  • New Blueprint for A4X Max Bare Metal: Introduced a new blueprint (a4xmax-bm-slurm-blueprint.yaml) specifically for the a4x-maxgpu-4g-metal machine type, enabling Slurm cluster deployments on this new hardware.
  • Advanced Networking and Driver Configuration: The new blueprint includes comprehensive installation and configuration for MOFED/DOCA drivers, asapd-lite service, and a multi-NIC network environment with two VPCs for IDPF and a dedicated RDMA network.
  • Slurm Integration with NVIDIA IMEX: Configured Slurm controller prolog and epilog scripts to manage the nvidia-imex service, crucial for optimizing performance on A4X Max bare metal nodes.
  • Deployment Configuration and Benchmarking: Added a clean deployment file (a4xmax-bm-slurm-deployment.yaml) for easy configuration and updated the run-nccl-tests-via-ramble.sh script to correctly set environment variables for bare metal multi-NIC setups, facilitating system benchmarks.
  • GPU Definition Update: Updated the internal GPU definition module to recognize the new a4x-maxgpu-4g-metal and a4x-maxgpu-4g-metal-nolssd machine types with nvidia-gb300 GPUs.
Changelog
  • examples/machine-learning/a4x-maxgpu-4g-metal/README.md
    • Added a new README file detailing the deployment steps and blueprint overview for A4X Max bare metal Slurm clusters.
  • examples/machine-learning/a4x-maxgpu-4g-metal/a4xmax-bm-slurm-blueprint.yaml
    • Introduced a new blueprint for the a4x-maxgpu-4g-metal bare metal machine type.
    • Configured the image build process to include MOFED/DOCA drivers, asapd-lite service setup, and a networkd workaround.
    • Defined a multi-NIC network setup with two IDPF VPCs and an RDMA network.
    • Integrated NCCL plugin installation and DCGM enablement.
    • Added Slurm controller prolog and epilog scripts for nvidia-imex service management.
  • examples/machine-learning/a4x-maxgpu-4g-metal/a4xmax-bm-slurm-deployment.yaml
    • Provided a new deployment configuration file for the A4X Max Slurm blueprint, externalizing necessary variables.
  • examples/machine-learning/a4x-maxgpu-4g-metal/system_benchmarks/README.md
    • Added documentation on how to run system benchmarks, specifically NCCL tests, using the Ramble framework.
  • examples/machine-learning/a4x-maxgpu-4g-metal/system_benchmarks/run-nccl-tests-via-ramble.sh
    • Implemented a script to automate the setup and execution of NCCL tests via Ramble.
    • Configured environment variables within the script to correctly support multi-NIC bare metal setups.
  • modules/internal/gpu-definition/main.tf
    • Added a4x-maxgpu-4g-metal and a4x-maxgpu-4g-metal-nolssd machine types with nvidia-gb300 GPUs to the accelerator machine definitions.
Activity
  • No human activity (comments, reviews) has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the a4x-maxgpu-4g-metal bare-metal machine type, which is a significant addition. The changes include a new blueprint, deployment configuration, documentation, and benchmark scripts. The implementation is largely well-structured and follows existing patterns.

However, there are a few areas that need attention:

  • Reproducibility: Several scripts fetch dependencies (like slurm-gcp tools and ramble) from the master branch of their respective GitHub repositories. This should be changed to use specific tags or commit hashes to ensure deployments are reproducible.
  • Staging Artifacts: The blueprint and documentation reference a staging Docker image for asapd-lite. This should be replaced with a production-ready image for stability.

I've added specific comments with suggestions for these points. Addressing them will improve the robustness and maintainability of this new blueprint.

Comment thread examples/machine-learning/a4x-maxgpu-4g-metal/a4xmax-bm-slurm-blueprint.yaml Outdated
Comment thread examples/machine-learning/a4x-maxgpu-4g-metal/README.md Outdated
Comment thread examples/machine-learning/a4x-maxgpu-4g-metal/README.md
Comment thread examples/machine-learning/a4x-maxgpu-4g-metal/README.md Outdated
@arpit974 arpit974 added the release-key-new-features Added to release notes under the "Key New Features" heading. label Feb 10, 2026
@arpit974
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the a4x-maxgpu-4g-metal bare-metal machine type by adding a new Slurm blueprint, deployment configuration, and associated documentation. The changes are comprehensive and well-structured. My review focuses on improving reproducibility by pinning external dependencies to specific versions, ensuring consistency in documentation, and hardening configurations by removing references to staging resources. All original comments were found to be valid and align with the provided rules, or are not covered by rules that would necessitate modification or removal. Therefore, all comments have been retained as originally suggested.

Comment thread examples/machine-learning/a4x-maxgpu-4g-metal/a4xmax-bm-slurm-blueprint.yaml Outdated
Comment thread examples/machine-learning/a4x-maxgpu-4g-metal/README.md Outdated
Comment thread examples/machine-learning/a4x-maxgpu-4g-metal/README.md Outdated
Comment thread examples/machine-learning/a4x-maxgpu-4g-metal/README.md
Comment thread examples/machine-learning/a4x-maxgpu-4g-metal/README.md Outdated
Comment thread examples/machine-learning/a4x-maxgpu-4g-metal/README.md
@arpit974 arpit974 marked this pull request as ready for review February 11, 2026 07:32
@arpit974 arpit974 requested review from a team and samskillman as code owners February 11, 2026 07:32
LAVEEN
LAVEEN previously approved these changes Feb 11, 2026
@arpit974 arpit974 merged commit 095f1dc into GoogleCloudPlatform:develop Feb 12, 2026
12 of 80 checks passed
kadupoornima pushed a commit to kadupoornima/cluster-toolkit that referenced this pull request Feb 17, 2026
adding support for the a4x-maxgpu-4g-metal bare-metal machine type. It introduces a new
blueprint and deployment configuration to enable users to deploy Slurm clusters on this new hardware. The
changes are aligned with the existing structure for other machine types, such as a4x-highgpu-4g, while
accommodating the specific requirements of a bare-metal environment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-key-new-features Added to release notes under the "Key New Features" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants