-
Notifications
You must be signed in to change notification settings - Fork 291
A4X-Max Bare Metal GKE toolkit blueprint #5211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
vikramvs-gg
merged 4 commits into
GoogleCloudPlatform:develop
from
vikramvs-gg:a4x-max-gke
Feb 23, 2026
Merged
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,132 @@ | ||
| ### Requirements | ||
|
|
||
| The following requirements apply to an AI-optimized A4X-Max Bare Metal GKE cluster: | ||
|
|
||
| 1. Your project must be allowlisted to use A4X-Max machine type. Please work with your account team to get your project allowlisted. | ||
| 2. The recommended GKE version for A4X-Max support is 1.34.1-gke.3849001. The GB300 GPUs in A4X-Max require a minimum of the 580.95.05 GPU driver version. GKE, by default, automatically installs this driver version on all A4X-Max nodes that run the required minimum version for A4X-Max, which is 1.34.1-gke.3849001 | ||
|
|
||
| ### Creation of cluster | ||
|
|
||
| 1. [Launch Cloud Shell](https://docs.cloud.google.com/shell/docs/launching-cloud-shell). You can use a different environment; however, we recommend Cloud Shell because the dependencies are already pre-installed for Cluster Toolkit. If you don't want to use Cloud Shell, follow the instructions to [install dependencies](https://docs.cloud.google.com/cluster-toolkit/docs/setup/install-dependencies) to prepare a different environment. | ||
| 2. Clone the Cluster Toolkit from the git repository: | ||
|
|
||
| ```bash | ||
| cd ~ | ||
| git clone https://github.com/GoogleCloudPlatform/cluster-toolkit.git | ||
| ``` | ||
|
|
||
| 3. Install the Cluster Toolkit: | ||
|
|
||
| ```bash | ||
| cd cluster-toolkit && git checkout main && make | ||
| ``` | ||
|
|
||
| 4. Create a Cloud Storage bucket to store the state of the Terraform deployment: | ||
|
|
||
| ```bash | ||
| gcloud storage buckets create gs://BUCKET_NAME \ | ||
| --default-storage-class=STANDARD \ | ||
| --project=PROJECT_ID \ | ||
| --location=COMPUTE_REGION_TERRAFORM_STATE \ | ||
| --uniform-bucket-level-access | ||
| gcloud storage buckets update gs://BUCKET_NAME --versioning | ||
| ``` | ||
|
|
||
| 5. Replace the following variables: | ||
| * BUCKET_NAME: the name of the new Cloud Storage bucket. | ||
| * PROJECT_ID: your Google Cloud project ID. | ||
| * COMPUTE_REGION_TERRAFORM_STATE: the compute region where you want to store the state of the Terraform deployment. | ||
|
|
||
| 6. In the [examples/gke-a4x-max-bm/gke-a4x-max-bm-deployment.yaml blueprint from the GitHub repo](https://github.com/GoogleCloudPlatform/cluster-toolkit/blob/main/examples/gke-a4x-max-bm/gke-a4x-max-bm-deployment.yaml), fill in the following settings in the terraform\_backend\_defaults and vars sections to match the specific values for your deployment: | ||
| * DEPLOYMENT_NAME: a unique name for the deployment, which must be between 6 and 30 characters in length. If the deployment name isn't unique within a project, cluster creation fails. The default value is gke-a4x-max-bm. | ||
| * BUCKET_NAME: the name of the Cloud Storage bucket you created in the previous step. | ||
| * PROJECT_ID: your Google Cloud project ID. | ||
| * COMPUTE_REGION: the compute region for the cluster. | ||
| * COMPUTE_ZONE: the compute zone for the node pool of A4X Max machines. Note that this zone should match the zone where machines are available in your reservation. | ||
| * NODE_COUNT: the number of A4X Max nodes in your cluster's node pool, which must be 18 nodes or less. We recommend using 18 nodes to obtain the GPU topology of 1x72 in one subblock using an NVLink domain. | ||
| * IP_ADDRESS/SUFFIX: the IP address range that you want to allow to connect with the cluster. This CIDR block must include the IP address of the machine that you want to use to call Terraform. For more information, see [How authorized networks work](https://docs.cloud.google.com/kubernetes-engine/docs/concepts/network-isolation#how_authorized_networks_work). | ||
| * For the reservation field, use one of the following, depending on whether you want to target specific [blocks](https://docs.cloud.google.com/ai-hypercomputer/docs/terminology#block) in a reservation when provisioning the node pool: | ||
| * To place the node pool anywhere in the reservation, provide the name of your reservation (RESERVATION_NAME). | ||
| * To target a specific block within your reservation, use the reservation and block names in the following format: | ||
|
|
||
| ```text | ||
| RESERVATION_NAME/reservationBlocks/BLOCK_NAME | ||
| ``` | ||
|
|
||
| * If you don't know which blocks are available in your reservation, see [View a reservation topology](https://docs.cloud.google.com/ai-hypercomputer/docs/view-reserved-capacity#view-capacity-topology). | ||
| * Set the boot disk sizes for each node of the system and A4X Max node pools. The disk size that you need depends on your use case. For example, if you use the disk as a cache to reduce the latency of pulling an image repeatedly, you can set a larger disk size to accommodate your framework, model, or container image: | ||
| * SYSTEM_NODE_POOL_DISK_SIZE_GB: the size of the boot disk for each node of the system node pool. The smallest allowed disk size is 10. The default value is 200. | ||
| * A4X_MAX_NODE_POOL_DISK_SIZE_GB: the size of the boot disk for each node of the A4X Max node pool. The smallest allowed disk size is 10. The default value is 100. | ||
|
|
||
| 7. To modify advanced settings, edit the [examples/gke-a4x-max-bm/gke-a4x-max-bm.yaml](https://github.com/GoogleCloudPlatform/cluster-toolkit/blob/main/examples/gke-a4x-max-bm/gke-a4x-max-bm.yaml) file. | ||
|
|
||
| 8. [Generate Application Default Credentials (ADC)](https://docs.cloud.google.com/docs/authentication/provide-credentials-adc#google-idp) to provide access to Terraform. If you're using Cloud Shell, you can run the following command: | ||
|
|
||
| ```bash | ||
| gcloud auth application-default login | ||
| ``` | ||
|
|
||
| 9. Deploy the blueprint to provision the GKE infrastructure using A4X Max machine types: | ||
|
|
||
| ```bash | ||
| cd ~/cluster-toolkit | ||
| ./gcluster deploy -d \ | ||
| examples/gke-a4x-max-bm/gke-a4x-max-bm-deployment.yaml \ | ||
| examples/gke-a4x-max-bm/gke-a4x-max-bm.yaml | ||
| ``` | ||
|
|
||
| 10. When prompted, select **(A)pply** to deploy the blueprint. | ||
|
|
||
| * The blueprint creates VPC networks, a GPU RDMA VPC network, service accounts, a cluster, and a node pool. | ||
| * To support the fio-bench-job-template job template in the blueprint, Google Cloud buckets, network storage, and persistent volumes resources are created. | ||
|
|
||
| ### Run NCCL on GKE clusters | ||
|
|
||
| This section describes how to run [NCCL/gIB](https://docs.cloud.google.com/ai-hypercomputer/docs/nccl/overview) tests on GKE clusters: | ||
|
|
||
| 1. Connect to your cluster: | ||
|
|
||
| ```bash | ||
| gcloud container clusters get-credentials CLUSTER_NAME \ | ||
| --location=COMPUTE_REGION | ||
| ``` | ||
|
|
||
| Replace the following variables: | ||
|
|
||
| * CLUSTER_NAME: the name of your cluster, which, for the clusters created with Cluster Toolkit, is based on the DEPLOYMENT_NAME. | ||
| * COMPUTE_REGION: the name of the compute region. | ||
|
|
||
| 2. Deploy an all-gather NCCL performance test by using the [gke-a4x-max-bm/nccl-jobset-example.yaml](https://github.com/GoogleCloudPlatform/cluster-toolkit/blob/main/examples/gke-a4x-max-bm/nccl-jobset-example.yaml) file: | ||
| 1. The test uses a certain number of nodes by default (2). If you want to change the number of nodes, modify the YAML file to change the following values to your required number of nodes: | ||
| 1. numNodes | ||
| 2. parallelism | ||
| 3. completions | ||
| 4. N_NODES | ||
| 2. Create the resources to run the test: | ||
|
|
||
| ```bash | ||
| kubectl create -f ~/cluster-toolkit/examples/gke-a4x-max-bm/nccl-jobset-example.yaml | ||
| ``` | ||
|
|
||
| 3. Confirm that all nccl-test Pods have reached the Completed state: | ||
|
|
||
| ```bash | ||
| kubectl get pods | ||
| ``` | ||
|
|
||
| 4. Find a Pod name matching the pattern nccl-all-worker-0-0-\*. The logs of this Pod contain the results of the NCCL test. To fetch the logs for this Pod, run the following command: | ||
|
|
||
| ```bash | ||
| kubectl logs $(kubectl get pods -o go-template='{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | grep nccl-all-worker-0-0) | ||
| ``` | ||
|
|
||
| ### Clean up resources created by Cluster Toolkit | ||
|
|
||
| To avoid recurring charges for the resources used, clean up the resources provisioned by Cluster Toolkit, including the VPC networks and GKE cluster: | ||
|
|
||
| ```bash | ||
| cd ~/cluster-toolkit | ||
| ./gcluster destroy CLUSTER_NAME | ||
| ``` | ||
|
|
||
| Replace CLUSTER_NAME with the name of your cluster. For the clusters created with Cluster Toolkit, the cluster name is based on the DEPLOYMENT_NAME. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,177 @@ | ||
| # Copyright 2026 "Google LLC" | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| apiVersion: apps/v1 | ||
| kind: DaemonSet | ||
| metadata: | ||
| name: asapd-lite | ||
| namespace: kube-system | ||
| labels: | ||
| k8s-app: asapd-lite | ||
| spec: | ||
| selector: | ||
| matchLabels: | ||
| k8s-app: asapd-lite | ||
| template: | ||
| metadata: | ||
| labels: | ||
| k8s-app: asapd-lite | ||
| spec: | ||
| priorityClassName: system-node-critical | ||
| affinity: | ||
| nodeAffinity: | ||
| requiredDuringSchedulingIgnoredDuringExecution: | ||
| nodeSelectorTerms: | ||
| - matchExpressions: | ||
| - key: node.kubernetes.io/instance-type | ||
| operator: In | ||
| values: | ||
| - a4x-maxgpu-4g-metal | ||
| - a4x-maxgpu-4g-metal-nolssd | ||
| - key: cloud.google.com/gke-os-distribution | ||
| operator: In | ||
| values: | ||
| - cos | ||
| tolerations: | ||
| - operator: Exists | ||
|
|
||
| # Use the host's network namespace. This is essential for managing the | ||
| # host's CX PFs. | ||
| hostNetwork: true | ||
|
|
||
| # Use the host's PID namespace. This can be useful for | ||
| # debugging and interacting with host processes. | ||
| hostPID: true | ||
| containers: | ||
| - name: asapd-lite | ||
| image: us-docker.pkg.dev/gce-ai-infra/asapd-lite/asapd-lite:v0.0.3 | ||
|
|
||
| command: ["/bin/bash", "-c"] | ||
| args: | ||
| - | | ||
| set -x | ||
| /usr/local/bin/run_asapd_lite.sh | ||
|
|
||
| # Wait for a bit to let interfaces come up. | ||
| sleep 15 | ||
|
|
||
| DADFAILED_FOUND=0 | ||
| INTERFACES=$(ip -o link show | awk -F': ' '{print $2}' | grep 'gpu') | ||
|
|
||
| for iface in $INTERFACES; do | ||
| if ip -6 addr show dev "$iface" | grep -q 'dadfailed'; then | ||
| echo "Found dadfailed on interface $iface. Bringing it down and up." | ||
| ip link set dev "$iface" down | ||
| ip link set dev "$iface" up | ||
| DADFAILED_FOUND=1 | ||
| fi | ||
| done | ||
|
|
||
| if [ "$DADFAILED_FOUND" -eq 1 ]; then | ||
| echo "Found and attempted to fix dadfailed on one or more interfaces. Exiting to allow container restart." | ||
| exit 1 | ||
| fi | ||
|
|
||
| # Keep the container running | ||
| echo "No dadfailed detected on interfaces." | ||
| wait | ||
|
|
||
| env: | ||
| - name: LD_LIBRARY_PATH | ||
| value: /asapd-lite/controller | ||
| livenessProbe: | ||
| httpGet: | ||
| path: /healthz | ||
| port: 19540 | ||
| initialDelaySeconds: 60 | ||
| readinessProbe: | ||
| httpGet: | ||
| path: /healthz | ||
| port: 19540 | ||
|
|
||
| securityContext: | ||
| # 'privileged: true' is required for modifying most host-level | ||
| # resources, including network devices and udev rules. | ||
| privileged: true | ||
|
|
||
| volumeMounts: | ||
| - name: host-sys | ||
| mountPath: /sys | ||
| - name: host-proc | ||
| mountPath: /proc | ||
| - name: host-udev-rules | ||
| mountPath: /etc/udev/rules.d | ||
| - name: host-run-udev | ||
| mountPath: /run/udev | ||
| - name: host-run-asapd-lite | ||
| mountPath: /run/asapd-lite | ||
| - mountPath: /hugepages-2Mi | ||
| name: hugepage-2mi | ||
| - name: var-log-google-asapd-lite | ||
| mountPath: /var/log/google/asapd-lite | ||
| - name: var-lib-google-asapd-lite | ||
| mountPath: /var/lib/google/asapd-lite | ||
| - name: host-systemd-system | ||
| mountPath: /etc/systemd/system | ||
| - name: host-run-systemd | ||
| mountPath: /run/systemd | ||
| resources: | ||
| limits: | ||
| hugepages-2Mi: 8192Mi | ||
| memory: 1024Mi | ||
|
|
||
| volumes: | ||
| - name: host-sys | ||
| hostPath: | ||
| path: /sys | ||
| - name: host-proc | ||
| hostPath: | ||
| path: /proc | ||
| - name: host-udev-rules | ||
| hostPath: | ||
| path: /etc/udev/rules.d | ||
| - name: host-run-udev | ||
| hostPath: | ||
| path: /run/udev | ||
| - name: host-run-asapd-lite | ||
| hostPath: | ||
| path: /run/asapd-lite | ||
| type: DirectoryOrCreate | ||
| - name: hugepage-2mi | ||
| emptyDir: | ||
| medium: HugePages-2Mi | ||
| - name: var-log-google-asapd-lite | ||
| hostPath: | ||
| path: /var/log/google/asapd-lite | ||
| type: DirectoryOrCreate | ||
| # Mount /var/lib/google/asapd-lite to provide a persistent location for | ||
| # the ipam binary on the host, allowing it to be executed from the host. | ||
| # This location was chosen based on this documentation | ||
| # https://docs.cloud.google.com/container-optimized-os/docs/concepts/disks-and-filesystem#working_with_the_file_system | ||
| # Hence, this may not necessarily be a suitable location for non-COS based | ||
| # systems. | ||
| - name: var-lib-google-asapd-lite | ||
| hostPath: | ||
| path: /var/lib/google/asapd-lite | ||
| type: DirectoryOrCreate | ||
| - name: host-systemd-system | ||
| hostPath: | ||
| path: /etc/systemd/system | ||
| - name: host-run-systemd | ||
| hostPath: | ||
| path: /run/systemd | ||
| updateStrategy: | ||
| type: RollingUpdate | ||
| rollingUpdate: | ||
| maxUnavailable: 1% |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,53 @@ | ||
| # Copyright 2026 "Google LLC" | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| terraform_backend_defaults: | ||
| type: gcs | ||
| configuration: | ||
| # The GCS bucket used for storing terraform state | ||
| bucket: | ||
|
|
||
| vars: | ||
| # Your GCP Project ID | ||
| project_id: | ||
|
|
||
| # This should be unique across all of your Cluster | ||
| # Toolkit Deployments. | ||
| deployment_name: | ||
|
|
||
| # The GCP Region used for this deployment. | ||
| region: | ||
|
|
||
| # The GCP Zone used for this deployment. | ||
| zone: | ||
|
|
||
| # The number of nodes to be created. | ||
| static_node_count: | ||
|
|
||
| # Cidr block containing the IP of the machine calling terraform. | ||
| # To allow all (IAM restrictions still enforced), use 0.0.0.0/0 | ||
| # To allow only your IP address, use <YOUR-IP-ADDRESS>/32 | ||
| authorized_cidr: | ||
|
|
||
| # The name of the compute engine reservation in the form of | ||
| # <reservation-name> | ||
| # To target a BLOCK_NAME, the name of the extended reservation | ||
| # can be inputted as <reservation-name>/reservationBlocks/<reservation-block-name> | ||
| reservation: | ||
|
|
||
| # The disk size of system node pool for this deployment. | ||
| system_node_pool_disk_size_gb: | ||
|
|
||
| # The disk size of a4x-max node pool for this deployment. | ||
| a4x_max_node_pool_disk_size_gb: |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.