Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
132 changes: 132 additions & 0 deletions examples/gke-a4x-max-bm/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
### Requirements

The following requirements apply to an AI-optimized A4X-Max Bare Metal GKE cluster:

1. Your project must be allowlisted to use A4X-Max machine type. Please work with your account team to get your project allowlisted.
2. The recommended GKE version for A4X-Max support is 1.34.1-gke.3849001. The GB300 GPUs in A4X-Max require a minimum of the 580.95.05 GPU driver version. GKE, by default, automatically installs this driver version on all A4X-Max nodes that run the required minimum version for A4X-Max, which is 1.34.1-gke.3849001

### Creation of cluster

1. [Launch Cloud Shell](https://docs.cloud.google.com/shell/docs/launching-cloud-shell). You can use a different environment; however, we recommend Cloud Shell because the dependencies are already pre-installed for Cluster Toolkit. If you don't want to use Cloud Shell, follow the instructions to [install dependencies](https://docs.cloud.google.com/cluster-toolkit/docs/setup/install-dependencies) to prepare a different environment.
2. Clone the Cluster Toolkit from the git repository:

```bash
cd ~
git clone https://github.com/GoogleCloudPlatform/cluster-toolkit.git
```

3. Install the Cluster Toolkit:

```bash
cd cluster-toolkit && git checkout main && make
```

4. Create a Cloud Storage bucket to store the state of the Terraform deployment:

```bash
gcloud storage buckets create gs://BUCKET_NAME \
--default-storage-class=STANDARD \
--project=PROJECT_ID \
--location=COMPUTE_REGION_TERRAFORM_STATE \
--uniform-bucket-level-access
gcloud storage buckets update gs://BUCKET_NAME --versioning
```

5. Replace the following variables:
* BUCKET_NAME: the name of the new Cloud Storage bucket.
* PROJECT_ID: your Google Cloud project ID.
* COMPUTE_REGION_TERRAFORM_STATE: the compute region where you want to store the state of the Terraform deployment.

6. In the [examples/gke-a4x-max-bm/gke-a4x-max-bm-deployment.yaml blueprint from the GitHub repo](https://github.com/GoogleCloudPlatform/cluster-toolkit/blob/main/examples/gke-a4x-max-bm/gke-a4x-max-bm-deployment.yaml), fill in the following settings in the terraform\_backend\_defaults and vars sections to match the specific values for your deployment:
* DEPLOYMENT_NAME: a unique name for the deployment, which must be between 6 and 30 characters in length. If the deployment name isn't unique within a project, cluster creation fails. The default value is gke-a4x-max-bm.
* BUCKET_NAME: the name of the Cloud Storage bucket you created in the previous step.
* PROJECT_ID: your Google Cloud project ID.
* COMPUTE_REGION: the compute region for the cluster.
* COMPUTE_ZONE: the compute zone for the node pool of A4X Max machines. Note that this zone should match the zone where machines are available in your reservation.
* NODE_COUNT: the number of A4X Max nodes in your cluster's node pool, which must be 18 nodes or less. We recommend using 18 nodes to obtain the GPU topology of 1x72 in one subblock using an NVLink domain.
Comment thread
vikramvs-gg marked this conversation as resolved.
* IP_ADDRESS/SUFFIX: the IP address range that you want to allow to connect with the cluster. This CIDR block must include the IP address of the machine that you want to use to call Terraform. For more information, see [How authorized networks work](https://docs.cloud.google.com/kubernetes-engine/docs/concepts/network-isolation#how_authorized_networks_work).
* For the reservation field, use one of the following, depending on whether you want to target specific [blocks](https://docs.cloud.google.com/ai-hypercomputer/docs/terminology#block) in a reservation when provisioning the node pool:
* To place the node pool anywhere in the reservation, provide the name of your reservation (RESERVATION_NAME).
* To target a specific block within your reservation, use the reservation and block names in the following format:

```text
RESERVATION_NAME/reservationBlocks/BLOCK_NAME
```

* If you don't know which blocks are available in your reservation, see [View a reservation topology](https://docs.cloud.google.com/ai-hypercomputer/docs/view-reserved-capacity#view-capacity-topology).
* Set the boot disk sizes for each node of the system and A4X Max node pools. The disk size that you need depends on your use case. For example, if you use the disk as a cache to reduce the latency of pulling an image repeatedly, you can set a larger disk size to accommodate your framework, model, or container image:
* SYSTEM_NODE_POOL_DISK_SIZE_GB: the size of the boot disk for each node of the system node pool. The smallest allowed disk size is 10. The default value is 200.
* A4X_MAX_NODE_POOL_DISK_SIZE_GB: the size of the boot disk for each node of the A4X Max node pool. The smallest allowed disk size is 10. The default value is 100.

7. To modify advanced settings, edit the [examples/gke-a4x-max-bm/gke-a4x-max-bm.yaml](https://github.com/GoogleCloudPlatform/cluster-toolkit/blob/main/examples/gke-a4x-max-bm/gke-a4x-max-bm.yaml) file.

8. [Generate Application Default Credentials (ADC)](https://docs.cloud.google.com/docs/authentication/provide-credentials-adc#google-idp) to provide access to Terraform. If you're using Cloud Shell, you can run the following command:

```bash
gcloud auth application-default login
```

9. Deploy the blueprint to provision the GKE infrastructure using A4X Max machine types:

```bash
cd ~/cluster-toolkit
./gcluster deploy -d \
examples/gke-a4x-max-bm/gke-a4x-max-bm-deployment.yaml \
examples/gke-a4x-max-bm/gke-a4x-max-bm.yaml
```

10. When prompted, select **(A)pply** to deploy the blueprint.

* The blueprint creates VPC networks, a GPU RDMA VPC network, service accounts, a cluster, and a node pool.
* To support the fio-bench-job-template job template in the blueprint, Google Cloud buckets, network storage, and persistent volumes resources are created.

### Run NCCL on GKE clusters

This section describes how to run [NCCL/gIB](https://docs.cloud.google.com/ai-hypercomputer/docs/nccl/overview) tests on GKE clusters:

1. Connect to your cluster:

```bash
gcloud container clusters get-credentials CLUSTER_NAME \
--location=COMPUTE_REGION
```

Replace the following variables:

* CLUSTER_NAME: the name of your cluster, which, for the clusters created with Cluster Toolkit, is based on the DEPLOYMENT_NAME.
* COMPUTE_REGION: the name of the compute region.

2. Deploy an all-gather NCCL performance test by using the [gke-a4x-max-bm/nccl-jobset-example.yaml](https://github.com/GoogleCloudPlatform/cluster-toolkit/blob/main/examples/gke-a4x-max-bm/nccl-jobset-example.yaml) file:
1. The test uses a certain number of nodes by default (2). If you want to change the number of nodes, modify the YAML file to change the following values to your required number of nodes:
1. numNodes
2. parallelism
3. completions
4. N_NODES
2. Create the resources to run the test:

```bash
kubectl create -f ~/cluster-toolkit/examples/gke-a4x-max-bm/nccl-jobset-example.yaml
```

3. Confirm that all nccl-test Pods have reached the Completed state:

```bash
kubectl get pods
```

4. Find a Pod name matching the pattern nccl-all-worker-0-0-\*. The logs of this Pod contain the results of the NCCL test. To fetch the logs for this Pod, run the following command:

```bash
kubectl logs $(kubectl get pods -o go-template='{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | grep nccl-all-worker-0-0)
```

### Clean up resources created by Cluster Toolkit

To avoid recurring charges for the resources used, clean up the resources provisioned by Cluster Toolkit, including the VPC networks and GKE cluster:

```bash
cd ~/cluster-toolkit
./gcluster destroy CLUSTER_NAME
```

Replace CLUSTER_NAME with the name of your cluster. For the clusters created with Cluster Toolkit, the cluster name is based on the DEPLOYMENT_NAME.
177 changes: 177 additions & 0 deletions examples/gke-a4x-max-bm/asapd-lite-installer.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
# Copyright 2026 "Google LLC"
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: asapd-lite
namespace: kube-system
labels:
k8s-app: asapd-lite
spec:
selector:
matchLabels:
k8s-app: asapd-lite
template:
metadata:
labels:
k8s-app: asapd-lite
spec:
priorityClassName: system-node-critical
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values:
- a4x-maxgpu-4g-metal
- a4x-maxgpu-4g-metal-nolssd
- key: cloud.google.com/gke-os-distribution
operator: In
values:
- cos
tolerations:
- operator: Exists

# Use the host's network namespace. This is essential for managing the
# host's CX PFs.
hostNetwork: true

# Use the host's PID namespace. This can be useful for
# debugging and interacting with host processes.
hostPID: true
containers:
- name: asapd-lite
image: us-docker.pkg.dev/gce-ai-infra/asapd-lite/asapd-lite:v0.0.3

command: ["/bin/bash", "-c"]
args:
- |
set -x
/usr/local/bin/run_asapd_lite.sh

# Wait for a bit to let interfaces come up.
sleep 15

DADFAILED_FOUND=0
INTERFACES=$(ip -o link show | awk -F': ' '{print $2}' | grep 'gpu')

for iface in $INTERFACES; do
if ip -6 addr show dev "$iface" | grep -q 'dadfailed'; then
echo "Found dadfailed on interface $iface. Bringing it down and up."
ip link set dev "$iface" down
ip link set dev "$iface" up
DADFAILED_FOUND=1
fi
done

if [ "$DADFAILED_FOUND" -eq 1 ]; then
echo "Found and attempted to fix dadfailed on one or more interfaces. Exiting to allow container restart."
exit 1
fi

# Keep the container running
echo "No dadfailed detected on interfaces."
wait

env:
- name: LD_LIBRARY_PATH
value: /asapd-lite/controller
livenessProbe:
httpGet:
path: /healthz
port: 19540
initialDelaySeconds: 60
readinessProbe:
httpGet:
path: /healthz
port: 19540

securityContext:
# 'privileged: true' is required for modifying most host-level
# resources, including network devices and udev rules.
privileged: true

volumeMounts:
- name: host-sys
mountPath: /sys
- name: host-proc
mountPath: /proc
- name: host-udev-rules
mountPath: /etc/udev/rules.d
- name: host-run-udev
mountPath: /run/udev
- name: host-run-asapd-lite
mountPath: /run/asapd-lite
- mountPath: /hugepages-2Mi
name: hugepage-2mi
- name: var-log-google-asapd-lite
mountPath: /var/log/google/asapd-lite
- name: var-lib-google-asapd-lite
mountPath: /var/lib/google/asapd-lite
- name: host-systemd-system
mountPath: /etc/systemd/system
- name: host-run-systemd
mountPath: /run/systemd
resources:
limits:
hugepages-2Mi: 8192Mi
memory: 1024Mi

volumes:
- name: host-sys
hostPath:
path: /sys
- name: host-proc
hostPath:
path: /proc
- name: host-udev-rules
hostPath:
path: /etc/udev/rules.d
- name: host-run-udev
hostPath:
path: /run/udev
- name: host-run-asapd-lite
hostPath:
path: /run/asapd-lite
type: DirectoryOrCreate
- name: hugepage-2mi
emptyDir:
medium: HugePages-2Mi
- name: var-log-google-asapd-lite
hostPath:
path: /var/log/google/asapd-lite
type: DirectoryOrCreate
# Mount /var/lib/google/asapd-lite to provide a persistent location for
# the ipam binary on the host, allowing it to be executed from the host.
# This location was chosen based on this documentation
# https://docs.cloud.google.com/container-optimized-os/docs/concepts/disks-and-filesystem#working_with_the_file_system
# Hence, this may not necessarily be a suitable location for non-COS based
# systems.
- name: var-lib-google-asapd-lite
hostPath:
path: /var/lib/google/asapd-lite
type: DirectoryOrCreate
- name: host-systemd-system
hostPath:
path: /etc/systemd/system
- name: host-run-systemd
hostPath:
path: /run/systemd
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1%
53 changes: 53 additions & 0 deletions examples/gke-a4x-max-bm/gke-a4x-max-bm-deployment.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Copyright 2026 "Google LLC"
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

terraform_backend_defaults:
type: gcs
configuration:
# The GCS bucket used for storing terraform state
bucket:

vars:
# Your GCP Project ID
project_id:

# This should be unique across all of your Cluster
# Toolkit Deployments.
deployment_name:

# The GCP Region used for this deployment.
region:

# The GCP Zone used for this deployment.
zone:

# The number of nodes to be created.
static_node_count:

# Cidr block containing the IP of the machine calling terraform.
# To allow all (IAM restrictions still enforced), use 0.0.0.0/0
# To allow only your IP address, use <YOUR-IP-ADDRESS>/32
authorized_cidr:

# The name of the compute engine reservation in the form of
# <reservation-name>
# To target a BLOCK_NAME, the name of the extended reservation
# can be inputted as <reservation-name>/reservationBlocks/<reservation-block-name>
reservation:

# The disk size of system node pool for this deployment.
system_node_pool_disk_size_gb:

# The disk size of a4x-max node pool for this deployment.
a4x_max_node_pool_disk_size_gb:
Loading
Loading