Releases: GoogleCloudPlatform/cluster-toolkit
v1.88.0
Release v1.88.0
What's Changed
Key New Features 🎉
- feat: plumb optional auto monitoring scope by @jessicaochen in #5331
Breaking Changes 🚨
- Reapply "Modify the kubectl-apply manifest helm_release_naming (#5438)" by @agrawalkhushi18 in #5473
Module Improvements 🔨
- feat: Implement dynamic machine configurations via Compute Engine API by @SwarnaBharathiMantena in #5426
Improvements 🛠
- Integrate CHS to GKE and Slurm A3U and A4 Daily Tests by @simrankaurb in #5335
- Modify the kubectl-apply manifest helm_release_naming by @agrawalkhushi18 in #5438
- feat: Automatically add ghpc_creator label to expanded blueprint by @cboneti in #5468
- [Telemetry] Set up a base skeleton framework - Resubmit by @kadupoornima in #5475
- [Telemetry] Add Viper-based User config backed by Firestore DB by @kadupoornima in #5478
- [Telemetry] Add metric implementation to collect flags by @kadupoornima in #5486
- [Telemetry] Add region and zone metrics implementation by @kadupoornima in #5489
New Contributors
- @jessicaochen made their first contribution in #5331
Full Changelog: v1.87.0...v1.88.0
v1.87.0
What's Changed
Key New Features 🎉
- Adding kueue support for GKE A4X-Max by @vikramvs-gg in #5389
- Add Customer Managed Encryption Keys (CMEK) support in Managed Lustre by @parulbajaj01 in #5449
Breaking Changes 🚨
- Migrating install_asapd_lite module to helm by @agrawalkhushi18 in #5410
New Modules 🧱
Module Improvements 🔨
- refactor: Fix pre-commit error in kubectl-apply by @jamOne- in #5427
- feat: Add resource-policy accelerator_topology_mode by @jamOne- in #5393
Improvements 🛠
- Update fielstore tier default by @saara-tyagi27 in #5379
- Revamp GKE A3 High blueprint and align integration tests by @shubpal07 in #5246
Deprecations 💤
- Marking parallelstore deprecated for gcluster warnings by @vikramvs-gg in #5325
Full Changelog: v1.86.0...v1.87.0
v1.86.0
What's Changed
Key New Features 🎉
- feat: Implement and configure GKE Image Streaming (GCFS) at the cluster level. by @raushan2016 in #5387
- Support vGPU (fractional GPU) for G4 GKE by @kadupoornima in #5399
- Support Customer-Managed Encryption Keys (CMEK) in Slurm GCP deployments by @saara-tyagi27 in #5407
Breaking Changes 🚨
- Enable JobSet and Nvidia Data Center monitoring by default by @SikaGrr in #5384
- Migrate kubectl_apply_manifest module to helm by @agrawalkhushi18 in #5282
Module Improvements 🔨
- feat: Add Enable GKE Slice Controller by @jamOne- in #5375
- Pathways cluster config by @FIoannides in #5370
Improvements 🛠
- Upgrade the DCGMI Version to 4.5.2 by @LAVEEN in #5408
- Feat: Automatically derive TPU node counts based on topology and machine type by @SwarnaBharathiMantena in #5386
- Upgrade Debian version in test runner image by @parulbajaj01 in #5400
Version Updates ⏫
- Update Slurm images to 6-12 by @AdarshK15 in #5273
- Bump slurm-gcp tag to 6.12.1 (Slurm 25.11.4) by @AdarshK15 in #5269
Bug fixes 🐞
- Update the TFLint Google ruleset version to 0.30.0 by @SwarnaBharathiMantena in #5401
- enable execution of external prolog/epilog for A4X by @Neelabh94 in #5403
- fix: Null iteration in kubectl-apply module by @sudheer-quad in #5430
New Contributors
- @SikaGrr made their first contribution in #5384
- @FIoannides made their first contribution in #5370
Full Changelog: v1.85.0...v1.86.0
v1.85.0
What's Changed
Key New Features 🎉
- feat(storage): Enable GCS zonal bucket capability with RAPID storage. by @Neelabh94 in #5353
- Support future reservation in name check validator by @saara-tyagi27 in #5252
Breaking Changes 🚨
- Update cloud_dns_config to default to KUBE_DNS (CoreDNS) by @SwarnaBharathiMantena in #5336
Improvements 🛠
- Add Managed Lustre integration in gke a4x-max by @parulbajaj01 in #5337
- Binary dependencies downloading script by @scaliby in #5354
- fix: use cleaned relative path instead of absolute path for local module hash by @mtibben in #5280
- feat: multi-arch build support and README updates by @kvenkatachala333 in #5388
Version Updates ⏫
- Pin shfmt and goimports version to resolve Go version conflict by @kadupoornima in #5365
Bug fixes 🐞
New Contributors
Full Changelog: v1.84.0...v1.85.0
v1.84.0
What's Changed
Key New Features 🎉
- Validate disk type in zone by @saara-tyagi27 in #5232
Version Updates ⏫
- Update gke-versioning in gpu_direct.tf by @agrawalkhushi18 in #5284
Bug fixes 🐞
- Update nccl test script to fix enroot directory issue in A3H by @agrawalkhushi18 in #5324
Full Changelog: v1.83.0...v1.84.0
v1.83.0
What's Changed
Key New Features 🎉
- feat(validations): Add early conditional validation by @AdarshK15 in #5160
- A4x Max BM slurm support. by @arpit974 in #5222
- Adding GKE TPU DWS Queued Provisioning support for v6e and 7x by @shubpal07 in #5218
- feat(validations): Add early required validation by @AdarshK15 in #5166
- Module deprecation warning system by @vikramvs-gg in #5229
- A4X-Max Bare Metal GKE toolkit blueprint by @vikramvs-gg in #5211
Breaking Changes 🚨
- Update and pin terraform version to 1.12.2 by @parulbajaj01 in #5216
- Update wait flag and resolving helm_release deadlock destruction error by @agrawalkhushi18 in #5147
Module Improvements 🔨
- Migrate configure_kueue from gavinbunney to helm by @agrawalkhushi18 in #5129
- Migrate install_gib from kubectl to helm by @agrawalkhushi18 in #5256
Improvements 🛠
- Add reservation name check validator by @saara-tyagi27 in #5185
- Update go files to add timestamps to gcluster logs by @agrawalkhushi18 in #5198
- Pin Dcgm version 4.5.1-1 by @saara-tyagi27 in #5197
- Add support for DualStack (IPv4/IPv6) networks by @DomiKoPL in #5206
Bug fixes 🐞
- Update slurm_cluster_name regex by @saara-tyagi27 in #5261
- Fix SELinux issue in hpc-build-slurm-image blueprint by @AdarshK15 in #5266
- Hotfix: update G4 NVIDIA drivers for kernel 6.17 compatibility by @SwarnaBharathiMantena in #5289
- Hardcode zone in a2high PR test to fix test failures by @kadupoornima in #5305
- Modifying prefix_length for PSA to accomodate sufficient IPs for peering by @vikramvs-gg in #5306
- fix: Update a3m and a3u script to resolve slurm nccl test failure by @agrawalkhushi18 in #5308
New Contributors
Full Changelog: v1.82.0...v1.83.0
v1.82.0
What's Changed
Key New Features 🎉
- A4X JBVM by @LAVEEN in #4950
- Introduced a binary ZIP archive to the release assets by @kvenkatachala333 in #5208
Module Improvements 🔨
Improvements 🛠
- Fix the babysit files limitation with pagination logic by @SwarnaBharathiMantena in #5191
- Adding A4X Base Support to JBVM by @LAVEEN in #4834
Version Updates ⏫
- Update SLURM blueprints to point to the latest slurm-gcp release by @Neelabh94 in #5215
New Contributors
- @spaturi13 made their first contribution in #5184
Full Changelog: v1.81.0...v1.82.0
v1.81.0
What's Changed
Key New Features 🎉
-
Switch to using gcsfuse profile feature in aiml gcs-bucket mounts in slurm cluster blueprints by @gargnitingoogle in https://github.com/GoogleCloudPlatform/cluster-toolkit/pull/5047
-
DWS Flex start support in TPU 7x and v6e by @shubpal07 in https://github.com/GoogleCloudPlatform/cluster-toolkit/pull/5111
Improvements 🛠
-
Improved validations enabling early enforcement of numeric boundaries and length constraints within metadata.yaml files across several core and community modules by @AdarshK15 in https://github.com/GoogleCloudPlatform/cluster-toolkit/pull/5115
-
Update Dockerfile and README.md instructions for a3mega nemo framework by @mufaqam-gcl in https://github.com/GoogleCloudPlatform/cluster-toolkit/pull/5164
-
TPU v6e DWS flex integration tests by @shubpal07 in https://github.com/GoogleCloudPlatform/cluster-toolkit/pull/5135
-
chore/allow hyphens in partition_name and slurm_cluster_name, increase max length to 20 for slurm_cluster_name by @rbekhtaoui in https://github.com/GoogleCloudPlatform/cluster-toolkit/pull/4316
New Contributors
@gargnitingoogle made their first contribution in https://github.com/GoogleCloudPlatform/cluster-toolkit/pull/5047
@gokamesh made their first contribution in https://github.com/GoogleCloudPlatform/cluster-toolkit/pull/5169
Full Changelog: https://github.com/GoogleCloudPlatform/cluster-toolkit/compare/v1.80.0...v1.81.0
v1.80.0
What's Changed
Module Improvements 🔨
- Compress the H4D blueprint with multivpc and vpc module update by @SwarnaBharathiMantena in #5133
Improvements 🛠
- Adding IPV6 & IDPF support by @LAVEEN in #5066
- R&R Slurm integration by @sarthakag in #5003
Full Changelog: v1.79.0...v1.80.0