Skip to content

Enable kubernetes_node_scale benchmark (up to 5k nodes) on GCP GKE with CCC#6536

Merged
copybara-service[bot] merged 8 commits intoGoogleCloudPlatform:masterfrom
kiryl-filatau:gcp-5k-ccc
Apr 21, 2026
Merged

Enable kubernetes_node_scale benchmark (up to 5k nodes) on GCP GKE with CCC#6536
copybara-service[bot] merged 8 commits intoGoogleCloudPlatform:masterfrom
kiryl-filatau:gcp-5k-ccc

Conversation

@vofish
Copy link
Copy Markdown
Collaborator

@vofish vofish commented Mar 12, 2026

Summary

Enables running the kubernetes_node_scale benchmark (scale up → scale down → scale up again, e.g., toward large node counts) on GCP GKE with Custom ComputeClass and node pool auto-creation (NAP) where configured. The benchmark applies a Deployment with pod anti-affinity (one pod per node), records scale-up / scale-down / second scale-up timing, then tears down.

Main changes

  • kubernetes_node_scale.yaml.j2 - Deployment only (pause pods, anti-affinity). No ComputeClass in the manifest; the cluster path creates the ComputeClass (aligned with upstream PKB / custom compute class support).
  • GKE - GetNodeSelectors sets cloud.google.com/compute-class to the default node pool name when _UsesCustomComputeClass(default_nodepool) is true, so ModifyPodSpecPlacementYaml adds the selector without cloud-specific YAML in the benchmark template.
  • GKE flags - gke_autoscaling_profile (optimize-utilization/balanced) and gke_cluster_ipv4_cidr_size for large scale-outs; gke_autoscaling_profile is also stored in GetResourceMetadata for run-to-run comparison.
  • Machine families - Use existing --k8s_machine_families (and ContainerClusterSpec.machine_families) instead of a dedicated NAP machine-type flag; container_spec._ApplyFlags reads flag_values.k8s_machine_families so the flag applies correctly.

Usage notes

  • For the Custom ComputeClass + NAP path on GCP, pass something like --k8s_machine_families=e2.
  • Raise container_cluster.max_vm_count (or equivalent) so the cluster autoscaler can reach your target --kubernetes_scale_num_nodes.

)

NAP_MACHINE_TYPE = flags.DEFINE_string(
'kubernetes_node_scale_nap_machine_type',
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than this, use the flag K8S_MACHINE_FAMILIES created in:
#6559

(or set via config_overrides)

kind: ComputeClass
metadata:
name: app-ccc
spec:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of this should be removed/duplicate with https://github.com/GoogleCloudPlatform/PerfKitBenchmarker/pull/6559/changes

If eg storage & spot false are quite important we can make some revisions within that framework.

cmd.flags['min-nodes'] = self.min_nodes
cmd.flags['cluster-ipv4-cidr'] = f'/{_CalculateCidrSize(self.max_nodes)}'
if gcp_flags.GKE_AUTOSCALING_PROFILE.value:
cmd.flags['autoscaling-profile'] = gcp_flags.GKE_AUTOSCALING_PROFILE.value
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also get this value into ResourceMetadata, since it sounds like something we might change over multiple runs (eg the only difference between 2 runs could be the autoscaling-profile & we'd like to compare them. ResourceMetadata lets us distinguish between the two from just the results).

# Timeout for "kubectl delete all --all" during teardown. Increase for
# large-scale runs (e.g. 5000+ pods) to avoid benchmark failure.
flags.DEFINE_integer(
'kubernetes_teardown_delete_timeout',
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just looking at this PR, is this flag actually used? Seems reasonable enough.

- app
topologyKey: "kubernetes.io/hostname"
{% if cloud == 'GCP' %}
nodeSelector:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this always the case? Prefer to pass in whether we're using custom compute classes or not.

@vofish vofish requested a review from hubatish April 8, 2026 16:29
cluster = bm_spec.container_cluster
manifest_kwargs: dict[str, Any] = {'cloud': FLAGS.cloud}
if cluster.default_nodepool.machine_families:
manifest_kwargs['gcp_compute_class_name'] = cluster.default_nodepool.name
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this go in ModifyPodSpecPlacementYaml instead? I belive that code already deals with nodeSelector & it would obscure any reference to the cloud (eg cloud.google.com/compute-class) from the yaml file.

@vofish vofish requested a review from hubatish April 8, 2026 17:14
Copy link
Copy Markdown
Collaborator

@hubatish hubatish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I think that's all my comments addressed! Can you resolve them as well? Approving.

@hubatish
Copy link
Copy Markdown
Collaborator

hubatish commented Apr 8, 2026

Summary

Enables running the kubernetes_node_scale benchmark (0→5k→0→5k nodes) on GCP GKE with Custom ComputeClass. The benchmark scales a deployment with pod anti-affinity, measures scale-up, scale-down, and a second scale-up, then tears down the cluster.

Main changes

  • kubernetes_node_scale.yaml.j2 - added ComputeClass and nodeSelector sections
  • GKE - added flags for autoscaling profile and ipv4 cidr size
  • Add --kubernetes_node_scale_nap_machine_type (default e2-medium) and render ComputeClass machineType from the Jinja template via nap_machine_type, with a template default for callers that omit the variable.

This description should also be updated

copybara-service Bot pushed a commit that referenced this pull request Apr 8, 2026
Specifically some changes in this flags.py file ran into internal vs external merge conflicts, so manually splitting this into a separate PR. Flags are therefore not used in this PR but will be used in a follow up.

PiperOrigin-RevId: 896710171
copybara-service Bot pushed a commit that referenced this pull request Apr 8, 2026
Specifically some changes in this flags.py file ran into internal vs external merge conflicts, so manually splitting this into a separate PR. Flags are therefore not used in this PR but will be used in a follow up.

PiperOrigin-RevId: 896733417
@copybara-service copybara-service Bot merged commit 0ffe5f5 into GoogleCloudPlatform:master Apr 21, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants