Enable kubernetes_node_scale benchmark (up to 5k nodes) on GCP GKE with CCC#6536
Conversation
| ) | ||
|
|
||
| NAP_MACHINE_TYPE = flags.DEFINE_string( | ||
| 'kubernetes_node_scale_nap_machine_type', |
There was a problem hiding this comment.
Rather than this, use the flag K8S_MACHINE_FAMILIES created in:
#6559
(or set via config_overrides)
| kind: ComputeClass | ||
| metadata: | ||
| name: app-ccc | ||
| spec: |
There was a problem hiding this comment.
Most of this should be removed/duplicate with https://github.com/GoogleCloudPlatform/PerfKitBenchmarker/pull/6559/changes
If eg storage & spot false are quite important we can make some revisions within that framework.
| cmd.flags['min-nodes'] = self.min_nodes | ||
| cmd.flags['cluster-ipv4-cidr'] = f'/{_CalculateCidrSize(self.max_nodes)}' | ||
| if gcp_flags.GKE_AUTOSCALING_PROFILE.value: | ||
| cmd.flags['autoscaling-profile'] = gcp_flags.GKE_AUTOSCALING_PROFILE.value |
There was a problem hiding this comment.
Let's also get this value into ResourceMetadata, since it sounds like something we might change over multiple runs (eg the only difference between 2 runs could be the autoscaling-profile & we'd like to compare them. ResourceMetadata lets us distinguish between the two from just the results).
| # Timeout for "kubectl delete all --all" during teardown. Increase for | ||
| # large-scale runs (e.g. 5000+ pods) to avoid benchmark failure. | ||
| flags.DEFINE_integer( | ||
| 'kubernetes_teardown_delete_timeout', |
There was a problem hiding this comment.
Just looking at this PR, is this flag actually used? Seems reasonable enough.
| - app | ||
| topologyKey: "kubernetes.io/hostname" | ||
| {% if cloud == 'GCP' %} | ||
| nodeSelector: |
There was a problem hiding this comment.
Is this always the case? Prefer to pass in whether we're using custom compute classes or not.
| cluster = bm_spec.container_cluster | ||
| manifest_kwargs: dict[str, Any] = {'cloud': FLAGS.cloud} | ||
| if cluster.default_nodepool.machine_families: | ||
| manifest_kwargs['gcp_compute_class_name'] = cluster.default_nodepool.name |
There was a problem hiding this comment.
Can this go in ModifyPodSpecPlacementYaml instead? I belive that code already deals with nodeSelector & it would obscure any reference to the cloud (eg cloud.google.com/compute-class) from the yaml file.
hubatish
left a comment
There was a problem hiding this comment.
Cool, I think that's all my comments addressed! Can you resolve them as well? Approving.
This description should also be updated |
Specifically some changes in this flags.py file ran into internal vs external merge conflicts, so manually splitting this into a separate PR. Flags are therefore not used in this PR but will be used in a follow up. PiperOrigin-RevId: 896710171
Specifically some changes in this flags.py file ran into internal vs external merge conflicts, so manually splitting this into a separate PR. Flags are therefore not used in this PR but will be used in a follow up. PiperOrigin-RevId: 896733417
0ffe5f5
into
GoogleCloudPlatform:master
Summary
Enables running the kubernetes_node_scale benchmark (scale up → scale down → scale up again, e.g., toward large node counts) on GCP GKE with Custom ComputeClass and node pool auto-creation (NAP) where configured. The benchmark applies a Deployment with pod anti-affinity (one pod per node), records scale-up / scale-down / second scale-up timing, then tears down.
Main changes
_UsesCustomComputeClass(default_nodepool) is true, soModifyPodSpecPlacementYamladds the selector without cloud-specific YAML in the benchmark template.gke_autoscaling_profile(optimize-utilization/balanced)andgke_cluster_ipv4_cidr_sizefor large scale-outs;gke_autoscaling_profileis also stored inGetResourceMetadatafor run-to-run comparison.--k8s_machine_families(andContainerClusterSpec.machine_families) instead of a dedicated NAP machine-type flag;container_spec._ApplyFlagsreadsflag_values.k8s_machine_familiesso the flag applies correctly.Usage notes
--k8s_machine_families=e2.container_cluster.max_vm_count(or equivalent) so the cluster autoscaler can reach your target--kubernetes_scale_num_nodes.