Skip to content

[None][test] Update K2.5 andGLM-5 into CI Perf Test#14960

Open
chenfeiz0326 wants to merge 5 commits into
NVIDIA:mainfrom
chenfeiz0326:chenfeiz/add-glm-5-agg-disagg
Open

[None][test] Update K2.5 andGLM-5 into CI Perf Test#14960
chenfeiz0326 wants to merge 5 commits into
NVIDIA:mainfrom
chenfeiz0326:chenfeiz/add-glm-5-agg-disagg

Conversation

@chenfeiz0326
Copy link
Copy Markdown
Collaborator

@chenfeiz0326 chenfeiz0326 commented Jun 4, 2026

Summary by CodeRabbit

  • Tests

    • Added performance sanity test coverage for GLM-5 FP4 model across GB200 and GB300 hardware configurations.
    • Expanded multi-node and multi-GPU test scenarios with new test cases for various parallelism and batch size combinations.
    • Added new benchmark configurations supporting disaggregated and aggregated testing modes.
  • Chores

    • Updated test pipeline triggering conditions and test count adjustments for multi-GPU and multi-node performance validation stages.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 4, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR extends TensorRT-LLM's multi-GPU performance sanity testing infrastructure by introducing GLM-5-fp4 model support across B200, GB200, and GB300 hardware platforms. It updates Jenkins pipeline orchestration logic, integration test definitions, and provides comprehensive benchmark reference configurations for both aggregated and disaggregated testing modes.

Changes

GLM-5-fp4 Multi-GPU Performance Testing

Layer / File(s) Summary
Jenkins pipeline test orchestration and stage matrix updates
jenkins/L0_MergeRequest.groovy, jenkins/L0_Test.groovy
File-change detection patterns updated to include GB300 gpu2-variant test configs instead of gpu4 variants; testCount parameters increased across GB200 multi-node disaggregated stages (7→9, 5→7, 8→14, 11→15), decreased for GB300 2-node (4→2), and new GLM-5 DEP2 configurations added for GB300 2/3/9-node cases.
GB200 and B200 integration test lists with GLM-5-fp4 entries
tests/integration/test_lists/test-db/l0_b200_multi_gpus_perf_sanity.yml, tests/integration/test_lists/test-db/l0_gb200_multi_gpus_perf_sanity.yml, tests/integration/test_lists/test-db/l0_gb200_multi_nodes_perf_sanity_*
Adds GLM-5-fp4 test cases to B200 and GB200 multi-GPU/multi-node test matrices, including post-merge aggregated upload variants and context-only test conditions, each with appropriate timeout settings (90–120 seconds).
GB300 integration test lists with GPU2-based configurations
tests/integration/test_lists/test-db/l0_gb300_multi_gpus_perf_sanity.yml, tests/integration/test_lists/test-db/l0_gb300_multi_nodes_perf_sanity_ctx1_node1_gpu2_gen1_*, tests/integration/test_lists/test-db/l0_gb300_multi_nodes_perf_sanity_ctx1_node1_gpu4_gen1_node8_gpu32.yml
Replaces deepseek-v32-fp4 with GLM-5-fp4 entries in multi-GPU tests; introduces three new GPU2-based 2-node, 3-node, and 9-node disaggregated configurations for GB300; removes obsolete GPU4 9-node variant.
Aggregated benchmark configurations for GLM-5-fp4
tests/scripts/perf-sanity/aggregated/glm5_fp4_*.yaml
Defines reference aggregated (multi-node collective tensor reduction) benchmark configurations for GLM-5-fp4 on B200 (8 GPUs/node) and GB200 (4 GPUs/node), each with TEP/DEP parallelism variants, TRTLLM or CUTLASS attention backends, fp8 KV cache, MTP speculative decoding, and openai client configuration templates.
GB200 disaggregated benchmark configuration tuning
tests/scripts/perf-sanity/disaggregated/gb200_glm-5-fp4_*.yaml
Adjusts token generation limits (128→64 tokens), reduces parallelism degrees (tensor parallel 8→4, moe expert parallel 8→4), lowers GPU memory headroom (0.9→0.85 fraction), and adds load_balancer configuration (num_slots 256, layer updates per iteration) to existing GB200 disaggregated benchmark configs.
GB300 disaggregated benchmark configurations (new)
tests/scripts/perf-sanity/disaggregated/gb300_glm-5-fp4_*.yaml
Introduces ten new disaggregated benchmark configurations covering 1k1k and 8k1k input/output sequences with varying tensor/moe parallelism, context-generation server splits, and concurrency profiles; each config defines SLURM job templates, worker batch/token limits, KV cache settings (fp4/fp8 dtypes), attention DP toggles, NIXL cache transceiver, and shared speculative decoding configuration via YAML anchors.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • NVIDIA/TensorRT-LLM#10912: Introduces the buildStageConfigs(...) function used to define stage-matrix entries in L0_Test.groovy, directly related to the stage orchestration updates in this PR.

Suggested reviewers

  • yufeiwu-nv
  • ruodil
  • LarryXFly
🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description contains only template placeholders with no substantive content filled in. Provide a clear description of the changes, explain why the GLM-5 model tests are being added, list the relevant test coverage, and complete the PR checklist items as applicable.
Title check ❓ Inconclusive The title partially relates to the changeset by mentioning GLM-5, which is a model added throughout the PR, but it is unclear and contains inconsistencies (e.g., 'K2.5' is mentioned but not substantive in the changes; 'andGLM-5' appears to be a typo; '[None][test]' is vague). The title does not clearly convey the main focus of updating test configurations and performance sanity benchmarks. Clarify the title to explicitly describe the primary changes, such as 'Add GLM-5 model to multi-GPU and multi-node perf sanity tests' or 'Update perf sanity test configs for GLM-5 on GB200/GB300'.
✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

Review ran into problems

🔥 Problems

Stopped waiting for pipeline failures after 30000ms. One of your pipelines takes longer than our 30000ms fetch window to run, so review may not consider pipeline-failure results for inline comments if any failures occurred after the fetch window. Increase the timeout if you want to wait longer or run a @coderabbit review after the pipeline has finished.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (4)
tests/scripts/perf-sanity/aggregated/glm5_fp4_grace_blackwell.yaml (1)

1-75: Coverage status: sufficient in-PR; one follow-up is outside this layer.

For tests/scripts/perf-sanity/aggregated/glm5_fp4_grace_blackwell.yaml, coverage is sufficient for GB200 aggregated 1k1k TEP/DEP variants.
Follow-up outside this PR layer (if not already handled in other stacked files): confirm CI selection references include all three new aggregated configs:

  • tests/scripts/perf-sanity/aggregated/glm5_fp4_2_nodes_grace_blackwell.yaml
  • tests/scripts/perf-sanity/aggregated/glm5_fp4_blackwell.yaml
  • tests/scripts/perf-sanity/aggregated/glm5_fp4_grace_blackwell.yaml
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/scripts/perf-sanity/aggregated/glm5_fp4_grace_blackwell.yaml` around
lines 1 - 75, The CI selection references need to include all three new
aggregated config files so tests run for each variant; update whatever
CI/selection list references (e.g., in the perf-sanity CI matrix or selection
files) to add
"tests/scripts/perf-sanity/aggregated/glm5_fp4_grace_blackwell.yaml",
"tests/scripts/perf-sanity/aggregated/glm5_fp4_blackwell.yaml", and
"tests/scripts/perf-sanity/aggregated/glm5_fp4_2_nodes_grace_blackwell.yaml" so
the new GB200 aggregated 1k1k TEP/DEP variants are selected by CI. Ensure any
selection logic that filters by the directory
tests/scripts/perf-sanity/aggregated or by model_name "glm_5_nvfp4" also
accounts for these three files.
tests/scripts/perf-sanity/aggregated/glm5_fp4_2_nodes_grace_blackwell.yaml (1)

1-75: Coverage status: sufficient for this file’s aggregated scope.

For tests/scripts/perf-sanity/aggregated/glm5_fp4_2_nodes_grace_blackwell.yaml, coverage is sufficient in this PR for GB200 aggregated perf sanity because both TEP (glm5_fp4_tep8_mtp3_8k1k) and DEP (glm5_fp4_dep8_mtp1_8k1k) variants are included with matching client workloads.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/scripts/perf-sanity/aggregated/glm5_fp4_2_nodes_grace_blackwell.yaml`
around lines 1 - 75, The YAML already includes both TEP and DEP aggregated
configs, but update/verify that metadata.model_name ("glm_5_nvfp4") matches each
server_config.model_name and that the two server_configs named
"glm5_fp4_tep8_mtp3_8k1k" and "glm5_fp4_dep8_mtp1_8k1k" remain present; also
replace the placeholder dataset_file in each client_configs entry with the
actual dataset path (or a CI-provided variable) so the perf-sanity jobs can run
end-to-end.
tests/scripts/perf-sanity/aggregated/glm5_fp4_blackwell.yaml (1)

1-75: Coverage status: sufficient for this file’s aggregated scope.

For tests/scripts/perf-sanity/aggregated/glm5_fp4_blackwell.yaml, coverage is sufficient in this PR for B200 aggregated perf sanity via both TEP and DEP 8k1k configurations.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/scripts/perf-sanity/aggregated/glm5_fp4_blackwell.yaml` around lines 1
- 75, Coverage for glm5_fp4_blackwell.yaml is already sufficient so no
structural changes are required; however ensure the client_configs dataset_file
placeholder is wired to the test runner by replacing the literal
"<dataset_file>" with the CI/test variable your harness expects (e.g.,
${DATASET_FILE}) so the two server configs named "glm5_fp4_tep8_mtp3_8k1k" and
"glm5_fp4_dep8_mtp1_8k1k" (and keys metadata.model_name and supported_gpus) run
with a real dataset path during execution.
tests/scripts/perf-sanity/disaggregated/gb300_glm-5-fp4_1k1k_con1_ctx1_dep2_gen1_tep4_eplb0_mtp3_ccb-NIXL.yaml (1)

1-94: QA coverage status: sufficient for config-definition scope; execution evidence needs follow-up outside this PR.

Coverage is sufficient across the new GB300 disaggregated config set for this cohort:

  • tests/scripts/perf-sanity/disaggregated/gb300_glm-5-fp4_1k1k_con1_ctx1_dep2_gen1_tep4_eplb0_mtp3_ccb-NIXL.yaml
  • tests/scripts/perf-sanity/disaggregated/gb300_glm-5-fp4_1k1k_con4096_ctx1_dep2_gen1_dep8_eplb256_mtp1_ccb-NIXL.yaml
  • tests/scripts/perf-sanity/disaggregated/gb300_glm-5-fp4_1k1k_con512_ctx1_dep2_gen1_dep32_eplb0_mtp3_ccb-NIXL.yaml
  • tests/scripts/perf-sanity/disaggregated/gb300_glm-5-fp4_8k1k_con1024_ctx1_dep2_gen1_dep8_eplb256_mtp1_ccb-NIXL.yaml
  • tests/scripts/perf-sanity/disaggregated/gb300_glm-5-fp4_8k1k_con1_ctx1_dep2_gen1_tep8_eplb0_mtp3_ccb-NIXL.yaml
  • tests/scripts/perf-sanity/disaggregated/gb300_glm-5-fp4_8k1k_con512_ctx1_dep2_gen1_dep32_eplb0_mtp3_ccb-NIXL.yaml

Actionable follow-up outside this PR: capture CI artifact evidence that placeholder fields (<partition>, <account>, <dataset_file>, <model_path>) are fully resolved at runtime for each listed file.

As per coding guidelines, "Act as a QA engineer reviewing test changes and coverage for TensorRT-LLM. Keep feedback actionable: suggest concrete list file names and whether coverage is sufficient, insufficient, or needs follow-up outside the PR."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/scripts/perf-sanity/disaggregated/gb300_glm-5-fp4_1k1k_con1_ctx1_dep2_gen1_tep4_eplb0_mtp3_ccb-NIXL.yaml`
around lines 1 - 94, The YAML contains unresolved placeholders (<partition>,
<account>, <dataset_file>, <model_path>) that must be validated before job
submission; update the test harness or the config generation step to replace
those placeholders for the files (e.g.,
tests/scripts/perf-sanity/disaggregated/gb300_glm-5-fp4_1k1k_con1_ctx1_dep2_gen1_tep4_eplb0_mtp3_ccb-NIXL.yaml
and the other listed YAMLs) and add a preflight check that reads keys partition,
account, dataset_file, model_path and fails early with a clear error if any
still match the placeholder pattern; alternatively wire them to concrete CI
variables or templating logic so create_job()/load_config() (or whatever config
loader function you use) performs substitution and validation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/scripts/perf-sanity/aggregated/glm5_fp4_2_nodes_grace_blackwell.yaml`:
- Around line 1-75: The YAML already includes both TEP and DEP aggregated
configs, but update/verify that metadata.model_name ("glm_5_nvfp4") matches each
server_config.model_name and that the two server_configs named
"glm5_fp4_tep8_mtp3_8k1k" and "glm5_fp4_dep8_mtp1_8k1k" remain present; also
replace the placeholder dataset_file in each client_configs entry with the
actual dataset path (or a CI-provided variable) so the perf-sanity jobs can run
end-to-end.

In `@tests/scripts/perf-sanity/aggregated/glm5_fp4_blackwell.yaml`:
- Around line 1-75: Coverage for glm5_fp4_blackwell.yaml is already sufficient
so no structural changes are required; however ensure the client_configs
dataset_file placeholder is wired to the test runner by replacing the literal
"<dataset_file>" with the CI/test variable your harness expects (e.g.,
${DATASET_FILE}) so the two server configs named "glm5_fp4_tep8_mtp3_8k1k" and
"glm5_fp4_dep8_mtp1_8k1k" (and keys metadata.model_name and supported_gpus) run
with a real dataset path during execution.

In `@tests/scripts/perf-sanity/aggregated/glm5_fp4_grace_blackwell.yaml`:
- Around line 1-75: The CI selection references need to include all three new
aggregated config files so tests run for each variant; update whatever
CI/selection list references (e.g., in the perf-sanity CI matrix or selection
files) to add
"tests/scripts/perf-sanity/aggregated/glm5_fp4_grace_blackwell.yaml",
"tests/scripts/perf-sanity/aggregated/glm5_fp4_blackwell.yaml", and
"tests/scripts/perf-sanity/aggregated/glm5_fp4_2_nodes_grace_blackwell.yaml" so
the new GB200 aggregated 1k1k TEP/DEP variants are selected by CI. Ensure any
selection logic that filters by the directory
tests/scripts/perf-sanity/aggregated or by model_name "glm_5_nvfp4" also
accounts for these three files.

In
`@tests/scripts/perf-sanity/disaggregated/gb300_glm-5-fp4_1k1k_con1_ctx1_dep2_gen1_tep4_eplb0_mtp3_ccb-NIXL.yaml`:
- Around line 1-94: The YAML contains unresolved placeholders (<partition>,
<account>, <dataset_file>, <model_path>) that must be validated before job
submission; update the test harness or the config generation step to replace
those placeholders for the files (e.g.,
tests/scripts/perf-sanity/disaggregated/gb300_glm-5-fp4_1k1k_con1_ctx1_dep2_gen1_tep4_eplb0_mtp3_ccb-NIXL.yaml
and the other listed YAMLs) and add a preflight check that reads keys partition,
account, dataset_file, model_path and fails early with a clear error if any
still match the placeholder pattern; alternatively wire them to concrete CI
variables or templating logic so create_job()/load_config() (or whatever config
loader function you use) performs substitution and validation.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8f8c1c3a-d00e-4a02-983b-8504f53ec70c

📥 Commits

Reviewing files that changed from the base of the PR and between c17611c and eabbbc1.

📒 Files selected for processing (28)
  • jenkins/L0_MergeRequest.groovy
  • jenkins/L0_Test.groovy
  • tests/integration/test_lists/test-db/l0_b200_multi_gpus_perf_sanity.yml
  • tests/integration/test_lists/test-db/l0_gb200_multi_gpus_perf_sanity.yml
  • tests/integration/test_lists/test-db/l0_gb200_multi_nodes_perf_sanity_ctx1_node1_gpu4_gen1_node1_gpu4.yml
  • tests/integration/test_lists/test-db/l0_gb200_multi_nodes_perf_sanity_ctx1_node1_gpu4_gen1_node2_gpu8.yml
  • tests/integration/test_lists/test-db/l0_gb200_multi_nodes_perf_sanity_ctx1_node1_gpu4_gen1_node8_gpu32.yml
  • tests/integration/test_lists/test-db/l0_gb200_multi_nodes_perf_sanity_node2_gpu8.yml
  • tests/integration/test_lists/test-db/l0_gb300_multi_gpus_perf_sanity.yml
  • tests/integration/test_lists/test-db/l0_gb300_multi_nodes_perf_sanity_ctx1_node1_gpu2_gen1_node1_gpu4.yml
  • tests/integration/test_lists/test-db/l0_gb300_multi_nodes_perf_sanity_ctx1_node1_gpu2_gen1_node2_gpu8.yml
  • tests/integration/test_lists/test-db/l0_gb300_multi_nodes_perf_sanity_ctx1_node1_gpu2_gen1_node8_gpu32.yml
  • tests/integration/test_lists/test-db/l0_gb300_multi_nodes_perf_sanity_ctx1_node1_gpu4_gen1_node1_gpu4.yml
  • tests/integration/test_lists/test-db/l0_gb300_multi_nodes_perf_sanity_ctx1_node1_gpu4_gen1_node8_gpu32.yml
  • tests/scripts/perf-sanity/aggregated/glm5_fp4_2_nodes_grace_blackwell.yaml
  • tests/scripts/perf-sanity/aggregated/glm5_fp4_blackwell.yaml
  • tests/scripts/perf-sanity/aggregated/glm5_fp4_grace_blackwell.yaml
  • tests/scripts/perf-sanity/disaggregated/gb200_glm-5-fp4_1k1k_con1_ctx1_dep4_gen1_tep4_eplb0_mtp3_ccb-NIXL.yaml
  • tests/scripts/perf-sanity/disaggregated/gb200_glm-5-fp4_1k1k_con4096_ctx1_dep4_gen1_dep8_eplb256_mtp1_ccb-NIXL.yaml
  • tests/scripts/perf-sanity/disaggregated/gb200_glm-5-fp4_1k1k_con512_ctx1_dep4_gen1_dep32_eplb0_mtp3_ccb-NIXL.yaml
  • tests/scripts/perf-sanity/disaggregated/gb200_glm-5-fp4_8k1k_con1024_ctx1_dep4_gen1_dep8_eplb256_mtp1_ccb-NIXL.yaml
  • tests/scripts/perf-sanity/disaggregated/gb200_glm-5-fp4_8k1k_con512_ctx1_dep4_gen1_dep32_eplb0_mtp3_ccb-NIXL.yaml
  • tests/scripts/perf-sanity/disaggregated/gb300_glm-5-fp4_1k1k_con1_ctx1_dep2_gen1_tep4_eplb0_mtp3_ccb-NIXL.yaml
  • tests/scripts/perf-sanity/disaggregated/gb300_glm-5-fp4_1k1k_con4096_ctx1_dep2_gen1_dep8_eplb256_mtp1_ccb-NIXL.yaml
  • tests/scripts/perf-sanity/disaggregated/gb300_glm-5-fp4_1k1k_con512_ctx1_dep2_gen1_dep32_eplb0_mtp3_ccb-NIXL.yaml
  • tests/scripts/perf-sanity/disaggregated/gb300_glm-5-fp4_8k1k_con1024_ctx1_dep2_gen1_dep8_eplb256_mtp1_ccb-NIXL.yaml
  • tests/scripts/perf-sanity/disaggregated/gb300_glm-5-fp4_8k1k_con1_ctx1_dep2_gen1_tep8_eplb0_mtp3_ccb-NIXL.yaml
  • tests/scripts/perf-sanity/disaggregated/gb300_glm-5-fp4_8k1k_con512_ctx1_dep2_gen1_dep32_eplb0_mtp3_ccb-NIXL.yaml
💤 Files with no reviewable changes (2)
  • tests/integration/test_lists/test-db/l0_gb300_multi_nodes_perf_sanity_ctx1_node1_gpu4_gen1_node1_gpu4.yml
  • tests/integration/test_lists/test-db/l0_gb300_multi_nodes_perf_sanity_ctx1_node1_gpu4_gen1_node8_gpu32.yml

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
@chenfeiz0326 chenfeiz0326 requested a review from a team as a code owner June 5, 2026 02:49
@chenfeiz0326
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-4,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-5,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-6,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-1,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-2,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-3,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-4,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-5,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-6,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-7,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-8,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-9,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-10,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-11,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-12,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-13,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-1,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-2,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-3,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-4,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-5,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-6,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-7,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-8,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-9,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-10,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-11,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-12,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-13,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU2-GEN1-NODE1-GPU4-Post-Merge-1,GB300-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU2-GEN1-NODE2-GPU8-Post-Merge-1,GB300-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU2-GEN1-NODE2-GPU8-Post-Merge-2,GB300-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU2-GEN1-NODE2-GPU8-Post-Merge-3,GB300-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU2-GEN1-NODE2-GPU8-Post-Merge-4,GB300-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU2-GEN1-NODE2-GPU8-Post-Merge-5,GB300-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU2-GEN1-NODE8-GPU32-Post-Merge-1,GB300-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU2-GEN1-NODE8-GPU32-Post-Merge-2,GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-1,GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-2,GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-3,GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-4,GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-5,GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-6,GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-7,GB200-8_GPUs-2_Nodes-PyTorch-PerfSanity-Node2-GPU8-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-PerfSanity-Node2-GPU8-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-PerfSanity-Node2-GPU8-Post-Merge-3,GB200-8_GPUs-2_Nodes-PyTorch-PerfSanity-Node2-GPU8-Post-Merge-4,GB200-8_GPUs-2_Nodes-PyTorch-PerfSanity-Node2-GPU8-Post-Merge-5,GB200-8_GPUs-2_Nodes-PyTorch-PerfSanity-Node2-GPU8-Post-Merge-6,GB200-8_GPUs-2_Nodes-PyTorch-PerfSanity-Node2-GPU8-Post-Merge-7,GB200-8_GPUs-2_Nodes-PyTorch-PerfSanity-Node2-GPU8-Post-Merge-8,GB200-8_GPUs-2_Nodes-PyTorch-PerfSanity-Node2-GPU8-Post-Merge-9"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #52219 [ run ] triggered by Bot. Commit: d2afbdb Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #52292 [ run ] triggered by Bot. Commit: d2afbdb Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #52219 [ run ] completed with state ABORTED. Commit: d2afbdb

Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #52292 [ run ] completed with state FAILURE. Commit: d2afbdb
/LLM/main/L0_MergeRequest_PR pipeline #41601 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@chenfeiz0326
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-4,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-5,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-6,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-1,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-2,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-3,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-4,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-5,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-6,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-7,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-8,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-9,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-10,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-11,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-12,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-13,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-1,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-2,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-3,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-4,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-5,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-6,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-7,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-8,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-9,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-10,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-11,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-12,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-13,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU2-GEN1-NODE1-GPU4-Post-Merge-1,GB300-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU2-GEN1-NODE2-GPU8-Post-Merge-1,GB300-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU2-GEN1-NODE2-GPU8-Post-Merge-2,GB300-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU2-GEN1-NODE2-GPU8-Post-Merge-3,GB300-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU2-GEN1-NODE2-GPU8-Post-Merge-4,GB300-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU2-GEN1-NODE2-GPU8-Post-Merge-5,GB300-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU2-GEN1-NODE8-GPU32-Post-Merge-1,GB300-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU2-GEN1-NODE8-GPU32-Post-Merge-2,GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-1,GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-2,GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-3,GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-4,GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-5,GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-6,GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-7,GB200-8_GPUs-2_Nodes-PyTorch-PerfSanity-Node2-GPU8-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-PerfSanity-Node2-GPU8-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-PerfSanity-Node2-GPU8-Post-Merge-3,GB200-8_GPUs-2_Nodes-PyTorch-PerfSanity-Node2-GPU8-Post-Merge-4,GB200-8_GPUs-2_Nodes-PyTorch-PerfSanity-Node2-GPU8-Post-Merge-5,GB200-8_GPUs-2_Nodes-PyTorch-PerfSanity-Node2-GPU8-Post-Merge-6,GB200-8_GPUs-2_Nodes-PyTorch-PerfSanity-Node2-GPU8-Post-Merge-7,GB200-8_GPUs-2_Nodes-PyTorch-PerfSanity-Node2-GPU8-Post-Merge-8,GB200-8_GPUs-2_Nodes-PyTorch-PerfSanity-Node2-GPU8-Post-Merge-9"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #52308 [ run ] triggered by Bot. Commit: d2afbdb Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #52308 [ run ] completed with state FAILURE. Commit: d2afbdb
/LLM/main/L0_MergeRequest_PR pipeline #41614 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
@chenfeiz0326 chenfeiz0326 changed the title [None][test] Add GLM-5 into CI Perf Test [None][test] Update K2.5 andGLM-5 into CI Perf Test Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants