chore: add Dynamo recipe bring-up skills#9782
Conversation
95556a6 to
4121e92
Compare
7b6e86b to
89932ce
Compare
There was a problem hiding this comment.
Review by Claude, an agent acting on Ben Hamm's behalf — focused purely on the skill/code substance below.
Nice work, @athreesh — the three skills are well-shaped (operational, progressive-disclosure, read-only-first, and careful never to print HF tokens). Substantive feedback, roughly in priority order:
Correctness bugs
-
recipe_tool.py—repo_root()hard-codes the skill layout as a root marker. It only recognizes the repo root when bothrecipes/and.agents/skills/exist (line 34). That couples the recipe tool to where the skills happen to live today — move/rename/symlink that directory and recipe discovery silently breaks with "Could not find Dynamo repo root." Detect on something stable instead, e.g.recipes/+.git/. -
recipe_tool.py—gpu_count_hint()double-counts GPUs. It sums everynvidia.com/gpu:match in the manifest (lines 46–50, 219–224), but a container normally carries the same count under bothresources.requestsandresources.limits. So a 1-GPU worker reports 2, an 8-GPU node reports 16. Either de-dupe per container or only readlimits. -
collect_dynamo_debug_bundle.py— missesinitContainers.container_names()reads onlyspec.containers(line 86). For Dynamo, model-download / cache-warm steps frequently run as init containers, and those failures are one of the most common things this skill exists to debug — their logs won't be in the bundle. -
collect_dynamo_debug_bundle.py— no--previouslogs. A CrashLoopBackOff pod usually has empty current logs; the actual stack trace is inkubectl logs --previous. For a troubleshooting bundle, grab both.
Portability
-
Hard-coded
.agents/skills/...script paths in the SKILL.md bodies. Each skill invokes its scripts by full repo path (e.g.python3 .agents/skills/dynamo-recipe-runner/scripts/recipe_tool.py). Reference them relative to the skill directory (scripts/recipe_tool.py) so they survive being installed/relocated independently of the repo tree. -
Cross-skill calls assume co-location.
dynamo-recipe-runner's smoke test shells out todynamo-router-starter'scheck_router_health.pyby path, and both point users atdynamo-troubleshoot. Since these can be installed individually, one skill can't assume another's files are on disk. Make it a soft instruction ("ifdynamo-router-starteris installed, use its health check") or inline a tiny check per skill.
Triggering / smaller items
-
Description overlap across the three skills. All three mention "smoke test / endpoint health," which invites the agent to pick the wrong one. Tighten each
descriptionso the boundaries are unambiguous (recipe selection+deploy vs. router mode vs. failure diagnosis). -
Minor:
dynamo-troubleshoot's default--outdiris timestamped (/tmp/dynamo-debug-<ts>) but the SKILL.md example passes a fixed/tmp/dynamo-debug— align them so copy-pasted commands and follow-up references match. Andrecipe_tool.py list --frameworkonly accepts the known set, so recipes discovered withframework="unknown"(unrecognized dir) can't be listed by filter — worth a fallback.
Keep as-is: the disciplined "don't claim throughput from a single chat request" caution in dynamo-router-starter, the read-only debug flow, and the <500-line SKILL.md + references/ split.
Happy to push fixes for 1–6 as a follow-up commit if that's useful.
cd4a023 to
d35f331
Compare
WalkthroughThis PR introduces a comprehensive Dynamo operational skills suite comprising three new skills (recipe runner, router starter, troubleshooting) with supporting documentation, CLI tools, and infrastructure. It updates skill discovery references and CI configuration while adding licensing headers across the skill ecosystem. ChangesDynamo Operational Skills Suite
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/digest/agentic-inference/agentic-inference.md`:
- Line 54: Update the markdown reference to the local skill path
"tool-call-parser-generator/SKILL.md" to also include its public GitHub URL so
external readers can follow the same link; edit the text that currently points
to "../../../skills/tool-parser-generator/SKILL.md" and append or replace with
the corresponding GitHub repository file URL for the tool-call-parser-generator
SKILL (the public GitHub counterpart of SKILL.md) so both local and external
links are available.
In `@docs/README.md`:
- Line 57: Update the markdown table row that currently links to the local skill
file (the entry containing "../skills/dynamo-docs/SKILL.md") to also include the
corresponding GitHub URL for that file so external readers can follow the
reference; keep the existing relative link and add the GitHub link (pointing to
the repo path for skills/dynamo-docs/SKILL.md) next to it in the same row, e.g.,
provide both links separated clearly (parentheses or "—") while preserving the
description text "Add, update, move, or remove a docs page".
In `@skills/dynamo-router-starter/references/router-modes.md`:
- Line 32: The table row for "Set queue policy" currently contains unescaped
pipe characters in the value `DYN_ROUTER_QUEUE_POLICY=fcfs|wspt|lcfs` which
breaks the Markdown table; update the cell so pipes are escaped or the entire
value is wrapped as inline code/HTML code element (e.g., escape each '|' or wrap
the value in a code tag) for the `DYN_ROUTER_QUEUE_POLICY` entry in
router-modes.md so the row remains a two-column table.
In `@skills/dynamo-router-starter/scripts/check_router_health.py`:
- Around line 65-72: Validate and restrict args.base_url to only http or https
schemes before using it to build URLs for request_json; parse args.base_url (the
base_url variable) with urllib.parse.urlparse, ensure parsed.scheme is exactly
"http" or "https", and if not, exit or raise a clear error before any call to
request_json (which ultimately uses urllib.request.urlopen); update the argument
handling where base_url = args.base_url.rstrip("/") to perform this check and
reject or normalize invalid schemes.
In `@skills/dynamo-troubleshoot/scripts/collect_dynamo_debug_bundle.py`:
- Around line 107-114: The default outdir creation using a predictable path
should be replaced so that when the CLI flag for outdir is not provided we
generate a secure temporary directory; update the argument parsing logic around
parser.add_argument("--outdir", ...) and the subsequent outdir assignment (where
outdir = Path(args.outdir)...) to: if args.outdir is None or empty, call
tempfile.mkdtemp(prefix="dynamo-debug-") to create a secure temp dir, then set
outdir = Path(that_tempdir).expanduser().resolve() and mkdir as currently done;
ensure imports include tempfile and that behavior remains unchanged when a user
passes --outdir.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 75af0db0-1a50-4b42-8497-0c014a193205
📒 Files selected for processing (25)
.agents/skills.github/filters.yamldocs/README.mddocs/digest/agentic-inference/agentic-inference.mdskills/debug-session/SKILL.mdskills/dep-create/SKILL.mdskills/dep-status/SKILL.mdskills/dep-update/SKILL.mdskills/dynamo-docs/SKILL.mdskills/dynamo-recipe-runner/SKILL.mdskills/dynamo-recipe-runner/references/k8s-recipe-workflow.mdskills/dynamo-recipe-runner/scripts/recipe_tool.pyskills/dynamo-router-starter/SKILL.mdskills/dynamo-router-starter/references/router-modes.mdskills/dynamo-router-starter/scripts/check_router_health.pyskills/dynamo-troubleshoot/SKILL.mdskills/dynamo-troubleshoot/references/failure-decision-tree.mdskills/dynamo-troubleshoot/scripts/collect_dynamo_debug_bundle.pyskills/gh-issue-bug/SKILL.mdskills/graham-code-review/SKILL.mdskills/pr-monitor/SKILL.mdskills/tool-parser-generator/README.mdskills/tool-parser-generator/SKILL.mdskills/tool-parser-generator/references/integration-guide.mdskills/tool-parser-generator/references/parser-patterns.md
|
I think it'd also be good to consider other things that they may need: mainly NIXL/NCCL env variables, IB capabilities, etc. It'd be good to have a tester which validates that nixl can talk over rdma/nvlink as well, as you may be able to get the deployment up with these skills, but disagg won't be correct. |
|
@Aphoh good call — a deployment can come up while disagg is silently wrong if NIXL can't reach a peer over RDMA/NVLink. Added a new
The env catalog + IB checklist live in |
@athreesh two quick things before this merges: 1. Heads-up — I pushed a commit to this branch ( 2. You (as author) need to post the OSRB IP-review attestation before merge. This is the one step that isn't auto-enforced — it has to come from you. SPDX headers are already present (Q2 ✓); for Q3 just confirm the listed terms are public. Please reply affirming:
Once you've affirmed, this is clear to merge — the remaining steps (NVCARPS signing + |
BenHamm
left a comment
There was a problem hiding this comment.
Approving. CI is green on the rebased HEAD (snapshot/operator/pre-merge all pass — the earlier snapshot red was an unrelated flaky Go test cleared by updating onto main). Customer skill set curated to the four published skills, evals/evals.json added and trigger-validated (24/24), CodeRabbit items addressed, and IP/OSRB clear. Good to merge for the Computex signing path.
(Reviewed with Claude acting on Ben Hamm's behalf.)
Signed-off-by: Anish Maddipoti <amaddipoti@nvidia.com>
- recipe_tool: match the DynamoGraphDeployment 'gpu: N' shorthand (not just nvidia.com/gpu) so GPU hints populate for real recipes - recipe_tool: include the sibling model-level model-cache manifests when validating a recipe directory - check_router_health: return structured output for a 200 with a non-JSON body instead of crashing; restrict requests to http/https schemes - collect_dynamo_debug_bundle: redact token/secret/password values and HF tokens before writing; default output to a private mkdtemp directory - router-modes: escape pipe chars in the queue-policy table cell - docs: restore absolute GitHub URLs for the moved skill links Signed-off-by: Anish Maddipoti <amaddipoti@nvidia.com>
- revert docs skill links to relative paths; absolute blob/main URLs 404 in the pre-merge lychee check since the files only exist on this branch - reflow model-cache file concatenation to satisfy black Signed-off-by: Anish Maddipoti <amaddipoti@nvidia.com>
- New dynamo-interconnect-check skill validates the NIXL/UCX/NCCL transport that disaggregated serving depends on. A deployment can pass an endpoint smoke test while disagg is silently wrong if NIXL cannot reach a peer over RDMA/NVLink; the skill audits transport env vars, probes RDMA/GPUDirect/ NVLink capabilities read-only, and runs a best-effort NIXL reachability check. Addresses review feedback from @Aphoh. - Move all skills back to .agents/skills as the real directory (matching the upstream agent-skills convention) and make root skills/ a symlink so the Computex packaging path still resolves. Signed-off-by: Anish Maddipoti <amaddipoti@nvidia.com>
The new root skills -> .agents/skills symlink is a tracked file that matched no CI filter, failing the changed-files coverage guard. Add it to the docs filter. Signed-off-by: Anish Maddipoti <amaddipoti@nvidia.com>
… path The skills/ symlink (-> .agents/skills) is what the nvidia/skills catalog sparse-checks out. Keep only the four customer-facing Dynamo skills there (dynamo-recipe-runner, dynamo-router-starter, dynamo-troubleshoot, dynamo-interconnect-check); move the nine contributor/internal skills to .agents/contributor-skills/ so they remain available to local agents but are NOT mirrored to the public catalog (per the Skills Publishing Onboarding Guide). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Ben Hamm <bhamm@nvidia.com>
Required-in-MR test datasets (per NVCARPS skills standard) for dynamo-recipe-runner, dynamo-router-starter, dynamo-troubleshoot, and dynamo-interconnect-check. Each has positive cases plus cross-skill negative cases (a sibling skill's prompt, where this skill must stay silent) to exercise trigger/discoverability boundaries. Validated with a blind trigger-routing pass over all 24 cases (descriptions only): 24/24 routed to the expected skill, including all 12 cross-skill negatives. Functional/output-quality (Tier-2) evals require a GPU cluster and are deferred post-Computex. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Ben Hamm <bhamm@nvidia.com>
63ed0a6 to
c3101f8
Compare
|
@alec-flowers @tanmayv25 — could one of you stamp this when you get a sec? CODEOWNERS |
Signed-off-by: Anish Maddipoti <amaddipoti@nvidia.com> Signed-off-by: Ben Hamm <bhamm@nvidia.com> Co-authored-by: Ben Hamm <bhamm@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Ben Hamm <ben.hamm@gmail.com> Signed-off-by: Chi Xing <cxing@nvidia.com>
Summary
Adds four customer-facing Dynamo agent skills for the Computex bring-up path, and reorganizes the in-repo skill catalog so end-user skills are cleanly separated from contributor-only ones.
New customer skills under
.agents/skills/:dynamo-recipe-runner— select, validate, patch, deploy, and smoke-test existing recipes.dynamo-router-starter— round-robin / KV router setup plus endpoint smoke tests.dynamo-troubleshoot— read-only Kubernetes debug-bundle collection and common-failure classification.dynamo-interconnect-check— read-only NIXL/UCX/NCCL transport validation after a disagg or multi-node recipe comes up, so disagg numbers reflect the real fabric.Each skill ships a
SKILL.md, areferences/doc, a script underscripts/, and anevals/evals.jsoncase set.Catalog reorganization:
.agents/skills/is the canonical location for customer skills.debug-session,dep-create,dep-status,dep-update,dynamo-docs,gh-issue-bug,graham-code-review,pr-monitor,tool-parser-generator) moved to.agents/contributor-skills/so they no longer appear in the public catalog.skillssymlink →.agents/skillskeeps existing references and tooling working..github/filters.yamlupdated so the rootskillssymlink is covered by the docs filter.Also addresses earlier review comments on recipe discovery, GPU counting, debug-bundle coverage, and portable script references.
Validation
python3 -m py_compile .agents/skills/dynamo-recipe-runner/scripts/recipe_tool.py .agents/skills/dynamo-router-starter/scripts/check_router_health.py .agents/skills/dynamo-troubleshoot/scripts/collect_dynamo_debug_bundle.py .agents/skills/dynamo-interconnect-check/scripts/check_interconnect.pypython3 -m isort --check-only .agents/skills/dynamo-recipe-runner/scripts/recipe_tool.py .agents/skills/dynamo-router-starter/scripts/check_router_health.py .agents/skills/dynamo-troubleshoot/scripts/collect_dynamo_debug_bundle.py .agents/skills/dynamo-interconnect-check/scripts/check_interconnect.pypython3 -m black --check .agents/skills/dynamo-recipe-runner/scripts/recipe_tool.py .agents/skills/dynamo-router-starter/scripts/check_router_health.py .agents/skills/dynamo-troubleshoot/scripts/collect_dynamo_debug_bundle.py .agents/skills/dynamo-interconnect-check/scripts/check_interconnect.pypython3 .agents/skills/dynamo-recipe-runner/scripts/recipe_tool.py list --query nemotron --framework sglang --format tablepython3 .agents/skills/dynamo-recipe-runner/scripts/recipe_tool.py validate recipes/nemotron-3-super-fp8/sglang/aggpython3 .agents/skills/dynamo-router-starter/scripts/check_router_health.py --base-url http://127.0.0.1:9 --retries 1python3 .agents/skills/dynamo-interconnect-check/scripts/check_interconnect.py env recipes/nemotron-3-super-fp8/sglang/agg<!-- This is an auto-generated comment: release notes by coderabbit.ai -->
Summary by CodeRabbit
New Features
dynamo-recipe-runnerskill for selecting, validating, and deploying Kubernetes recipes.dynamo-router-starterskill for configuring router modes and performing health checks.dynamo-troubleshootskill with troubleshooting workflows and debug bundle collection.dynamo-interconnect-checkskill for post-deploy NIXL/UCX/NCCL transport validation.Documentation
Chores
.agents/contributor-skills/; rootskillssymlink points to.agents/skills.evals/evals.jsoncase sets for the four customer skills.<!-- review_stack_entry_start -->
[!Review Change Stack](https://app.coderabbit.ai/change-stack/ai-dynamo/dynamo/pull/9782?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)
<!-- review_stack_entry_end -->
<!-- end of auto-generated comment: release notes by coderabbit.ai -->