Skip to content

fix(onboard): three NVIDIA Endpoints model probe bugs (#1601)#1602

Merged
cv merged 2 commits intoNVIDIA:mainfrom
latenighthackathon:fix/onboard-nvidia-build-probes
Apr 9, 2026
Merged

fix(onboard): three NVIDIA Endpoints model probe bugs (#1601)#1602
cv merged 2 commits intoNVIDIA:mainfrom
latenighthackathon:fix/onboard-nvidia-build-probes

Conversation

@latenighthackathon
Copy link
Copy Markdown
Contributor

@latenighthackathon latenighthackathon commented Apr 8, 2026

Summary

Problem

Three quirks of NVIDIA Build (https://integrate.api.nvidia.com/v1) combine to make custom-model selection in nemoclaw onboard look broken even when the model would otherwise work. The default Nemotron path is unaffected because it picks models that happen to mask all three behaviors.

The reporter (@kaviin27 in the NVIDIA Developer Discord) saw this exact failure pasted in their bug report:

Responses API with tool calling: HTTP 404: 404 page not found
Chat Completions API: HTTP 404: Function 'ZZZZ': Not found for account 'XXXX'

I reproduced both halves of that against a live free-tier NVIDIA Build key and traced them to three independent code paths.

1. Responses probe always 404s on NVIDIA Build

bin/lib/onboard.js:probeOpenAiLikeEndpoint() probes /v1/responses first when requireResponsesToolCalling is true (which is gated on provider === 'nvidia-prod'). NVIDIA Build does not expose /v1/responses for any model — even nemotron-49b returns 404 page not found. The probe loop silently falls through to chat completions on success, so working models still onboard, but the noisy 404 leaks into error messages whenever chat completions also fails. That's why @kaviin27's error string had both halves.

Fix: skip the Responses probe entirely for nvidia-prod via a new shouldSkipResponsesProbe() helper.

2. NVCF Function not found for account is opaque to the user

Many catalog models return:

{"status":404,"title":"Not Found","detail":"Function 'a1b2c3': Not found for account 'xyz'"}

This means the model is in the public catalog but is not deployed for the user's account/org. The current code joins all probe failures into one raw string and surfaces the NVCF body verbatim, leaving the user with no idea what to do.

Fix: detect the pattern with isNvcfFunctionNotFoundForAccount() and replace the message with nvcfFunctionNotFoundMessage(model), which says:

Model '' is in the NVIDIA Build catalog but is not deployed for your account. Pick a different model, or check the model card on https://build.nvidia.com to see if it requires org-level access.

3. Validation probes inherit a 60s curl timeout

getCurlTimingArgs() returns --max-time 60. That is fine for inference calls, but a misbehaving model (confirmed: mistralai/mistral-medium-3-instruct hangs 30+ seconds with zero bytes returned, 3 attempts in a row) should not block the wizard for a full minute per failure mode. The user-visible result is a wizard that just sits there for ages on a model the user can't fix.

Fix: add getValidationProbeCurlArgs() that returns --max-time 15 and use it in probeOpenAiLikeEndpoint and probeResponsesToolCalling.

Test plan

  • 6 new unit tests in src/lib/validation.test.ts covering isNvcfFunctionNotFoundForAccount, nvcfFunctionNotFoundMessage, and shouldSkipResponsesProbe — including the literal NVCF error string captured from a live API run
  • All 30 validation tests pass: npx vitest run src/lib/validation.test.ts → 30/30
  • Existing onboard test suites unaffected — verified test/onboard-selection.test.js and test/onboard-readiness.test.js pass
  • Live reproduction against https://integrate.api.nvidia.com/v1:
    • nvidia/llama-3.3-nemotron-super-49b-v1 chat → 200 in 0.67s (control)
    • mistralai/mistral-large chat → 404 Function 'XXX': Not found for account 'YYY' (NVCF account-not-deployed error)
    • mistralai/mistral-medium-3-instruct chat → 30+ second hang, no bytes
    • /v1/responses for any model → 404 page not found
  • npx prettier --check clean
  • npx eslint clean
  • Signed commit + DCO sign-off

Closes #1601

Signed-off-by: latenighthackathon latenighthackathon@users.noreply.github.com

Summary by CodeRabbit

  • Bug Fixes

    • Improved detection of NVIDIA “function not found” failures and return of a clearer, model-specific message and guidance.
    • Validation now preserves raw probe response bodies for better failure reporting.
  • Performance

    • Tighter per-probe timeouts and optional skipping of unnecessary Responses probes for specific providers to reduce noise and speed up validation.
  • Tests

    • Added unit tests covering the new validation helpers and probe-skip behavior.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 8, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5576b1b9-cfd5-44cd-bd65-d061afebf335

📥 Commits

Reviewing files that changed from the base of the PR and between aff5bf8 and 8ace970.

📒 Files selected for processing (3)
  • bin/lib/onboard.js
  • src/lib/validation.test.ts
  • src/lib/validation.ts
✅ Files skipped from review due to trivial changes (1)
  • src/lib/validation.test.ts

📝 Walkthrough

Walkthrough

Added three validation helpers and tests; tightened per-probe curl timeouts; optionally skipped the Responses API probe for nvidia-prod; preserved raw probe bodies; detected NVIDIA Cloud Functions "Function ... Not found for account" failures and return a reframed message during provider/model validation.

Changes

Cohort / File(s) Summary
Validation Helpers
src/lib/validation.ts
Added isNvcfFunctionNotFoundForAccount(message: string), nvcfFunctionNotFoundMessage(model: string), and shouldSkipResponsesProbe(provider: string) to detect NVCF 404s, produce a user-facing reframed message, and gate probing of the Responses API for nvidia-prod.
Validation Tests
src/lib/validation.test.ts
Added tests for the three new helpers: regex matching for NVCF errors (including wrapped/variation cases), message contents/classification, and provider-based shouldSkipResponsesProbe() behavior.
Onboarding / Probe Logic
bin/lib/onboard.js
Imported new validators; added getValidationProbeCurlArgs() that sets --connect-timeout 10 and --max-time 15; refactored probeOpenAiLikeEndpoint to build separate chat-completions probe, optionally skip /responses when requested, preserve raw probe body in failures, detect NVCF "Function not found for account" using either summarized message or raw body, and return the reframed nvcfFunctionNotFoundMessage(model) when detected.

Sequence Diagram(s)

sequenceDiagram
    participant Wizard as Onboard Wizard
    participant Onboard as onboard.js validator
    participant Curl as curl probe
    participant Build as NVIDIA Build API

    Wizard->>Onboard: validate provider & model (includes shouldSkipResponsesProbe)
    alt Responses probed
        Onboard->>Curl: POST /v1/responses (with curl args)
        Curl->>Build: /v1/responses
        Build-->>Curl: 404 or response
        Curl-->>Onboard: probe result (raw body preserved)
    end
    Onboard->>Curl: POST /v1/chat/completions (with --connect-timeout 10 --max-time 15)
    Curl->>Build: /v1/chat/completions
    Build-->>Curl: success, timeout, or "Function 'X': Not found for account 'Y'"
    Curl-->>Onboard: probe result (raw body preserved)
    Onboard->>Onboard: if NVCF pattern detected -> nvcfFunctionNotFoundMessage(model)
    Onboard-->>Wizard: validation outcome (success or reframed NVCF message)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

A rabbit hops through probes and timeouts,
Skips the /responses that always drowns,
Sniffs the NVCF trace with keen delight,
Reframes the error, makes the path right—
Hop on, model chooser, choose what's bright! 🐰

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 55.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(onboard): three NVIDIA Endpoints model probe bugs (#1601)' directly and specifically summarizes the main purpose of the PR—fixing three distinct probe-related bugs in NVIDIA Endpoints onboarding.
Linked Issues check ✅ Passed The PR fully addresses all coding objectives from issue #1601: skip /responses probe for nvidia-prod via shouldSkipResponsesProbe(), detect NVCF 'Function not found' errors via isNvcfFunctionNotFoundForAccount() and provide actionable nvcfFunctionNotFoundMessage(), and tighten validation-probe timeouts via getValidationProbeCurlArgs().
Out of Scope Changes check ✅ Passed All code changes are tightly scoped to the three stated probe bugs: responses probe skipping, NVCF error detection/reframing, and timeout tightening. No unrelated refactoring, dependencies, or unscoped modifications are present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
src/lib/validation.ts (1)

107-112: Make the rewritten message model-classifiable.

If this string is fed back through classifyValidationFailure(), it won't hit the model regex on Lines 38-39, so the recovery path degrades to unknown/selection even though the guidance says to pick a different model. Either extend the classifier for not deployed for your account, or phrase this in a way the existing classifier already recognizes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/validation.ts` around lines 107 - 112, The message returned by
nvcfFunctionNotFoundMessage should be rewritten so it matches the existing
classifyValidationFailure model regex (so the classifier will route to the
model-specific recovery path) — update the string to include the
classifier-recognized phrasing (e.g., include "Model '<model>' not found" or
"Model '<model>' not available" and also mention "not deployed for your
account") so classifyValidationFailure can detect the model name and the "not
found/not available" condition; modify nvcfFunctionNotFoundMessage accordingly
(or alternatively add the "not deployed for your account" pattern to
classifyValidationFailure) and ensure the function name
nvcfFunctionNotFoundMessage and classifier classifyValidationFailure remain
consistent.
bin/lib/onboard.js (1)

1133-1138: Match the NVCF signal before collapsing failures to message.

This branch only checks failure.message. If runCurlProbe() ever stops carrying the raw NVCF detail through that field, the special-case silently stops working even though result.body still contains the marker. Matching against the raw body before pushing into failures would make this path much less brittle.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 1133 - 1138, The current detection logic
uses failures.find(...) and only checks failure.message with
isNvcfFunctionNotFoundForAccount, which is brittle if runCurlProbe moves the raw
NVCF `detail` out of message; update the detection to inspect the raw probe
response before failures are collapsed — either (A) when building the failures
array in the runCurlProbe result construction, preserve the raw response
body/detail (e.g., attach result.body or result.detail to the failure object),
or (B) change the accountFailure search to call isNvcfFunctionNotFoundForAccount
against failure.message || failure.body || failure.detail (or directly against
the original result.body if available); touch the code paths that create
failures in runCurlProbe and the lookup using accountFailure and
isNvcfFunctionNotFoundForAccount to ensure the raw NVCF marker is matched
reliably.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@bin/lib/onboard.js`:
- Around line 1133-1138: The current detection logic uses failures.find(...) and
only checks failure.message with isNvcfFunctionNotFoundForAccount, which is
brittle if runCurlProbe moves the raw NVCF `detail` out of message; update the
detection to inspect the raw probe response before failures are collapsed —
either (A) when building the failures array in the runCurlProbe result
construction, preserve the raw response body/detail (e.g., attach result.body or
result.detail to the failure object), or (B) change the accountFailure search to
call isNvcfFunctionNotFoundForAccount against failure.message || failure.body ||
failure.detail (or directly against the original result.body if available);
touch the code paths that create failures in runCurlProbe and the lookup using
accountFailure and isNvcfFunctionNotFoundForAccount to ensure the raw NVCF
marker is matched reliably.

In `@src/lib/validation.ts`:
- Around line 107-112: The message returned by nvcfFunctionNotFoundMessage
should be rewritten so it matches the existing classifyValidationFailure model
regex (so the classifier will route to the model-specific recovery path) —
update the string to include the classifier-recognized phrasing (e.g., include
"Model '<model>' not found" or "Model '<model>' not available" and also mention
"not deployed for your account") so classifyValidationFailure can detect the
model name and the "not found/not available" condition; modify
nvcfFunctionNotFoundMessage accordingly (or alternatively add the "not deployed
for your account" pattern to classifyValidationFailure) and ensure the function
name nvcfFunctionNotFoundMessage and classifier classifyValidationFailure remain
consistent.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e782ea14-5884-45f0-bb4f-af81d0fa3b0c

📥 Commits

Reviewing files that changed from the base of the PR and between adbea05 and ca671a0.

📒 Files selected for processing (3)
  • bin/lib/onboard.js
  • src/lib/validation.test.ts
  • src/lib/validation.ts

@cv cv added the v0.0.10 Release target label Apr 8, 2026
@cv cv self-assigned this Apr 8, 2026
Copy link
Copy Markdown
Contributor

@cv cv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — security review PASS.

  • Credential handling unchanged, keys still routed through normalizeCredentialValue()
  • NVCF error detection regex is restrictive and well-tested
  • Model IDs validated upstream by isSafeModelId(), no injection risk
  • Timeout reduction from 60s to 15s is a security positive
  • Raw body in failures array is internal-only, never surfaced to user
  • 6 new unit tests cover positive/negative matches and classification routing

No concerns.

@cv
Copy link
Copy Markdown
Contributor

cv commented Apr 8, 2026

Several v0.0.10 PRs just merged, including changes to onboard.js. Could you rebase on main to retrigger CI? Thanks!

NVIDIA Build (https://integrate.api.nvidia.com/v1) has three quirks
that combine to make custom-model selection in NemoClaw onboarding
look broken even when the model would otherwise work. The default
Nemotron path is unaffected because it picks models that happen to
mask all three behaviors.

1. Skip the Responses API probe entirely for nvidia-prod.
   NVIDIA Build does not expose /v1/responses for any model — every
   probe to that path returns '404 page not found'. The probe loop
   silently falls through to chat completions on success, so working
   models still onboard, but the noisy 404 leaks into error messages
   when chat completions also fails. Add shouldSkipResponsesProbe()
   in src/lib/validation.ts and gate the probe list in
   probeOpenAiLikeEndpoint on it.

2. Detect NVCF 'Function not found for account' and reframe.
   Many catalog models (e.g. mistralai/mistral-large) return:
     {"detail":"Function '<uuid>': Not found for account '<id>'"}
   This means the model is in the public catalog but is not deployed
   for the user's account/org. The raw NVCF body is opaque; surface
   a clear, actionable message instead. Add isNvcfFunctionNotFoundForAccount
   and nvcfFunctionNotFoundMessage helpers in validation.ts and use
   them when collecting probe failures.

3. Tighten validation probe timeout from 60s to 15s.
   getCurlTimingArgs() defaults to --max-time 60. That is fine for
   inference calls, but a hung NVIDIA backend (confirmed live on
   mistralai/mistral-medium-3-instruct, 30+ seconds with zero bytes)
   should not block the wizard for a full minute per failure mode.
   Add getValidationProbeCurlArgs() with --max-time 15 and use it
   in probeOpenAiLikeEndpoint and probeResponsesToolCalling.

The catalog-only validation false-positive (Bug 4 in the issue) is
intentionally out of scope for this PR — that is a behavior change
that should be discussed with maintainers first.

Adds 6 new unit tests in src/lib/validation.test.ts covering all
three new helpers, with the literal NVCF error string reproduced
from a live API run.

Closes NVIDIA#1601

Signed-off-by: latenighthackathon <latenighthackathon@users.noreply.github.com>
1. Reword nvcfFunctionNotFoundMessage to start with 'Model <id> not found'
   so classifyValidationFailure() matches its model.+not found regex and
   routes the user into the model-selection recovery path. Without this,
   the recovery flow degrades to the generic unknown/selection branch
   even though the message tells the user to pick a different model.

2. Preserve the raw probe response body alongside the summarized message
   in the failures array, and have the NVCF detector check both message
   and body. summarizeProbeError currently surfaces the NVCF detail
   string through message, but if that ever changes the detector keeps
   working.

Adds a regression test asserting that classifyValidationFailure() routes
the reframed NVCF message to {kind: 'model', retry: 'model'}.

Signed-off-by: latenighthackathon <latenighthackathon@users.noreply.github.com>
@latenighthackathon latenighthackathon force-pushed the fix/onboard-nvidia-build-probes branch from aff5bf8 to 8ace970 Compare April 8, 2026 22:06
@latenighthackathon
Copy link
Copy Markdown
Contributor Author

latenighthackathon commented Apr 8, 2026

@cv Thanks for the thorough security review! Rebased onto current main — clean linear history, CI should pick up a fresh run. Cheers!

@cv cv merged commit 303e3f7 into NVIDIA:main Apr 9, 2026
5 of 8 checks passed
@ericksoa
Copy link
Copy Markdown
Contributor

ericksoa commented Apr 9, 2026

Thanks for the fix — the code changes look good and the review is approved. CI is failing on two pre-existing ESLint errors that your branch picked up from an older main:

  1. test/onboard.test.js:597'__dirname' is not defined (ESM file needs import.meta.dirname)
  2. bin/lib/onboard.js:424@typescript-eslint/no-require-imports rule not found

Both have since been fixed on main. A git merge main into your branch should resolve them and get CI green. Once that's done we'll merge.

@latenighthackathon latenighthackathon deleted the fix/onboard-nvidia-build-probes branch April 9, 2026 03:01
gemini2026 pushed a commit to gemini2026/NemoClaw that referenced this pull request Apr 14, 2026
…VIDIA#1602)

## Summary

- Skip the OpenAI Responses probe entirely for `nvidia-prod` (NVIDIA
Build does not expose `/v1/responses` for any model — every probe to
that path returns `404 page not found`)
- Detect the NVCF `Function '<uuid>': Not found for account '<id>'`
failure and surface an actionable message that points the user at
picking a different model on https://build.nvidia.com
- Tighten the per-validation-probe `--max-time` from the default 60s to
15s so a hung NVIDIA backend (confirmed live on
`mistralai/mistral-medium-3-instruct`) cannot block the wizard for a
full minute per failure mode
- Closes NVIDIA#1601

## Problem

Three quirks of NVIDIA Build (https://integrate.api.nvidia.com/v1)
combine to make custom-model selection in `nemoclaw onboard` look broken
even when the model would otherwise work. The default Nemotron path is
unaffected because it picks models that happen to mask all three
behaviors.

The reporter (**@kaviin27** in the NVIDIA Developer Discord) saw this
exact failure pasted in their bug report:

> ```
> Responses API with tool calling: HTTP 404: 404 page not found
> Chat Completions API: HTTP 404: Function 'ZZZZ': Not found for account
'XXXX'
> ```

I reproduced both halves of that against a live free-tier NVIDIA Build
key and traced them to three independent code paths.

### 1. Responses probe always 404s on NVIDIA Build

[`bin/lib/onboard.js:probeOpenAiLikeEndpoint()`](bin/lib/onboard.js)
probes `/v1/responses` first when `requireResponsesToolCalling` is true
(which is gated on `provider === 'nvidia-prod'`). NVIDIA Build does not
expose `/v1/responses` for **any** model — even nemotron-49b returns
`404 page not found`. The probe loop silently falls through to chat
completions on success, so working models still onboard, but the noisy
404 leaks into error messages whenever chat completions also fails.
That's why @kaviin27's error string had both halves.

Fix: skip the Responses probe entirely for nvidia-prod via a new
`shouldSkipResponsesProbe()` helper.

### 2. NVCF `Function not found for account` is opaque to the user

Many catalog models return:
```json
{"status":404,"title":"Not Found","detail":"Function 'a1b2c3': Not found for account 'xyz'"}
```

This means the model is in the public catalog but is not deployed for
the user's account/org. The current code joins all probe failures into
one raw string and surfaces the NVCF body verbatim, leaving the user
with no idea what to do.

Fix: detect the pattern with `isNvcfFunctionNotFoundForAccount()` and
replace the message with `nvcfFunctionNotFoundMessage(model)`, which
says:

> Model '<id>' is in the NVIDIA Build catalog but is not deployed for
your account. Pick a different model, or check the model card on
https://build.nvidia.com to see if it requires org-level access.

### 3. Validation probes inherit a 60s curl timeout

`getCurlTimingArgs()` returns `--max-time 60`. That is fine for
inference calls, but a misbehaving model (confirmed:
`mistralai/mistral-medium-3-instruct` hangs 30+ seconds with zero bytes
returned, 3 attempts in a row) should not block the wizard for a full
minute per failure mode. The user-visible result is a wizard that just
sits there for ages on a model the user can't fix.

Fix: add `getValidationProbeCurlArgs()` that returns `--max-time 15` and
use it in `probeOpenAiLikeEndpoint` and `probeResponsesToolCalling`.

## Test plan

- [x] **6 new unit tests** in `src/lib/validation.test.ts` covering
`isNvcfFunctionNotFoundForAccount`, `nvcfFunctionNotFoundMessage`, and
`shouldSkipResponsesProbe` — including the literal NVCF error string
captured from a live API run
- [x] All 30 validation tests pass: `npx vitest run
src/lib/validation.test.ts` → 30/30
- [x] Existing onboard test suites unaffected — verified
`test/onboard-selection.test.js` and `test/onboard-readiness.test.js`
pass
- [x] **Live reproduction** against
`https://integrate.api.nvidia.com/v1`:
- `nvidia/llama-3.3-nemotron-super-49b-v1` chat → 200 in 0.67s (control)
- `mistralai/mistral-large` chat → `404 Function 'XXX': Not found for
account 'YYY'` (NVCF account-not-deployed error)
- `mistralai/mistral-medium-3-instruct` chat → 30+ second hang, no bytes
  - `/v1/responses` for any model → `404 page not found`
- [x] `npx prettier --check` clean
- [x] `npx eslint` clean
- [x] Signed commit + DCO sign-off

Closes NVIDIA#1601

Signed-off-by: latenighthackathon
<latenighthackathon@users.noreply.github.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Improved detection of NVIDIA “function not found” failures and return
of a clearer, model-specific message and guidance.
* Validation now preserves raw probe response bodies for better failure
reporting.

* **Performance**
* Tighter per-probe timeouts and optional skipping of unnecessary
Responses probes for specific providers to reduce noise and speed up
validation.

* **Tests**
* Added unit tests covering the new validation helpers and probe-skip
behavior.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: latenighthackathon <latenighthackathon@users.noreply.github.com>
Co-authored-by: latenighthackathon <latenighthackathon@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

v0.0.10 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(onboard): NVIDIA Endpoints model selection has 4 broken probe behaviors causing misleading errors and hung wizard

3 participants