feat(sandbox): runtime model override via env vars#1633
feat(sandbox): runtime model override via env vars#1633brandonpelfrey merged 21 commits intoNVIDIA:mainfrom
Conversation
Allow model and provider changes without rebuilding the sandbox image. The entrypoint patches openclaw.json at startup when env vars are set, then recomputes the config hash so integrity checks still pass. New env vars: - NEMOCLAW_MODEL_OVERRIDE: patches agents.defaults.model.primary and the provider model name. Must match the model configured on the gateway via openshell inference set. - NEMOCLAW_INFERENCE_API_OVERRIDE: patches the inference API type (e.g., "anthropic-messages" or "openai-completions"). Only needed for cross-provider switches. Security: env vars come from the host (Docker/OpenShell), not from inside the sandbox. Landlock locks the file after patching. Same trust model as NEMOCLAW_LOCAL_INFERENCE_TIMEOUT. Closes NVIDIA#759 Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a runtime model override: a new apply_model_override() in the container startup script that, when allowed, rewrites /sandbox/.openclaw/openclaw.json (and recomputes .config-hash) based on NEMOCLAW_MODEL_OVERRIDE and optional NEMOCLAW_INFERENCE_API_OVERRIDE; tests and docs updated accordingly. Changes
Sequence Diagram(s)sequenceDiagram
participant Entrypoint as Entrypoint Script
participant Verifier as verify_config_integrity
participant Override as apply_model_override()
participant FS as Sandbox FS (/sandbox/.openclaw)
participant Hasher as sha256sum
Entrypoint->>Verifier: call verify_config_integrity
Verifier-->>Entrypoint: integrity OK
Entrypoint->>Override: call apply_model_override()
Override->>FS: check ownership & symlink status
alt running as root and NEMOCLAW_MODEL_OVERRIDE set
Override->>FS: read openclaw.json
Override->>FS: modify agents.defaults.model.primary
Override->>FS: set providers' models[].id and .name
alt NEMOCLAW_INFERENCE_API_OVERRIDE set (allowed)
Override->>FS: replace providers' api fields
end
Override->>FS: write openclaw.json
Override->>Hasher: compute sha256sum openclaw.json
Hasher-->>FS: write .config-hash
else no-op (non-root or unset)
Override-->>Entrypoint: no changes
end
Override-->>Entrypoint: return
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/nemoclaw-start.sh`:
- Line 623: The script calls apply_model_override which attempts to open
/sandbox/.openclaw/openclaw.json with "w" even when running as non-root while
install_configure_guard()/configure_messaging_channels have made
/sandbox/.openclaw Landlock read-only; this causes NEMOCLAW_MODEL_OVERRIDE to
fail under set -e. Fix by updating apply_model_override to first detect uid (id
-u) and if non-root either (A) require/gate the write path to root (skip or sudo
the write) or (B) write the override into a per-user writable config mirror
(e.g., create a writable copy directory for the non-root fallback and write into
that mirror) and ensure subsequent reads prefer the mirror; also add a pre-check
in apply_model_override that verifies the target file is writable and falls back
cleanly rather than causing an immediate set -e exit.
- Around line 187-223: The script writes config_file and hash_file before
symlink validation; update apply_model_override() (or the section using
variables config_file and hash_file) to reject or handle symlinks: before
opening/writing config_file and before writing hash_file, check if each path is
a symlink (e.g., test -L or use readlink) and abort with an error if so, or
write to a temp file in the same directory and atomically mv into place only
after verifying the target is not a symlink; ensure the same symlink guard is
applied to both config_file and hash_file so they cannot be pre-created as
symlinks and followed/overwritten.
In `@test/nemoclaw-start.test.js`:
- Around line 207-213: The test currently uses a regex that starts at the first
verify_config_integrity and matches helper definitions; instead extract the
non-root shell block 'if [ "$(id -u)" -ne 0 ]; then ... fi' from src first and
then assert that within that extracted block the sequence
verify_config_integrity → apply_model_override → export_gateway_token occurs;
update the matcher in test/nemoclaw-start.test.js to locate the non-root
if-block and run the ordering regex against that block (referencing
verify_config_integrity, apply_model_override, and export_gateway_token to
identify the calls).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: cd47e5eb-bd44-45cb-81d8-0271a4f2e336
📒 Files selected for processing (2)
scripts/nemoclaw-start.shtest/nemoclaw-start.test.js
cv
left a comment
There was a problem hiding this comment.
Security Review — WARNING
The shell/Python mechanics are solid — proper quoting, single-quoted heredoc, argv passing, json.load/dump. No injection vectors. Call ordering is correct.
Requested changes
1. Input validation (MEDIUM):
NEMOCLAW_INFERENCE_API_OVERRIDEshould be checked against an allowlist (e.g.,anthropic-messages,openai-completions). An arbitrary value here could cause unexpected provider routing behavior.NEMOCLAW_MODEL_OVERRIDEshould reject control characters, newlines, and enforce a reasonable length limit. This converts the documentation-only contract ("must match the gateway config") into a code-enforced one.
2. Document as security-sensitive (MEDIUM):
NEMOCLAW_MODEL_OVERRIDE controls where the agent sends all prompts. Anyone who can set this env var controls inference routing. Please:
- Document this explicitly in the PR description / docs
- Consider logging at
[SECURITY]level (not just[config]) when an override is active
Minor (non-blocking)
- Use
printf '%s\n' "$value"instead ofecho "$value"in log lines to prevent terminal escape sequence injection (low risk since values come from host, but more defensive)
- Patch model.id in addition to model.name — id is what OpenClaw sends in API requests, name is the display reference - Add e2e tests 26-27 to e2e-gateway-isolation.sh verifying the override patches primary/id/name, and is a no-op when unset - Document cross-provider switching in switch-inference-providers.md with env var usage and recreate-sandbox workflow - Update Notes section to clarify same-provider vs cross-provider Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/inference/switch-inference-providers.md`:
- Around line 96-100: Clarify that same-provider model switches (e.g., between
two NVIDIA models) only require updating the gateway route and do not need
NEMOCLAW_INFERENCE_API_OVERRIDE, but do require setting NEMOCLAW_MODEL_OVERRIDE;
make the revert path explicit: removing NEMOCLAW_INFERENCE_API_OVERRIDE and/or
NEMOCLAW_MODEL_OVERRIDE in the environment will not take effect until the
sandbox is restarted/recreated because overrides are applied at startup, so
document that users must recreate the sandbox to revert to the original model;
update the conflicting lines to reflect this consistent behavior and mention
both env vars by name (NEMOCLAW_INFERENCE_API_OVERRIDE, NEMOCLAW_MODEL_OVERRIDE)
and the sandbox recreation step.
In `@test/e2e-gateway-isolation.sh`:
- Around line 391-399: The current loop overwrites model_id/model_name on each
iteration so only the last model is validated; update the logic that iterates
providers -> pval.get("models") to validate every model entry (e.g., perform the
equality check against primary inside the inner loop or accumulate a boolean
that requires all models to match) using the variables pval, models, model_id,
model_name and primary, and only print "OVERRIDE_OK" when every model's id and
name equal "test/override-model" (otherwise print the failure with the first
mismatched model's details).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 541f98df-72e2-4124-ba37-07dbe88c93bd
📒 Files selected for processing (4)
.agents/skills/nemoclaw-user-configure-inference/SKILL.mddocs/inference/switch-inference-providers.mdscripts/nemoclaw-start.shtest/e2e-gateway-isolation.sh
🚧 Files skipped from review as they are similar to previous changes (1)
- scripts/nemoclaw-start.sh
Address CodeRabbit and security review (Carlos): - Gate override to root mode only (sandbox user can't write 444 files) - Add symlink guard on config and hash files before writing - Validate model override: reject control chars, enforce 256 char limit - Allowlist inference API types (openai-completions, anthropic-messages) - Use [SECURITY] log level for override messages (printf, not echo) - Fix test regex to target non-root block specifically - Add tests for symlink guard, root-only gate, input validation - Fix e2e test to validate all model entries, not just last - Fix docs: revert requires sandbox recreate, remove contradictions - Fix 2 pre-existing test failures (non-root block extraction used ^fi$ which matched nested fi, now uses # ── Root path boundary) Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
test/nemoclaw-start.test.js (1)
223-225: Make root-path ordering assertion less brittle to harmless comments.Current matcher requires adjacent lines; a comment between calls would fail despite correct order. Prefer the same flexible sequencing style used in the non-root assertion.
Suggested refactor
- const rootBlock = src.match( - /# ── Root path[\s\S]*?verify_config_integrity\n\s*apply_model_override\n\s*export_gateway_token/, - ); + const rootBlock = src.match( + /# ── Root path[\s\S]*?verify_config_integrity[\s\S]*?apply_model_override[\s\S]*?export_gateway_token/, + );🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/nemoclaw-start.test.js` around lines 223 - 225, The current rootBlock regex in the test (src.match) is brittle because it requires verify_config_integrity, apply_model_override, and export_gateway_token to be on adjacent lines; update the pattern to allow arbitrary content (including comments and blank lines) between those calls using the same flexible sequencing approach used in the non-root assertion so that the order is enforced but not adjacency (i.e., replace the /\n\s*/ separators with a non-greedy [\s\S]*? between each symbol: verify_config_integrity, apply_model_override, export_gateway_token in the rootBlock match).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/nemoclaw-start.test.js`:
- Around line 50-53: The brace-group stripping regex used when building
braceStripped fails to match redirects that include spaces after the '>' (it
currently expects no space); update the regex used on block (the one within the
assignment to braceStripped) to allow optional whitespace between '>' and the
quoted filename (e.g. change the pattern so it matches
/\{[\s\S]*?\}\s*>\s*"[^"]*"/g), ensuring brace-group redirects like > "file" are
stripped correctly.
---
Nitpick comments:
In `@test/nemoclaw-start.test.js`:
- Around line 223-225: The current rootBlock regex in the test (src.match) is
brittle because it requires verify_config_integrity, apply_model_override, and
export_gateway_token to be on adjacent lines; update the pattern to allow
arbitrary content (including comments and blank lines) between those calls using
the same flexible sequencing approach used in the non-root assertion so that the
order is enforced but not adjacency (i.e., replace the /\n\s*/ separators with a
non-greedy [\s\S]*? between each symbol: verify_config_integrity,
apply_model_override, export_gateway_token in the rootBlock match).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 22fff212-fbd9-4cf4-94e3-9ecb063de068
📒 Files selected for processing (5)
.agents/skills/nemoclaw-user-configure-inference/SKILL.mddocs/inference/switch-inference-providers.mdscripts/nemoclaw-start.shtest/e2e-gateway-isolation.shtest/nemoclaw-start.test.js
✅ Files skipped from review due to trivial changes (1)
- .agents/skills/nemoclaw-user-configure-inference/SKILL.md
🚧 Files skipped from review as they are similar to previous changes (3)
- test/e2e-gateway-isolation.sh
- scripts/nemoclaw-start.sh
- docs/inference/switch-inference-providers.md
CodeRabbit review: the regex stripping brace-group redirects missed the common form with whitespace after >, e.g. } > "file". Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ride (NVIDIA#719) Adds apply_cors_override() following the same pattern as the model override: host-set env var, applied after integrity check and before chattr +i, config hash recomputed. Users deploying on custom domains/ports can now add their browser origin without rebuilding the sandbox: export NEMOCLAW_CORS_ORIGIN="https://my-server.example.com:8443" Security guards: symlink check, control character rejection, length limit, http/https URL validation. Only applies in root mode. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…overrides Extend apply_model_override() to also accept: - NEMOCLAW_CONTEXT_WINDOW: override model context window size - NEMOCLAW_MAX_TOKENS: override model max output tokens - NEMOCLAW_REASONING: enable/disable reasoning mode (true/false) These can be set independently of NEMOCLAW_MODEL_OVERRIDE — useful for tuning model parameters without switching models. Validation: context window and max tokens must be positive integers, reasoning must be "true" or "false". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Make the root-path call-order regex use [\s\S]*? between function calls instead of \n\s*, so comments or blank lines between calls don't break the test. Matches the style used in the non-root assertion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
14 tests covering all runtime env var overrides: - No-op baseline (config hash valid without overrides) - Individual overrides: model, context window, max tokens, reasoning, CORS - Combined overrides (all 5 at once) - Validation rejections: control chars, non-integer, invalid reasoning, non-http CORS, invalid API type (6 cases) - Config unchanged after rejected override Each test runs as an independent short-lived container — safe for parallel CI execution. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Keep the timeout configuration section added in main. Regenerate skill files after docs-to-skills.py ran during pre-commit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Full Onboard + Inference Test — PR #1633Date: 2026-04-09 22:35–22:51 UTC Phase 1: Install from PR source ✅
Phase 2: Onboard ✅
Phase 3: Sandbox verification ✅Live config from sandbox: Phase 4: Live inference ✅Test 1: Test 2: Both inference calls went through NVIDIA Endpoints → OpenShell gateway proxy → sandbox → model → response. Phase 5: Runtime entrypoint tests (in DinD container, not sandbox) ✅ 10/10Ran
Phase 6: Model override in live sandbox
|
| Area | Status |
|---|---|
| Install from PR source | ✅ Clean build, no errors |
| Non-interactive onboard | ✅ Full onboard in 310s |
| Sandbox creation | ✅ Ready, Landlock+seccomp+netns |
| Live inference (NVIDIA API) | ✅ Two successful completions |
| Entrypoint security tests | ✅ 10/10 pass |
| Config hash integrity | ✅ Verified |
| Model override logic | ✅ Verified via isolated tests |
Verdict: PR #1633's runtime override feature works correctly. The security validation is thorough (input sanitization, symlink protection, root-only enforcement, allowlisted values). Inference through NVIDIA Endpoints is functional.
…ig injection Upstream PR NVIDIA#1633 added env var support for runtime model overrides, making the openclaw.json injection hack unnecessary for model switching. - auto-start now passes --skip-gemini to apply-custom-policies.sh - apply-custom-policies.sh skips Steps 1+2 (Gemini inject + hash) when --skip-gemini is set; fetch-guard patch, device pairing, and skills still run (NemoClaw NVIDIA#1252 is still open) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
## Summary - Allow model and provider changes without rebuilding the sandbox image - The entrypoint patches `openclaw.json` at startup when `NEMOCLAW_MODEL_OVERRIDE` is set, then recomputes the config hash so integrity checks still pass - Same pattern as `NEMOCLAW_LOCAL_INFERENCE_TIMEOUT` (PR NVIDIA#1620) ### New env vars | Env var | Purpose | When needed | |---------|---------|-------------| | `NEMOCLAW_MODEL_OVERRIDE` | Override `agents.defaults.model.primary` and provider model name | Any model switch | | `NEMOCLAW_INFERENCE_API_OVERRIDE` | Override inference API type (`openai-completions` or `anthropic-messages`) | Cross-provider switches only | ### Usage example (NVIDIA → Anthropic) ```bash # On host: configure gateway route openshell inference set --provider anthropic-prod --model claude-sonnet-4-6 --no-verify # Set env vars for the sandbox (via openshell or Docker) export NEMOCLAW_MODEL_OVERRIDE="anthropic/claude-sonnet-4-6" export NEMOCLAW_INFERENCE_API_OVERRIDE="anthropic-messages" # Restart sandbox — no image rebuild needed ``` ### Security - Env vars come from the host (Docker/OpenShell), not from inside the sandbox - Config integrity is verified first (detects build-time tampering), then override is applied - Config hash is recomputed after patching - Landlock locks the file after this function runs - Agent cannot set these env vars ## Related Issue Closes NVIDIA#759 ## Test plan - [ ] `npm test` passes (39 tests in nemoclaw-start.test.js) - [ ] Set `NEMOCLAW_MODEL_OVERRIDE` → sandbox starts with overridden model - [ ] Unset env var → sandbox starts with original baked model (no regression) - [ ] Set invalid model → sandbox starts but inference fails (expected) - [ ] Config hash passes integrity check on restart after override Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Env vars to override the active model and (optionally) the inference API at container startup; when used in privileged startup, the runtime config is updated and its integrity hash recomputed so startup verification aligns. * **Runtime safeguards** * Input validation, API allowlist, symlink protections, applies only in privileged mode, and no-op behavior when unset. * **Tests** * New unit and end-to-end tests covering override behavior, timing, hash recomputation, validation, and noop cases. * **Documentation** * Guidance added for cross-provider switching and the runtime override workflow. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Carlos Villela <cvillela@nvidia.com>
## Summary - Allow model and provider changes without rebuilding the sandbox image - The entrypoint patches `openclaw.json` at startup when `NEMOCLAW_MODEL_OVERRIDE` is set, then recomputes the config hash so integrity checks still pass - Same pattern as `NEMOCLAW_LOCAL_INFERENCE_TIMEOUT` (PR NVIDIA#1620) ### New env vars | Env var | Purpose | When needed | |---------|---------|-------------| | `NEMOCLAW_MODEL_OVERRIDE` | Override `agents.defaults.model.primary` and provider model name | Any model switch | | `NEMOCLAW_INFERENCE_API_OVERRIDE` | Override inference API type (`openai-completions` or `anthropic-messages`) | Cross-provider switches only | ### Usage example (NVIDIA → Anthropic) ```bash # On host: configure gateway route openshell inference set --provider anthropic-prod --model claude-sonnet-4-6 --no-verify # Set env vars for the sandbox (via openshell or Docker) export NEMOCLAW_MODEL_OVERRIDE="anthropic/claude-sonnet-4-6" export NEMOCLAW_INFERENCE_API_OVERRIDE="anthropic-messages" # Restart sandbox — no image rebuild needed ``` ### Security - Env vars come from the host (Docker/OpenShell), not from inside the sandbox - Config integrity is verified first (detects build-time tampering), then override is applied - Config hash is recomputed after patching - Landlock locks the file after this function runs - Agent cannot set these env vars ## Related Issue Closes NVIDIA#759 ## Test plan - [ ] `npm test` passes (39 tests in nemoclaw-start.test.js) - [ ] Set `NEMOCLAW_MODEL_OVERRIDE` → sandbox starts with overridden model - [ ] Unset env var → sandbox starts with original baked model (no regression) - [ ] Set invalid model → sandbox starts but inference fails (expected) - [ ] Config hash passes integrity check on restart after override Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Env vars to override the active model and (optionally) the inference API at container startup; when used in privileged startup, the runtime config is updated and its integrity hash recomputed so startup verification aligns. * **Runtime safeguards** * Input validation, API allowlist, symlink protections, applies only in privileged mode, and no-op behavior when unset. * **Tests** * New unit and end-to-end tests covering override behavior, timing, hash recomputation, validation, and noop cases. * **Documentation** * Guidance added for cross-provider switching and the runtime override workflow. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Carlos Villela <cvillela@nvidia.com>
Summary
openclaw.jsonat startup whenNEMOCLAW_MODEL_OVERRIDEis set, then recomputes the config hash so integrity checks still passNEMOCLAW_LOCAL_INFERENCE_TIMEOUT(PR fix(inference): increase timeout for local providers to 180s #1620)New env vars
NEMOCLAW_MODEL_OVERRIDEagents.defaults.model.primaryand provider model nameNEMOCLAW_INFERENCE_API_OVERRIDEopenai-completionsoranthropic-messages)Usage example (NVIDIA → Anthropic)
Security
Related Issue
Closes #759
Test plan
npm testpasses (39 tests in nemoclaw-start.test.js)NEMOCLAW_MODEL_OVERRIDE→ sandbox starts with overridden modelSigned-off-by: Prekshi Vyas prekshiv@nvidia.com
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Runtime safeguards
Tests
Documentation