cosmos3

Cosmos3 — smoke-test runner

The canonical reference for Cosmos3OmniPipeline lives in the diffusers docs: docs/source/en/api/pipelines/cosmos3.md. Use the examples there as the source of truth for application code — they cover text-to-image, text-to-video, image-to-video, and text+sound modes.

This directory provides a small CLI wrapper (inference_cosmos3.py) that exercises the full load → encode → denoise → decode path against either the Hub release or a local checkpoint during development.

Setup

pip install -r examples/cosmos3/requirements.txt

Usage

Text-to-image:

python examples/cosmos3/inference_cosmos3.py \
    --prompt "A medium shot of a modern robotics research laboratory…" \
    --num-frames 1

Text-to-video:

python examples/cosmos3/inference_cosmos3.py \
    --prompt "A waterfall cascading down a rocky cliff in a lush forest."

Image-to-video:

python examples/cosmos3/inference_cosmos3.py \
    --prompt "The right robotic hand picks up the red sphere…" \
    --vision-path https://github.com/nvidia-cosmos/cosmos-dependencies/releases/download/assets/robot_153.jpg

Text-to-video-with-sound (sound-capable checkpoint only):

python examples/cosmos3/inference_cosmos3.py \
    --prompt "A waterfall in a lush forest." \
    --enable-sound

Action forward dynamics, robot domain (predict video from an observation video and a provided action chunk):

python examples/cosmos3/inference_cosmos3.py \
    --model nano \
    --prompt "Put the pot to the left of the purple item." \
    --vision-path "https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/assets/cosmos3/inputs/action/bridge_0.mp4" \
    --action-mode forward_dynamics \
    --action-path "https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/assets/cosmos3/inputs/action/bridge_0.json" \
    --action-chunk-size 16 \
    --domain-name bridge_orig_lerobot \
    --resolution-tier 480 --fps 5 \
    --num-inference-steps 30 --guidance-scale 1.0 --flow-shift 10.0 --seed 0 \
    --output results/cosmos3_forward_dynamics_robot

Action forward dynamics, autonomous-vehicle domain:

python examples/cosmos3/inference_cosmos3.py \
    --model nano \
    --prompt "You are an autonomous vehicle planning system." \
    --vision-path "https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/assets/cosmos3/inputs/action/av_vision_25_73d01c91-51f0-46cf-9b76-5682a76fb349.mp4" \
    --action-mode forward_dynamics \
    --action-path "https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/assets/cosmos3/inputs/action/av_action_25.json" \
    --action-chunk-size 60 \
    --domain-name av \
    --resolution-tier 480 --fps 10 \
    --num-inference-steps 30 --guidance-scale 1.0 --flow-shift 10.0 --seed 0 \
    --output results/cosmos3_forward_dynamics_av

Action inverse dynamics, robot domain (predict actions from an observed video):

python examples/cosmos3/inference_cosmos3.py \
    --model nano \
    --prompt "Put the pot to the left of the purple item." \
    --vision-path "https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/assets/cosmos3/inputs/action/bridge_0.mp4" \
    --action-mode inverse_dynamics \
    --action-chunk-size 16 \
    --domain-name bridge_orig_lerobot \
    --resolution-tier 480 --fps 5 \
    --num-inference-steps 30 --guidance-scale 1.0 --flow-shift 10.0 --seed 0 \
    --output results/cosmos3_inverse_dynamics_robot

Action inverse dynamics, autonomous-vehicle domain:

python examples/cosmos3/inference_cosmos3.py \
    --model nano \
    --prompt "You are an autonomous vehicle planning system." \
    --vision-path "https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/assets/cosmos3/inputs/action/av_vision_25_73d01c91-51f0-46cf-9b76-5682a76fb349.mp4" \
    --action-mode inverse_dynamics \
    --action-chunk-size 60 \
    --domain-name av \
    --resolution-tier 480 --fps 10 \
    --num-inference-steps 30 --guidance-scale 1.0 --flow-shift 10.0 --seed 0 \
    --output results/cosmos3_inverse_dynamics_av

Action policy, robot domain (predict both future video and actions from the first observation frame):

python examples/cosmos3/inference_cosmos3.py \
    --model nano \
    --prompt "Put the pot to the left of the purple item." \
    --vision-path "https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/assets/cosmos3/inputs/action/bridge_0.mp4" \
    --action-mode policy \
    --action-chunk-size 16 \
    --domain-name bridge_orig_lerobot \
    --resolution-tier 480 --fps 5 \
    --num-inference-steps 30 --guidance-scale 1.0 --flow-shift 10.0 --seed 0 \
    --output results/cosmos3_policy_robot

Action policy, autonomous-vehicle domain:

python examples/cosmos3/inference_cosmos3.py \
    --model nano \
    --prompt "You are an autonomous vehicle planning system. Please go backward." \
    --vision-path "https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/assets/cosmos3/inputs/action/av_vision_25_73d01c91-51f0-46cf-9b76-5682a76fb349.mp4" \
    --action-mode policy \
    --action-chunk-size 60 \
    --domain-name av \
    --resolution-tier 480 --fps 10 \
    --num-inference-steps 30 --guidance-scale 1.0 --flow-shift 10.0 --seed 0 \
    --output results/cosmos3_policy_av

Action modes use action_chunk_size + 1 conditioning frames. forward_dynamics consumes --action-path; inverse_dynamics and policy write predicted actions to sample_action.json in model-normalized action space. This script loads --vision-path as a video for all action modes; policy and forward_dynamics condition only on the first frame, while inverse_dynamics uses the whole video.

Pass --prompt as a plain task description and select the camera perspective with --view-point (default ego_view); the pipeline builds the structured action caption (task, viewpoint, duration, FPS, resolution) the model was trained on. Do not hand-write the viewpoint sentence into --prompt.

--resolution-tier is a resolution tier (256/480/704/720). The tier keys a table of predefined aspect-ratio canvases; the one closest to the input aspect ratio becomes the padded conditioning canvas. It is not the output frame size: the input is downscaled (never upscaled) and padded to fill the canvas, then the padding is cropped from the latents so the decoded output follows the downscaled input content. --height / --width (and --num-frames) are ignored for action modes.

Pick the tier that matches the native resolution of your conditioning input (480 for ~480p, 720 for ~720p). A tier below your input downscales it and discards detail; a tier above your input gains no resolution (content is never upscaled), wastes compute on padding, and is a train/inference distribution mismatch that can degrade quality.

Useful flags

Flag	Default	Description
`--prompt`	(required)	Text prompt.
`--vision-path`	`None`	URL or local path for an image-conditioning frame (image-to-video), or the image/video conditioning for action modes.
`--num-frames`	`189`	`1` = image, otherwise number of video frames (`189` ≈ 7.9 s @ 24 FPS). Ignored for action modes (derived from `--action-chunk-size`).
`--height` / `--width`	`720` / `1280`	Output resolution (must be a multiple of the VAE spatial scale factor). Ignored for action modes; use `--resolution-tier`.
`--resolution-tier`	`480`	Action resolution tier (`256`/`480`/`704`/`720`): selects the aspect bin / padded conditioning canvas, not the output size.
`--fps`	`24.0`	Frame rate of the generated video.
`--flow-shift`	`None`	Override `UniPCMultistepScheduler.flow_shift` (and force `use_karras_sigmas=False`); left at the checkpoint default when unset. Cosmos3 runs use `10.0`.
`--enable-sound`	off	Generate a synchronized audio track.
`--action-mode`	`None`	Enable action conditioning/generation. One of `forward_dynamics`, `inverse_dynamics`, or `policy`.
`--action-path`	`None`	URL or local JSON action path for `forward_dynamics`.
`--action-chunk-size`	`None`	Number of action tokens. Action runs generate/use `action_chunk_size + 1` video frames.
`--domain-name`	`None`	Action embodiment domain, for example `bridge_orig_lerobot` or `av`.
`--view-point`	`ego_view`	Camera perspective for the action caption's framing (`ego_view`, `third_person_view`, `wrist_view`, `concat_view`). Action only.
`--no-duration-template`	off	Skip the duration metadata sentence appended to the prompt and negative prompt. Ignored for `--num-frames 1` and for action modes (which build a structured caption instead).
`--no-resolution-template`	off	Skip the resolution metadata sentence appended to the prompt and negative prompt. Ignored for action modes.
`--output`	`.`	Directory to write `sample.jpg` or `sample.mp4`.

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
inference_cosmos3.py		inference_cosmos3.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Cosmos3 — smoke-test runner

Setup

Usage

Useful flags

FilesExpand file tree

cosmos3

Directory actions

More options

Directory actions

More options

Latest commit

History

cosmos3

Folders and files

parent directory

README.md

Cosmos3 — smoke-test runner

Setup

Usage

Useful flags