← All blog posts 6-8 min readcommunity

Get Started With NVIDIA Cosmos 3: The Open Physical AI World Model (2026)

What you'll learn
  • Understand the Mixture-of-Transformers architecture behind Cosmos 3 and why it is not a language model
  • Choose between Nano and Super based on your hardware and use case
  • Run Cosmos 3 Nano using Diffusers, vLLM-Omni, or SGLang
  • Evaluate Cosmos 3's synthetic robot training capabilities against its documented limitations

NVIDIA Cosmos 3 is an open-weight physical AI world model — not a language model — released May 31, 2026. Weights for Nano (16B) and Super (64B) are live on HuggingFace under the OpenMDW 1.1 license. The fastest run path is the Diffusers Cosmos3OmniPipeline. Nano requires at least 96GB VRAM (RTX PRO 6000 class or equivalent multi-GPU); Super requires H100/H200/B200 datacenter hardware.

The most important thing to understand before you install anything: Cosmos 3 is not a chatbot. If you are expecting a smarter GPT, you are looking at the wrong model. The right mental model is Stable Diffusion meets a robotics simulator — a system that generates physically plausible video frames and robot action trajectories, not answers to questions. Treat it like that and it is genuinely powerful. Treat it like Claude and you will be disappointed.

What Cosmos 3 Actually Is

Cosmos 3 uses a Mixture-of-Transformers (MoT) architecture with two specialized towers operating in tandem, per NVIDIA's technical blog:

  • Reasoner Tower — an autoregressive vision-language model that interprets multimodal inputs and builds a physical-world understanding
  • Generator Tower — a diffusion-based system that produces future video frames and robot action sequences conditioned on the Reasoner's output

Together they create a model that can natively handle five modalities: text, images, video, ambient sound, and robot action trajectories. The NVIDIA press release describes it as "a vision language model, world model, and world action model backbone" — three jobs in one.

This was released alongside Nemotron 3 Ultra as part of NVIDIA's "open-source week," including weights, code, datasets, and fine-tuning recipes — per HPC Wire.

Nano vs Super: Which Can You Actually Run?

Cosmos 3 NanoCosmos 3 Super
Parameters16B64B
HuggingFacenvidia/Cosmos3-Nanonvidia/Cosmos3-Super
PrecisionBF16 onlyBF16 only
GPU ArchitectureAmpere, Hopper, BlackwellHopper, Blackwell only
Practical hardware floorRTX PRO 6000 (96GB VRAM)H100 / H200 / B200
Use caseFast inference, real-time roboticsHighest-quality synthetic data

The Nano's "Ampere support" framing in the official docs is technically true but practically optimistic for consumer hardware. 96GB VRAM means an RTX PRO 6000 workstation GPU or a multi-GPU setup — not a gaming card. Super is a datacenter-only model.

The HuggingFace collection also includes specialty variants: Cosmos3-Super-Text2Image, Cosmos3-Super-Image2Video, and Cosmos3-Nano-Policy-DROID (a pre-finetuned robot manipulation policy on the DROID dataset). A Cosmos 3 Edge variant for real-time inference is listed as coming soon.

Running Cosmos 3: The Three Official Paths

The NVIDIA Cosmos GitHub repo documents three supported inference paths. There is no official Ollama integration as of June 2026 — skip any third-party Ollama forks for a model this new.

The easiest on-ramp. Install with uv for clean Python 3.13 isolation:

``bash uv venv --python 3.13 --seed --managed-python uv pip install --torch-backend=auto diffusers accelerate torch torchvision transformers uvx hf@latest auth login ``

Then run Nano:

```python import torch from diffusers import Cosmos3OmniPipeline

pipe = Cosmos3OmniPipeline.from_pretrained( "nvidia/Cosmos3-Nano", torch_dtype=torch.bfloat16, device_map="cuda", enable_safety_checker=True, )

result = pipe( prompt="Robot arm picks up red cube from table", num_frames=189, height=720, width=1280, num_inference_steps=35, guidance_scale=6.0, ) result.frames[0].save("cosmos_output.mp4") ```

Expected output: An MP4 video showing a physically plausible simulation of the described scene, 189 frames at 24fps (~7.9 seconds). Generation time on an H100 is approximately 2-4 minutes for this configuration.

Path 2 — vLLM-Omni (Production API Server)

Exposes an OpenAI-compatible endpoint — useful for integrating Cosmos 3 into existing pipelines:

```bash docker pull vllm/vllm-omni:cosmos3

vllm serve nvidia/Cosmos3-Nano \ --omni \ --host 0.0.0.0 \ --port 8000 \ --init-timeout 1800 ```

The --init-timeout 1800 flag is required — Cosmos 3 checkpoints exceed the default server init timeout. The API is available at localhost:8000/v1/videos/sync.

Path 3 — SGLang

Minimal setup for inference serving:

``bash sglang serve --model-path nvidia/Cosmos3-Nano ``

SGLang is a good option if you are already using it for other models and want consistent tooling across your stack.

Synthetic Robot Training: The Real Use Case

The "months to days" claim in Axios's Cosmos 3 coverage is NVIDIA-sourced, not independently benchmarked. Treat it as directional, not a guaranteed reduction.

That said, the mechanism is legitimate. Cosmos 3 generates synthetic robot training data — video frames plus action trajectories — that can substitute for expensive real-world data collection. It supports four distinct task types:

  • Forward dynamics: Given video context + action, predict the next state
  • Inverse dynamics: Given before/after video, infer what action was taken
  • Policy generation: Given video + goal, output robot action trajectories as JSON
  • Synthetic dataset creation: NVIDIA released six datasets covering embodied robots, physical interactions, warehouse operations, and autonomous driving — all on HuggingFace

The training corpus for Cosmos 3 itself comprises 1.3B data points across 393 datasets from 2024–2026, including public sources (Coyo700M, OpenImage, YouTube) and private robotics and AV data.

Artificial Analysis independently confirmed Cosmos 3 achieved #1 among open-weight models on text-to-image and image-to-video leaderboards — per the Latent Space AI News roundup. On physical AI benchmarks (Physics-IQ, PAI-Bench, RoboArena, VANTAGE-Bench), the claims come from NVIDIA's own technical blog. Independent replication of those physical AI benchmarks is still early.

The Cosmos Coalition

Six companies are founding members of the Cosmos Coalition — an open physical AI ecosystem built around the Cosmos platform:

  • Agile Robots — humanoid robotics
  • Black Forest Labs — image generation (FLUX models)
  • Generalist — embodied AI
  • LTX — video generation
  • Runway — video AI (notable given they also compete in this space)
  • Skild AI — robot foundation models

Broader launch partners include Doosan Robotics, LG Electronics, Samsung Electronics, Li Auto (autonomous vehicles), and several vision AI companies. The Coalition framing mirrors what HuggingFace did for language models — a shared infrastructure layer owned by no single vendor.

Honest Caveats

The Cosmos 3 model card documents these limitations directly:

  • Temporal inconsistencies: motion can be unstable; object and camera jitter is documented
  • Physics gaps: no explicit physics simulation — objects may disappear, morph, or collide unrealistically despite the "physical AI" branding
  • Long-horizon degradation: quality degrades with longer video outputs
  • Hallucinations on spatial geometry: can misinterpret causal relationships and depth
  • Not certified for safety-critical use: autonomous systems and robotics control require additional validation beyond Cosmos 3 outputs

The bottom line: Cosmos 3 is a world model for training data generation, not for direct deployment in a production robot. Use it to create diverse synthetic scenarios that you then validate with real-world data before any safety-critical application.


KnowledgeCheck: You want to fine-tune a manipulation policy using synthetic data from Cosmos 3 Nano, but you only have a consumer RTX 4090 (24GB VRAM). What should you do?

A) Download Nano and run with --quantize int4 to fit in 24GB B) Use the hosted API at build.nvidia.com instead of local inference C) Switch to Cosmos 3 Super, which has lower VRAM requirements D) Use the --cpu-offload flag to spill to system RAM

Answer: B. Cosmos 3 Nano requires ~96GB VRAM for BF16 inference — a 24GB consumer GPU cannot run it regardless of quantization (officially only BF16 is supported). The hosted API at build.nvidia.com gives you access without local hardware. Option A is tempting but not supported by official documentation; Option C is wrong (Super needs more, not less VRAM); Option D is not documented in official sources.


Ready to build production-grade AI agent pipelines that integrate world models and physical AI into real systems? The How to build a production Claude Agent SDK app in 6 chapters course covers multi-modal tool use, agentic workflows, and production deployment — the infrastructure layer that makes models like Cosmos 3 usable in real pipelines.

References

  1. nvidianews.nvidia.com
  2. developer.nvidia.com
  3. huggingface.co
  4. huggingface.co
  5. github.com
  6. huggingface.co
  7. www.axios.com
  8. www.latent.space
  9. www.hpcwire.com
Next up
anthropic 6-8 min read

Claude Authors 80% of Anthropic's Code in 2026 — Here's What the Threshold Changes

Continue reading