NVIDIA SANA-WM: Free Open-Source AI That Turns One Photo Into a Full Minute of 720p Video

On May 16, 2026, NVIDIA released SANA-WM — a 2.6 billion parameter open-source AI model that takes a single image and a camera path as input and generates a full 60 seconds of 720p video on a single GPU.

It's free. It's open-source. And it puts serious video world modeling in the hands of anyone with a capable GPU.

This is one of the most significant open-source releases in AI video so far this year, and it landed on Hacker News #1 this morning.

Here's everything you need to know.

What Is SANA-WM?

SANA-WM stands for SANA World Model. It's an AI that creates video by modeling a 3D world — rather than just hallucinating pixels frame-by-frame.

You give it:

A single starting image (a photo, a render, a still frame)
A camera trajectory — instructions for how the virtual camera should move through the scene

SANA-WM then synthesizes a realistic, continuous 60-second video of that world, with the camera following the path you defined.

The output: 720p, 60 seconds, controllable, photorealistic video.

It was built by researchers at NVIDIA, trained on approximately 213,000 public video clips with precise camera pose data, and the entire training run took 15 days on 64 H100 GPUs. The inference model, however, runs on a single GPU.

How Is This Different From Regular AI Video Generators?

Most AI video tools you've seen — Runway, Kling, Pika, Sora — are text-to-video or image-to-video generators. They're good at creating short, cinematic clips, but they don't give you precise control over camera movement, and they struggle with consistency over longer durations.

SANA-WM is a world model. The difference:

Feature	Standard AI Video (Runway, Kling)	SANA-WM World Model
Duration	3–10 seconds typically	Up to 60 seconds
Camera control	Limited or none	Full 6-DoF trajectory
Input	Text prompt or image	Image + camera path
Output quality	High (polished)	720p, high-fidelity
Cost to try	Subscription required	Free, open-source
GPU requirement	Cloud-only	Single H100 or RTX 5090

6-DoF means six degrees of freedom: the camera can move forward/backward, left/right, up/down, and rotate on three axes — pitch, yaw, and roll. This is the kind of precise camera control you'd expect from a game engine or a VFX suite, now available in an open-source AI model.

What Does "World Model" Mean for Beginners?

A world model is an AI that builds an internal representation of 3D space — it understands depth, physics, and spatial relationships — and then generates what the camera would see as it moves through that space.

Think of it this way: instead of painting one picture after another, SANA-WM imagines the room and then shows you what you'd see if you walked through it.

This is what makes it useful for:

3D scene exploration — fly through an environment from a single photo
Virtual cinematic shots — pan, dolly, orbit, push-in, pull-out
Game asset previews — visualize how a concept art piece looks from different angles
AI film production — create establishing shots or environment fly-throughs without a camera crew

The Hardware Reality

This is the part beginners need to know upfront.

What you need to run SANA-WM:

Minimum for base model: A single H100 GPU (80GB VRAM)
Distilled variant: Runs on an RTX 5090 with NVFP4 quantization — generates a 60-second 720p video in about 34 seconds

The H100 requirement puts this out of reach for most consumer setups today. The RTX 5090 distilled variant is more accessible, but still not cheap hardware.

What this means in practice:

If you have a high-end consumer GPU (RTX 4090, RTX 5080, RTX 5090), you may be able to run the quantized/distilled variant. If you're on a mid-range GPU, cloud inference via Hugging Face Spaces or Google Colab (A100/H100 tier) is the practical way to experiment.

If you're not ready to run this locally, the guide to checking your VRAM for AI will help you understand what your GPU can handle.

What Makes SANA-WM Technically Impressive

For readers who want the deeper technical picture (non-devs can skip to the next section):

SANA-WM solves two core problems that limited previous open-source world models:

1. Long video generation without running out of memory

Most transformer-based video models hit memory limits past ~10 seconds. SANA-WM uses Hybrid Linear Attention — a combination of frame-wise Gated DeltaNet and periodic softmax attention. This lets the model hold a coherent world state for a full minute without standard attention memory costs exploding.

2. Precise camera control without expensive annotation

Getting precise 6-DoF camera data from public video is hard — most videos don't come with metadata about how the camera moved. NVIDIA built a custom annotation pipeline that extracts metric-scale camera poses from raw public video clips, giving SANA-WM high-quality training data without requiring a proprietary dataset.

There's also a two-stage pipeline: the base 2.6B model generates the long rollout, and a separate 17B long-video refiner sharpens texture, motion, and consistency in the final output.

How to Try SANA-WM

The model and code are open-source. Here's how to get started:

Option 1: Official GitHub + Hugging Face

The project page is at nvlabs.github.io/Sana/WM/. The code and model weights will be available on GitHub (NVIDIA's NVlabs organization) and on Hugging Face. Follow the repository for the inference script — NVIDIA typically provides a straightforward gradio demo for their open-source models.

Option 2: Hugging Face Spaces Demo

NVIDIA usually deploys a hosted Space for their open-source releases. Check Hugging Face Spaces for "SANA-WM" — a zero-setup browser demo typically appears within days of the paper release, letting you test the model without any local hardware.

Option 3: RunPod or Vast.ai (Cloud GPU Rental)

If you want to run the full model without owning an H100, cloud GPU rentals are practical. RunPod and Vast.ai charge hourly — expect $2–4/hour for H100 access. A few test generations will cost under $1.

Option 4: Use a Cloud Video AI Instead

If you want AI video without GPU setup, tools like TryHolo offer text-to-video and image-to-video generation in a browser interface with no hardware needed. It won't have SANA-WM's camera control precision, but it's the easiest path to AI video for complete beginners.

How SANA-WM Compares to Other AI Video in 2026

NVIDIA benchmarked SANA-WM against two closed industrial baselines — LingBot-World and HY-WorldPlay. The results:

Visual quality: Comparable to both industrial baselines
Action-following accuracy: Stronger than prior open-source baselines
Throughput: 36× higher than competing methods

This matters because most world models that match industrial quality require massive proprietary infrastructure. SANA-WM achieves the same output tier at a fraction of the compute cost — and it's free.

What Beginners Can Actually Do With This

You don't need to understand world modeling theory to find creative uses:

Real estate walkthroughs — turn a photo of a room into a camera flythrough
Nature videos — generate 60-second ambient scenes from landscape photos
Concept visualization — show how a design would look from multiple angles
Social media content — create unique background loops for YouTube or Instagram

The key skill is defining the camera trajectory — where you want the virtual camera to go. Think of it as choreographing a camera move without a real camera.

FAQ

Q: Is SANA-WM completely free? A: Yes — NVIDIA released SANA-WM as open-source. The model weights and code are publicly available. You only pay for the hardware (your own GPU, or cloud rental time).

Q: Do I need an NVIDIA GPU to run it? A: The base model was benchmarked on H100 GPUs. The distilled variant is designed for RTX 5090 with NVFP4 quantization. Other high-VRAM NVIDIA GPUs (RTX 4090 and above) may work for lighter configurations — the GitHub repo will document supported hardware once the inference code is fully released.

Q: How is this different from Sora? A: Sora (OpenAI) is a closed, subscription-based model. SANA-WM is open-source. Sora is text-to-video; SANA-WM is primarily image + camera trajectory to world video. SANA-WM specializes in camera-controlled world modeling, not general text-to-video generation.

Q: How long does it take to generate a 60-second video? A: The distilled variant generates a 60-second 720p clip in approximately 34 seconds on a single RTX 5090 with NVFP4 quantization. On a full H100, generation is faster. Cloud inference times will vary depending on the provider.

Q: Can I use videos generated by SANA-WM commercially? A: NVIDIA's open-source releases typically use a research license with restrictions on commercial use. Check the specific license in the GitHub repository before using generated content for commercial purposes.

Q: What's a "world model" vs a normal AI video generator? A: A standard AI video generator predicts what the next frames should look like based on your prompt or image. A world model builds an internal spatial understanding of the scene and renders what you'd see as a camera moves through it. The result is more geometrically consistent and controllable — especially over longer durations where standard models tend to drift.

Q: Is there a browser demo available? A: The paper and project page launched May 16, 2026. A Hugging Face Spaces demo typically follows within days. Check huggingface.co/spaces and search for SANA-WM. The official project page at NVIDIA Labs will link to it when available.