AI Video Generation with Wan 2.2 in ComfyUI
Learn how to generate stunning AI videos using Wan 2.2 in ComfyUI. Complete setup guide for 5B and 14B models, workflows, and best practices for consumer GPUs.
Table of Contents
- Why Wan 2.2?
- Model Variants
- Setup Guide
- Prerequisites
- Step 1: Install ComfyUI
- Step 2: Download Model Files
- Step 3: Download Workflows
- Generating Your First Video
- Text-to-Video Workflow
- Image-to-Video Workflow
- Hardware Optimization
- Memory-Saving Techniques
- Quality vs Speed
- Prompt Engineering
- Structure Your Prompts
- Example Prompts
- Negative Prompts
- Comparing Wan 2.2 to Other Models
- Wan 2.2 vs LTX-2
- Wan 2.2 vs Wan 2.1
- Troubleshooting
- Out of Memory Errors
- Slow Generation
- Poor Quality Output
- Resources
AI Video Generation with Wan 2.2 in ComfyUI
Wan 2.2 is the latest generation of video foundation models from Wan-AI, offering consumer-friendly video generation right in ComfyUI. Unlike LTX-2’s massive 19B parameter model requiring enterprise GPUs, Wan 2.2’s 5B variant runs on just 12GB VRAM—making it accessible to anyone with a mid-range graphics card.
Why Wan 2.2?
Wan 2.2 represents a significant leap forward in accessible AI video generation:
- Consumer-friendly: The 5B model runs on 12GB VRAM (RTX 3060/4070 Ti)
- Unified model: One 5B model handles both text-to-video AND image-to-video
- Quality options: 14B models available for higher quality output
- Native ComfyUI integration: Direct support without custom nodes
- Visual text: Generate Chinese and English text in videos
Model Variants
| Model | Parameters | Task | VRAM | Best For |
|---|---|---|---|---|
| ti2v_5B | 5B | T2V + I2V | 12GB | Consumers, fast iteration |
| t2v_14B | 14B | Text-to-Video | 24GB+ | High-quality output |
| i2v_14B | 14B | Image-to-Video | 24GB+ | Animation from images |
Setup Guide
Prerequisites
Before installing Wan 2.2, ensure your system meets these requirements:
Minimum (5B model):
- GPU: NVIDIA RTX 3060 12GB or equivalent
- VRAM: 12GB
- RAM: 16GB
- Storage: 30GB for models
- Python: 3.10+
Recommended (14B models):
- GPU: RTX 4090 / RTX 5090 / A100
- VRAM: 24GB+
- RAM: 32GB+
- Storage: 50GB+
Step 1: Install ComfyUI
If you haven’t installed ComfyUI yet:
git clone https://github.com/comfyanonymous/ComfyUI
cd ComfyUI
pip install -r requirements.txt
Step 2: Download Model Files
Wan 2.2 requires several components. Here’s how to download them:
Text Encoder (Required)
# Create text_encoders directory
mkdir -p models/text_encoders
# Download UMT5 text encoder
huggingface-cli download Comfy-Org/Wan_2.2_ComfyUI_repackaged \
split_files/text_encoders/umt5_xxl_fp8_e4m3fn_scaled.safetensors \
--local-dir .
The text encoder goes in: ComfyUI/models/text_encoders/
For 5B Model (Recommended)
# Download 5B diffusion model
huggingface-cli download Comfy-Org/Wan_2.2_ComfyUI_repackaged \
split_files/diffusion_models/wan2.2_ti2v_5B_fp16.safetensors \
--local-dir .
# Download 5B VAE
huggingface-cli download Comfy-Org/Wan_2.2_ComfyUI_repackaged \
split_files/vae/wan2.2_vae.safetensors \
--local-dir .
Files go in:
- Diffusion model:
ComfyUI/models/diffusion_models/ - VAE:
ComfyUI/models/vae/
For 14B Models (High Quality)
The 14B models use a two-stage approach with high and low noise schedulers:
# Download T2V 14B models
huggingface-cli download Comfy-Org/Wan_2.2_ComfyUI_repackaged \
split_files/diffusion_models/wan2.2_t2v_high_noise_14B_fp8_scaled.safetensors \
--local-dir .
huggingface-cli download Comfy-Org/Wan_2.2_ComfyUI_repackaged \
split_files/diffusion_models/wan2.2_t2v_low_noise_14B_fp8_scaled.safetensors \
--local-dir .
Note: 14B models use wan_2.1_vae.safetensors, not the 2.2 VAE.
Step 3: Download Workflows
Grab the official ComfyUI workflow JSON files:
Load these in ComfyUI via Load Default → Choose File.
Generating Your First Video
Text-to-Video Workflow
- Load the T2V workflow JSON in ComfyUI
- Find the WanTextToVideoCheckpointLoader node
- Select your model:
wan2.2_ti2v_5B_fp16 - Set your prompt in the Text Encode node:
A cat playing piano in a sunlit room, soft afternoon light, cinematic, 4K - Adjust settings:
- Width: 832 (480P) or 1280 (720P)
- Height: 480 or 720
- Frames: 33 (about 1.3 seconds at 24fps)
- CFG: 6.0
- Click Queue Prompt
Image-to-Video Workflow
- Load the I2V workflow JSON
- Connect your input image to the Load Image node
- Write a motion prompt:
The character walks forward, turning to look at the camera, natural movement - Adjust frames and CFG as needed
- Generate
Hardware Optimization
Memory-Saving Techniques
If you’re running into OOM (Out of Memory) errors:
Use FP8 Variants
FP8 models use ~40% less VRAM with minimal quality loss:
# Instead of FP16
wan2.2_ti2v_5B_fp16.safetensors
# Use FP8
wan2.2_ti2v_5B_fp8_scaled.safetensors
CPU Offloading
For systems with limited VRAM, offload the text encoder:
# In ComfyUI settings or launch args
--lowvram --cpu
Reduce Frame Count
Lower the frame count for faster generation:
| Resolution | Frames | Duration | VRAM |
|---|---|---|---|
| 480P | 33 | 1.3s | Lower |
| 480P | 81 | 3.4s | Medium |
| 720P | 33 | 1.3s | Medium |
| 720P | 81 | 3.4s | Higher |
Quality vs Speed
| Format | Quality | Speed | File Size |
|---|---|---|---|
| FP16 | Best | Slowest | Largest |
| BF16 | High | Medium | Large |
| FP8 Scaled | Good | Fast | Medium |
| FP8 E4M3FN | Acceptable | Fastest | Smallest |
Prompt Engineering
Structure Your Prompts
Use this formula for best results:
[Subject] + [Action/Motion] + [Environment] + [Lighting] + [Quality/Style]
Example Prompts
Cinematic:
A vintage car driving through a neon-lit Tokyo street at night,
rain reflections on wet pavement, Blade Runner aesthetic,
cinematic lighting, 4K, photorealistic
Animation:
An animated character waving at the camera,
bright studio lighting, Pixar style,
vibrant colors, smooth motion
Nature:
Ocean waves crashing on rocky cliffs during golden hour,
spray catching the light, slow motion,
cinematic, 4K, nature documentary
Negative Prompts
Wan doesn’t use traditional negative prompts, but you can improve results by being specific about what you DO want, avoiding ambiguous descriptions.
Comparing Wan 2.2 to Other Models
Wan 2.2 vs LTX-2
| Feature | Wan 2.2 5B | Wan 2.2 14B | LTX-2 19B |
|---|---|---|---|
| Parameters | 5B | 14B | 19B |
| VRAM (min) | 12GB | 24GB | 32GB+ |
| Resolution | 480-720P | 480-720P | Native 4K |
| Audio Sync | ❌ No | ❌ No | ✅ Yes |
| Consumer GPU | ✅ Yes | ⚠️ High-end | ❌ No |
| Unified Model | ✅ Yes | ❌ Separate | ❌ Separate |
Winner for consumers: Wan 2.2 5B runs on a 12GB card, while LTX-2 requires enterprise hardware.
Winner for quality: LTX-2 at 4K with synchronized audio, if you have the hardware.
Wan 2.2 vs Wan 2.1
Wan 2.2 introduces:
- Unified 5B model (both T2V and I2V)
- Improved quality over 2.1’s 1.3B
- Better motion consistency
- New VAE architecture for 5B variant
Troubleshooting
Out of Memory Errors
Problem: CUDA out of memory during generation
Solutions:
- Switch to FP8 model variants
- Reduce frame count
- Enable CPU offload
- Close other GPU applications
Slow Generation
Problem: Video takes 10+ minutes to generate
Solutions:
- Use 5B model instead of 14B
- Reduce resolution to 480P
- Reduce frame count
- Check GPU utilization—is it actually being used?
Poor Quality Output
Problem: Generated videos look blurry or have artifacts
Solutions:
- Use FP16 instead of FP8
- Increase CFG value (try 6-8)
- Improve prompt clarity
- Use 14B model for better quality
Resources
- Official GitHub: Wan-Video/Wan2.2
- ComfyUI Examples: Wan 2.2 Workflows
- Model Downloads: Hugging Face
- Discord: Wan-AI Community
Wan 2.2 democratizes AI video generation by making it accessible to consumer hardware. Start with the 5B unified model for quick iterations, then scale up to the 14B variants when you need higher quality output. The native ComfyUI integration makes it easy to experiment with different prompts and settings without leaving your workflow.
Comments
Powered by GitHub Discussions