AI Video Generation with Wan 2.2 in ComfyUI

Learn how to generate stunning AI videos using Wan 2.2 in ComfyUI. Complete setup guide for 5B and 14B models, workflows, and best practices for consumer GPUs.

• 6 min read
aicomfyuivideo-generationwan2tutorial
AI Video Generation with Wan 2.2 in ComfyUI

AI Video Generation with Wan 2.2 in ComfyUI

Wan 2.2 is the latest generation of video foundation models from Wan-AI, offering consumer-friendly video generation right in ComfyUI. Unlike LTX-2’s massive 19B parameter model requiring enterprise GPUs, Wan 2.2’s 5B variant runs on just 12GB VRAM—making it accessible to anyone with a mid-range graphics card.

Why Wan 2.2?

Wan 2.2 represents a significant leap forward in accessible AI video generation:

  • Consumer-friendly: The 5B model runs on 12GB VRAM (RTX 3060/4070 Ti)
  • Unified model: One 5B model handles both text-to-video AND image-to-video
  • Quality options: 14B models available for higher quality output
  • Native ComfyUI integration: Direct support without custom nodes
  • Visual text: Generate Chinese and English text in videos

Model Variants

ModelParametersTaskVRAMBest For
ti2v_5B5BT2V + I2V12GBConsumers, fast iteration
t2v_14B14BText-to-Video24GB+High-quality output
i2v_14B14BImage-to-Video24GB+Animation from images

Setup Guide

Prerequisites

Before installing Wan 2.2, ensure your system meets these requirements:

Minimum (5B model):

  • GPU: NVIDIA RTX 3060 12GB or equivalent
  • VRAM: 12GB
  • RAM: 16GB
  • Storage: 30GB for models
  • Python: 3.10+

Recommended (14B models):

  • GPU: RTX 4090 / RTX 5090 / A100
  • VRAM: 24GB+
  • RAM: 32GB+
  • Storage: 50GB+

Step 1: Install ComfyUI

If you haven’t installed ComfyUI yet:

git clone https://github.com/comfyanonymous/ComfyUI
cd ComfyUI
pip install -r requirements.txt

Step 2: Download Model Files

Wan 2.2 requires several components. Here’s how to download them:

Text Encoder (Required)

# Create text_encoders directory
mkdir -p models/text_encoders

# Download UMT5 text encoder
huggingface-cli download Comfy-Org/Wan_2.2_ComfyUI_repackaged \
  split_files/text_encoders/umt5_xxl_fp8_e4m3fn_scaled.safetensors \
  --local-dir .

The text encoder goes in: ComfyUI/models/text_encoders/

# Download 5B diffusion model
huggingface-cli download Comfy-Org/Wan_2.2_ComfyUI_repackaged \
  split_files/diffusion_models/wan2.2_ti2v_5B_fp16.safetensors \
  --local-dir .

# Download 5B VAE
huggingface-cli download Comfy-Org/Wan_2.2_ComfyUI_repackaged \
  split_files/vae/wan2.2_vae.safetensors \
  --local-dir .

Files go in:

  • Diffusion model: ComfyUI/models/diffusion_models/
  • VAE: ComfyUI/models/vae/

For 14B Models (High Quality)

The 14B models use a two-stage approach with high and low noise schedulers:

# Download T2V 14B models
huggingface-cli download Comfy-Org/Wan_2.2_ComfyUI_repackaged \
  split_files/diffusion_models/wan2.2_t2v_high_noise_14B_fp8_scaled.safetensors \
  --local-dir .

huggingface-cli download Comfy-Org/Wan_2.2_ComfyUI_repackaged \
  split_files/diffusion_models/wan2.2_t2v_low_noise_14B_fp8_scaled.safetensors \
  --local-dir .

Note: 14B models use wan_2.1_vae.safetensors, not the 2.2 VAE.

Step 3: Download Workflows

Grab the official ComfyUI workflow JSON files:

Load these in ComfyUI via Load DefaultChoose File.

Generating Your First Video

Text-to-Video Workflow

  1. Load the T2V workflow JSON in ComfyUI
  2. Find the WanTextToVideoCheckpointLoader node
  3. Select your model: wan2.2_ti2v_5B_fp16
  4. Set your prompt in the Text Encode node:
    A cat playing piano in a sunlit room,
    soft afternoon light, cinematic, 4K
  5. Adjust settings:
    • Width: 832 (480P) or 1280 (720P)
    • Height: 480 or 720
    • Frames: 33 (about 1.3 seconds at 24fps)
    • CFG: 6.0
  6. Click Queue Prompt

Image-to-Video Workflow

  1. Load the I2V workflow JSON
  2. Connect your input image to the Load Image node
  3. Write a motion prompt:
    The character walks forward,
    turning to look at the camera,
    natural movement
  4. Adjust frames and CFG as needed
  5. Generate

Hardware Optimization

Memory-Saving Techniques

If you’re running into OOM (Out of Memory) errors:

Use FP8 Variants

FP8 models use ~40% less VRAM with minimal quality loss:

# Instead of FP16
wan2.2_ti2v_5B_fp16.safetensors

# Use FP8
wan2.2_ti2v_5B_fp8_scaled.safetensors

CPU Offloading

For systems with limited VRAM, offload the text encoder:

# In ComfyUI settings or launch args
--lowvram --cpu

Reduce Frame Count

Lower the frame count for faster generation:

ResolutionFramesDurationVRAM
480P331.3sLower
480P813.4sMedium
720P331.3sMedium
720P813.4sHigher

Quality vs Speed

FormatQualitySpeedFile Size
FP16BestSlowestLargest
BF16HighMediumLarge
FP8 ScaledGoodFastMedium
FP8 E4M3FNAcceptableFastestSmallest

Prompt Engineering

Structure Your Prompts

Use this formula for best results:

[Subject] + [Action/Motion] + [Environment] + [Lighting] + [Quality/Style]

Example Prompts

Cinematic:

A vintage car driving through a neon-lit Tokyo street at night,
rain reflections on wet pavement, Blade Runner aesthetic,
cinematic lighting, 4K, photorealistic

Animation:

An animated character waving at the camera,
bright studio lighting, Pixar style,
vibrant colors, smooth motion

Nature:

Ocean waves crashing on rocky cliffs during golden hour,
spray catching the light, slow motion,
cinematic, 4K, nature documentary

Negative Prompts

Wan doesn’t use traditional negative prompts, but you can improve results by being specific about what you DO want, avoiding ambiguous descriptions.

Comparing Wan 2.2 to Other Models

Wan 2.2 vs LTX-2

FeatureWan 2.2 5BWan 2.2 14BLTX-2 19B
Parameters5B14B19B
VRAM (min)12GB24GB32GB+
Resolution480-720P480-720PNative 4K
Audio Sync❌ No❌ No✅ Yes
Consumer GPU✅ Yes⚠️ High-end❌ No
Unified Model✅ Yes❌ Separate❌ Separate

Winner for consumers: Wan 2.2 5B runs on a 12GB card, while LTX-2 requires enterprise hardware.

Winner for quality: LTX-2 at 4K with synchronized audio, if you have the hardware.

Wan 2.2 vs Wan 2.1

Wan 2.2 introduces:

  • Unified 5B model (both T2V and I2V)
  • Improved quality over 2.1’s 1.3B
  • Better motion consistency
  • New VAE architecture for 5B variant

Troubleshooting

Out of Memory Errors

Problem: CUDA out of memory during generation

Solutions:

  1. Switch to FP8 model variants
  2. Reduce frame count
  3. Enable CPU offload
  4. Close other GPU applications

Slow Generation

Problem: Video takes 10+ minutes to generate

Solutions:

  1. Use 5B model instead of 14B
  2. Reduce resolution to 480P
  3. Reduce frame count
  4. Check GPU utilization—is it actually being used?

Poor Quality Output

Problem: Generated videos look blurry or have artifacts

Solutions:

  1. Use FP16 instead of FP8
  2. Increase CFG value (try 6-8)
  3. Improve prompt clarity
  4. Use 14B model for better quality

Resources


Wan 2.2 democratizes AI video generation by making it accessible to consumer hardware. Start with the 5B unified model for quick iterations, then scale up to the 14B variants when you need higher quality output. The native ComfyUI integration makes it easy to experiment with different prompts and settings without leaving your workflow.

Anthony Lattanzio

Anthony Lattanzio

Tech Enthusiast & Builder

I'm a tech enthusiast who loves building things with hardware and software. By night, I run a homelab that's grown way beyond what any reasonable person needs. Check out about me for more.

Comments

Powered by GitHub Discussions