Generating AI Video with LTX-2 in ComfyUI
Learn how to generate stunning AI videos with synchronized audio using LTX-2 in ComfyUI. Complete setup guide, workflows, and best practices.
Table of Contents
- What Makes LTX-2 Special
- Key Capabilities
- Setup Guide
- Prerequisites
- Installation
- Memory Optimization
- Workflow Types
- Text-to-Video (Full)
- Text-to-Video (Distilled)
- Image-to-Video
- Two-Stage Pipeline
- Hardware Requirements in Detail
- VRAM Breakdown
- Low-VRAM Workarounds
- Prompting Guide
- Prompt Structure (4-8 sentences)
- Example: Product Video Prompt
- What Works Well
- What to Avoid
- Camera Language Reference
- Audio Description
- Advanced Prompting Techniques
- Six-Part Structured Prompt (4K Quality)
- Lens Language Reference
- Shutter Descriptions
- Keywords for Smooth Motion
- Long Take Strategy (15-20 Second Clips)
- Audio-Visual Sync Techniques
- The 9 Prompting Rules (RunDiffusion)
- 1. Single Continuous Paragraph
- 2. Present-Tense Action Verbs
- 3. Explicit Camera Behavior
- 4. Precise Physical Details
- 5. Atmospheric Environment
- 6. Smooth Temporal Flow
- 7. Genre-Specific Language
- 8. Character Specificity
- 9. Show, Don’t Tell Emotion
- Match Prompt to Input Type
- Best Practices
- Negative Prompts
- Inference Settings
- Camera Movement LoRAs
- Common Issues
- OOM on CLIP Encoding
- Audio Sync Issues
- Artifacts in Output
- Resources
Generating AI Video with LTX-2 in ComfyUI
AI video generation has taken a massive leap forward with LTX-2, Lightricks’ second-generation audio-video foundation model. Unlike its predecessor, LTX-2 generates synchronized audio and video together in a single pass—the first DiT-based model to achieve this. This guide walks you through setting up and using LTX-2 in ComfyUI.
What Makes LTX-2 Special
LTX-2 is a 19-billion parameter model split into two streams:
- 14B video stream for visual generation
- 5B audio stream for synchronized sound
This dual-stream architecture enables native 4K resolution at up to 50 FPS with videos up to 10 seconds long. The model produces cinematic-quality output without requiring post-processing audio sync.
Key Capabilities
| Feature | Details |
|---|---|
| Resolution | Native 4K, up to 50 FPS |
| Duration | 10 seconds continuous |
| Audio | Synchronized generation |
| Architecture | Diffusion Transformer (DiT) |
| Integration | Built into ComfyUI core |
Setup Guide
Prerequisites
Before installing LTX-2, ensure your system meets these requirements:
Minimum:
- GPU: NVIDIA RTX 4090 or equivalent
- VRAM: 24GB
- RAM: 32GB
- Storage: 100GB SSD
Recommended:
- GPU: NVIDIA A100 or H100
- VRAM: 40GB+
- RAM: 64GB
- Storage: 200GB+ SSD
Installation
- Install ComfyUI if you haven’t already:
git clone https://github.com/comfyanonymous/ComfyUI
cd ComfyUI
pip install -r requirements.txt
- Download LTX-2 models from Hugging Face:
# Create models directory
mkdir -p models/LTX2
# Download required models
# UNET (19B, quantized for lower VRAM)
huggingface-cli download Lightricks/LTX-Video LTX2/ltx-2-19b-dev_Q4_K_M.gguf --local-dir models/LTX2
# VAE (video)
huggingface-cli download Lightricks/LTX-Video LTX2/LTX2_video_vae_bf16.safetensors --local-dir models/LTX2
# VAE (audio)
huggingface-cli download Lightricks/LTX-Video LTX2/LTX2_audio_vae_bf16.safetensors --local-dir models/LTX2
# CLIP text encoders
huggingface-cli download Lightricks/LTX-Video ltx-2-19b-embeddings_connector_dev_bf16.safetensors --local-dir models/LTX2
- Install additional nodes (optional but recommended):
cd custom_nodes
git clone https://github.com/Kosinkadink/ComfyUI-VideoHelperSuite
git clone https://github.com/Kosinkadink/ComfyUI-AnimateDiff-Evolved
Memory Optimization
If you’re running on limited VRAM, use these optimizations:
- Quantized models: Q4_K_M GGUF models reduce VRAM by 60%
- Tiled VAE decode: Process high-res frames in tiles
- Sage Attention: Memory-efficient attention mechanism
- CPU offload: Move audio VAE to CPU with
device: "cpu"
Workflow Types
LTX-2 supports several generation pipelines in ComfyUI:
Text-to-Video (Full)
The highest quality workflow. Provides fine control over every aspect of generation:
EmptyLatent → LTXVSampler → LTXVideoDecode → VHSVideoCombine
Use when: Quality matters more than speed.
Text-to-Video (Distilled)
A faster version using distilled models:
EmptyLatent → LTXVDistilledSampler → LTXVideoDecode → VHSVideoCombine
Use when: Quick iterations, lower VRAM.
Image-to-Video
Animate static images with motion:
LoadImage → LTXVImageEncode → LTXVSampler → LTXVideoDecode → VHSVideoCombine
Use when: Creating animations from artwork or photos.
I2V Workflow File
The ltx2_I2V_api.json workflow provides a complete Image-to-Video pipeline with:
| Node | Purpose |
|---|---|
LoadImage | Input your source image |
LTXVImgToVideoInplace | Convert image to video latent |
EmptyLTXVLatentVideo | Initialize video frames |
LTXVAudioVAEDecode | Generate synchronized audio |
VHS_VideoCombine | Output combined video |
Using I2V
- Load the workflow in ComfyUI (drag and drop or Load Default)
- Select your input image in the
LoadImagenode - Set your prompt in the text encoder (describe the motion you want)
- Adjust frame count in
EmptyLTXVLatentVideo(24 frames = 1 second) - Queue prompt and wait for generation
I2V Prompt Tips
When using Image-to-Video, describe what you want to happen after the initial frame:
The camera slowly pushes in as the subject turns to face the lens.
Subtle movement in the hair suggests a gentle breeze. The background
remains static while the subject's expression softens.
Key difference from T2V: The first frame is already defined by your input image. Your prompt describes motion and camera behavior FROM that frame.
Two-Stage Pipeline
Generate low-res first, then upscale:
Stage 1: Low-res generation (640x360)
Stage 2: LTXVLatentUpscaler → Decode at full resolution
Hardware Requirements in Detail
VRAM Breakdown
| Component | VRAM Usage |
|---|---|
| UNET (Q4_K_M) | ~12GB |
| Video VAE | ~4GB |
| Audio VAE | ~2GB |
| CLIP Encoders | ~2GB |
| Latents | ~2-4GB |
| Total | 24-28GB |
Low-VRAM Workarounds
- Use CPU for audio VAE: Reduces VRAM by 2GB
- Reduce batch size: 1 frame at a time
- Lower resolution: Start at 640x360, then upscale
- GGUF quantization: Q4_K_M uses 40% less memory
Prompting Guide
LTX-2 responds best to prompts that paint a complete picture. Following the official LTX-2 prompting guidelines will significantly improve your results.
Prompt Structure (4-8 sentences)
Write prompts as a single flowing paragraph using present tense. Include these elements:
- Establish the shot — Use cinematography terms (wide, medium, close-up, extreme close-up)
- Set the scene — Describe lighting, color palette, textures, atmosphere
- Describe the action — Natural sequence from beginning to end
- Define characters — Physical details with visual emotional cues (not abstract labels)
- Camera movement — Explicit instructions: “slow dolly in”, “handheld tracking”, “static shot”
- Describe audio — Ambient sound, dialogue in quotation marks, language/accent
Example: Product Video Prompt
Before (vague):
A blender lid levitates above a pitcher. Cinematic product shot.
After (LTX-2 optimized):
Extreme close-up product shot with shallow depth of field. The frame centers
on a sleek stainless steel blender pitcher resting on a matte black infinity
surface. Soft box lighting from above creates a subtle highlight along the
blender's curve. The lid begins to rise smoothly, weightlessly, as if lifted
by an invisible force. It ascends straight upward, rotating imperceptibly,
then descends with the same grace, settling back with surgical precision.
Static locked-off shot. Photorealistic, commercial-grade aesthetic.
What Works Well
| Strength | Description |
|---|---|
| Cinematic compositions | Wide, medium, close-up with thoughtful lighting and depth of field |
| Physical emotional cues | Facial expressions, gestures, body language instead of “sad” or “happy” |
| Atmosphere | Fog, rain, golden-hour light, reflections, ambient textures |
| Explicit camera language | ”slow dolly in”, “handheld tracking”, “pushes in”, “circles around” |
| Stylized aesthetics | Noir, film grain, painterly, fashion editorial, animation styles |
What to Avoid
| Don’t Use | Why |
|---|---|
| Abstract emotional labels | ”sad”, “confused”, “angry” → use visual cues instead |
| Text and logos | Not reliable in current version |
| Complex physics | Chaotic motion can cause artifacts |
| Overloaded scenes | Too many characters/actions reduces clarity |
| Conflicting lighting | Mixed light sources confuse the model |
Camera Language Reference
| Term | Effect |
|---|---|
| Dolly in/out | Camera moves toward/away from subject |
| Pan | Camera rotates horizontally |
| Tilt | Camera rotates vertically |
| Track | Camera follows subject laterally |
| Push in | Smooth forward motion |
| Pull back | Smooth backward motion |
| Circles around | Orbital camera movement |
| Static/locked-off | No camera movement |
| Handheld | Slight natural wobble |
| Over-the-shoulder | Camera behind another subject |
Audio Description
For videos with synchronized audio:
The reporter looks into the camera and speaks with an energetic announcer voice:
"Thank you, Sylvia. This morning, here in New Castle, Vermont... black gold has been found!"
Ambient sounds of construction equipment and distant chatter fill the background.
Tips:
- Put spoken dialogue in quotation marks
- Specify language and accent if needed
- Describe ambient sounds separately
- Match audio intensity to on-screen action
Advanced Prompting Techniques
Six-Part Structured Prompt (4K Quality)
For best results, structure your prompt in clear layers:
| Layer | Purpose | Example |
|---|---|---|
| Scene Anchor | Location, time, atmosphere | ”Abandoned rocket launch site at dusk, orange-red sunset clouds” |
| Subject + Action | Who/what + strong verb | ”A silver drone skims low over the ground, scanning debris” |
| Camera + Lens | Movement, focal length, aperture | ”Fast forward tracking shot, 24mm lens, f1.8, ultra wide angle” |
| Visual Style | Color grading, film emulation | ”High contrast, cool blue-green grading, Fujifilm Provia 100F texture” |
| Motion Cues | Speed, frame rate, shutter | ”Subtle motion blur, 60fps feel, 180-degree shutter equivalent” |
| Guardrails | What to avoid | ”No distortion, no blown highlights, no AI artifacts” |
Lens Language Reference
| Focal Length | Effect | Use When |
|---|---|---|
| 24mm wide | Environmental scale, sense of space | Establishing shots, landscapes |
| 50mm standard | Natural human eye perspective | Documentary, interviews |
| 85mm portrait | Compression, intimacy | Character close-ups |
| 200mm telephoto | Isolate subject from background | Sports, wildlife |
Shutter Descriptions
| Term | Effect |
|---|---|
| 180-degree shutter | Classic cinematic motion blur |
| Natural motion blur | Realism in moving subjects |
| Fast shutter | Sharp, high-energy action feel |
Keywords for Smooth Motion
For 50fps fluidity:
- Stable dolly push
- Smooth gimbal stabilization
- Tripod locked off
- Constant speed pan
- Natural motion blur
- Fluid movement, controlled motion
Avoid at high frame rates:
- Chaotic handheld movement (causes warping)
- Shaky camera
- Irregular motion
Long Take Strategy (15-20 Second Clips)
For maximum duration clips, treat the prompt like a mini-scene:
**Scene Heading:** Location and time of day
**Brief Description:** Overall vibe and atmosphere
**Blocking:** Sequence of actions + camera movements
**Dialogue/Cues:** Performance notes in parentheses
Example (15s long take):
Scene: A pilot's cockpit at sunset.
Blocking: Start macro shot of gloved hand on flight stick, metallic reflections
catching dying sunlight. Camera slowly pulls back to medium shot, revealing
clenched jaw and cold dashboard glow. Expression shifts from focus to grim
determination. Camera continues dollying back, revealing tarmac behind—rusted
fighter jets, scattered debris, orange-red sky.
Audio-Visual Sync Techniques
LTX-2 generates audio and video simultaneously. Tighten synchronization with:
Temporal Cueing:
- “On the heavy drum beat” — Align action with musical rhythm
- “At the 3-second mark” — Specify exact timing
- “On the third bass hit” — Precise event timing
Action Regularity:
- “Constant speed tracking shot” — Predictable camera for AI
- “Rhythmic robotic arm oscillation” — Regular movement intervals
- “Steady heartbeat pulse” — Consistent audio-visual pattern
Example:
A robotic arm precisely grabs a component on the bass hit, its metallic pincers
opening and closing in perfect rhythm. The camera remains steady in close-up,
while each grab produces a crisp metallic clank echoing through the sterile lab.
The 9 Prompting Rules (RunDiffusion)
These rules ensure your prompts translate into cinematic video:
1. Single Continuous Paragraph
No line breaks, lists, or fragmented thoughts. Write one flowing description.
A lone fisherman rows across a foggy lake before sunrise, the boat creaking
softly as water laps at its sides. The camera glides overhead, tracking his
slow progress. His lantern casts a warm circle of light, reflecting in ripples
while reeds sway gently on the shoreline.
2. Present-Tense Action Verbs
Use “walks”, “tilts”, “flickers” — not “walked” or “is walking”.
A young boy runs barefoot across a wet stone courtyard as the first raindrops
begin to fall. The camera tracks behind him at low angle, catching the splashes
beneath his feet. He turns sharply, arms outstretched for balance.
3. Explicit Camera Behavior
Define perspective, angle, movement, and speed.
The camera begins in a wide shot from across the street, then slowly pushes
forward at shoulder height as pedestrians blur in the foreground. A passing
bicycle crosses the frame just before the shot settles into a close-up.
4. Precise Physical Details
Use small, measurable movements. Describe what the camera sees.
Her eyebrows lift approximately two millimeters as she hears a creak behind her,
and the blade pauses mid-air. The camera holds in medium close-up with shallow
depth of field, capturing the tension in her wrist.
5. Atmospheric Environment
Include lighting, air, textures, sound, ambient elements.
Pale blue light from an overcast sky diffuses across the scene, softening the
edges of distant waves and casting no sharp shadows. A cool breeze ripples
through her hair while seagulls fly overhead.
6. Smooth Temporal Flow
Use connectors: “as”, “then”, “while”.
As the camera begins in a stationary wide shot, a tall alien figure steps forward
through the haze. Then the camera glides sideways, following its stride as it
moves across the deck toward a glowing console.
7. Genre-Specific Language
Match tone and vocabulary to your genre.
Sci-Fi:
A maintenance drone glides through a long tunnel inside a deep space cargo vessel,
its circular frame rotating gently as it shines beams of light on the walls.
A soft mechanical hum blends with distant low thrum of the ship's reactor core.
8. Character Specificity
Only include observable details: age, ethnicity, clothing, posture.
A middle-aged South Asian man wearing a long tan coat and dark scarf steps into
a narrow alley lit by neon signage. The camera tilts up from his shoes as rain
hits the cobblestones, revealing his profile in close-up.
9. Show, Don’t Tell Emotion
Never describe feelings. Describe body reactions.
| Don’t Write | Do Write |
|---|---|
| ”He is nervous" | "His fingers tighten, his breathing slows as he steadies himself" |
| "She is sad" | "A single tear trails down her cheek. Her shoulders drop" |
| "He is confident" | "Back straight, gestures controlled and deliberate” |
Match Prompt to Input Type
| Input Type | Strategy |
|---|---|
| Image to Video | Describe the image exactly, then build motion on top |
| Text to Video | Define everything from scratch |
Best Practices
Negative Prompts
Always include negative prompts to prevent artifacts:
blurry, low quality, distorted, watermark, text,
deformed, ugly, bad anatomy, extra limbs, flickering
Inference Settings
| Setting | Quality | Speed | VRAM |
|---|---|---|---|
| Steps | 20-30 | 10-15 | Same |
| CFG | 2.5-3.5 | 1.5-2.0 | Same |
| Resolution | 1280x720 | 640x360 | Higher res = more VRAM |
Camera Movement LoRAs
Add cinematic camera movements with specialized LoRAs:
- Dolly: Smooth forward/backward motion
- Jib: Vertical camera movement
- Tracking: Lateral tracking shots
Common Issues
OOM on CLIP Encoding
Problem: Out of memory during text encoding.
Solution: Use smaller CLIP model or reduce batch size:
# In workflow, change:
clip_name1: "gemma-3-12b-it-abliterated.q4_k_m.gguf"
# To smaller model if needed
Audio Sync Issues
Problem: Audio doesn’t match video timing.
Solution: Ensure audio VAE is loaded with correct device settings:
audio_vae:
device: "cpu" # Offload to CPU
weight_dtype: "bf16"
Artifacts in Output
Problem: Visual flickering or distortion.
Solutions:
- Increase steps (20 → 30)
- Lower CFG scale (3.5 → 2.5)
- Use distilled model for cleaner output
Resources
- Official Documentation: Lightricks/LTX-Video on GitHub
- Model Downloads: Hugging Face
- ComfyUI Examples: ComfyUI Examples Wiki
- Community: r/StableDiffusion
LTX-2 represents a significant advancement in AI video generation. With synchronized audio, native 4K resolution, and ComfyUI integration, it’s now accessible to anyone with a capable GPU. Start with the distilled workflows for quick results, then experiment with the full pipeline for production-quality output.
Comments
Powered by GitHub Discussions