Generating AI Video with LTX-2 in ComfyUI

AI video generation has taken a massive leap forward with LTX-2, Lightricks’ second-generation audio-video foundation model. Unlike its predecessor, LTX-2 generates synchronized audio and video together in a single pass—the first DiT-based model to achieve this. This guide walks you through setting up and using LTX-2 in ComfyUI.

What Makes LTX-2 Special

LTX-2 is a 19-billion parameter model split into two streams:

14B video stream for visual generation
5B audio stream for synchronized sound

This dual-stream architecture enables native 4K resolution at up to 50 FPS with videos up to 10 seconds long. The model produces cinematic-quality output without requiring post-processing audio sync.

Key Capabilities

Feature	Details
Resolution	Native 4K, up to 50 FPS
Duration	10 seconds continuous
Audio	Synchronized generation
Architecture	Diffusion Transformer (DiT)
Integration	Built into ComfyUI core

Setup Guide

Prerequisites

Before installing LTX-2, ensure your system meets these requirements:

Minimum:

GPU: NVIDIA RTX 4090 or equivalent
VRAM: 24GB
RAM: 32GB
Storage: 100GB SSD

Recommended:

GPU: NVIDIA A100 or H100
VRAM: 40GB+
RAM: 64GB
Storage: 200GB+ SSD

Installation

Install ComfyUI if you haven’t already:

git clone https://github.com/comfyanonymous/ComfyUI
cd ComfyUI
pip install -r requirements.txt

Download LTX-2 models from Hugging Face:

# Create models directory
mkdir -p models/LTX2

# Download required models
# UNET (19B, quantized for lower VRAM)
huggingface-cli download Lightricks/LTX-Video LTX2/ltx-2-19b-dev_Q4_K_M.gguf --local-dir models/LTX2

# VAE (video)
huggingface-cli download Lightricks/LTX-Video LTX2/LTX2_video_vae_bf16.safetensors --local-dir models/LTX2

# VAE (audio)
huggingface-cli download Lightricks/LTX-Video LTX2/LTX2_audio_vae_bf16.safetensors --local-dir models/LTX2

# CLIP text encoders
huggingface-cli download Lightricks/LTX-Video ltx-2-19b-embeddings_connector_dev_bf16.safetensors --local-dir models/LTX2

Install additional nodes (optional but recommended):

cd custom_nodes
git clone https://github.com/Kosinkadink/ComfyUI-VideoHelperSuite
git clone https://github.com/Kosinkadink/ComfyUI-AnimateDiff-Evolved

Memory Optimization

If you’re running on limited VRAM, use these optimizations:

Quantized models: Q4_K_M GGUF models reduce VRAM by 60%
Tiled VAE decode: Process high-res frames in tiles
Sage Attention: Memory-efficient attention mechanism
CPU offload: Move audio VAE to CPU with device: "cpu"

Workflow Types

LTX-2 supports several generation pipelines in ComfyUI:

Text-to-Video (Full)

The highest quality workflow. Provides fine control over every aspect of generation:

EmptyLatent → LTXVSampler → LTXVideoDecode → VHSVideoCombine

Use when: Quality matters more than speed.

Text-to-Video (Distilled)

A faster version using distilled models:

EmptyLatent → LTXVDistilledSampler → LTXVideoDecode → VHSVideoCombine

Use when: Quick iterations, lower VRAM.

Image-to-Video

Animate static images with motion:

LoadImage → LTXVImageEncode → LTXVSampler → LTXVideoDecode → VHSVideoCombine

Use when: Creating animations from artwork or photos.

I2V Workflow File

The ltx2_I2V_api.json workflow provides a complete Image-to-Video pipeline with:

Node	Purpose
`LoadImage`	Input your source image
`LTXVImgToVideoInplace`	Convert image to video latent
`EmptyLTXVLatentVideo`	Initialize video frames
`LTXVAudioVAEDecode`	Generate synchronized audio
`VHS_VideoCombine`	Output combined video

Using I2V

Load the workflow in ComfyUI (drag and drop or Load Default)
Select your input image in the LoadImage node
Set your prompt in the text encoder (describe the motion you want)
Adjust frame count in EmptyLTXVLatentVideo (24 frames = 1 second)
Queue prompt and wait for generation

I2V Prompt Tips

When using Image-to-Video, describe what you want to happen after the initial frame:

The camera slowly pushes in as the subject turns to face the lens. 
Subtle movement in the hair suggests a gentle breeze. The background 
remains static while the subject's expression softens.

Key difference from T2V: The first frame is already defined by your input image. Your prompt describes motion and camera behavior FROM that frame.

Two-Stage Pipeline

Generate low-res first, then upscale:

Stage 1: Low-res generation (640x360)
Stage 2: LTXVLatentUpscaler → Decode at full resolution

Hardware Requirements in Detail

VRAM Breakdown

Component	VRAM Usage
UNET (Q4_K_M)	~12GB
Video VAE	~4GB
Audio VAE	~2GB
CLIP Encoders	~2GB
Latents	~2-4GB
Total	24-28GB

Low-VRAM Workarounds

Use CPU for audio VAE: Reduces VRAM by 2GB
Reduce batch size: 1 frame at a time
Lower resolution: Start at 640x360, then upscale
GGUF quantization: Q4_K_M uses 40% less memory

Prompting Guide

LTX-2 responds best to prompts that paint a complete picture. Following the official LTX-2 prompting guidelines will significantly improve your results.

Prompt Structure (4-8 sentences)

Write prompts as a single flowing paragraph using present tense. Include these elements:

Establish the shot — Use cinematography terms (wide, medium, close-up, extreme close-up)
Set the scene — Describe lighting, color palette, textures, atmosphere
Describe the action — Natural sequence from beginning to end
Define characters — Physical details with visual emotional cues (not abstract labels)
Camera movement — Explicit instructions: “slow dolly in”, “handheld tracking”, “static shot”
Describe audio — Ambient sound, dialogue in quotation marks, language/accent

Example: Product Video Prompt

Before (vague):

A blender lid levitates above a pitcher. Cinematic product shot.

After (LTX-2 optimized):

Extreme close-up product shot with shallow depth of field. The frame centers 
on a sleek stainless steel blender pitcher resting on a matte black infinity 
surface. Soft box lighting from above creates a subtle highlight along the 
blender's curve. The lid begins to rise smoothly, weightlessly, as if lifted 
by an invisible force. It ascends straight upward, rotating imperceptibly, 
then descends with the same grace, settling back with surgical precision. 
Static locked-off shot. Photorealistic, commercial-grade aesthetic.

What Works Well

Strength	Description
Cinematic compositions	Wide, medium, close-up with thoughtful lighting and depth of field
Physical emotional cues	Facial expressions, gestures, body language instead of “sad” or “happy”
Atmosphere	Fog, rain, golden-hour light, reflections, ambient textures
Explicit camera language	”slow dolly in”, “handheld tracking”, “pushes in”, “circles around”
Stylized aesthetics	Noir, film grain, painterly, fashion editorial, animation styles

What to Avoid

Don’t Use	Why
Abstract emotional labels	”sad”, “confused”, “angry” → use visual cues instead
Text and logos	Not reliable in current version
Complex physics	Chaotic motion can cause artifacts
Overloaded scenes	Too many characters/actions reduces clarity
Conflicting lighting	Mixed light sources confuse the model

Camera Language Reference

Term	Effect
Dolly in/out	Camera moves toward/away from subject
Pan	Camera rotates horizontally
Tilt	Camera rotates vertically
Track	Camera follows subject laterally
Push in	Smooth forward motion
Pull back	Smooth backward motion
Circles around	Orbital camera movement
Static/locked-off	No camera movement
Handheld	Slight natural wobble
Over-the-shoulder	Camera behind another subject

Audio Description

For videos with synchronized audio:

The reporter looks into the camera and speaks with an energetic announcer voice:
"Thank you, Sylvia. This morning, here in New Castle, Vermont... black gold has been found!"
Ambient sounds of construction equipment and distant chatter fill the background.

Tips:

Put spoken dialogue in quotation marks
Specify language and accent if needed
Describe ambient sounds separately
Match audio intensity to on-screen action

Advanced Prompting Techniques

Six-Part Structured Prompt (4K Quality)

For best results, structure your prompt in clear layers:

Layer	Purpose	Example
Scene Anchor	Location, time, atmosphere	”Abandoned rocket launch site at dusk, orange-red sunset clouds”
Subject + Action	Who/what + strong verb	”A silver drone skims low over the ground, scanning debris”
Camera + Lens	Movement, focal length, aperture	”Fast forward tracking shot, 24mm lens, f1.8, ultra wide angle”
Visual Style	Color grading, film emulation	”High contrast, cool blue-green grading, Fujifilm Provia 100F texture”
Motion Cues	Speed, frame rate, shutter	”Subtle motion blur, 60fps feel, 180-degree shutter equivalent”
Guardrails	What to avoid	”No distortion, no blown highlights, no AI artifacts”

Lens Language Reference

Focal Length	Effect	Use When
24mm wide	Environmental scale, sense of space	Establishing shots, landscapes
50mm standard	Natural human eye perspective	Documentary, interviews
85mm portrait	Compression, intimacy	Character close-ups
200mm telephoto	Isolate subject from background	Sports, wildlife

Shutter Descriptions

Term	Effect
180-degree shutter	Classic cinematic motion blur
Natural motion blur	Realism in moving subjects
Fast shutter	Sharp, high-energy action feel

Keywords for Smooth Motion

For 50fps fluidity:

Stable dolly push
Smooth gimbal stabilization
Tripod locked off
Constant speed pan
Natural motion blur
Fluid movement, controlled motion

Avoid at high frame rates:

Chaotic handheld movement (causes warping)
Shaky camera
Irregular motion

Long Take Strategy (15-20 Second Clips)

For maximum duration clips, treat the prompt like a mini-scene:

**Scene Heading:** Location and time of day
**Brief Description:** Overall vibe and atmosphere
**Blocking:** Sequence of actions + camera movements
**Dialogue/Cues:** Performance notes in parentheses

Example (15s long take):

Scene: A pilot's cockpit at sunset.
Blocking: Start macro shot of gloved hand on flight stick, metallic reflections 
catching dying sunlight. Camera slowly pulls back to medium shot, revealing 
clenched jaw and cold dashboard glow. Expression shifts from focus to grim 
determination. Camera continues dollying back, revealing tarmac behind—rusted 
fighter jets, scattered debris, orange-red sky.

Audio-Visual Sync Techniques

LTX-2 generates audio and video simultaneously. Tighten synchronization with:

Temporal Cueing:

“On the heavy drum beat” — Align action with musical rhythm
“At the 3-second mark” — Specify exact timing
“On the third bass hit” — Precise event timing

Action Regularity:

“Constant speed tracking shot” — Predictable camera for AI
“Rhythmic robotic arm oscillation” — Regular movement intervals
“Steady heartbeat pulse” — Consistent audio-visual pattern

Example:

A robotic arm precisely grabs a component on the bass hit, its metallic pincers 
opening and closing in perfect rhythm. The camera remains steady in close-up, 
while each grab produces a crisp metallic clank echoing through the sterile lab.

The 9 Prompting Rules (RunDiffusion)

These rules ensure your prompts translate into cinematic video:

1. Single Continuous Paragraph

No line breaks, lists, or fragmented thoughts. Write one flowing description.

A lone fisherman rows across a foggy lake before sunrise, the boat creaking 
softly as water laps at its sides. The camera glides overhead, tracking his 
slow progress. His lantern casts a warm circle of light, reflecting in ripples 
while reeds sway gently on the shoreline.

2. Present-Tense Action Verbs

Use “walks”, “tilts”, “flickers” — not “walked” or “is walking”.

A young boy runs barefoot across a wet stone courtyard as the first raindrops 
begin to fall. The camera tracks behind him at low angle, catching the splashes 
beneath his feet. He turns sharply, arms outstretched for balance.

3. Explicit Camera Behavior

Define perspective, angle, movement, and speed.

The camera begins in a wide shot from across the street, then slowly pushes 
forward at shoulder height as pedestrians blur in the foreground. A passing 
bicycle crosses the frame just before the shot settles into a close-up.

4. Precise Physical Details

Use small, measurable movements. Describe what the camera sees.

Her eyebrows lift approximately two millimeters as she hears a creak behind her, 
and the blade pauses mid-air. The camera holds in medium close-up with shallow 
depth of field, capturing the tension in her wrist.

5. Atmospheric Environment

Include lighting, air, textures, sound, ambient elements.

Pale blue light from an overcast sky diffuses across the scene, softening the 
edges of distant waves and casting no sharp shadows. A cool breeze ripples 
through her hair while seagulls fly overhead.

6. Smooth Temporal Flow

Use connectors: “as”, “then”, “while”.

As the camera begins in a stationary wide shot, a tall alien figure steps forward 
through the haze. Then the camera glides sideways, following its stride as it 
moves across the deck toward a glowing console.

7. Genre-Specific Language

Match tone and vocabulary to your genre.

Sci-Fi:

A maintenance drone glides through a long tunnel inside a deep space cargo vessel, 
its circular frame rotating gently as it shines beams of light on the walls. 
A soft mechanical hum blends with distant low thrum of the ship's reactor core.

8. Character Specificity

Only include observable details: age, ethnicity, clothing, posture.

A middle-aged South Asian man wearing a long tan coat and dark scarf steps into 
a narrow alley lit by neon signage. The camera tilts up from his shoes as rain 
hits the cobblestones, revealing his profile in close-up.

9. Show, Don’t Tell Emotion

Never describe feelings. Describe body reactions.

Don’t Write	Do Write
”He is nervous"	"His fingers tighten, his breathing slows as he steadies himself"
"She is sad"	"A single tear trails down her cheek. Her shoulders drop"
"He is confident"	"Back straight, gestures controlled and deliberate”

Match Prompt to Input Type

Input Type	Strategy
Image to Video	Describe the image exactly, then build motion on top
Text to Video	Define everything from scratch

Best Practices

Negative Prompts

Always include negative prompts to prevent artifacts:

blurry, low quality, distorted, watermark, text, 
deformed, ugly, bad anatomy, extra limbs, flickering

Inference Settings

Setting	Quality	Speed	VRAM
Steps	20-30	10-15	Same
CFG	2.5-3.5	1.5-2.0	Same
Resolution	1280x720	640x360	Higher res = more VRAM

Camera Movement LoRAs

Add cinematic camera movements with specialized LoRAs:

Dolly: Smooth forward/backward motion
Jib: Vertical camera movement
Tracking: Lateral tracking shots

Common Issues

OOM on CLIP Encoding

Problem: Out of memory during text encoding.

Solution: Use smaller CLIP model or reduce batch size:

# In workflow, change:
clip_name1: "gemma-3-12b-it-abliterated.q4_k_m.gguf"
# To smaller model if needed

Audio Sync Issues

Problem: Audio doesn’t match video timing.

Solution: Ensure audio VAE is loaded with correct device settings:

audio_vae:
  device: "cpu"  # Offload to CPU
  weight_dtype: "bf16"

Artifacts in Output

Problem: Visual flickering or distortion.

Solutions:

Increase steps (20 → 30)
Lower CFG scale (3.5 → 2.5)
Use distilled model for cleaner output

Resources

Official Documentation: Lightricks/LTX-Video on GitHub
Model Downloads: Hugging Face
ComfyUI Examples: ComfyUI Examples Wiki
Community: r/StableDiffusion

LTX-2 represents a significant advancement in AI video generation. With synchronized audio, native 4K resolution, and ComfyUI integration, it’s now accessible to anyone with a capable GPU. Start with the distilled workflows for quick results, then experiment with the full pipeline for production-quality output.