Generating AI Video with LTX-2 in ComfyUI

Learn how to generate stunning AI videos with synchronized audio using LTX-2 in ComfyUI. Complete setup guide, workflows, and best practices.

• 12 min read
aicomfyuivideo-generationltx-2tutorial
Generating AI Video with LTX-2 in ComfyUI

Generating AI Video with LTX-2 in ComfyUI

AI video generation has taken a massive leap forward with LTX-2, Lightricks’ second-generation audio-video foundation model. Unlike its predecessor, LTX-2 generates synchronized audio and video together in a single pass—the first DiT-based model to achieve this. This guide walks you through setting up and using LTX-2 in ComfyUI.

What Makes LTX-2 Special

LTX-2 is a 19-billion parameter model split into two streams:

  • 14B video stream for visual generation
  • 5B audio stream for synchronized sound

This dual-stream architecture enables native 4K resolution at up to 50 FPS with videos up to 10 seconds long. The model produces cinematic-quality output without requiring post-processing audio sync.

Key Capabilities

FeatureDetails
ResolutionNative 4K, up to 50 FPS
Duration10 seconds continuous
AudioSynchronized generation
ArchitectureDiffusion Transformer (DiT)
IntegrationBuilt into ComfyUI core

Setup Guide

Prerequisites

Before installing LTX-2, ensure your system meets these requirements:

Minimum:

  • GPU: NVIDIA RTX 4090 or equivalent
  • VRAM: 24GB
  • RAM: 32GB
  • Storage: 100GB SSD

Recommended:

  • GPU: NVIDIA A100 or H100
  • VRAM: 40GB+
  • RAM: 64GB
  • Storage: 200GB+ SSD

Installation

  1. Install ComfyUI if you haven’t already:
git clone https://github.com/comfyanonymous/ComfyUI
cd ComfyUI
pip install -r requirements.txt
  1. Download LTX-2 models from Hugging Face:
# Create models directory
mkdir -p models/LTX2

# Download required models
# UNET (19B, quantized for lower VRAM)
huggingface-cli download Lightricks/LTX-Video LTX2/ltx-2-19b-dev_Q4_K_M.gguf --local-dir models/LTX2

# VAE (video)
huggingface-cli download Lightricks/LTX-Video LTX2/LTX2_video_vae_bf16.safetensors --local-dir models/LTX2

# VAE (audio)
huggingface-cli download Lightricks/LTX-Video LTX2/LTX2_audio_vae_bf16.safetensors --local-dir models/LTX2

# CLIP text encoders
huggingface-cli download Lightricks/LTX-Video ltx-2-19b-embeddings_connector_dev_bf16.safetensors --local-dir models/LTX2
  1. Install additional nodes (optional but recommended):
cd custom_nodes
git clone https://github.com/Kosinkadink/ComfyUI-VideoHelperSuite
git clone https://github.com/Kosinkadink/ComfyUI-AnimateDiff-Evolved

Memory Optimization

If you’re running on limited VRAM, use these optimizations:

  • Quantized models: Q4_K_M GGUF models reduce VRAM by 60%
  • Tiled VAE decode: Process high-res frames in tiles
  • Sage Attention: Memory-efficient attention mechanism
  • CPU offload: Move audio VAE to CPU with device: "cpu"

Workflow Types

LTX-2 supports several generation pipelines in ComfyUI:

Text-to-Video (Full)

The highest quality workflow. Provides fine control over every aspect of generation:

EmptyLatent → LTXVSampler → LTXVideoDecode → VHSVideoCombine

Use when: Quality matters more than speed.

Text-to-Video (Distilled)

A faster version using distilled models:

EmptyLatent → LTXVDistilledSampler → LTXVideoDecode → VHSVideoCombine

Use when: Quick iterations, lower VRAM.

Image-to-Video

Animate static images with motion:

LoadImage → LTXVImageEncode → LTXVSampler → LTXVideoDecode → VHSVideoCombine

Use when: Creating animations from artwork or photos.

I2V Workflow File

The ltx2_I2V_api.json workflow provides a complete Image-to-Video pipeline with:

NodePurpose
LoadImageInput your source image
LTXVImgToVideoInplaceConvert image to video latent
EmptyLTXVLatentVideoInitialize video frames
LTXVAudioVAEDecodeGenerate synchronized audio
VHS_VideoCombineOutput combined video

Using I2V

  1. Load the workflow in ComfyUI (drag and drop or Load Default)
  2. Select your input image in the LoadImage node
  3. Set your prompt in the text encoder (describe the motion you want)
  4. Adjust frame count in EmptyLTXVLatentVideo (24 frames = 1 second)
  5. Queue prompt and wait for generation

I2V Prompt Tips

When using Image-to-Video, describe what you want to happen after the initial frame:

The camera slowly pushes in as the subject turns to face the lens. 
Subtle movement in the hair suggests a gentle breeze. The background 
remains static while the subject's expression softens.

Key difference from T2V: The first frame is already defined by your input image. Your prompt describes motion and camera behavior FROM that frame.

Two-Stage Pipeline

Generate low-res first, then upscale:

Stage 1: Low-res generation (640x360)
Stage 2: LTXVLatentUpscaler → Decode at full resolution

Hardware Requirements in Detail

VRAM Breakdown

ComponentVRAM Usage
UNET (Q4_K_M)~12GB
Video VAE~4GB
Audio VAE~2GB
CLIP Encoders~2GB
Latents~2-4GB
Total24-28GB

Low-VRAM Workarounds

  1. Use CPU for audio VAE: Reduces VRAM by 2GB
  2. Reduce batch size: 1 frame at a time
  3. Lower resolution: Start at 640x360, then upscale
  4. GGUF quantization: Q4_K_M uses 40% less memory

Prompting Guide

LTX-2 responds best to prompts that paint a complete picture. Following the official LTX-2 prompting guidelines will significantly improve your results.

Prompt Structure (4-8 sentences)

Write prompts as a single flowing paragraph using present tense. Include these elements:

  1. Establish the shot — Use cinematography terms (wide, medium, close-up, extreme close-up)
  2. Set the scene — Describe lighting, color palette, textures, atmosphere
  3. Describe the action — Natural sequence from beginning to end
  4. Define characters — Physical details with visual emotional cues (not abstract labels)
  5. Camera movement — Explicit instructions: “slow dolly in”, “handheld tracking”, “static shot”
  6. Describe audio — Ambient sound, dialogue in quotation marks, language/accent

Example: Product Video Prompt

Before (vague):

A blender lid levitates above a pitcher. Cinematic product shot.

After (LTX-2 optimized):

Extreme close-up product shot with shallow depth of field. The frame centers 
on a sleek stainless steel blender pitcher resting on a matte black infinity 
surface. Soft box lighting from above creates a subtle highlight along the 
blender's curve. The lid begins to rise smoothly, weightlessly, as if lifted 
by an invisible force. It ascends straight upward, rotating imperceptibly, 
then descends with the same grace, settling back with surgical precision. 
Static locked-off shot. Photorealistic, commercial-grade aesthetic.

What Works Well

StrengthDescription
Cinematic compositionsWide, medium, close-up with thoughtful lighting and depth of field
Physical emotional cuesFacial expressions, gestures, body language instead of “sad” or “happy”
AtmosphereFog, rain, golden-hour light, reflections, ambient textures
Explicit camera language”slow dolly in”, “handheld tracking”, “pushes in”, “circles around”
Stylized aestheticsNoir, film grain, painterly, fashion editorial, animation styles

What to Avoid

Don’t UseWhy
Abstract emotional labels”sad”, “confused”, “angry” → use visual cues instead
Text and logosNot reliable in current version
Complex physicsChaotic motion can cause artifacts
Overloaded scenesToo many characters/actions reduces clarity
Conflicting lightingMixed light sources confuse the model

Camera Language Reference

TermEffect
Dolly in/outCamera moves toward/away from subject
PanCamera rotates horizontally
TiltCamera rotates vertically
TrackCamera follows subject laterally
Push inSmooth forward motion
Pull backSmooth backward motion
Circles aroundOrbital camera movement
Static/locked-offNo camera movement
HandheldSlight natural wobble
Over-the-shoulderCamera behind another subject

Audio Description

For videos with synchronized audio:

The reporter looks into the camera and speaks with an energetic announcer voice:
"Thank you, Sylvia. This morning, here in New Castle, Vermont... black gold has been found!"
Ambient sounds of construction equipment and distant chatter fill the background.

Tips:

  • Put spoken dialogue in quotation marks
  • Specify language and accent if needed
  • Describe ambient sounds separately
  • Match audio intensity to on-screen action

Advanced Prompting Techniques

Six-Part Structured Prompt (4K Quality)

For best results, structure your prompt in clear layers:

LayerPurposeExample
Scene AnchorLocation, time, atmosphere”Abandoned rocket launch site at dusk, orange-red sunset clouds”
Subject + ActionWho/what + strong verb”A silver drone skims low over the ground, scanning debris”
Camera + LensMovement, focal length, aperture”Fast forward tracking shot, 24mm lens, f1.8, ultra wide angle”
Visual StyleColor grading, film emulation”High contrast, cool blue-green grading, Fujifilm Provia 100F texture”
Motion CuesSpeed, frame rate, shutter”Subtle motion blur, 60fps feel, 180-degree shutter equivalent”
GuardrailsWhat to avoid”No distortion, no blown highlights, no AI artifacts”

Lens Language Reference

Focal LengthEffectUse When
24mm wideEnvironmental scale, sense of spaceEstablishing shots, landscapes
50mm standardNatural human eye perspectiveDocumentary, interviews
85mm portraitCompression, intimacyCharacter close-ups
200mm telephotoIsolate subject from backgroundSports, wildlife

Shutter Descriptions

TermEffect
180-degree shutterClassic cinematic motion blur
Natural motion blurRealism in moving subjects
Fast shutterSharp, high-energy action feel

Keywords for Smooth Motion

For 50fps fluidity:

  • Stable dolly push
  • Smooth gimbal stabilization
  • Tripod locked off
  • Constant speed pan
  • Natural motion blur
  • Fluid movement, controlled motion

Avoid at high frame rates:

  • Chaotic handheld movement (causes warping)
  • Shaky camera
  • Irregular motion

Long Take Strategy (15-20 Second Clips)

For maximum duration clips, treat the prompt like a mini-scene:

**Scene Heading:** Location and time of day
**Brief Description:** Overall vibe and atmosphere
**Blocking:** Sequence of actions + camera movements
**Dialogue/Cues:** Performance notes in parentheses

Example (15s long take):

Scene: A pilot's cockpit at sunset.
Blocking: Start macro shot of gloved hand on flight stick, metallic reflections 
catching dying sunlight. Camera slowly pulls back to medium shot, revealing 
clenched jaw and cold dashboard glow. Expression shifts from focus to grim 
determination. Camera continues dollying back, revealing tarmac behind—rusted 
fighter jets, scattered debris, orange-red sky.

Audio-Visual Sync Techniques

LTX-2 generates audio and video simultaneously. Tighten synchronization with:

Temporal Cueing:

  • “On the heavy drum beat” — Align action with musical rhythm
  • “At the 3-second mark” — Specify exact timing
  • “On the third bass hit” — Precise event timing

Action Regularity:

  • “Constant speed tracking shot” — Predictable camera for AI
  • “Rhythmic robotic arm oscillation” — Regular movement intervals
  • “Steady heartbeat pulse” — Consistent audio-visual pattern

Example:

A robotic arm precisely grabs a component on the bass hit, its metallic pincers 
opening and closing in perfect rhythm. The camera remains steady in close-up, 
while each grab produces a crisp metallic clank echoing through the sterile lab.

The 9 Prompting Rules (RunDiffusion)

These rules ensure your prompts translate into cinematic video:

1. Single Continuous Paragraph

No line breaks, lists, or fragmented thoughts. Write one flowing description.

A lone fisherman rows across a foggy lake before sunrise, the boat creaking 
softly as water laps at its sides. The camera glides overhead, tracking his 
slow progress. His lantern casts a warm circle of light, reflecting in ripples 
while reeds sway gently on the shoreline.

2. Present-Tense Action Verbs

Use “walks”, “tilts”, “flickers” — not “walked” or “is walking”.

A young boy runs barefoot across a wet stone courtyard as the first raindrops 
begin to fall. The camera tracks behind him at low angle, catching the splashes 
beneath his feet. He turns sharply, arms outstretched for balance.

3. Explicit Camera Behavior

Define perspective, angle, movement, and speed.

The camera begins in a wide shot from across the street, then slowly pushes 
forward at shoulder height as pedestrians blur in the foreground. A passing 
bicycle crosses the frame just before the shot settles into a close-up.

4. Precise Physical Details

Use small, measurable movements. Describe what the camera sees.

Her eyebrows lift approximately two millimeters as she hears a creak behind her, 
and the blade pauses mid-air. The camera holds in medium close-up with shallow 
depth of field, capturing the tension in her wrist.

5. Atmospheric Environment

Include lighting, air, textures, sound, ambient elements.

Pale blue light from an overcast sky diffuses across the scene, softening the 
edges of distant waves and casting no sharp shadows. A cool breeze ripples 
through her hair while seagulls fly overhead.

6. Smooth Temporal Flow

Use connectors: “as”, “then”, “while”.

As the camera begins in a stationary wide shot, a tall alien figure steps forward 
through the haze. Then the camera glides sideways, following its stride as it 
moves across the deck toward a glowing console.

7. Genre-Specific Language

Match tone and vocabulary to your genre.

Sci-Fi:

A maintenance drone glides through a long tunnel inside a deep space cargo vessel, 
its circular frame rotating gently as it shines beams of light on the walls. 
A soft mechanical hum blends with distant low thrum of the ship's reactor core.

8. Character Specificity

Only include observable details: age, ethnicity, clothing, posture.

A middle-aged South Asian man wearing a long tan coat and dark scarf steps into 
a narrow alley lit by neon signage. The camera tilts up from his shoes as rain 
hits the cobblestones, revealing his profile in close-up.

9. Show, Don’t Tell Emotion

Never describe feelings. Describe body reactions.

Don’t WriteDo Write
”He is nervous""His fingers tighten, his breathing slows as he steadies himself"
"She is sad""A single tear trails down her cheek. Her shoulders drop"
"He is confident""Back straight, gestures controlled and deliberate”

Match Prompt to Input Type

Input TypeStrategy
Image to VideoDescribe the image exactly, then build motion on top
Text to VideoDefine everything from scratch

Best Practices

Negative Prompts

Always include negative prompts to prevent artifacts:

blurry, low quality, distorted, watermark, text, 
deformed, ugly, bad anatomy, extra limbs, flickering

Inference Settings

SettingQualitySpeedVRAM
Steps20-3010-15Same
CFG2.5-3.51.5-2.0Same
Resolution1280x720640x360Higher res = more VRAM

Camera Movement LoRAs

Add cinematic camera movements with specialized LoRAs:

  • Dolly: Smooth forward/backward motion
  • Jib: Vertical camera movement
  • Tracking: Lateral tracking shots

Common Issues

OOM on CLIP Encoding

Problem: Out of memory during text encoding.

Solution: Use smaller CLIP model or reduce batch size:

# In workflow, change:
clip_name1: "gemma-3-12b-it-abliterated.q4_k_m.gguf"
# To smaller model if needed

Audio Sync Issues

Problem: Audio doesn’t match video timing.

Solution: Ensure audio VAE is loaded with correct device settings:

audio_vae:
  device: "cpu"  # Offload to CPU
  weight_dtype: "bf16"

Artifacts in Output

Problem: Visual flickering or distortion.

Solutions:

  • Increase steps (20 → 30)
  • Lower CFG scale (3.5 → 2.5)
  • Use distilled model for cleaner output

Resources


LTX-2 represents a significant advancement in AI video generation. With synchronized audio, native 4K resolution, and ComfyUI integration, it’s now accessible to anyone with a capable GPU. Start with the distilled workflows for quick results, then experiment with the full pipeline for production-quality output.

Anthony Lattanzio

Anthony Lattanzio

Tech Enthusiast & Builder

I'm a tech enthusiast who loves building things with hardware and software. By night, I run a homelab that's grown way beyond what any reasonable person needs. Check out about me for more.

Comments

Powered by GitHub Discussions