Title: LTX-2 Lip Sync Workflow - CivArchive (CivitAI Archive)

Description:

LTX-2 Lip Sync Workflow is a ComfyUI workflow designed for audio-driven lip sync video generation, talking character animation, and image-to-video portrait performance using LTX-2. Instead of only creating a silent motion clip from an image, this workflow brings audio into the generation process and lets the video latent and audio latent work together, making it suitable for creating short speaking videos, AI presenters, dialogue clips, digital human previews, character voice performances, and social media talking-head content.

The workflow is built around the LTX-2 19B Dev FP8 checkpoint, using both the main video model and the dedicated LTX audio VAE pipeline. The audio file is encoded into an audio latent, then combined with the video latent through an audio-video latent workflow. This design allows the model to use the input audio as part of the generation condition, instead of treating the audio as something added after the video is finished. The result is a more direct audio-to-mouth-motion relationship, which is important for lip sync, speech rhythm, facial timing, and natural talking performance.

The core logic of the workflow is image + audio to lip-synced video. You provide a source image as the visual identity reference and an audio file as the speech or singing reference. The image is used to initialize the character appearance and video layout, while the audio latent guides the speaking rhythm. The workflow then generates a video where the character can appear to talk along with the provided audio.

A key part of this workflow is the LTXVAudioVAEEncode stage. The input audio is processed by the LTX audio VAE and converted into an audio latent. This audio latent is then passed into the later video generation stage through LTXVConcatAVLatent, where it is combined with the video latent. After sampling, LTXVSeparateAVLatent is used to separate the final video latent from the audio-video latent structure. This gives the workflow a clear audio-video pipeline: load audio, encode audio, combine audio with video latent, sample, separate video latent, then decode or upscale the final result.

The workflow also uses image-to-video logic through LTXVImgToVideoInplace. This helps preserve the source image identity and composition while allowing the generated frames to move. For portrait images, this is especially useful because the face, clothing, background, and general framing can remain close to the original image while the mouth, facial expression, and subtle head motion are animated according to the audio.

The workflow includes an EmptyLTXVLatentVideo stage for setting the base video latent dimensions and frame length. In the included setup, the frame rate logic is based around 24 fps. This is important because lip sync quality depends heavily on matching the audio duration to the correct number of frames. For example, a 10-second clip at 24 fps usually needs 24 × 10 + 1 frames, which means 241 frames. If the frame count is wrong, the audio and mouth movement may drift or feel off-sync.

The workflow also includes sampler control through SamplerCustomAdvanced, ManualSigmas, KSamplerSelect, CFGGuider, and RandomNoise. These nodes control how the latent video is generated, how strongly the prompt affects the result, and how the noise schedule behaves. The workflow is not just a basic video generation template; it is structured to support audio-conditioned motion, image identity preservation, and controlled sampling.

Another important part is the LTXVLatentUpsampler stage. This allows the workflow to upscale the latent video after the first generation pass. The purpose is to improve output quality while keeping the initial motion and lip sync result. For faster previews, the upscale stage can be bypassed so users can test image, seed, prompt, and audio timing more quickly. After finding a good seed and prompt combination, the upscale stage can be enabled again for a cleaner final output.

This workflow is suitable for AI creators who want to turn a still portrait into a speaking or singing video. It can be used for digital human demos, AI character narration, short-form video avatars, dialogue previews, virtual host content, product explanation clips, tutorial presenters, anime-style talking characters, realistic portrait animation, and creative voice-driven character tests.

Main features:

- LTX-2 audio-driven lip sync workflow

- Built around LTX-2 19B Dev FP8

- Image + audio to talking video generation

- Audio VAE encoding for speech-driven motion

- Audio latent and video latent combination

- LTXVConcatAVLatent and LTXVSeparateAVLatent workflow

- Image-to-video identity preservation

- 24 fps video conditioning logic

- Manual frame control based on audio duration

- SamplerCustomAdvanced generation pipeline

- ManualSigmas and CFG guidance control

- Optional latent upscaling for final quality improvement

- Suitable for portrait animation, digital humans, and character dialogue

- Good for testing speech, singing, narration, and AI presenter workflows

Recommended use cases:

AI talking head video, digital human demo, lip sync portrait animation, audio-driven character performance, virtual presenter video, product explanation avatar, short-form social media narration, anime character talking video, realistic portrait speech animation, singing character tests, voiceover-driven video creation, dialogue scene preview, ComfyUI audio-video workflow testing, and Civitai showcase examples.

Suggested workflow:

Start by preparing a clean source image. A front-facing or slightly angled portrait usually works best. The face should be visible, the mouth area should not be blocked, and the image should have enough resolution for facial detail. Avoid images with extreme face angles, heavy occlusion, tiny faces, very strong motion blur, or low-quality compression.

Next, prepare the audio file. MP3 can work, but clean audio usually gives better results. Try to use a voice clip with clear speech, limited background noise, and stable volume. If the audio contains music, echo, overlapping voices, or heavy noise, the mouth movement may become less reliable. For a first test, use a short 5-second clip before trying longer videos.

Set the audio duration and frame count carefully. Since the workflow uses 24 fps logic, the frame count should match the audio length. A 5-second clip should use about 121 frames. A 10-second clip should use about 241 frames. A 15-second clip should use about 361 frames. If the frame length is too short or too long, the final video may not match the audio timing.

Write a prompt that describes the character and the intended performance. For lip sync videos, the prompt should not only describe the visual style, but also the behavior. You can describe a person speaking naturally, subtle head movement, realistic mouth movement, calm facial expression, stable camera, soft lighting, and shallow depth of field. If you want the video to stay close to the source image, keep the prompt simple and avoid changing identity details too aggressively.

Use the negative prompt to suppress common lip sync problems. Useful negative terms include mismatched lip sync, distorted mouth, exaggerated expression, unnatural face movement, wrong gaze direction, robotic voice, audio delay, jittery motion, deformed face, flickering, duplicated mouth, missing microphone, incorrect expression, over-smiling, laughing, camera shake, and AI artifacts.

For portrait videos, start with moderate resolution and short duration. A safe testing setup is 480 x 832 for portrait or 832 x 480 for widescreen, around 5 seconds. After confirming that the seed, prompt, and audio timing work well, you can increase duration and resolution. Longer videos require more VRAM and more generation time, so it is better to test in small steps.

If the video has almost no motion, use a stronger motion prompt or camera-control guidance if available. If the face changes too much, simplify the prompt and reduce identity-changing descriptions. If the mouth does not match the audio, check the audio duration, frame count, and whether the audio has clear speech. If the video looks soft, enable the latent upscaler for the final pass. If you only want to preview quickly, bypass the upscale stage first.

This workflow is designed for creators who need a practical LTX-2 lip sync pipeline inside ComfyUI. It combines source image preservation, audio latent conditioning, video latent generation, sampler control, and optional latent upscaling into one workflow. It is useful for testing LTX-2 audio-video generation, creating AI talking characters, preparing short digital human clips, and building publishable Civitai examples from image + voice input.

🎥 YouTube Video Tutorial

Want to know what this workflow actually does and how to start fast?

This video explains what the tool is, how to launch the workflow instantly, and shares my core design logic — no local setup, no complicated environment.

Everything starts directly on RunningHub, so you can experience it in action first.

👉 YouTube Tutorial: https://youtu.be/LH1FquAz5O8

Before you begin, I recommend watching the video thoroughly — getting the full context helps you understand the tool faster and avoid common detours.

⚙️ RunningHub Workflow

Try the workflow online right now — no installation required.

👉 Workflow: https://www.runninghub.ai/post/2011736436441092097/?inviteCode=rh-v1111

If the results meet your expectations, you can later deploy it locally for customization.

🎁 Fan Benefits: Register to get 1000 points + daily login 100 points — enjoy 4090 performance and 48 GB super power!

📺 Bilibili Updates (Mainland China & Asia-Pacific)

If you’re in the Asia-Pacific region, you can watch the video below to see the workflow demonstration and creative breakdown.

📺 Bilibili Video: https://www.bilibili.com/video/BV1LLkFBhEgm/

☕ Support Me on Ko-fi

If you find my content helpful and want to support future creations, you can buy me a coffee ☕.

Every bit of support helps me keep creating — just like a spark that can ignite a blazing flame.

👉 Ko-fi: https://ko-fi.com/aiksk

💼 Business Contact

For collaboration or inquiries, please contact aiksk95 on WeChat.

🎥 YouTube 视频教程

想了解这个工作流到底是怎样的工具，以及如何快速启动？

视频主要介绍工具定位、快速启动方法和我的构筑思路。

我们会直接在 RunningHub 上进行演示，让你第一时间看到实际效果。

👉 YouTube 教程： https://youtu.be/LH1FquAz5O8

开始前建议尽量完整地观看视频 —— 把握整体思路会更快上手，也能少走常见弯路。

⚙️ 在线体验工作流

现在就可以在线体验，无需安装。

👉 工作流： https://www.runninghub.ai/post/2011736436441092097/?inviteCode=rh-v1111

打开上方链接即可直接运行该工作流，实时查看生成效果。

如果觉得效果理想，你也可以在本地进行自定义部署。

🎁 粉丝福利：注册即送 1000 积分，每日登录 100 积分，畅玩 4090 体验 48 G 超级性能！

📺 Bilibili 更新（中国大陆及南亚太地区）

如果你在中国大陆或南亚太地区，可以通过下方视频查看该工作流的实测效果与构思讲解。

📺 B站视频： https://www.bilibili.com/video/BV1LLkFBhEgm/

我会在夸克网盘持续更新模型资源：

👉 https://pan.quark.cn/s/20c6f6f8d87b

这些资源主要面向本地用户，方便进行创作与学习。

Description

Details

Files

titleLTX2LipSync_v10.zip

Mirrors