base workflow for Audio+Image to video for Dev model. LOW VRAM as possible.
can also generate text to video with audio reference. (switch red boolean node to TRUE)
i suggest leaving the prompt alone unless you want to prompt for a specific motion or action to occur.
prompt:
" Transform this static image into a high-quality video with with realistic facial expressions and realistic motion.
Perfect lip-sync to the attached audio. "
FILES:
OPTIONAL Kijais fp8 Scaled (requires load diffusion model node instead of unet loader node and replaces the gguf entirely. )
https://huggingface.co/Kijai/LTX2.3_comfy/tree/main/diffusion_models
DEV gguf (distilled ggufs are in the repo as well)
https://huggingface.co/unsloth/LTX-2.3-GGUF/tree/main
Gemma 3_12B FP4 text encoder
Audio VAE
https://huggingface.co/Kijai/LTX2.3_comfy/blob/main/vae/LTX23_audio_vae_bf16.safetensors
Video VAE
https://huggingface.co/Kijai/LTX2.3_comfy/blob/main/vae/LTX23_video_vae_bf16.safetensors
Text Projection text encoder
https://huggingface.co/Kijai/LTX2.3_comfy/tree/main/text_encoders
Distill Lora
https://huggingface.co/Lightricks/LTX-2.3/blob/main/ltx-2.3-22b-distilled-lora-384.safetensors
Upscaler
https://huggingface.co/Lightricks/LTX-2.3/blob/main/ltx-2.3-spatial-upscaler-x2-1.1.safetensors
Description
A+I2V