The workflow generates video starting from given image as first frame and... makes the character speak sentence in prompt using voice given in 10 sec sample.
The workflow was tested on recent version of ComfyUI and recent versions of nodes. RTX 5090, PyTorch 2.9.0, Python 3.13.11.
This is not for the begginers. Make female speak male voice is pretty hard, but low resolution and good prompt helps. So, good scenario for using the workflow is to take only audio output and inject to the video in high resolution. Female to female could work easily.
I tried some voice cloning technics with different models, but everytime i got the speech more or less "out of context". Now i can apply the speech perfectly matching to the action.
So, there's no need to train lora with specific voice anymore.
Description
Basic parameters. For your own further developing.