Changelog
Version 1.0.3: Connected both steps so no more re-uploading is required. Just upload your video in Step 1 and hit Run.
Version 1.0.2: Changed VHS nodes to VHS ffmpeg nodes to avoid color drift (thank you LastAssignment). Also changed FPS flow from 24 to 25 to more closely align to MMAudio specs.
Version 1.0.1: RIFE Group output was set to 8fps by accident. Changed it to 24fps
Version 1.0: Initial release
A TRIBUTE TO GOONERS EVERYWHERE
Your WAN 2.2 video is great. It looks awesome. But where's the sound? We moved from images to videos, and WAN 2.2 is incredible for video. The missing piece...AUDIO!
This is my first article ever, so I'm sorry if I made any mistakes. Please leave a comment if I've made an error or if you need any help. For your reference, I'm running:
ComfyUI 0.3.68
Torch 2.9
CUDA 13
Python 3.13.9
Sage Attention 2.2
NVIDIA 5070 Ti (16gb vram)
And here are the custom nodes (3 in total):
ComfyUI-VideoHelperSuite 1.7.7 (https://github.com/Kosinkadink/ComfyUI-VideoHelperSuite)
ComfyUI-MMAudio Nightly (https://github.com/kijai/ComfyUI-MMAudio)
I recommend manually git cloning this node pack into your /ComfyUI/models/custom_nodes folder and then installing the requirements.txt file using your embedded python. I'm on portable Comfy, so the command would look something like this:
"C:\ComfyUI\python_embeded\python.exe" -m pip install -r "C:\ComfyUI\ComfyUI\custom_nodes\ComfyUI-MMAudio\requirements.txt"
ComfyUI-VFI Unknown (https://github.com/GACLove/ComfyUI-VFI)
I think there's a more popular RIFE custom node that a lot of other people use, but Icouldn't figure out how to get fractional multiples for interpolation (16 -> 25fps is a ~1.5x interpolation), but this node allows it.
Onto the workflow...
------------------------------------
This workflow handles two jobs:
Fix WAN 2.2’s native 16fps output by interpolating it to 25fps with RIFE.
Generate synced audio with MMAudio using the final 25fps video.
The setup is plug-and-play. Drop in your WAN video → interpolate → feed it into MMAudio → get synced output. The included notes explain the reasoning for FPS, step settings, and seed behavior.
What this workflow covers:
RIFE interpolation from 16 → 25 fps.
MMAudio sampler
Upon some further testing, 50-100 steps works well. The node runs pretty fast in general, and it's also worthwhile toying with CFG (4.5 - 8). 100 steps and CFG 8 works well for high-quality output and better prompt adherence.
Automatic audio + video combine at 25fps.
Optional re-interpolation afterward if you want 30fps+ output.
You can plug your finished 25fps video into the 'Step 1: Rife Interpolation' group and just change the 'source_fps' to 25 and the 'target_fps' to 30.
Required MMAudio files
Download all of these into:
ComfyUI/models/mmaudio
MMAudio NSFW Model (fine-tuned off the base model)
MMAudio VAE (fp16)
MMAudio Synchformer (fp16)
https://huggingface.co/Kijai/MMAudio_safetensors/resolve/main/mmaudio_synchformer_fp16.safetensors
MMAudio CLIP Encoder (fp16)
Nvidia BigVGAN v2 24KHz 100band 512x
This seems to be required for MMAudio to work. You can manually download all the files, git clone, or use the HuggingFace CLI tool (huggingface-cli repo clone URL). The repo should be placed in the ComfyUI/models/mmaudio folder.
https://huggingface.co/nvidia/bigvgan_v2_44khz_128band_512x
Bonus
Once you've created a good MMAudio track, there are some further steps you can take depending on what you'd like to create.
1. Import your audio/video into some type of software (CapCut/Shotcut) and layer on some music in the background. I've done this with a few of my videos. I added a 'radio' filter to make it seem like the music was kinda tinny and playing in the background.
2. Layer other audio tracks alongside the NSFW audio track. You can see KaptainSisay very elegantly did something like that here (https://civarchive.com/images/110700679)
Description
Changed VHS nodes to VHS ffmpeg nodes to avoid color drift (thank you LastAssignment).
Also changed FPS flow from 24 to 25 to more closely align to MMAudio specs.
FAQ
Comments (8)
yessss, this is the missing piece! I wonder how I would train this on my own dataset, is video used or just audio? are there tuts for this?
I hope this helps:
https://github.com/hkchengrex/MMAudio/blob/main/docs/TRAINING.md
@SeoulSeeker thank you will check it out
I just wanna say thank you for making this! Even though it is "simple" it really helped me up my game without having to do a lot of research. Some learnings I've done that others might find useful:
- Steps:
The first versions of this had steps at 50, I saw someone suggesting upping this to 100. That made a difference and I see the latest version have 100 now.
- Prompting:
As the author has noted, this checkpoint really likes to moan! I personally prefer less moaning and more sounds from wet and sloppy sex. So prompting is a must for me. Do a couple of generation with no prompt and CFG at 4.5 to gauge what annoying sounds keep showing up and what sounds you are missing. Don't spam the prompt right away, gradually add just what is needed.
- Positive prompt:
For the positive prompt add more of what you like to hear. For my oral videos I add words like "gulp, facefuck". Try them one by one and see what effect they have. Not all words do what you think they do... As with image models, adding the right positive prompt usually beats overusing the negative prompt.
- Negative prompt:
Adding words like "moan, loud, breathing, sharp, smack, slap" etc. to the negative prompt seems to help reduce the overly anime and cheesy porn sounds.
- CFG:
I first leave this at 4.5. If the model keeps moaning and making random noises despite my prompting I increase the CFG gradually. It seems that similarly to image models producing "oversaturated" images at high CFG, this model produce rougher and more unnatural sounds the higher you go. So only increase CFG if you have to.
- Bonus tips:
I like to interpolate and upscale my videos in Topaz, so I don't really care that much about the interpolation in this workflow other than it being necessary for the sound generation. So I like to do a Topaz run on my "silent" video separately and then combine that video with the audio from this workflow. The upscaled video can be combined with the audio using ffmpeg like this:
ffmpeg -i "topaz_video.mp4" -i "mmaudio_video.mp4" -c copy -map 0:v:0 -map 1:a:0 final_video_with_audio.mp4
Great comment, I agree with basically 100% of what you've written here as I've learned much the same things as you haha
Btw thank you very much for the buzz! <33
WTF man ?? I'm gonna try this out
Why can't i install MMaudio node via Comfy ui manager?((