Experimental workflow Ultralytics detector and MMAudio.
You will need only 4 custom nodes you probably already have
ComfyUI Impact Pack
rgthree-comfy
ComfyUI-VideoHelperSuite
ComfyUI Impact Subpack
If you want a good and simple MMaudio Workflow @SeoulSeeker made a very good one you can try.
https://civarchive.com/models/2137833/nsfw-dead-simple-mmaudio-rife-interpolation-setup-for-wan-22-i2v-14b
My workflow dosen't have an interpolation parts, all my videos are already at 24 FPS
So make sure yours too.
Few months ago I was struggling with MMAudio, the models itself use the motion, face of your whole video to generate the sounds, so most of the time if there is too much going on, the models get pretty confused, so first I'v started to experimenting with DAVinci and cropped then zoom the part of my video, like the face to have good synced moaning for example.
But the process was tedious.
So my idea was how can I crop it direclty into ComfyUI, I was thinking of mask first but you will need to redraw it everytime depending of the composition.
So my knowledge was limited with automated mask draw and get rid of this idea.
And then last week it click, "why can't you use ultralytics to detect the face or the area around it or the hand/ass etc.. give this cropped parts to MMAudio and generate sound for it"
So after few try with SEGS for video detector, I found I think a very good way of doing it, it take a bit longer yeah but the results is pretty good.
It is not perfect off course, MMAudio will do his random shit but it is way better on some of my videos, some videos dosen't need it, like a video where most of the subject and action is centered.
How to Use
So here you can choose your BBox detector, I'v tried with Anzhc face, yolov8m/s/n, yolov9c and the results will be slightly different for each of them, but you can use pretty much every face detector you want.
You can also play with crop factor, I found 1.2 is a good ratio.
With the video Combine preview you can see if the detection was good, this part will be send to MMAudio sampler.
When the detection is done there is no need to touch the SEGS, so you can generate a bunch of try with differents prompt and seeds.
I added a 2nd detector if you want to detect something else or even use MMaudio classic model to add some other sfx, you can toogle it on/off with Fast GroupBypasser.
If you want to generate a bunch of tries with only the 2nd detector just fix the seed on the first crop.
Make sure to let the detection enable on the 2nd crop if you want to only generate on the first MMAudio, then you can fix the first and play with 2nd MMAudio.
I'v added a preview audio for the 2nd Crop.
Here you can play with volume of the 2nd MMAudio to have a better Mix
I'v also added a Master Volume and First crop Volume
I hope I was clear enough if you have any trouble operating it or questions or maybe I missed somethings just let me know.
Description
V0.9 experimental