MOCHI VIDEO GENERATOR
(results are in the v1, v2, etc gallery, click the tabs at the top)
True i2v workflow added from V8 onwards, details in the main Article
video TBA
Showcase Special: (created with mostly one ACE-HOLO promptgen line)
pack update V7 + special Video promptgen guide with ACE-HoloFS.
V7 Demo Reel (made with Shuffle Video Studio)
Roundup of the research so far, with some more detailed instructions/info
Current leader: (V7 gallery) (V8 adds image encoding)
"\V7-Spatial-Tiling-VAE\Donut-Mochi-848x480-t2v-BatchedLatentSideload-v55"
i2v version used LLM Video prompt gen, t2v used my Zenkai-prompt + DJZ-LoadLatent.
WIP project by Kijai
Info/Setup/Install guide: https://civarchive.com/articles/8313
Requires Torch 2.5.0 minimum, so update your Torch if you are behind.
As with the CogVideo Workflows, they are provided for people that want to try the Preview :)
Even with a 4090 it can push the limits a little, I provide my workflows used to research Tile Optimisation in V1;
We're reducing tile sizes by roughly 20-40% from the defaults
We're increasing the frame batch size to compensate
Maintaining the same overlap factors to prevent visible seams
Key principles:
Tile sizes should ideally be multiples of 32 for most efficient processing
Keep width:height ratio similar to the original tile sizes
Frame batch size increases should be modest to avoid frame skipping
Researchers Tip!
If you work with a fixed seed, the sampler remains in memory, so the first gen took ~1700 seconds, however, changes to the Decoder can be made which means that the next video will take ~23 seconds. All the work is already done by the Sampler, so unless we take a new seed it will use the samples over and over, VAE decode speed is very good!
^ subsequent gens on same seed are very fast, allowing tuning of the decoder settings ^
^ initial generation was taking ~1700 with pytorch 2.5.0 SDP ^
V1 Workflows:
outputs labelled and added to V1 gallery, test prompt used:
"In a bustling spaceport, a diverse crowd of humans and aliens board a massive interstellar cruise ship. Robotic porters effortlessly handle exotic luggage, while holographic signs display departure times in multiple languages. A family of translucent, floating beings drift through the security checkpoint, their tendrils wrapping around their travel documents. In the sky above, smaller ships zip between towering structures, their ion trails creating an ever-changing tapestry of light."
\Decoder-Research\Donut-Mochi-848x480-batch10-default-v5
= Author Default Settings
This version used the recommended config from Author
\Decoder-Research\Donut-Mochi-640x480-batch10-autotile-v5
= Reduzed size, Auto Tiling
- This is my first run which created the video in the gallery, simply using Auto Tile on the decoder and reducing the overall dimensions to 640x480. This reduction makes generation take less memory, but is heavy handed and will reduce the quality of outputs.
The remaining workflows are all Investigating the possible configs, without using Auto Tiling so we know what was used exactly. Videos will be labelled for the batch count and added to v1 gallery. Community research is required !
\Decoder-Research\Donut-Mochi-848x480-batch12-v5
frame_batch_size = 12
tile_sample_min_width = 256
tile_sample_min_height = 128
\Decoder-Research\Donut-Mochi-848x480-batch14-v5
frame_batch_size = 14
tile_sample_min_width = 224
tile_sample_min_height = 112
\Decoder-Research\Donut-Mochi-848x480-batch16-v5
frame_batch_size = 16
tile_sample_min_width = 192
tile_sample_min_height = 96
\Decoder-Research\Donut-Mochi-848x480-batch20-v5
frame_batch_size = 20
tile_sample_min_width = 160
tile_sample_min_height = 96
\Decoder-Research\Donut-Mochi-848x480-batch24-v5
frame_batch_size = 24
tile_sample_min_width = 128
tile_sample_min_height = 64
\Decoder-Research\Donut-Mochi-848x480-batch32-v5
frame_batch_size = 32
tile_sample_min_width = 96
tile_sample_min_height = 48
The last workflow is a Hybrid Approach, the increased overlap factors (0.3 instead of 0.25) might help reduce visible seams when using very small tiles.
\Decoder-Research\Donut-Mochi-848x480-batch16-v6
frame_batch_size = 16
tile_sample_min_width = 144
tile_sample_min_height = 80
tile_overlap_factor_height = 0.3
tile_overlap_factor_width = 0.3
V2 Workflow
\CFG-Research\Donut-Mochi-848x480-batch16-CFG7-v7
This used the Donut-Mochi-848x480-batch16-v6 workflow with 7.0 CFG
this seems to be a good setting, generation time is 24 minutes with this setup.
(pytorch SDP used)
V3 Workflows
\FP8--T5-Scaled\Donut-Mochi-848x480-batch16-CFG7-T5scaled-v8
We decided to use the FP8_Scaled T5 CLIP model, this improved the outputs greatly across all prompts tested. check the v3 gallery. This is the best so far ! (until we beat it)
\GGUF-Q8_0--T5-Scaled\Donut-Mochi-848x480-b16-CFG7-T5scaled-Q8_0-v9
This did not yield the best results, probably due to T5 scaled Clip still being in FP8 as we were testing the use of GGUF Q8_0 as the main model.
V4 Workflow
\T5-FP16-CPU\Donut-Mochi-848x480-b16-CFG7-CPU_T5-FP16-v11
used T5XXL in FP16 by forcing it onto the CPU. Seems like the same artifacts from V3 where we used GGUF Q8_0 with T5XXL FP8.
V5 Workflows
\GGUF-Q8_0--T5-FP16-CPU\Donut-Mochi-848x480-GGUF-Q8_0-CPU_T5-FP16-v14
This was the best settings with VAE Tiling enabled, increasing the steps of course will increase the quality and the time taken.
Increasing steps to 100-200 is increasing quality at the expense of time taken, 200 steps takes 45 minutes. Likely no version for this because anybody can add more steps to any of these workflows and just wait a very long time for a 6 second video. This can be remedied with a Cloud setup and more/larger GPU/VRAM allocation.
V6 Workflows
\Fast-25-Frames\Donut-Mochi-848x480-Fast-v4
Used VAE Tiling with 25 frames to generate 1 second of video. with 50 steps this takes a few minutes, 4-5 minutes for 100 steps.
\NoTiling-SaveLoadLatent\Donut-Mochi-848x480-i2v-LatentSideload-v21
Using my new DJZ-LoadLatent Node, you can save the sampler results as .latent files on the disk, this makes it possible to decode the latents as a separate stage, eliminating the need for the Tiling VAE. This is image to video, and used OneVision to estimate a video prompt from any given image, it also automatically detect Tall or Wide Aspect ratio and crops/fills to 16:9 or 9:16. NOTE: more testing must be done to prove that Tall Aspect Quality is good.
\NoTiling-SaveLoadLatent\Donut-Mochi-848x480-t2v-LatentSideload-v25
This is the text to image version of the previous workflow, we drop OneVision and ImageSizeAdjusterV3 and add Zenkai-Prompt-V2 back in to take advantage of our prompt lists. Full instructions are found in the workflow notes.
Save/Load Latent approach allows us to drop the Tiling VAE, which introduced ghosting to all videos regardless of the settings, as we achieved improved quality the ghosting becomes more apparent.
V7 Workflows
Updated the V6 latent sideload workflows to use the newer VAE Spatial Tiling Decoder
This can run 100% on local GPU, and all the demo videos in the gallery used on 50 steps
(100 steps used in the V6 gallery) another significant upgrade !
\V7-Spatial-Tiling-VAE\Donut-Mochi-848x480-t2v-LatentSideload-v50.json
text2video, VAE spatial tiling decoder, with my latent loader
\V7-Spatial-Tiling-VAE\Donut-Mochi-848x480-i2v-LatentSideload-v50.json
pseudo image2video, VAE spatial tiling decoder, with my latent loader
\V7-Spatial-Tiling-VAE\Donut-Mochi-848x480-t2v-BatchLatentSideload-v55.json
text2video, VAE spatial tiling decoder, with my V2 batched latent loader
\V7-Spatial-Tiling-VAE\Donut-Mochi-848x480-i2v-BatchLatentSideload-v55.json
pseudo image2video, VAE spatial tiling decoder, with my V2 batched latent loader
NOTE: V7 is available on Github in my DJZ-Workflows pack, however it will not get published here until the new batch of videos are finished (cooking all night tonight)
V8 Workflows
\True-Image-To-Video\Donut-Mochi-848x480-i2v-LatentSideload-v90.json
image2video, VAE spatial tiling decoder, with my latent loader
\True-Image-To-Video\Donut-Mochi-848x480-i2v-BatchedLatentSideload-v90.json
image2video, VAE spatial tiling decoder, with my V2 batched latent loader
Added true i2v (image to video using new VAE Encoder)
tutorial video TBA. details in the main article
Description
FAQ
Comments (22)
Any chance this will run on 16gb vram? I don't think so, but its not clear if its even possible.
I believe that you can, offloading the CLIP (T5XXL-FP16) to CPU should still work if you have enough system RAM, Then maybe you can try the GGUF Q4_0 which should squeeze the size down a bit more, maybe under 12GB, however when quantizing models you do lose some accuracy. It's tricky and i plan to look into this a bit more in future updates.
@driftjohnson Thank you for the answer, I might try this (or maybe wait till img2vid is supported)-- I would appreciate anything you figure out in the future around this topic--
I have an RTX4090 GEFORCE 16 GB and I have got videos 480p no more than 2 second, and are decent videos. Then I put the videos on CapCut video enhancer, and they come out beautiful. I have done videos of 12 seconds, but they are just a joke. A video of 2 sec, will take like 10 or more minutes, when you try more than that it will take pretty long. 12 Secs, 1.5 to 2.5 hours waiting. I have done so many trials and errors trying to get more than that, and the only good thing is that you start to understand how things work. I don't know coding, but with the help of ChatGPT, I was able to make the model use only 15gb of the 16, and with that 1gb free my PC perform better. Even the Antivirus is a problem, if you have Windows. I guess if you have Linux, you will have fewer issues than me.
@RubenTainoAI i think that using it for 2 second clips is very useful - this is how i use it on my 4090 also. If i wanted more length i would probably use interpolation (RIFE/FILM) to double the frames (works with slow motion prompts) and as you say use an external video upscaler/detailer tool.
with even 50 steps, 163 frames was taking 25 minutes for me.
@RubenTainoAI ty for the info-- It's a lot of data to download just to get no where, so im still on the fence-- But I feel much closer to experimenting with it myself finding out it is possible ha.
@RubenTainoAI hi can you explain a bit more about this "I don't know coding, but with the help of ChatGPT, I was able to make the model use only 15gb of the 16, and with that 1gb free my PC perform better." thanks
v6 works very well ! Thanks a lot for sharing - this actually is exactly what I wanted to do but could not achieve. I have the impression the solution to challenges I was facing is to use your custom nodes to save and load latents instead of ComfyUI's own. What is the difference between them ?
It's amazing how much information is packed in a .latent file, isn't it ?
I used the default SaveLatent Node, which saved to the /output folder. This was annoying because you have to copy the .latent file into the /input folder manually, which was disruptive and maybe needed two workflows, so by using the group bypass switch, we can do it all in one workflow.
My LoadLatent node is identical in the function of the code, i just made it scan the /output folder for any .latent files. By pressing "R" you can easily update the node's list and simply carry on with no fuss :)
and yes ! I hope to use this node in other applications as it seems to work great !
@driftjohnson In fact the problem I had was problably related to the "R" - it was lacking a much necessary refresh !
I enjoy your videos been following your videos since the mimicmotion stuff. I'm using your runpod template but I'm getting an error missing node type mochi vae loader
it is a ~WIP~ Node so it's possible something is changed or removed, no worries a new pack version will address any issues like this - thanks for alerting me :)
@driftjohnson I just removed the node and used the vae from the MochiWrapper and it worked fine. I've been running the Runpod template on an H100 it can do the standard Mochi Workflow at 163 frames in around 10 minutes at a cost of $2.79/hr
@nexusjuan342 yes at the time of posting my pack the "Spatial Tiling VAE decoder" did not exist, i have since updated the pack with the V7 release. With all WIP wrappers they are subject to changes - this means that the nodes can break in older workflows :)
@nexusjuan342 Thanks for posting this result - it will surely help others to decide how they want to run it.
V7 can run on 100% local, but clearly it's faster on the Runpod method :D thanks again for sharing!
You want to use the VAE Decode Spatial Tiling node.
Yes, i wanted to give some time, cover a few other AI projects and return to this after a few day :)
It's on my list !!
confirmed this was added in an update after i did all my research, i'm looking into this next
does this mean you don't get the ghosting so no need to save latents , what are the recommended settings , thanks
@citydailyai saving the latents can still help with fitting the whole process into memory (depends how many frames/steps you generate)
Yes, The new Spatial Tiling eliminated ghosting.
@driftjohnson thanks i tried decoding latents but doing 49 frames i kept getting too high on the vram so it was going to take hours , the spacial tiling is a welcome addition
@citydailyai the decoding of the latents only takes 30 seconds, but the VAE decode is much more VRAM. This is the nature of gen video - the encoding take a long time, but the decoding (although faster) is requiring more effort, so by splitting the job into two parts, it effectively halved the load. That the same logic behind the "purge VRAM" nodes, to free up more memory at each stage.
I also offered a Runpod template if people want the "full-fat" experience, which currently require 48GB Vram.
Certainly the Spatial Tiling can work great with the right combination of models etc.