MOCHI VIDEO GENERATOR
(results are in the v1, v2, etc gallery, click the tabs at the top)
True i2v workflow added from V8 onwards, details in the main Article
video TBA
Showcase Special: (created with mostly one ACE-HOLO promptgen line)
pack update V7 + special Video promptgen guide with ACE-HoloFS.
V7 Demo Reel (made with Shuffle Video Studio)
Roundup of the research so far, with some more detailed instructions/info
Current leader: (V7 gallery) (V8 adds image encoding)
"\V7-Spatial-Tiling-VAE\Donut-Mochi-848x480-t2v-BatchedLatentSideload-v55"
i2v version used LLM Video prompt gen, t2v used my Zenkai-prompt + DJZ-LoadLatent.
WIP project by Kijai
Info/Setup/Install guide: https://civarchive.com/articles/8313
Requires Torch 2.5.0 minimum, so update your Torch if you are behind.
As with the CogVideo Workflows, they are provided for people that want to try the Preview :)
Even with a 4090 it can push the limits a little, I provide my workflows used to research Tile Optimisation in V1;
We're reducing tile sizes by roughly 20-40% from the defaults
We're increasing the frame batch size to compensate
Maintaining the same overlap factors to prevent visible seams
Key principles:
Tile sizes should ideally be multiples of 32 for most efficient processing
Keep width:height ratio similar to the original tile sizes
Frame batch size increases should be modest to avoid frame skipping
Researchers Tip!
If you work with a fixed seed, the sampler remains in memory, so the first gen took ~1700 seconds, however, changes to the Decoder can be made which means that the next video will take ~23 seconds. All the work is already done by the Sampler, so unless we take a new seed it will use the samples over and over, VAE decode speed is very good!
^ subsequent gens on same seed are very fast, allowing tuning of the decoder settings ^
^ initial generation was taking ~1700 with pytorch 2.5.0 SDP ^
V1 Workflows:
outputs labelled and added to V1 gallery, test prompt used:
"In a bustling spaceport, a diverse crowd of humans and aliens board a massive interstellar cruise ship. Robotic porters effortlessly handle exotic luggage, while holographic signs display departure times in multiple languages. A family of translucent, floating beings drift through the security checkpoint, their tendrils wrapping around their travel documents. In the sky above, smaller ships zip between towering structures, their ion trails creating an ever-changing tapestry of light."
\Decoder-Research\Donut-Mochi-848x480-batch10-default-v5
= Author Default Settings
This version used the recommended config from Author
\Decoder-Research\Donut-Mochi-640x480-batch10-autotile-v5
= Reduzed size, Auto Tiling
- This is my first run which created the video in the gallery, simply using Auto Tile on the decoder and reducing the overall dimensions to 640x480. This reduction makes generation take less memory, but is heavy handed and will reduce the quality of outputs.
The remaining workflows are all Investigating the possible configs, without using Auto Tiling so we know what was used exactly. Videos will be labelled for the batch count and added to v1 gallery. Community research is required !
\Decoder-Research\Donut-Mochi-848x480-batch12-v5
frame_batch_size = 12
tile_sample_min_width = 256
tile_sample_min_height = 128
\Decoder-Research\Donut-Mochi-848x480-batch14-v5
frame_batch_size = 14
tile_sample_min_width = 224
tile_sample_min_height = 112
\Decoder-Research\Donut-Mochi-848x480-batch16-v5
frame_batch_size = 16
tile_sample_min_width = 192
tile_sample_min_height = 96
\Decoder-Research\Donut-Mochi-848x480-batch20-v5
frame_batch_size = 20
tile_sample_min_width = 160
tile_sample_min_height = 96
\Decoder-Research\Donut-Mochi-848x480-batch24-v5
frame_batch_size = 24
tile_sample_min_width = 128
tile_sample_min_height = 64
\Decoder-Research\Donut-Mochi-848x480-batch32-v5
frame_batch_size = 32
tile_sample_min_width = 96
tile_sample_min_height = 48
The last workflow is a Hybrid Approach, the increased overlap factors (0.3 instead of 0.25) might help reduce visible seams when using very small tiles.
\Decoder-Research\Donut-Mochi-848x480-batch16-v6
frame_batch_size = 16
tile_sample_min_width = 144
tile_sample_min_height = 80
tile_overlap_factor_height = 0.3
tile_overlap_factor_width = 0.3
V2 Workflow
\CFG-Research\Donut-Mochi-848x480-batch16-CFG7-v7
This used the Donut-Mochi-848x480-batch16-v6 workflow with 7.0 CFG
this seems to be a good setting, generation time is 24 minutes with this setup.
(pytorch SDP used)
V3 Workflows
\FP8--T5-Scaled\Donut-Mochi-848x480-batch16-CFG7-T5scaled-v8
We decided to use the FP8_Scaled T5 CLIP model, this improved the outputs greatly across all prompts tested. check the v3 gallery. This is the best so far ! (until we beat it)
\GGUF-Q8_0--T5-Scaled\Donut-Mochi-848x480-b16-CFG7-T5scaled-Q8_0-v9
This did not yield the best results, probably due to T5 scaled Clip still being in FP8 as we were testing the use of GGUF Q8_0 as the main model.
V4 Workflow
\T5-FP16-CPU\Donut-Mochi-848x480-b16-CFG7-CPU_T5-FP16-v11
used T5XXL in FP16 by forcing it onto the CPU. Seems like the same artifacts from V3 where we used GGUF Q8_0 with T5XXL FP8.
V5 Workflows
\GGUF-Q8_0--T5-FP16-CPU\Donut-Mochi-848x480-GGUF-Q8_0-CPU_T5-FP16-v14
This was the best settings with VAE Tiling enabled, increasing the steps of course will increase the quality and the time taken.
Increasing steps to 100-200 is increasing quality at the expense of time taken, 200 steps takes 45 minutes. Likely no version for this because anybody can add more steps to any of these workflows and just wait a very long time for a 6 second video. This can be remedied with a Cloud setup and more/larger GPU/VRAM allocation.
V6 Workflows
\Fast-25-Frames\Donut-Mochi-848x480-Fast-v4
Used VAE Tiling with 25 frames to generate 1 second of video. with 50 steps this takes a few minutes, 4-5 minutes for 100 steps.
\NoTiling-SaveLoadLatent\Donut-Mochi-848x480-i2v-LatentSideload-v21
Using my new DJZ-LoadLatent Node, you can save the sampler results as .latent files on the disk, this makes it possible to decode the latents as a separate stage, eliminating the need for the Tiling VAE. This is image to video, and used OneVision to estimate a video prompt from any given image, it also automatically detect Tall or Wide Aspect ratio and crops/fills to 16:9 or 9:16. NOTE: more testing must be done to prove that Tall Aspect Quality is good.
\NoTiling-SaveLoadLatent\Donut-Mochi-848x480-t2v-LatentSideload-v25
This is the text to image version of the previous workflow, we drop OneVision and ImageSizeAdjusterV3 and add Zenkai-Prompt-V2 back in to take advantage of our prompt lists. Full instructions are found in the workflow notes.
Save/Load Latent approach allows us to drop the Tiling VAE, which introduced ghosting to all videos regardless of the settings, as we achieved improved quality the ghosting becomes more apparent.
V7 Workflows
Updated the V6 latent sideload workflows to use the newer VAE Spatial Tiling Decoder
This can run 100% on local GPU, and all the demo videos in the gallery used on 50 steps
(100 steps used in the V6 gallery) another significant upgrade !
\V7-Spatial-Tiling-VAE\Donut-Mochi-848x480-t2v-LatentSideload-v50.json
text2video, VAE spatial tiling decoder, with my latent loader
\V7-Spatial-Tiling-VAE\Donut-Mochi-848x480-i2v-LatentSideload-v50.json
pseudo image2video, VAE spatial tiling decoder, with my latent loader
\V7-Spatial-Tiling-VAE\Donut-Mochi-848x480-t2v-BatchLatentSideload-v55.json
text2video, VAE spatial tiling decoder, with my V2 batched latent loader
\V7-Spatial-Tiling-VAE\Donut-Mochi-848x480-i2v-BatchLatentSideload-v55.json
pseudo image2video, VAE spatial tiling decoder, with my V2 batched latent loader
NOTE: V7 is available on Github in my DJZ-Workflows pack, however it will not get published here until the new batch of videos are finished (cooking all night tonight)
V8 Workflows
\True-Image-To-Video\Donut-Mochi-848x480-i2v-LatentSideload-v90.json
image2video, VAE spatial tiling decoder, with my latent loader
\True-Image-To-Video\Donut-Mochi-848x480-i2v-BatchedLatentSideload-v90.json
image2video, VAE spatial tiling decoder, with my V2 batched latent loader
Added true i2v (image to video using new VAE Encoder)
tutorial video TBA. details in the main article
Description
FAQ
Comments (13)
From deciding to try it after YouTube kept pushing the video on me, to up and running, to the new version (v8) update, to a 2second clip of a hot air baloon above a valley, all inside of an hour or so. This is text to video on consumer gear. Great.
Thanks so much !
It's been getting kinda confusing now that the new checkpoint (combined method) was released and it caused a lot of confusion - so i'm glad that you are finding it works great :)
Very cool videos.
But the archive is a mess. Like what are you even supposed to open and run? Your video talks about pseudo i2v, but the files only has a true i2v. Version numbering doesn't match either. Notes doesn't match the video.
That being said, this workflow is interesting and I hope it'll be made easier to get into soon.
Well actually,,,
Each folder is documented and referenced in the workflow pack description on Civit. (links in description for each video & from Civit Article)
Research build notes (V1-V6) were all fully explained in the github thread, this is explained in the first video also.
I included the folder path in the pack in the notes explaining what each folder is too.
However - There are many options for how to Run Mochi and there is a new combined Checkpoint now, which i have never covered - yet. This is normal in open research as the author will make changes that often break or make changes to nodes i have not written myself - this is why i documented exactly what i did.
Any changes are beyond my control after the fact.
All that being said - in future the Folders will Correspond to my pack versions. This is the best i can do when many factors are beyond my control.
I have made every effort to indicate which versions are the most up to date.
Also - there is a github version which has all 301 workflows from every video on my channel in one place for "git clone" users.
I hope you can see it's not so simple when things move very quickly.
Often the last version is actually made obselete "overnight" when developers are working hard to improve the code and so workflows must be rebuilt.
The only True Image-to-Video workflow is in the folder of the same name. I try to make it easy for everyone.
There will be more updates to the pack - the folder names are intended to give you a clue. It's a little harsh to say that nothing matches the video, when there are multiple videos.
BTW ~ The version number at the end is for my own organization, so i know what workflow it is, every time i make a change, that number goes up by one. it has no bearing on the Pack version.
no drama's this is just the full answer to all your points
@driftjohnson I hope my comment didn't feel like an attack.
I just feel it's a waste for all your effort to go to waste, if the last 10% of usability isn't there. There's a lot of good stuff here, and it's sooo close to being a great package.
I'll follow and maybe I can understand it all better soon.
@RRMO I totally agree, however there were a few thousand people that were "catching the ball while it was up in the air" the problem is that when it lands, the bounce can mean the older stuff becomes useless.
I'm used to this - we make nice usable workflows and there is always a chance they get broken due to a number of reasons, however with WIP projects i'm thankful the maintainer/author released their code for use to help in finding the best settings and give feedback.
I could have deleted the old stuff, but i prefer to include it all in the spirit of transparency. I'm an open source researcher and i hate it when people delete stuff retrospectively. Sometimes the evidence and experiments that were done are helpful moving forward.
Some of my other packs went on to V20 and in those cases, every version was still valuable, but it still caused huge confusion for people who don't have the time to watch 20 hours of tutorial just to understand one workflow collection.
I included all previous versions in one pack, as it is a complete research project. The problem with marking a specific folder - "Current" is that will be downloaded to a persons computer and then all the folders are marked Current if they DL as they are released. This is why i made the master pack on Github.
Ultimately - it's a challenge to make these indications and i know that most don't watch the video guides they are often paired with. I'm always experimenting with better ways to highlight and mark workflows, but there is no single best solution for this.
I've been doing this for a long time, so it never feels like an attack - this is why i take the time to give to a proper answer on this front. If i really thought it was an attack, there would likely be no response given :)
to directly answer the question: (its in the description on civit) V7 and V8 are the current workflows, they are held in:
"Donut-Mochi-Video/V7-Spatial-Tiling-VAE" (has a guide video) &
"Donut-Mochi-Video/True-Image-To-Video" (guide video should release today)
That "decode samples" section is disabled and it ends up just producing a .latent file.
Largely useless as it is?
Hitting "R", like it says, doesnt do anything
incorrect - the .latent file is loaded by my nodes - thousands used them with success and no change has been made on my part. Once loaded they are decoded by the VAE spatial tiling decoder, you are using an advanced 2 stage workflow, which is intentionally setup this way to half the VRAM requirement of running it.
This allows for more frames and higher steps - resulting in higher quality outputs.
Please read descriptions and/or watch the video guide before jumping to conclusions
NOTE: this is setup with the Q8_0 mode and the T5XXL-FP16, offloaded to CPU,
Not the combined comfyui checkpoints which release two weeks after this workflow was released.
Thanks ;)
If changes were made to the comfycore latent save node, this is also beyond my control.
My Load Latent node is looking inside the \output\ folder, and supports subdirectories
The comfycore latent load node, required manual copying of .latent files from \output\ to \input\ and does not support subdirectories, at the time of posting.
a new update version of this pack (V9) will release for casual users of the new checkpoints.
TLDR - watch the video for V8 for full instructions, or use the "fast bypass group switch" to swap between encoding (top level) and decoding (bottom level). "R" refreshes the loader which is disabled by default, if you have not encoded any .latent files, they will not be available to be loaded, this is a limitation of ComfyUI.
Epic research as always. Surfs up!
Excellent job sir!
I am not good at any of this, but anyway:
Your workflow does 6.5s/it - I have an Nvidia 4090. Is this correct?
It gets as far as the djz load latent node and that has a red line around it... :/
Could you please make a workflow that is as simple as possible and is focused on making Mochi run fast af? People on Reddit are telling me of 2s/it workflows but I keep running into errors with missing somethings. I've created an entirely separate ComfyUI install purely for Mochi and am still suffering setbacks I can't figure out.
Hi there, you might me missing my Custom Nodes: https://github.com/MushroomFleet/DJZ-Nodes
you can find them in the Comfy Manager too for easy install, manual instruction on the above link.
My custom load latent node includes batch operation in V2 and scans your output folder to prevent you needing to stop and copy them manually to the \input\ path, and support subdirectories
I use the project file path generator in a lot of my workflows (also in DJZ nodes) which helps to keep your projects organized instead of a big wall of outputs
And many other nodes too, such as safe dimensions and prompt gen / lists support.
I was focused on Quality with the latest release, opting to split the Encoding (takes forever) and decoding (super fast) into two stages, however the next pack update will feature more options to keep everyone happy.
Speed is really down to two factors "frame count" (less is faster) and step count (less is faster), but of course this makes for a quality tradeoff decision.
I also used 4090, and it depends exactly how you have your setup running. I elected to allow people to use many different model types to get the best quality possible, and queue my encoding overnight, then batch the decoding in the mornings. This seemed to make sense at the time. I was doing this about 1-2 weeks before the comfy official support and checkpoint arrived.
The next update will have a video explaining the whole setup as i wanted to let the dust settle since official support arrived. be sure to check the links in my workflow pack page, it's all very important/useful if you are stuck - and drop by my discord if you need further help !