FLUX Monte Carlo
Note: This model has undergone deterministic bias correction + stochastic regularization.
Even this took around 2GW of power.
To train a FLUX at FP32 it would take a supercomputer and power-plant.This model has 1 Trillion Iterations to restore FP32 precision.
Do to the complexity of the tensor shape this was done at 10,000 steps per element, with noise guidance based on the tensor.
Description
FAQ
Comments (46)
The hilarity to ensue if they let you run this on site.
The commercial license does not allow them to run unofficial FLUX is my understanding
You're out of your mind...and I love you for it. I can't wait to try this.
I do not recommend forcing FP32 UNET for most cases, although it does render a slightly different image even seed to seed
@Felldude I was just messing with you. I get what you're doing here. Even if downcast or quantized, having that extra precision will make a difference.
@AUsername111 Athletically I think the FP8 fast was giving the best looking images, the noise makes them look less AI and processed
@Felldude This is what I've noticed as well, literally adding "Film Grain" to images makes them look so much more 'real'. The super smooth textureless skin makes it look so fake!
Like if you're trying to go with that effect it's fine, but in most cases it looks VERY fake and uncanny much more than even the slightest noise added.
Downloading just to say I did, and to have this baby ready for the day I can use it LOL
But I'm thinking Q8 GGUFing just to see. Since it's FP32.
I did a NF4 test but not Q8 or Q4 gguf
Via comfyui I tried casting it to fp8 but it crashed (RTX 4090 and 64gb RAM 😥)
Q8 Conversion completed in 0 hour(s) 9 minute(s) 48.49 second(s).
I'll give it some tests tomorrow, headed to bed at the moment - Have a good one Felldude!
Awesome model, probably my favorite now! Only issue i found is the nipples, they seem to be somewhat deformed on most creations -though easily fixed with tweaking.
Thank you
Just wondering how some of you have gotten this to run? I have an RTX 4090 and 64gb of RAM and it always crashes on loading, even when data type is set to FP8. Any tips/tricks?
My recommendation of 120GB virtual memory allotment was not a joke - beyond that make sure you do not have the -highvram or gpu only flags set
@Felldude Thank you.
@AUsername111 I have a RTX4090 laptop, 64GB ram, linux. I can run with or without FP8 though i use FP8 FAST to render quickly. It uses 32.5 gb chip ram, 31.1 gb cache ram and 11.3 gb of VRam. I use:
--listen --use-sage-attention --bf16-vae --cuda-malloc --bf16-text-enc --fast --normalvram
@Agimax Thanks for sharing, --fp32-text-enc You, may want to use this command as my understanding is that comfy would be taking the FP32 TE and CLIP, downcasting to BF16 per your argument, then back to FP16 for use on your CPU, and depending on how your pytorch is setup it may be upcast from FP16 back to FP32
Thanks guys. Already did the virtual memory trick. Also created F16 Quant and a FP8 safetensor :) Even loaded at full strength and generated with it. The quality difference is definitely there. Thanks for making this.
@AUsername111 Thanks, glad its working for you
I know - this is stupid, but maybe, just maybe ... GGUF?
If the quantization shows the same improvement as the full model across hundreds of images
@Felldude yes, but still - for all of us, simply to test and play around with it ;)
Is there a special FP32 clip for this? The highest I have is FP16.
Yeah I will link them
@Felldude Thank you. Running on a RTX 4090 with 24GB VRAM and 124GB system RAM and have about 30 second run times for a 1024x1024 image. Not bad at all. Still looking at my results. I will post and give more feedback tomorrow.
Ty for the gguf, works perfect! A++++
Thanks
Which is higher quality and closer to Flux Pro, this or the De-distilled Flux by Nyanko7? Also, does it make sense to run this as a daily model or is it more for experimentation?
The de-distilled would be a completely different approach and if your goal is NSFW it might work with some TE trained loras better.
As to daily use if you have the system ram to load all the models 70GB (Virtual memory can be used) then the time to render is the same. (Unless you use the full FP32 command)
I appreciate the quick reply. :) I don't need it for NSFW. I'm doing landscapes and architectural imagery with LoRAs and a good amount of img2img transformations with controlnets. I'm mostly concerned about which can bring out more detail, follows prompts better, and if it's feasible to use your Monte Carlo version for the purposes I outlined above. I do have 96GB DDR5 RAM and RTX 5090 graphics card, so I guess this should be possible? Again, I'm more concerned with quality than speed. I can wait a bit longer if there's an appreciable difference in image quality and prompt comprehension. :)
@mmdd2543 With that setup you could try --force-fp32, in my testing the images sometimes very drastically with the FP32 UNET however they did take 3x-4x longer - If using the FP32 FlanT5xxl that is linked the prompt adherence should by quite high. Personally I only use --fp32-text-enc and then either BF16 UNET or FP8 fast
@Felldude Got it! I will try it out when I find some time for experimentation. Thanks a bunch!! 👍
I am impressed with your work on this model. At first it didn't work, it always crashed. Strangely, without making any changes to the comfyui, this model started working. Unfortunately, for some reason, this model no longer works. I started getting the error "The paging file is too small for this operation to complete. (os error 1455)". I have RTX 4090 and 64 DDR 5 on W11. It worked for a while without making any changes to comfyui or windows and suddenly it stopped working.
Paging File suggest that windows is managing virtual memory and running out - I would suggest setting it to 64GB or larger
@Felldude Thanks for your answer. It is strange that it went for a while without making changes to either comfyui or virtual memory. I currently have 28 439 MB in Virtual Memory configured.
@Felldude Cannot find flux_vae_fp32.safetensors anywhere on the internet or civitai. Tried original flux VAE ae.safetensors but images do not match the prompt.
Please upload
I renamed the ae file to what you see in the workflow - did you launch in FP32 Text encoder
@Felldude I used your workflow from the 1st image here:
https://civitai.com/images/71295200
I selected the 2 FP32 clip files you mention to download, and ae.safetensors for vae.
GGUF load checkpoint fluxMonteCarloFull_unetGGUFQ8.gguf.
Does not follow the prompt at all
@r600 Using --fp32-text-enc
@Felldude added to run_nvidia_gpu.bat:
--fp32-vae --fp32-text-enc --force-fp32
It works! Although the difference is very minor compared to fp16
fluxMonteCarloFull_unetGGUFQ8.gguf is FP8 8 bit?
What good is it to have 32bit text/clip/vae Encoders?
@r600 force fp32 will force fp32 unet which would slow down generation and I would not recommend it for most users, forcing the CLIP in FP32 is not an issue for most users unless you have a 24GB video card and run the NF4 or Q4 and use -highvram flag
@Felldude after much testing, this blows the original flux_dev_q8.gguf out of the water!
recommend this lora to fix flux chin:
https://civitai.com/models/775002/chin-fixer-2000
Brilliant model if you can get it to work.
Details
Available On (1 platform)
Same model published on other platforms. May have additional downloads or version variants.


