Belle Delphine - Flux.dev - CivArchive (CivitAI Archive)

Belle Delphine - Flux.dev - v1.0

This is a LoRA of the internet celebrity Belle Delphine for Flux.dev.

Trigger word: “Belle Delphine”.

Suggested LoRA weight: 0.6 – 1.1

The model is trained at 512, 768 and 1024 resolutions.

As with most Flux LoRAs the model is quite flexible. Although this specific LoRA is also overtrained (which was slightly mitigated in retrospect, more about that in the training section), which results in worse prompt following, especially considering text (Although with multiple tries it is still fine, as can be seen in the example images).

Quite a bit of a additional tags were captioned, however the training did not retain them. Two triggers where it might make a small difference are: braces and snapchat

Images were exclusively generated in ComfyUI.

Training

As always, I will add a little bit about the training.

I saw Flux with interest and originally planned to wait for a bit more maturity considering the software tools, but by now I have seen sufficiently many things by other people that I thought it might be time.

Flux has shown great resemblance for persons even with a low image count, so it was very likely that people will already produce a very good Flux LoRA of Belle Delphine before I do anything (which was the case as I expected). However, I always use the Belle Delphine dataset for testing new stuff for myself, as there is so much data of her available. So I decided: screw it - and trained this model with the same large dataset I used for the LoRA version for the pony checkpoint.

This dataset already had both booru tags as captions and natural language generated by a VLM. However, I felt like the quality of the natural language was too low (as it was also configured to use shorter prompts), so I decided to recaption all the images in the dataset. I used InternVL2-8B for this and generated more diverse captions (both longer and shorter). For the training I exclusively used those new natural language captions.

Then I had to decide which trainer to use. I planned on using kohya, however reading through several GitHub issues it seems like it wasn’t quite there yet, so I went with the ai-toolkit by ostris instead. Of course, this meant missing out on some nice features like masked training (as I had masks for the entire dataset). I feared that this might negatively affect watermarks, so I took special care to label them more explicitly in the captions, hoping the model would then not generate them if it is not in the prompt (which at least semi worked).

People also recommend low step counts, however they also use low image counts like 15. I on the other hand have multiple magnitudes more data than that, so I just decided on a step count of 30000 steps (without any scientific reason). And as I don’t know how fast it will overfit or turn bad I saved the checkpoint every 600 steps, leading to 50 versions of the LoRA.

And my fears turned out to be true:

At 30000 steps the model is mostly worse than at 10000 steps, while at the same time not having learnt some concepts which were tagged. So the conclusion is using a more refined smaller dataset and then maybe additional LoRAs for more concepts? I will see when I have the time again and then maybe either add a high amount of steps to see if the result changes then, or try training multiple smaller datasets and combining the resulting models.

To at least save the overcooked model a little bit, I decided to merge different versions of it, specifically: 10200 Steps – 50%; 6000 Steps – 8%; 18000 Steps – 20%; 30000 Steps – 22%

To my dismay the kohya scripts did not support the LoRA trained with ostris out of the box, so I had to manually adjust the script so that the merging worked. I then also planned on at least reducing the size slightly, as I trained with rank 16 (which is the default) and once again had to modify some code for that. I resized from 16 to 16, meaning it should keep the same quality, while shaving off roughly 60MB.

Overall, I am only semi happy with the results, but I mean I now spent the time on it, so I thought I could share the model once again.

Training was done on a 4090 in about 24 hours.

Other relevant training settings:

Alpha + Dim = 16
Caption Dropout at 5%
Trained at 512, 768, 1024 resolutions
Batch size 1
Noise Scheduler flowmatch
Enabled linear timesteps
Optimizer adamw8bit
Learn rate 1.69e-4
Trained on the quantized model

Disclaimer

I want to highlight again that this model is non-commercial, and you should only post images on CivitAI which follow the Content Rules.

Users are solely responsible for the content they generate using this LoRA. It is the user’s responsibility to ensure that their usage of this model adheres to all applicable local, state, national and international laws. I do not endorse any user-generated content and expressly disclaim any and all liability in connection with user generations.

Description

FAQ

Comments (13)

barlogspam176Aug 24, 2024

CivitAI

how this compares to the other belle flux model?

DiffusedIdentity

Author

Aug 24, 2024

I would say it is difficult to compare and it is not like one is better or worse than the other.

I would say they are both more rigid and more flexible. The other LoRAs will work with prompts more closely, needing less generations, as text for example will already work first or second try.

However they are trained on less variety of images. And I would argue Belle looks different over the years. My LoRA can generate multiple looks (although you won't really be able to pick which one), while the other LoRAs only recreate a certain likeness.

So if you just want a good specific likeness with flexible prompting the other LoRAs are better. If you want certain looks which may seem familiar and are up to run multiple tries you could use this LoRA.

I will potentially come back to this however with a dataset which is better for Flux. But this might not happen at all or very late.

barlogspam176Aug 24, 2024

@DiffusedIdentity thank you for the exhaustive answer

barlogspam176Aug 24, 2024· 2 reactions

CivitAI

I love your choice of the pcitures to showcase the felxibility of the model. Amazing job!!!

barlogspam176Aug 24, 2024· 1 reaction

CivitAI

loved the description.

Very useufull.

My experiemnts also showed that a 40 pic dataset performed better than a 130 pic dataset.

If you would train in the future which numbers would you go for?

DiffusedIdentity

Author

Aug 24, 2024· 1 reaction

As so often this of course depends on the LoRA you are trying to train. I did not do sufficient testing for Flux, but my next trial would probably be around 50 images for a more narrow subject (like for example one type of clothing outfit). This however will not work with trainings which used larger datasets with more diverse themes. I do not have the exact count (and am too lazy to check right now), but my Belle dataset probably already has over 50 different tags with at least 10 images each. So there it will not really be possible to have only a small but efficient dataset.

The next thing I could imagine I will try for myself when I have the time (and the tool support is better, although kohya is getting there) is finetuning the entire model instead on such larger data and then extracting a LoRA. The problem with that could be retaining the prompt adhesion of the current Flux model, so it might also be necessary to have additional filler training data which is just used so for example text accuracy is not forgotten.

However the dataset should also be refined for that (it can be narrowed down a bit), and I should probably not only rely on the masks to reduce the watermarks but get to creating a script which will automatically inpaint them.

JumptownAug 25, 2024· 1 reaction

CivitAI

Thank you for detailed info about the training process, very helpful! I wish more people presented their work like you do.

What tool did you use to caption images using InternVL2-8B? Did you write the script yourself or there is some existing script to do that?

DiffusedIdentity

Author

Aug 26, 2024· 1 reaction

For speed reasons I am using vllm. It is the fastest implementation I am aware of and also has good batching support. (You can also run an OpenAI compatible API of it).

The code I use for the initial caption is almost unchanged from the example vllm provides. However I also post process that caption (ensure that the key word is contained, if a person is visible; reduce verbosity; add additional keywords; ...). But this is out of the scope of this answer.

The script I run looks somewhat like this:

https://pastebin.com/HUy5d2cH

I do have even more modifications to the chat template, so you see custom code her combining together the template. If you do not need anything like this it is easier to automatically use the chat template by having a messages array and running "tokenizer.apply_chat_template" (see here).

Considering performance, I think on a 4090 with batch size 8 and using the AWQ model for even more speed, captioning one image took on average less than a second? I didn't really watch the time bar, so sadly I can't give you more info on it.

JumptownAug 27, 2024

@DiffusedIdentity Thank you very much for the script! I really appreciate your help and I wish you all the best! :)

JaneBAug 29, 2024

CivitAI

Have you tried using a higher learning rate? 30,000 seems excessive for Flux. I had good luck with 1,600-2000 steps with simpleTuner and even around 1k steps with the civit trainer which uses kohya

DiffusedIdentity

Author

Sep 19, 2024

Hey, sorry for the late answer (I was away) - The learning rate or similar is not the problem. As I have already written in the description, I also made the observation that this high step count lead to worse results and not better results. However if you already have for example over 1000 images, which do contain diverse concepts, it would be impossible to learn them if you only had one step per image. So I selected a higher step count when testing it out.

BestBobbinsDec 8, 2024· 1 reaction

CivitAI

Very interesting thought about merging some of the overtrained epoch back. I found at 16/16, 60 images (20 closeup crops), I was overfitting even by 2800 steps, with 2e-4. But that epoch is the one that is most reliable for likeness, with some excellent skin detail unique to that character. Thank you for detailing your process ❤️

Only_Fuuka_OFApr 9, 2025

CivitAI

could you do a lora of a specific cosplay of hers? i already got a data set

LORA

Flux.1 D