PixArt-Sigma-1024px_512px-animetune - CivArchive (CivitAI Archive)

PixArt-Sigma-1024px_512px-animetune - 1024px_v0.1

NSFW

4/7 1024px model update! 1024px_v0.4 Please check the details in the 1024px_v0.4 tab.

Compared to the 512px model, it's less stable and more prone to artifacts, but it can offer more compositional freedom. While the newer version has learned more concepts, v0.2 or earlier may be better for aesthetic results.

3/5 512px model update! 512px_v0.7 Please check the details in the 512px_v0.7 tab.

Personally, I recommend the 512px model.The 512px model has learned significantly more concepts. I like the workflow of using the 512px model for trial-and-error inference to generate good images, then either upscaling them with i2i using the 1024px model or sd1.5 or trying the same prompt with the 1024px model.

2/11 1024px&512px workflow update! I have also added the TIPO workflow & sd1.5 i2i. TIPO: It reduces the effort of crafting prompts and allows for easily generating images, so I highly recommend it.The SD1.5 i2i workflow is useful for improving details and changing styles.There's joy in choosing a model. It leverages the strengths of both Pixart and SD1.5.The "TinyBreaker" in Suggested Resources is a perfect example, further refined by exploring its potential. Be sure to check it out as well.

I also experimentally merged an SD1.5 model for i2i, so feel free to check it out if you're interested.

https://civarchive.com/models/1246353

A method to combine PixArt with SDXL has also been discovered.

https://github.com/kantsche/ComfyUI-MixMod

https://civarchive.com/models/1565538/a-pile-of-junk-mixmod-workflow

■This is an experimental fine tuning.

Attention This Fine Tuning model is very difficult!

The quality is not good!! Don't expect too much!

If you are interested in PixArt-Sigma for the first time, we recommend that you check out the workflow that allows you to infer the original model... Even if my model is not great, try using other people's amazing fine-tuning models!

I think the "Comfy Sigma Portable" can be used even by those who have never used ComfyUI before. There's no need for a difficult installation. Just download and try it out!

Merging can be done with ComfyUI. The "Tool to easily merge models" is also simple and good.

●Forge also has the following extensions available.Inference is also possible with SDNext.

It's not the smartest solution, but I've prepared a guide on using fine-tuned models in Forge. Feel free to use it as a reference.2/16:With a recent update, my model can now be added and used for inference. I appreciate the developer for creating such a highly functional and user-friendly extension.

https://github.com/DenOfEquity/PixArt-Sigma-for-webUI

https://civarchive.com/articles/11612

The 'anime sigma1024px' in Suggested Resources is a flexible and aesthetically pleasing anime model. Give it a try.

I would be happy if you could be interested in Pixart even a little.Pixart has potential.

My hope is for more people to discover basemodels with potential and to see their possibilities grow even further. I would be happy if I could help make that happen.

PixArt-Sigma is simple, highly lightweight, and capable of training with 300 tokens. Few models meet these conditions, making it a rare model with minimal training limitations. Since its hardware requirements are nearly the same as SD1.5, anyone can participate in training, and even individuals can conduct large-scale experiments with minimal burden.You can benefit from 300 tokens even during inference, and the small model size makes merge experiments easier.This is like an SD1.5 model with support for 1024px, DIT, T5, SDXL VAE, and improved contrast handling. I was looking for a model like this, and PixArt met that standard.

■I trained using onetrainer.

Fine-tuning is performed on a 70,000 or 400,000 image dataset(no use AI image) that mainly contains anime images, but also some realistic and AI images.all booru tag train. The training resolution is 512px or 1024px. Pixart is high quality but has low requirements, making it suitable for training. 12GB VRAM is enough .Detailed information about the training is written at the bottom of the page, so please refer to it. I have also uploaded the Onetrainer configuration data.

■Please be careful as sexual images are also generated.

■Here are my recent favorite inference settings. This will be updated as needed.

This is not the optimal solution.Please try various things!

Both booru tags and natural language are available for use.

●Using SD1.5 i2i could be a good idea. This approach frees Pixart from its limitations.

Pixart has good compositional strength, but details like hands can often be challenging. Combining it with SD1.5 through i2i improves the details, allowing you to benefit from the strengths of both models.

Additionally, by switching the SD1.5 model, you can flexibly shift to any style—realistic, 2.5D, or anime. If you have the resources, combining it with SDXL is also an excellent option.

●The sample images have embedded workflows viewable in ComfyUI, but recently they’ve been converted to JPG to save space, so some may not load. Installing the extension below will allow you to check them.

https://github.com/Goktug/comfyui-saveimage-plus

●sampler:"SDE cfg2.5-6 step12-20" ,"Euler cfg_pp" or "Euler A cfg_pp" cfg 1.5-2.5 step30-50

Scheduler:"GITS" or "simple"

●Euler, Euler_CFG_PP, DEIS: Sharp with excellent composition, enjoying the aesthetics of collapse.

Euler_A: The most stable, ideal for poses and unique concepts, but less surprising.

DPM++_SDE: A middle ground—dynamic yet stable.

●GITS provides rich textures, Simple ensures stable generation quality, SDE stays true to the dataset, Euler is sharp,Euler A offers stability.

I generally prefer GITS + "Euler," "Euler cfg_pp," or "SDE."

"GITS + Euler" or "Euler cfg_pp" is very sharp.

"GITS + SDE" is dynamic.

"simple + Euler A or SDE" feels stable and seems to improve fidelity, though it may have high contrast.

●GITS can produce amazing detail, but it sometimes seems prone to breakdowns or not following prompts. I prefer it when I want to focus on atmosphere using natural language. Simple, on the other hand, is stable and follows prompts well, making it more suited for character work.

●Resolutions slightly outside of 512x512 and 1024x1024 are acceptable. Resolutions like 512x768 or 1024x1536 may have minor issues but remain practical. For more stability, it’s best to stick to resolutions like 832x1216 that are closer to standard.

I prefer larger resolutions over stability, so I tend to choose non-standard resolutions.

●If you can't come up with a prompt, try using the prompt auto-generation below.

https://huggingface.co/spaces/KBlueLeaf/TIPO-DEMO

Command R+ does not censor or reject prompts, making it ideal for explicit natural language prompts. You can try it for free by creating an account on the official website.

●If a certain tag's effect is too strong, try lowering its weight or increasing the weight of other tags. It may not be non-functional but rather overly dominant, and this can help resolve the issue.

Be cautious with unique tags for characters, as they can be very dominant.

Character tags might even alter the style, so depending on the situation, placing character tags at the end and supplementing the character's traits with general tags like "1girl, green hair, School uniform" may provide more flexibility.

●Negative prompts are not trained. Please try various prompts!

As described in the dataset contents on the page below, if you don't like realistic textures, you might want to include terms like "realistic, figure".

Adding 'anime screencap' to the negative prompt helps reduce flatness.

I don't like restrictions and prioritize diversity, so I keep the negative prompts to a minimum.

Lately, I've been favoring a workflow where I disable negative prompts in the early steps and only apply them starting from the later steps. This approach results in fewer compositional issues in the early stages, and since I can freely adjust the style in the later stages, the overall quality is improved.

However, my way of thinking is unconventional. You don't have to follow it! You might get better results with many negative prompts, so give it a try!

I feel that with fewer steps, the composition doesn't turn out as well.

●It might be better to have at least 20 steps. Recently, I've been sticking to 50 steps.

For previews, I stop around 15-25 steps to check the progress.

Once I find a good seed, I refine it with 50 or 100 steps, adjusting the CFG as needed.

Since there is little change in the later steps, I can predict the outcome. This way, I balance both efficiency and quality.

However, with a higher number of steps, breakdowns may decrease, but it might end up overcooked. A setting like 30 steps might provide a better balance in terms of contrast.

By the way, I haven't trained with tags for work titles, but sometimes character tags include the work title. This tendency is especially strong with mobile games. When I randomly added a work title, there was a change in the style, so it’s possible that it may have some effect.

●It might be better to have at least 20 steps. Recently, I've been sticking to 50 steps.

For previews, I stop around 15-25 steps to check the progress.

Uni-pc may be faster as it achieves good results in about 20 steps. If i2i is the basis, I think it's also a good idea to finish in half the steps using methods like splitsigmas and then perform i2i.

Once I find a good seed, I refine it with 50 or 100 steps, adjusting the CFG as needed.

Since there is little change in the later steps, I can predict the outcome. This way, I balance both efficiency and quality.

If you find it troublesome to come up with prompts that produce stable quality, using prompts like the ones below might help stabilize the output. Ironically, tags like these end up becoming quality tags.lol

" nikke, azur lane, blue archive, kancolle, virtual youtuber, arknights, girls' frontline"

●I’ll also share the natural language prompt I use for quality improvement. Try adding it to the end of your prompt. It’s already included in my workflow.I think adding the game title tag to the last row would be a good idea.

■Consistently high quality

A highly detailed character with smooth, glowing skin and vibrant, natural colors, A dynamic, expressive pose with natural proportions and accurate composition. Soft, balanced lighting enhances depth and warmth, while surrounding light subtly interacts with the character, blending tones and creating a harmonious connection with the environment. Rich facial expressions convey emotion and presence, and soft highlights accentuate the character’s curves and details, adding depth and a natural, luminous glow.

■Dynamic composition.

A highly detailed anime-style character with smooth, radiant skin and vibrant, balanced colors, depicted in a dynamic and expressive pose with flawless anatomy and natural proportions. The composition is visually compelling, with intricate textures and exquisite detailing in the character's design. Soft, nuanced lighting enhances depth and warmth, interacting harmoniously with the surroundings to create a cohesive, immersive atmosphere. The background is richly detailed and dynamic, filled with captivating elements that complement the scene without overwhelming the character. Subtle highlights and shadows accentuate the character's curves, clothing, and features, adding realism and a luminous glow. The overall image captures a perfect balance between artistic stylization and a convincingly grounded presence.

●This massive, chaotic negative prompt might actually be effective, though I just copied it from other models without any guarantees. Still, it seems to have some effect.

If you feel that the composition or anatomy looks strange, try removing the negative prompt. I've noticed several times that it can have a negative impact.

■amputated, bad anatomy, bad proportions, blurry, dated, deformed, extra limbs, fused fingers, low quality, malformed limbs, missing limbs, mutated, ugly, overexposed, underexposed, flat colors, low detail,

■512px model.

The standard size for this model is 512px

A ratio like 512x768 like SD1.5 is suitable.

768px 1024px is not trained, so the result will be disastrous.

The base model is very high quality even at 512px!

Usually, models in the middle of pre-training or lite versions lack sufficient learning or aesthetic appeal, but this model is different. It is the most aesthetically pleasing I have seen so far.

Due to its low requirements for training and inference specs and its fast speed, I feel that it has the potential to become the successor to SD1.5 that I've been looking for.I love this model.

Honestly, for creating images focused on 2D characters, there’s little difference between 512px and 1024px. Unless it’s a concept that clearly requires high resolution, 512px should be sufficient.

■ 1024px model.

If you don’t want to waste time, it might be a good idea to use the 512px model first to practice which prompts are effective.

Merging might also be interesting.

Merging with a realistic model can sometimes improve anatomy.

An example of an interesting merging experiment:

simply merge the 1024px and 512px models at a 0.5 ratio. This will allow you to generate at a 768px scale. Try resolutions like 768x768, 576x960, or even 640x1024. 768x1024 may sometimes break down, but it can succeed occasionally.

If the preview shows no block noise or line noise, then it’s fine. If these appear and strange artifacts start to show in the generated image, that’s the resolution limit.

This approach balances speed and detail, but I’m not entirely confident the merge is stable—it may have some issues. Still, it’s worth trying for an interesting experiment.

※By the way, I don't think the older versions are inferior.

As the training progresses, the model learns more concepts but gradually deviates from PixArt's aesthetics.

Therefore, earlier versions might have a better balance in some cases.

It's a matter of personal preference, so I think you should use the version you like best.

Personally, there are sample images from older versions that I really like. I'm not confident I could replicate them with the latest version, lol.

■I am training with the danbooru tag.

We are only learning general tags such as 1gril, and we are not training artist or anime work tags.

A small number of tags will produce a disastrous result.

Popular tags tend to be of higher quality.

Examples: looking at viewer, upper body,shiny skin,anime screencap, etc..

If the effect is too strong, it might be a good idea to lower the weight.

It would be interesting to generate various tags using something that can automatically generate tags.

This is an experiment to see how much the tags can learn.

My training quality is poor, but it's learning better than expected.

In some cases, it may be able to express things that are difficult to do with other models.

It seems possible to add some new concepts even without fine-tuning the T5.

The base model is not excessively censored; like Cascade, it can handle high-exposure outfits without issues and sometimes even generate nudity.

It's interesting because it feels different from other models.

Due to the small size of the dataset, we are not yet able to recognize all tags.

It seems that natural language still works as well. There might be an interesting aspect that is different from the base model.

It's quite fun. I give themes to ChatGPT to create natural language prompts.

■There are cases where the look of something realistic or AI comes out strongly.

It might be a good idea to add "realistic" to the negative prompt.

On the other hand, it might be fun to try something other than anime.

New discoveries are made in areas that were not originally intended.

It's okay not to expect perfection too much.

This model is still immature.The broken results are more interesting!

■There is no consistency in style.The quality is poor and there are no fixed settings or prompts.

●It has no advantage over existing models and has a narrower dataset.

●It's an incomplete and very difficult model, but if you're interested, please give it a try.

●If the human body breaks down, it's not due to censorship but rather because my fine-tuning is poor, so please bear with me! lol

I will continue to refine it to make it better in the future!

●Merging is no problem.If you have any interesting results please share!

I think the 512px model can be merged into the 1024px model using differential merging. If the proportion is too large, it might break down, but it could be useful for enhancing concepts and styles.

■Dataset Notes:

●"realistic, figure, anime screencap"

These are the only three tags that I intentionally trained for style, and using them will enforce a particular style.

"anime screencap" will result in a TV anime style.

●Putting "realistic, figure" in the negative prompts will enforce an anime style.

However, other 2D styles lack consistency and the style will change based on the keywords...

●From what I understand, sexual content tends to adopt a visual novel game style, and natural language tends to lean towards AI or 2.5D.

Tags like "looking at viewer, upper body, shiny skin" are tagged in many images, so the quality might be higher. I feel they tend to be closer to the AI image style.

"blush" is also widely used and tends to be the flat style of visual novel games and Japanese 2D artists.

●The contents of my dataset include visual novel games, real people, figures, 2.5D, anime screencaps, and AI images.

Because I trained on such a wide range, styles are linked to tags, which might make control a bit difficult...

●If there are no background tags, the image may end up with a white background.

This happens because elements outside the given prompt are less likely to bleed into the image.

With a short prompt, the result may be vague and blurry.Try adding key keywords that describe the type of image you want to generate.

●It's best to include tags for the type of scenery you have in mind, like the examples below.

Additionally, based on those tags, consider what elements should be present in the background and add them accordingly—such as plants in a room or cars in a city.

If the background becomes the main focus and the character appears small, using tags like "solo focus" can help emphasize the character as the main subject.The "landscape" tag tends to make the background the main focus. If the character is the main subject, it might be better not to use it.

"outdoors, scenery, landscape, indoors, bedroom, building, car, crowd, forest, beach, city, street, day, night, from above, from below"

■For reference, I will also share my simple confyui workflow and onetrainer training setting data.

If you want to use confyui for inference, you need to install the "ExtraModels" plugin. I will also share the URLs of "vae" and "T5" that I use.

I don't know if it can be used with other WebUI.

Other people have shared their workflows, so it might be a good idea to refer to them.

■ExtraModels

https://github.com/city96/ComfyUI_ExtraModels?tab=readme-ov-file#installation

■vae

https://huggingface.co/madebyollin/sdxl-vae-fp16-fix/blob/main/diffusion_pytorch_model.safetensors

■T5

https://huggingface.co/theunlikely/t5-v1_1-xxl-fp16/tree/main

It's the same as the T5 on sd3, so you can probably use the 8bit T5 on sd3 as well. That should load faster.

■Base model Please download when you want to try other resolutions.

https://huggingface.co/PixArt-alpha/PixArt-Sigma/tree/main

■1024px diffuser model is required during training. Please specify this as the base model and train.

https://huggingface.co/PixArt-alpha/PixArt-Sigma-XL-2-1024-MS

■ 512px Model.

https://huggingface.co/PixArt-alpha/PixArt-Sigma-XL-2-512-MS

Compared to the 1024px model, it has lower hardware requirements and training speed is about 4 times faster, making it accessible for more people to train. Apart from the transformer, it uses the same data as the 1024px model, so please transfer the data from the URL above.

■If you have room in your GPU, loading T5 on the GPU will make inference faster and less stressful.

By converting T5 to 4-bit, inference is possible even with lower specifications.

A 12GB GPU should be fine.If you convert it to 4bit you might be able to load it on an 8GB GPU...If that doesn't work don't worry you can load it into your system RAM!

If an error occurs even after installing ExtraModels with ComfyUI Manager,

follow the instructions in the ExtraModels URL,

activate VENV, and re-enter the requirements.

When I tried to convert T5 to 4-bit, an error occurred with bitsandbytes, but re-entering the requirements solved the problem.

I don't know much about it either, so it may be difficult for me to provide support for installation...

■I'm new to civitai, so if you have any opinions, I'd appreciate it if you could let me know.

I'm not good at training, but I would be happy if I could share the potential of pixart with as many people as possible.

PixArt-Sigma have potential.

My dream is to see more Pixart models. I'd love to see the models you've trained as well!

The training requirements are low, 12GB is fine!

The total number of downloads has exceeded 1000. Thank you for your interest in my immature model! Thank you very much for your many likes. m(＿＿)m

Thank you for the buzz as well!

This fine-tuning itself isn't particularly exceptional, but I hope the information about my training can help someone interested in Pixart!

■Below I will list the GPU and training time I used for my training. Please use it as a reference for your training!

If you want to know the exact settings, please download the onetrainer data.

GPU: RTX 4060 Ti 16GB

■512px

Batch size: 48

70,000 / 48 = 1,500 steps

1 epoch: 5 hours

15 epochs: 75 hours

GPU usage: 13GB

With this batch size and epoch time, I think the speed isn't much different from SD1.5. It's fast.

I feel the 512px model is like a successor to SD1.5.

■1024px (testing)

Batch size: 12

70,000 / 12 = 5,833 steps

1 epoch: 30 hours

5 epochs: 150 hours

GPU usage: 15GB

The reason it doesn't take exactly four times longer is due to the difference in batch size.

In my environment, I felt it was impossible to train a 1024px SDXL model, so I haven't tried it and don't know if it's fast or slow. But I think the batch size is good!

■Full fine-tuning With 12GB, 1024px training is not a problem.

I have 16GB, so my batch size is slightly larger.

If you lower the batch size, the VRAM usage decreases significantly.

With a batch size of 1 or 2, it might be fine even with 8GB.

I use CAME as the optimizer, which slightly increases GPU usage.I liked it because the quality was good.

With Adafactor or AdamW8bit, VRAM usage is significantly reduced.

Since the text encoder is T5 and very large, it might be difficult for now because training requires a lot of VRAM...

With the advent of SD3, this discussion will progress and training methods will be established. Until then, a large amount of VRAM might be necessary...

If you want guidelines for full fine-tuning settings, you can use these as a reference.

However, it may sometimes lead to overfitting or be challenging due to your PC specifications.

While referring to these, try to find settings that work best for you.

I was able to achieve the same settings by switching to BF16 training to reduce GPU usage, so that's what I use.

https://github.com/PixArt-alpha/PixArt-sigma/blob/master/configs/pixart_sigma_config/PixArt_sigma_xl2_img512_internalms.py

https://github.com/PixArt-alpha/PixArt-sigma/blob/master/configs/pixart_sigma_config/PixArt_sigma_xl2_img1024_internalms.py

Note!

■When training with Onetrainer, the number of tokens may be limited to 120.

For tag training, the impact should be minimal since tag shuffling is performed.

Honestly, I have never had any issues with 120 tokens for tags.

However, for natural language, the length of the caption is important, so unintended truncation might occur.

■Relevant part: "max_token_length=120" This value is the token limit.

https://github.com/Nerogar/OneTrainer/blob/23006f0c2543e52a9376b0557e7a78016d489acc/modules/dataLoader/PixArtAlphaBaseDataLoader.py#L244

■In the case of xformers, errors occurred beyond 256 tokens. With sdp, there were no issues up to 300 tokens, but at 512 tokens, the generated images broke down.

It seems that more tokens do not necessarily mean better results.

Due to the increase in cache size, if the cost-effectiveness is not promising, 120 tokens might be sufficient.

There is no guarantee of quality improvement, but it might be worth investigating.

Since there is no certainty, please let me know if there are any mistakes!

If you have any questions, please feel free to ask!

日本語での質問も大丈夫ですのでご気軽にお声がけください～

Description

Pruned Model fp16 (1.15 GB): Inference checkpoint.
Training Data (2.69 GB): Fine-tuning diffusers model+OneTrainer config data.+10epoch_Inference checkpoint.
Config (12.48 KB): ComfyUI workflow. Due to upload requirements, it has been changed to .yml. Please change the extension to .json when using.
Pruned Model fp32 (2.29 GB):Full model. Unless you have a special reason, there is no need to use it as it is heavy.There is no difference in quality.
Pruned Model bf16 (1.61 GB):Experimental merge model v04. I have updated to my v0.2 model, and it feels better than before. I also changed the way I merge, so please be aware that the real style is strong.
this model is very experimental, so it is a secret between you and me who found it... The extension has been changed to .bin for upload purposes. Please change it to .safetensors for use.
I feel like the merge model creates a calmer atmosphere.
It is fine to upload images generated with this merged model!
However, if you merge this experimental merge model with other models, please do not upload the resulting merged model.
This can lead to the dangerous practice of model inbreeding! Let's enjoy it within the scope of personal use.This request is only for the experimental merge model.
It is okay to use my anime fine-tuning model as merge material to create merged models. You can upload them as well!

■1024px_v0.1 has been updated.

Tag training has become comparable to the 512px model.

Anatomy has improved compared to before. There are still many flaws, though...

However, I recently realized that instead of worrying about and fixing flaws, I find it more enjoyable to come up with prompts that create images with a good atmosphere. It's enjoyable to provide ChatGPT with a theme and have it generate natural language prompts!

Generating with natural language is fun, but the accuracy might have decreased due to the progress in tag training. Therefore, it might be interesting to merge it with the previous model or other models fine-tuned with natural language.

I am also sharing the training settings and test ComfyUI workflow.

uses the "Uncond-Zero-for-ComfyUI" extension node, please install it if you encounter any errors.

Please use the latest version as much as possible since some nodes used do not exist in the older versions of ComfyUI.

this is not the definitive solution, so please try various workflows!

For those who don’t want to waste time, I think it’s a good idea to practice with the 512px model first to see which tags work before using the 1024px model.

This model has been trained for 12 epochs, but I also like the 10-epoch model because it is stable, so I am sharing it as well. The tag training is worse, but it might be better for natural language generation.

Merging might also be interesting.

An example of an interesting merging experiment: Perform a differential merge of my 512px model with a 512px base model to extract only the fine-tuning elements.

Then, adding about "0.1-0.25" of the extracted elements to a 1024px model .There are more failures, but it's fun because it emphasizes the style.

FAQ

Comments (18)

User37Jul 15, 2024· 2 reactions

CivitAI

Yo, I (and I speak for many others,) am tremendously appreciative towards your work. Many models that haven't gained much traction, such as Cascade, have no anime models. That said, you were the legend that trained an anime model on this base model. You a real one. Carrying us weebs.

hjhf

Author

Jul 15, 2024· 2 reactions

Thank you!

My wish is for many people to know that there are potential base models available, and for a diverse community to be formed where many people can choose a model that suits them and participate in inference and fine-tuning or merge.

Therefore, your words really make me happy!

I love pixart-sigma, cascade, and playground_v2, but they are not very popular. I feel sad that they might be forgotten as outdated architectures, even though all of these models are actually wonderful.

It is sad to see a monopoly where one model is considered superior and the rest are deemed worthless, as it diminishes many possibilities... They all have their pros and cons, but they're all great models.

That’s why I am fine-tuning these models, hoping someone will notice their potential and want to try fine-tuning or inference with them.

I will continue training without giving up on any model.

I am not very good at fine-tuning, but if someone becomes interested in these models, that would make me the happiest!

lolkemoJul 16, 2024· 1 reaction

CivitAI

こんにちは、あなたのモデルをベースとしてファインチューニングしてもよろしいでしょうか？

hjhf

Author

Jul 16, 2024

はい問題ないです！アップロードもして頂いて構いません！

何かファインチューニングするのに足りないデータや欲しい情報がありましたらなるべく共有致します！

lolkemoJul 17, 2024· 1 reaction

@hjhf ありがとうございます！では、学習時の画像の解像度などお聞きしてもよろしいでしょうか？今の所素材を全て1024x1024で集めているのですが、どの解像度で行われたのか気になりました。

hjhf

Author

Jul 17, 2024· 1 reaction

@lolkemo トレーニングはモデルと同じ解像度でトレーニングしています！

1024pxモデルなら1024pxという感じですね。

トレーニングは正方形でなく、アスペクトバケットで元画像の比率をなるべく維持してトレーニングしています。

なので実際にはリサイズや切り捨てなども考慮して1152x896とかの比率だと思います。

比率はバラバラなので多分バッチに含まれない画像や比率の偏りはあると思いますがあまり気にしないようにしています...

私のデータセットの場合ゲームCGが多いので大半は16:9,4:3の横長比率だと思います。

データセット7万枚の画像サイズ事態はトレーニング解像度より大きい、短辺が1024px以上で構成しています。

その大きさ以上の画像を事前に選別して構築しているわけでなく、小さい画像はアップスケールで1024px以上にしてデータセットを構築しています。

元画像の比率+データセットの短辺を1024px以上でトレーニングしているのは品質向上等の理由があるわけではありません！

高解像度かつ元画像に破壊的な加工しないことで、他のモデルのトレーニングにも使いまわせる汎用的なデータセットでトレーニングしているためです。トレーニング時は自動的にリサイズされる+専用のデータセット構築するのも大変なので512pxトレーニングでもこれをそのまま使っています。

なので正方形でのトレーニングでも問題ありません！

また何か疑問ありましたらお声掛けください！

lolkemoJul 19, 2024· 1 reaction

@hjhf ありがとうございます。とりあえず1万ステップ回しましたが、学習全然足りないですね...

hjhf

Author

Jul 19, 2024

@lolkemo 私も苦労しています...　

主観ですがsigmaはスタイルは初期の段階で再現できますが、それ以外の概念やポーズは工夫しないと難しそうですね。

私は今came 2e-5で80000stepぐらいですが、スタイルは1-2万ステップ時点でかなり理想的でそれ以降も強化され満足ですが、概念や解剖学は良くなってる自身はありません...

トレーニング時のサンプルも1万ステップ前と比べるとちょっと概念覚えたかなぐらいの変化量ですかね。

バッチサイズは大きいほうが概念とかの学習が上手くいっている気がします。

512pxの時すぐ覚えたものが1024pxで覚えられなくて、バッチサイズを512pxの時のバッチサイズにしたら多少良くなりました。

sigmaの学習の速さや能力事態はsdxlに近いものを感じていて、簡単な概念はかなり覚えてきていてますが、warizaとかのポーズ的概念や人物の絡みは複雑でまだ覚えられてないです。

概念はテキストエンコーダーのトレーニングを考えたくなりますがHunyuan ditの結果を見ると、トランスフォーマーのトレーニングのみであのレベルを実現してみたいですね...

後他の人はポーズもかなり学習できていますのでまだ限界じゃないと思いたいです。

私の場合覚えられてない概念はそもそもデータセットに少ない場合もあるので、今後データセット拡張して幅広い概念追加とリアル系や男性なども加えてバランスをとればもっと良くなるのかなとか楽観的に考えてますが...

lolkemoJul 19, 2024· 1 reaction

@hjhf Hunyuan-DiT、手足がベースモデルの時点で完璧なので引かれますよね...

ただ、モデルのトレーニングがsdxlの3倍近く時間が掛かるので個人的にはあまりですね～

現在私がトレーニングしているのは0.9Bモデルですが、手足が弱いのは変わらずですね。

hjhf

Author

Jul 19, 2024

@lolkemo pixart-sigmaの手足問題は根が深いですよね...これが解決したら正直完璧だと思っています。

スペックが低くてトレーニングしやすいしトークンも300あって自然言語キャプションも使いやすいです。

手足問題は最近の合成データでトレーニングした系のモデルに多かれ少なかれ共通して感じていますが直せるなら直したい...

Hunyuan-DiTはデータセットが大規模だろうし、ほぼnovelAIモデルをもらったに等しいのでアニメファインチューニングなら理想的でこれ一択だと思ってますが、私はSDXLですらトレーニングキツイです...

私は基本的に512pxモデルが存在するものをトレーニングしてますが、低い解像度のモデルがあると良い練習になるのでHunyuan-DiTもそういう選択肢があると嬉しいですね。sd3も0.8Bモデルを512pxで出してくれないかな...SD1.5後継機になるのですが...

lolkemoJul 19, 2024

@hjhf Hunyuan-DiTは近日中に0.7Bモデルをリリースする予定なので、そちらがどうなるかですね。(手足がまともならそっちに逃げるかもしれない)

hjhf

Author

Jul 19, 2024

@lolkemo いいですね!品質が落ちずトレーニングが簡単になれば覇権は確実に取れるでしょう。

手足まともで既にwarizaやhugging own legs等のタグ認識できているのは他のモデルには無い強みです。

yamatazenJul 21, 2024· 1 reaction

CivitAI

Should I use natural language or booru prompts?

hjhf

Author

Jul 21, 2024· 2 reactions

Either way is fine!

Recently, I've been enjoying generating natural language prompts, so I use them more often.

The training itself is done entirely with booru tags, so using just the tags is also fine.

My examples include both natural language and tags.

I use three different approaches:

1.If I want to generate a beautiful image with a sense of story, I use only natural language. Although the model is trained only with tags, it surprisingly works well.

2.For features like a person's hairstyle or body type, and concepts that the base model might not know, I use tags

3.We sometimes combine 1 and 2. For poses, the base model may already know them, so it's fine to rely on natural language instead of booru tags. For example, you could instruct ChatGPT as follows:

Please generate a natural language prompt using these tags: "bare shoulders, small breasts, red eyes, short hair, white hair, dress". The background setting should have a random theme, and it should depict a girl with her hands together in prayer.

hjhf

Author

Jul 21, 2024· 1 reaction

Additionally, installing the "Uncond-Zero-for-ComfyUI" node might make the generation process more enjoyable, as it automatically adjusts even if the cfg is set high!

It's a matter of personal preference, so it's not mandatory. Feel free to try different things!

yamatazenJul 21, 2024

@hjhf Do I need to use low CFG scale?

hjhf

Author

Jul 21, 2024· 1 reaction

@yamatazen Yes, I believe there is some recommended cfg information for sigma somewhere. If I remember correctly, the recommendation is around 2-4.5.

I prefer to set it freely, so I use the custom node mentioned earlier for automatic adjustment.

The "Euler cfg_pp,Euler A cfg_pp" samplers are easy to use even with a low cfg. I also like schedulers such as "simple" and "GITS".

Step 40 is indeed good. Normally, I don't use such high steps, but the quality of the hands and feet has noticeably improved! Since I'm also searching for the optimal solution, I'm happy to see the generations from various people.

hjhf

Author

Jul 21, 2024· 2 reactions

I also tried generating images using your prompt. I hope you find it helpful!

I used the "Uncond-Zero-for-ComfyUI" node along with the "GITS" scheduler. For samplers, I used SDE and euler_A_cfg_pp.

For SDE, I used cfg4, and for euler_A_cfg_pp, I used cfg1.

It's not perfect, but both significantly reduced the distortions in the hands and feet with 50 steps!

Checkpoint

PixArt E

by hjhf

Download (Beta) View on CivitAI

anime