What is this?
https://github.com/AbstractEyes/anima-trainer
A tool for using JSON with Anima. This model does not require JSON, however it does provide added beneficial control WITH JSON while simultaneously being capable at many new plain English prompting.
Trained with the same trainer as Anima was trained with originally - diffusion-pipe, snapped together with a new dataset organization system so I could run it in either Runpod or notebooks.
The trigger word is NOT the exact token "JSON", it's literal json in string form.
Prelim 1k
https://huggingface.co/datasets/AbstractPhil/diffusion-pretrain-set-ft1
This is 1k images randomly sampled and subject-bucketed from the 80k image dataset "qwen_90k" that will be trained next.
https://huggingface.co/AbstractPhil/Qwen3.5-0.8B-json-captioner
Each of the images were captioned using the VLM's VIT for a JSON outputted system and additionally a variant of AnimeTIMM VIT also captioned and then processed into JSON as well.
12 epochs on the VLM JSON captions, same images back in for 8 more epochs with AnimeTIMM JSON. This is the results from subject-bucketing with json.
More specifically
https://huggingface.co/blog/AbstractPhil/subject-bucketing
This is a subject-bucket trained JSON finetune.
The specific targets are meant to provide better accuracy and more fidelity to finetunes experimentally while simultaneously training a proof-of-concept paradigm related to subject-bucketing.
TLDR Subject Bucketing
Dataset, balancing. Normally you end up with a series of, problems from finetunes. Breakpoints, kinks, issues, distortions, faults, and so on.
This is meant as an experiment to solve those exact problems. By finetuning a model with JSON, you provide a form of differentiated perspective to the AI. By grouping subjects to a more complex paradigm as stated in the article - the differentiation becomes robust.
A little longer, still short.
Each token separator is another format of language that QWEN already understands and recognizes. The more you combine in sequence, the more QWEN will understand this process - providing more utilizable structure to the diffusion system.
With robust and orderly encodings provided to the diffusion system that include differentiated lesser-used tokens in conjunction with more common-use tokens, the more powerful the training results in useful outcomes.
Why?
The smaller-scale non-bucketed variants were successful, so it's time to train the real thing. The tool itself, and the tool yields.
Now the first 1k image train for the direct tool has been successful. The results are yielding and powerful. This merits a full uptick in training.
Description
Comments (14)
is this like bringing the json prompt capabilities from ideogram v4 to anima?
I haven't played with Ideogram V4 but I've been planning this one for a couple months. My dataset consists of over 700k fully prepared dual-prompt images with my shared QWEN 3.5 0.8b model as the catalyst for the entire system.
SDXL took to it like a bag of rocks, however Anima took it fairly clean.
What exactly does Lora do? Can I just use it to generate prompts in JSON format? What exactly does that look like?
It accepts plain English prompting as well as JSON prompting.
@AbstractPhila But if this Lora not for enhancing the JSON promptstructure understanding, what is the idea for it? For what is this?
@VKilko The model becomes more selective with larger margins between the LLM inputs. The LLM itself isn't particularly very smart, so more sparse captions have trouble. This both strengthens small chains of tokens by giving them scaffolding with JSON, as well as trains subject symbolism from the LLM into the diffusion mechanism. Thus allowing the model to align to specifics in a different way, in this case JSON was the catalyst and plain English was the mechanism.
@VKilko https://huggingface.co/datasets/AbstractPhil/anima-90k-cache/tree/main/vlm This will give a good idea if what's in there.
Here is one with a viewer, same images.
https://huggingface.co/datasets/AbstractPhil/sdxl-qwen-phase0
@AbstractPhila What is structure / format of JSON?
I did some testing and ... I can't see any difference with | without this LORA using Anima base.
Modern models, surprisingly, do understand JSON, some more others less, i.e. using Anima gives 60/40 positive results but Krea2 jumps to 90/10.
I used Ideogram JSON description from KJ and am surprised that this does work so well for Krea2, not ideal, but this is all "Ai" shtick these days ("good enough so we all should use it"), much better than in Anima.
The most problematic part is bbox coordinates that Anima seams to ignore in i.e. 50/50.
I haven't trained bounding box coordinates yet, you need to use difference offsets for now. "to the left of", "the upper right corner of the image", etc.
The next structure I create will be substantially more powerful. I'm scaling up to full VIT classification capacity; text identification, rotation, offset, depth, scale, bounding boxes, and considerably more identified capacities all packed into JSON.
In that sense I'm going to find the strongest VLM that can run on the rtx 6000 pro's 95 gigs of vram, and with that the version 2 will be considerably more powerful.
Version 1 is currently cooking, and the subject semantics association preview shows that it will in fact yield - but my eyes are now open to something much much more powerful.
As the sample images doesn't show any JSON in their prompt, could you give us an example ?
It's a bit barebones for now, but it'll get the model started for the next batch.
There's an actual qwen model you can use to translate your plain english prompt directly to the json format that this model learned.
https://huggingface.co/AbstractPhil/anima-prelim-1k-r64/tree/main/comfy-qwen-json
The qwen node works in comfyui but I haven't packaged it up into it's own repo yet. It requires transformers >5.4
I suggest appending the plain english + booru tags after the json formatted data, which provides the necessary solidity to the prompt.
What do you think about xml as a input structure like NewbieAi have.
Example prompt:
<character_1>
<n>$character_1$</n>
<gender>1girl</gender>
<appearance>chibi, red_eyes, blue_hair, long_hair, hair_between_eyes, head_tilt, tareme, closed_mouth</appearance>
<clothing>school_uniform, serafuku, white_sailor_collar, white_shirt, short_sleeves, red_neckerchief, bow, blue_skirt, miniskirt, pleated_skirt, blue_hat, mini_hat, thighhighs, grey_thighhighs, black_shoes, mary_janes</clothing>
<expression>happy, smile</expression>
<action>standing, holding, holding_briefcase</action>
<position>center_left</position>
</character_1>
<character_2>
<n>$character_2$</n>
<gender>1girl</gender>
<appearance>chibi, red_eyes, pink_hair, long_hair, very_long_hair, multi-tied_hair, open_mouth</appearance>
<clothing>school_uniform, serafuku, white_sailor_collar, white_shirt, short_sleeves, red_neckerchief, bow, red_skirt, miniskirt, pleated_skirt, hair_bow, multiple_hair_bows, white_bow, ribbon_trim, ribbon-trimmed_bow, white_thighhighs, black_shoes, mary_janes, bow_legwear, bare_arms</clothing>
<expression>happy, smile</expression>
<action>standing, holding, holding_briefcase, waving</action>
<position>center_right</position>
</character_2>
<general_tags>
<count>2girls, multiple_girls</count>
<style>anime_style, digital_art</style>
<background>white_background, simple_background</background>
<atmosphere>cheerful</atmosphere>
<quality>high_resolution, detailed</quality>
<objects>briefcase</objects>
<other>alternate_costume</other>
</general_tags>















