Qwen 3.5 4B Text Encoder for Anima 2B
NEW → Now supported on Forge Neo (sd-webui-forge-neo) as a native extension! See the Forge Neo install instructions below.
Installation
ComfyUI
Clone the repo into your ComfyUI custom_nodes folder:
cd ComfyUI/custom_nodes
git clone https://github.com/GumGum10/comfyui-qwen35-anima.git
Then restart ComfyUI.
Forge Neo
Clone or copy the extension into your Forge Neo extensions folder:
cd sd-webui-forge-neo/extensions
git clone https://github.com/GumGum10/sd-forge-qwen-35-encoder.git
Then restart Forge Neo. Dependencies (transformers, safetensors) install automatically on first launch.
What Is This?
A drop-in upgrade for Anima 2B's text encoder. The stock Anima ships with a tiny 0.6B parameter text encoder — it works, but it struggles with complex prompts. This replaces it with a 4B parameter encoder that understands your prompts significantly better.
The trade-off: the larger encoder needs alignment work to "speak the same language" as the diffusion model. We've done that work and ship the alignment files with this release. You just need to place files in the right folders and toggle a couple of settings.
What You Get
Pros:
Much better understanding of complex/long prompts (7× more parameters dedicated to reading your text)
Better handling of detailed scene descriptions, multiple subjects, and nuanced instructions
Alignment controls let you blend between raw 4B output and 0.6B-compatible output
Cons:
Uses more VRAM than the stock 0.6B encoder (~4GB vs ~0.6GB for the text encoder portion)
Slightly slower encoding (more parameters to run)
Alignment is an approximation — the diffusion model was trained against the 0.6B, so we're rotating the 4B's output to match. It's very good (0.96 cosine similarity) but not identical
This is a reverse-engineered implementation — the original author's private code may differ in subtle ways
File Placement
All files are available at: lylogummy/anima2b-qwen-3.5-4b
ComfyUI
You'll download 4 files:
ComfyUI/
├── models/
│ └── text_encoders/
│ └── qwen35_4b.safetensors ← THE TEXT ENCODER WEIGHTS
│
└── custom_nodes/
└── comfyui-qwen35-anima/ ← THIS CUSTOM NODE FOLDER
├── __init__.py ← (comes with the node)
├── calibration_params.safetensors ← MAGNITUDE CALIBRATION
├── rotation_matrix.safetensors ← ALIGNMENT ROTATION
└── qwen35_tokenizer/ ← TOKENIZER FILES
├── tokenizer.json
├── vocab.json
└── merges.txt
Forge Neo
You only need to download 1 file — the calibration files, alignment matrix, and tokenizer are already bundled with the extension:
sd-webui-forge-neo/
├── models/
│ └── text_encoder/
│ ├── qwen_3_06b_base.safetensors ← STOCK 0.6B (you already have this)
│ └── qwen35_4b.safetensors ← DOWNLOAD THIS
│
└── extensions/
└── sd_forge_qwen35_encoder/ ← THIS EXTENSION
├── scripts/ ← (comes with extension)
├── lib_qwen35/ ← (comes with extension)
├── calibration_params.safetensors ← (bundled)
├── rotation_matrix.safetensors ← (bundled)
└── qwen35_tokenizer/ ← (bundled)
Forge Neo note: Keep qwen_3_06b_base.safetensors selected in the top VAE/Text Encoder dropdown — its LLM adapter is still required. Do not put qwen35_4b.safetensors in that top dropdown.
Where to download each file:
qwen35_4b.safetensors (both ComfyUI and Forge Neo) → Download from: text_encoders/ → Place in: ComfyUI/models/text_encoders/ or sd-webui-forge-neo/models/text_encoder/ → What it does: The actual 4B text encoder model weights
calibration_params.safetensors + rotation_matrix.safetensors (ComfyUI only — bundled in Forge Neo) → Download from: calibration/ → Place in: ComfyUI/custom_nodes/comfyui-qwen35-anima/ → What they do: Calibration scales the 4B output to match the 0.6B's magnitude per dimension. The rotation matrix rotates the 4B's concept directions to match what the adapter expects.
qwen35_tokenizer/ folder (ComfyUI only — bundled in Forge Neo) → Download from: tokenizer/ → Place in: ComfyUI/custom_nodes/comfyui-qwen35-anima/qwen35_tokenizer/ → What it does: The correct tokenizer (vocab=248K, NOT the default Qwen3 tokenizer) → Note: This will auto-download from HuggingFace on first use if you don't place it manually.
How to Use
ComfyUI
Add the "Load Qwen3.5 CLIP (Anima)" node (found under
loaders → Anima)Select
qwen35_4b.safetensorsfrom the dropdownConnect the CLIP output to a CLIPTextEncode node
Use with your Anima 2B checkpoint as normal
Forge Neo
Load an Anima 2B checkpoint
Make sure
qwen_3_06b_base.safetensorsis in the top VAE/Text Encoder dropdownIn the generation tab, expand "Qwen3.5 Text Encoder (Anima)" and enable it
Select
qwen35_4b.safetensorsin the extension's Model File dropdownGenerate as normal — the extension intercepts text encoding automatically
Recommended Settings to Start (both)
use_alignment: ON
alignment_strength: 0.5
use_calibration: OFF
output_scale: 1.0
That's it. Generate some images and compare against the stock 0.6B.
Tuning Guide
What the settings actually do (plain English):
use_alignment — Rotates the 4B's internal "compass" so that when it says "from the side" or "looking up", it points in the same direction the diffusion model expects. Without this, the 4B understands your prompt fine — it just communicates it in a way the diffusion model misreads.
alignment_strength (0.0 – 1.0) — The rotation (direction fix) is always on when alignment is enabled. This slider controls how much the magnitude shifts to match the 0.6B:
0.0 = Directions fixed, but keep the 4B's own signal strength
0.5 = Halfway blend ← start here
1.0 = Fully match the 0.6B's signal strength
use_calibration — A finer-grained magnitude adjustment (per dimension instead of uniform). Can help, can also over-correct. Try it on and off and compare.
output_scale — A simple multiplier on the final output. Leave at 1.0 unless you know what you're doing.
Recommended workflow:
Generate with alignment OFF first — see what the raw 4B gives you. The text understanding will be better, but poses/viewpoints may be off.
Turn alignment ON, set strength to 0.5 — generate the same prompts again. You should see better pose/viewpoint adherence while keeping the 4B's improved understanding.
Adjust strength — bump it up if spatial stuff is still off, pull it back if quality degrades.
Optionally enable calibration — compare on/off, keep whichever looks better for your use case.
FAQ
Q: Do I need both calibration AND alignment files? A: The alignment file (rotation_matrix.safetensors) is the most important one. Calibration is optional and supplementary. You can use alignment without calibration.
Q: Will this work with any Anima 2B checkpoint? A: Yes — any checkpoint built on Anima 2B that uses the standard text encoder pipeline.
Q: Does this need extra Python packages? A: For ComfyUI — no, everything ships with ComfyUI already. For Forge Neo — transformers and safetensors install automatically on first launch.
Q: How much extra VRAM does this use? A: The 4B encoder weights are FP8 quantized, so roughly ~4GB for the text encoder. The stock 0.6B is under 1GB. Your total VRAM usage depends on your diffusion model + VAE + this.
Q: Why not just scale the output by 10× instead of all this alignment stuff? A: Uniform scaling fixes the magnitude but not the directions. The 4B encodes "from the side" as a vector pointing in a completely different direction than the 0.6B. The rotation matrix fixes that. Scaling alone would be like shouting the wrong directions louder.
Q: Is this better than the stock 0.6B? A: For text understanding — yes, meaningfully. For raw out-of-the-box image quality — it depends on your alignment settings and prompts. The 0.6B has the advantage of being exactly what the model was trained against. The 4B has the advantage of actually understanding complex prompts. With alignment at 0.5, most users see comparable or better results, especially on detailed prompts where the 0.6B falls short.
Q: Can I use this with img2img? A: Yes — works for both txt2img and img2img on both ComfyUI and Forge Neo.
Q: Why does Forge Neo still need the 0.6B model loaded? A: The Anima pipeline uses a small LLM adapter that lives on the 0.6B model. This adapter converts text embeddings into the format the diffusion model expects. The 4B provides the text understanding, but the adapter (on the 0.6B) still handles the final conversion. Both models are needed.
Credits
Anima 2B: circlestone-labs
Qwen 3.5 4B for Anima: nightknocker/cosmos-qwen3.5
Custom Node, Alignment & Forge Neo Port: GumGum10
Description
FAQ
Comments (39)
Disclaimer: This is more of an adapter than a fully fledged qwen 3.5 implementation. Are the results better? Not necessarily, are they worse? again not necessarily...it's all subjective. Qwen 3.5 4B is several times bigger than qwen 3 0.6B so the model understands a lot more concepts, and it has multilingual support too. Please test and let me know your thoughts, if you see issues with prompt following set alignment to 1.0 in comfyui and that should fix it
is that possible to use with Forge?
Not at the moment but I can look into supporting it, which forge is being used right now? Neo?
@LyloGummy Definitely Neo!
@LyloGummy Yes Neo would be your best bet.
Thanks! Will get neo shipped today/tmrw
@ujustgotcyberfuuck213 @sneedingonmyligma420 @compgamer1337267 Forge Neo is now supported:
@LyloGummy nice. well, i ran it, with the suggested settings on your github page, overall it crushed prompt adherence, maybe improved the prompt i ran in a specific way i was looking for but results varied wildly. the project certainly has potential.
@sneedingonmyligma420 yep noticed that as well, if you set alignment to 1 it's gonna work better. This is due to the fact that Anima was trained with 0.6B. This will improve as the TE is trained more (if)
@LyloGummy sorry, i cant get what to do, where do i find this extension?
@compgamer1337267 Hey! Just download this archive/clone the repo, and unzip/place in sd-forge-neo folder -> extensions https://github.com/GumGum10/sd-forge-qwen-35-encoder
@LyloGummy Sorry for the stupid question, but I just don't understand how to download the file...
This is a good job, but my 4050 only has 6GB of video memory, so I won't try it.
Hi, thanks for the feedback, I will look into adding the option to offload the TE to CPU/RAM so more people can try it. Also if the original author of the TE open sources the 2B variant I can add support for that as well, as it should fit with 6gb https://huggingface.co/nightknocker/cosmos-qwen3.5
i do use comfyui for video generation and some qwen flux, but for anime style i use forge cause faster and easier also can inpaint/img2img easier, i wish this work on forge
Forge is now supported!
@LyloGummy i got this error
AttributeError: 'Qwen3_06B' object has no attribute 'llm_adapter'
@Seii1 Hey, please open an issue here and post the full console logs/output, will take a look:
https://github.com/GumGum10/sd-forge-qwen-35-encoder
望ましい結果が得られる事はありませんでしたがRTX3060の環境でも動作そのものは軽快でした。
This looks promising but is it even worth it as on some of the pictures the Qwen3 0.6b looks better and the prompt adherence looks similar too.
It's all subjective imo, we make use of qwen 3.5's 4B parameters and use an adapter to generate the embeddings that Anima was trained on, that being said this is just the initial release and I'm researching ways to improve it further, if anything this is just a proof of concept to show that Anima is compatible with larger LLMs. I do appreciate the feedback! ❤
@LyloGummy It might be worth researching Rouwei-Gemma for Illustrious As I think it's a LLM T5 adapter for Illustrious clip I haven't used it but I have heard good things about it. https://civitai.com/models/1782437/rouwei-gemma
As it might help with future models that you make. Although again it does use an older architecture of SDXL/Illustrious and T5 rather than Qwen3...
@LyloGummy wait isn't that what it already does but your adding another layer of it? Qwen 3.5 -> 3 translator-> I forgot what cosmos used or are you just replacing the 3 part with your own thing
Also, how are you training it? If your training it on outputs of 0.6b might just learn it's mistakes, base training then some kind of RL with prompt adherence rated somehow would probably be best don't ask me how to do that at scale though maybe VLLMs or smaller models that try to generate tags from images
Qwen is probably best for now, but is model not in Qwen series doable you think?
Any LLM should be doable in theory, the real problem is if we wish to avoid re-training, how do we align the embeddings of another LLM so they match 0.6B embeddings. And are the results better enough to make it worthwhile? Which model did you have in mind if I may ask?
Seems to not work with natural language prompts at all. I tried with different checkpoints with and without lora, with different samplers etc.
As soon as I switched to tag based prompts both encoders worked similarly fine, but with natural language the encoder basically only focused on a single paragraph. Some images even came out as flat colors (or in one case, a green flat color with two very tiny figures visible xD).
I uploaded my results here https://civitai.com/posts/27161701 though I wasn't able to tag which images used which encoder. They all contain their workflows though, so you can check that way.
But for reference, none of the images in that post with ice-cream or other people where produced with your encoder. The best image (the one where the chracter is showing a piece sign) still ignored most of the prompt.
PS: I also had multiple crashes with dynamic vram enabled, though I'm not sure what exactly caused them.
Yeah, i noticed that as well. The majority of the prompt that i was using was completely ignored, as i'm using about 30% tags and 70% natural language... Only the tags was generated.
Not to mention that generations took about 3 times longer than normal (using Forge Neo).
Really cool proof of concept and I really want you to experiment with this idea and maybe get something way better than what we have, but in my testing it's just way worse than the normal text encoder. Really do not want to discourage you at end of the day though to be super clear
Appreciate the honest feedback! This is what I lacked during my testing lol, stay tuned for next version, I am addressing all issues
Works very bad: artists don't look like artists, ignores "from side", missing limbs. With alignment at 1.0 or without. Turning on calibration generates a grid of noise.
Firstly,if you use calibration and alignment same time, you will get noise. But you can get picture by using either calibration or alignment, so strange.
This is a great technical attempt, and at the same time, I must agree with some comments that the performance of this version is currently not as good as the original
Anima was trained on 0.6B, 4B has different embeddings, first of all solve this problem
Hi! Thanks for all the great feedback! The community interest is overwhelming, that tells me that I have to make this project meet expectations! Posting here as its easier than responding to each comment, but rest assured I am reading all of them. This is more a proof of concept rather than anything else, and I am actively working to address all the issues, including NL, quality, artists, limbs etc. I still believe with some more iterations this can become a good replacement for qwen 3 0.6b. Stay tuned
So I gave it a go however, unfortunately it didn't work very well, so the checkpoint I'm using is AnimaYume and I'm Alsop using the RDBT - Anima stability Lora. However, unfortunately with this LLM adapter it is completely ignoring the prompt and it isn't given what I want. I don't know if I'm doing it right, but I've tried multiple times and it's not really working
I had hopes but nope, this is not good for me. We completely lose characters knowledge, this is not good. Instant delete, sorry
This has great potential! But it didn't work for me either in its current state. I experimented all night using different alignment strengths. Not just character knowledge was lost, but other things like landmarks too (Tokyo Tower became Eiffel Tower). Looking forward to future improvements!
Hey, bro, cool work. I was just thinking, Anima really loves good prompts out of the box, with every detail clearly written. In Illustris, you could get a cool image by typing a few words. The model, as I understand it, had a built-in prompt enhancer. Is it possible to do something like this for Anima? So that you can get varied, logical images from a single sentence without requiring a lot of careful thought?



















