Lens is a 3.8B-parameter text-to-image diffusion model from Microsoft Research, built around a 48-block Multimodal DiT denoiser, the FLUX.2 semantic VAE, and multi-layer GPT-OSS text features. The headline pitch from the team: competitive quality with substantially less training compute than larger T2I models, plus flexible resolution up to 1440x1440 across aspect ratios from 1:2 to 2:1.
Originally released by Microsoft on Hugging Face under the MIT license. All credit for the model goes to the Microsoft Research team listed below. Civitai is hosting a mirror so creators can run it on-site - please head to the original Hugging Face repo or the GitHub project for weights, updates, and to follow the project directly.
Built by
Microsoft Research. Project leads: Dong Chen, Fangyun Wei, Ziyu Wan. Core contributors: Jiawei Zhang, Jinjing Zhao, Sirui Zhang, Yang Yue, Zhiyang Liang. Full contributor list (alphabetical): Baining Guo, Chong Luo, Dong Chen, Dongdong Chen, Fangyun Wei, Ji Li, Jianmin Bao, Jiawei Zhang, Jinjing Zhao, Lei Shi, Qinhong Yang, Sirui Zhang, Xiuyu Wu, Xuelu Feng, Yan Lu, Yanchen Dong, Yang Yue, Yitong Wang, Yunuo Chen, Zhiyang Liang, Ziyu Wan.
Versions mirrored on Civitai
Two builds are hosted here:
- Lens - the default, RL-tuned build. 20 inference steps at CFG 5.0. Use this when quality matters more than speed.
- Lens-Turbo - a distilled fast variant. 4 inference steps at CFG 1.0. Use this when you want quick iterations or are batching at volume.
Upstream also publishes Lens-Base, a supervised baseline (50 steps, no RL or distillation) intended for research comparisons. That build is not mirrored on Civitai. Grab it from the Hugging Face repo if you need it.
Architecture and training
The denoiser is a 48-block MMDiT operating in the FLUX.2 semantic VAE latent space, conditioned on concatenated multi-layer features from GPT-OSS as the text encoder. Training used the Lens-800M corpus - 800M image-text pairs with long, dense GPT-4.1 captions - combined with mixed-resolution learning so the model handles multiple aspect ratios natively rather than locking to a single training resolution.
Resolution and aspect ratios
Native generation up to 1440x1440 with aspect ratios from 1:2 to 2:1. Nine standard ratios are supported out of the box. Prompt-following is reported strong on long, descriptive captions, and the GPT-OSS text stack carries multilingual prompts.
Intended use
Microsoft positions Lens as a research model rather than a product. It is appropriate for controlled research settings with human oversight, not for unattended production deployment. Honor the upstream model card if you build on top of it.
Links
- Source (Lens): huggingface.co/microsoft/Lens
- Source (Lens-Turbo): huggingface.co/microsoft/Lens-Turbo
- Code: github.com/microsoft/Lens
- License: MIT





