CivArchive
    Microsoft Lens - Turbo
    Preview 132024964
    Preview 132025006
    Preview 132025009
    Preview 132025003
    Preview 132025051
    Preview 132025054

    Lens is a 3.8B-parameter text-to-image diffusion model from Microsoft Research, built around a 48-block Multimodal DiT denoiser, the FLUX.2 semantic VAE, and multi-layer GPT-OSS text features. The headline pitch from the team: competitive quality with substantially less training compute than larger T2I models, plus flexible resolution up to 1440x1440 across aspect ratios from 1:2 to 2:1.

    Originally released by Microsoft on Hugging Face under the MIT license. All credit for the model goes to the Microsoft Research team listed below. Civitai is hosting a mirror so creators can run it on-site - please head to the original Hugging Face repo or the GitHub project for weights, updates, and to follow the project directly.

    Built by

    Microsoft Research. Project leads: Dong Chen, Fangyun Wei, Ziyu Wan. Core contributors: Jiawei Zhang, Jinjing Zhao, Sirui Zhang, Yang Yue, Zhiyang Liang. Full contributor list (alphabetical): Baining Guo, Chong Luo, Dong Chen, Dongdong Chen, Fangyun Wei, Ji Li, Jianmin Bao, Jiawei Zhang, Jinjing Zhao, Lei Shi, Qinhong Yang, Sirui Zhang, Xiuyu Wu, Xuelu Feng, Yan Lu, Yanchen Dong, Yang Yue, Yitong Wang, Yunuo Chen, Zhiyang Liang, Ziyu Wan.

    Versions mirrored on Civitai

    Two builds are hosted here:

    • Lens - the default, RL-tuned build. 20 inference steps at CFG 5.0. Use this when quality matters more than speed.
    • Lens-Turbo - a distilled fast variant. 4 inference steps at CFG 1.0. Use this when you want quick iterations or are batching at volume.

    Upstream also publishes Lens-Base, a supervised baseline (50 steps, no RL or distillation) intended for research comparisons. That build is not mirrored on Civitai. Grab it from the Hugging Face repo if you need it.

    Architecture and training

    The denoiser is a 48-block MMDiT operating in the FLUX.2 semantic VAE latent space, conditioned on concatenated multi-layer features from GPT-OSS as the text encoder. Training used the Lens-800M corpus - 800M image-text pairs with long, dense GPT-4.1 captions - combined with mixed-resolution learning so the model handles multiple aspect ratios natively rather than locking to a single training resolution.

    Resolution and aspect ratios

    Native generation up to 1440x1440 with aspect ratios from 1:2 to 2:1. Nine standard ratios are supported out of the box. Prompt-following is reported strong on long, descriptive captions, and the GPT-OSS text stack carries multilingual prompts.

    Intended use

    Microsoft positions Lens as a research model rather than a product. It is appropriate for controlled research settings with human oversight, not for unattended production deployment. Honor the upstream model card if you build on top of it.

    Links

    Description

    Checkpoint
    Lens

    Details

    Downloads
    0
    Platform
    CivitAI
    Platform Status
    Available
    Created
    5/27/2026
    Updated
    5/27/2026
    Deleted
    -

    Files

    microsoftLens_turbo.safetensors

    microsoftLens_turbo.safetensors

    Mirrors

    HuggingFace (1 mirrors)

    microsoftLens_turbo.safetensors