Diffusion Templates: A Unified Plugin Framework for Controllable Diffusion

Zhongjie Duan, Hong Zhang, Yingda Chen

ModelScope Team, Alibaba Group

Paper Code Demo

Datasets 🤗 Models 🤗 Datasets

Diffusion Templates Framework Overview

Diffusion Templates decouples base-model inference from controllable capability injection. Template models map task-specific inputs to a standardized Template cache (KV-Cache, LoRA, etc.), which is then injected into the base diffusion pipeline — enabling reusable, composable plugins for controllable generation.

Abstract

Controllable diffusion methods have substantially expanded the practical utility of diffusion models, but they are typically developed as isolated, backbone-specific systems with incompatible training pipelines, parameter formats, and runtime hooks. This fragmentation makes it difficult to reuse infrastructure across tasks, transfer capabilities across backbones, or compose multiple controls within a single generation pipeline.

We present Diffusion Templates, a unified and open plugin framework that decouples base-model inference from controllable capability injection. The framework is organized around three components: Template models that map arbitrary task-specific inputs to an intermediate capability representation, a Template cache that functions as a standardized interface for capability injection, and a Template pipeline that loads, merges, and injects one or more Template caches into the base diffusion runtime. Because the interface is defined at the systems level rather than tied to a specific control architecture, heterogeneous capability carriers such as KV-Cache and LoRA can be supported under the same abstraction.

Based on this design, we build a diverse model zoo spanning structural control, brightness adjustment, color adjustment, image editing, super-resolution, sharpness enhancement, aesthetic alignment, content reference, local inpainting, and age control. These case studies show that Diffusion Templates can unify a broad range of controllable generation tasks while preserving modularity, composability, and practical extensibility across rapidly evolving diffusion backbones. All resources are open sourced, including code, models, and datasets.

Model Zoo

A diverse set of Template models trained on FLUX.2-klein-base-4B, covering structural control, attribute adjustment, image editing, and more.

Structural & Visual Control

Structural Control KV-Cache

Guides spatial structure, contours, and perspective via reference images. Supports depth, outline, human pose, and normal maps.

Condition

Output 1

Output 2

Condition

Output 1

Output 2

Brightness Adjustment Scalar

Adjusts image brightness with a scalar value normalized to [0,1].

Dark

Normal

Light

Dark

Normal

Light

Color Adjustment Scalar

Controls color tone and temperature via R/G/B channel values.

Cool

Neutral

Warm

Cool

Neutral

Warm

Editing & Attribute Alignment

Image Editing KV-Cache

Precisely edits images via natural language instructions, ~1.8x faster than the base model.

Input

Add hat

Turn head

Input

Change color

Add rain

Aesthetic Alignment LoRA

Aesthetic alignment via LoRA carrier, generalizes beyond the [0,1] training range.

0.0

1.0

2.5

0.0

1.0

2.5

Age Control Scalar

Controls portrait age with a scalar parameter (10–90), trained on IMDB-WIKI.

Age 20

Age 50

Age 80

Age 20

Age 50

Age 80

Enhancement & Reference

Super-Resolution KV-Cache

Takes low-resolution input, preserves composition and semantics, recovers high-frequency details.

Input

Output

Input

Output

Input

Output

Content Reference LoRA

Encodes reference images via SigLIP2, converts to LoRA for flexible reference-based generation.

Reference

Output

Reference

Output

Reference

Output

Sharpness Enhancement Scalar

Precisely controls sharpness and detail level; lower values yield soft results, higher values produce crisp details.

Soft

Sharp

Soft

Sharp

Soft

Sharp

Inpainting

Local Inpainting KV-Cache

Accepts an input image and mask, generates new content in the masked region via natural language prompts, seamlessly blending with the surrounding background.

Input

Mask 1

Result 1

Mask 2

Result 2

Input

Mask 1

Result 1

Mask 2

Result 2

Easter Egg

Panda Meme Easter Egg

A fun easter egg model that generates hilarious panda-head meme stickers.

Happy

Sleepy

Surprised

Smile

Sad

Angry

ImagePulseV2 Datasets

A large-scale open dataset collection (~1.2 TB) for training Diffusion Templates models, covering text-to-image generation and diverse image editing tasks. All datasets are released under Apache License 2.0.

Text-to-Image

sample

sample

ImagePulseV2-TextImage

General text-to-image

sample

sample

ImagePulseV2-TextImage-Human

Human portrait generation

sample

sample

ImagePulseV2-TextImage-MultiResolution

Multi-resolution generation

Image Editing

sample

sample

Structure editing

sample

sample

Local inpainting

sample

sample

Foreground modification

sample

sample

Object add/remove

sample

sample

Super-resolution

sample

sample

Multi-image merging

sample

sample

Random crop/zoom

sample

sample

sample

sample

Edit-Background

Background replacement

sample

sample

Clothing replacement

sample

sample

Pose adjustment

sample

sample

Facial expression editing

sample

sample

Lighting adjustment

sample

sample

Viewpoint adjustment

...

More Coming Soon

Community contributions welcome!

BibTeX

@article{duan2025diffusion,
  author    = {Duan, Zhongjie and Zhang, Hong and Chen, Yingda},
  title     = {Diffusion Templates: A Unified Plugin Framework for Controllable Diffusion},
  year      = {2025},
}