Guide
Diffusion Models
A practical guide to how diffusion models work, their training process, and applications in image, video, and multimodal generation.
What Are Diffusion Models?
Diffusion models are a class of generative model that learn to produce data by learning to reverse a gradual noising process. During training, the model is shown real data samples at various stages of corruption — from the original clean sample all the way to pure Gaussian noise — and learns to predict and remove that noise. At inference time, generation begins from pure noise and the model iteratively denoises over hundreds or thousands of steps until a coherent sample emerges.
This approach has produced state-of-the-art results across image, audio, and video generation, decisively outperforming GANs on perceptual quality benchmarks while being substantially more stable to train. Unlike GANs, diffusion models do not suffer from mode collapse or training instability, and unlike VAEs, they do not impose a strong prior on the latent space. The trade-off is inference cost: naive sampling requires many sequential neural network evaluations, making generation slower than a single GAN forward pass.
The Forward and Reverse Process
The forward process is a fixed Markov chain that adds small amounts of Gaussian noise at each of T timesteps, gradually destroying the structure of the original data. A key mathematical property is that this process is analytically tractable: given any data sample x₀, you can compute the noised version at any arbitrary timestep t directly, without stepping through all intermediate timesteps. This closed-form computation makes training efficient — you can sample random timesteps and compute losses in parallel rather than simulating the full chain sequentially.
The reverse process is the learned component. A neural network — typically parameterised as a noise prediction network — takes a noisy sample and the current timestep as inputs, and predicts the noise that was added. Training minimises the mean-squared error between predicted and actual noise, a surprisingly simple objective that yields powerful generative models. The timestep input is critical: it tells the model how much noise it should expect to see and therefore how aggressively to denoise. At inference, the reverse process is applied from t=T down to t=0, with each step slightly reducing noise until a clean sample is recovered.
Training Diffusion Models
Training a diffusion model from scratch requires large, high-quality datasets of the target modality. For text-to-image models, hundreds of millions of image-caption pairs are the baseline — Stable Diffusion was trained on LAION-5B, a dataset of 5.85 billion image-text pairs. Data quality matters significantly more than raw quantity: filtering for aesthetic quality, caption alignment, and content safety produces meaningfully better models than simply training on everything available. CLIP-score filtering and aesthetic classifiers are standard preprocessing steps.
Classifier-free guidance (CFG) is the dominant conditioning technique and is baked into the training procedure itself. During training, the conditioning signal — typically a text embedding — is randomly dropped for some fraction of examples, forcing the model to learn both conditional and unconditional generation simultaneously. At inference, CFG interpolates between these two modes, amplifying the conditioning signal's influence. Higher guidance scale increases prompt adherence at the cost of sample diversity and sometimes introduces artifacts. Compute requirements are substantial: training a competitive image generation model from scratch requires tens of thousands of GPU-hours, which is why fine-tuning from existing checkpoints (using LoRA, DreamBooth, or full fine-tuning) is the practical approach for most teams.
Conditioning and Control
Text conditioning is the most common control mechanism: a language model (CLIP, T5, or a combined text encoder) converts the input prompt into embeddings, which are injected into the noise prediction network via cross-attention layers. The quality of the text encoder and the alignment of its training data with the image training data are major determinants of prompt adherence. SDXL and later models use dual text encoders to capture both semantic meaning and stylistic nuance from prompts.
Beyond text, diffusion models can be conditioned on a wide variety of control signals. ControlNet adds trainable adapter layers that condition generation on structured inputs — edge maps, depth maps, human poses, segmentation masks — allowing precise spatial control without retraining the base model. IP-Adapter enables image-based conditioning, allowing a reference image to guide the style or subject of generation. For video, temporal attention layers and frame-level conditioning signals allow control over motion, camera movement, and scene continuity. Each additional conditioning modality requires its own paired training data and introduces complexity in the inference pipeline.
Architectures: U-Net and DiT
The U-Net has been the dominant architecture for diffusion model noise prediction networks since the original DDPM paper. It is a convolutional encoder-decoder with skip connections between corresponding encoder and decoder layers, preserving spatial resolution while learning hierarchical features. Timestep conditioning is injected at each residual block via learned embeddings derived from sinusoidal position encodings. Attention layers — initially added at lower resolutions to manage compute — are now used throughout the network in state-of-the-art models. Latent diffusion models (the architecture underlying Stable Diffusion) operate in a compressed latent space produced by a VAE rather than in pixel space, reducing computational cost by an order of magnitude.
The Diffusion Transformer (DiT), introduced by Peebles and Xie in 2022, replaces the convolutional U-Net backbone with a Vision Transformer operating on patch tokens. DiT demonstrates significantly better scaling behaviour than U-Net architectures — perceptual quality improves predictably as model size, compute, and training steps increase — and has become the architecture of choice for frontier models including Stable Diffusion 3, FLUX, and Sora. The shift to transformer backbones also simplifies the integration of multimodal conditioning, since attention-based text injection is architecturally natural in a transformer framework. Efficient attention implementations (Flash Attention) are essential at scale, as the quadratic attention cost over image patch sequences is the dominant compute bottleneck.
Applications and Use Cases
Image generation is the highest-profile application but a narrow slice of the practical value diffusion models deliver. Synthetic data generation for training downstream models is one of the highest-ROI enterprise use cases: in domains where real labeled data is scarce or expensive — medical imaging, satellite imagery, rare defect inspection — synthetic generation can dramatically expand training datasets. Style-consistent asset generation for marketing, ecommerce product imagery, and game development reduces production costs and cycle times. Inpainting and outpainting extend or restore images in a context-aware manner, enabling workflows that would previously have required hours of manual editing.
Video generation is the frontier: models like Sora, Wan, and Kling demonstrate coherent multi-second video clips from text prompts, with physically plausible motion and camera dynamics. Audio diffusion models — applied to speech, music, and sound effects — follow the same mathematical framework as image models, treating spectrograms or waveforms as the data domain. In drug discovery, diffusion models generate novel molecular structures conditioned on target binding properties, with RFDiffusion and AlphaFold-derived techniques showing genuine experimental validation. The common thread across all these domains is the same training recipe: define a noising process over your data modality, train a denoising network, and scale data and compute to the task.
Building with diffusion models?
Our data engine provides the high-quality labeled and synthetic training data that diffusion models need to perform at production quality.
See Lean Data Engine