Diffusion Model
Useful Links
- Denoising Diffusion Probabilistic Models - by Jonathan Ho et al
- Score-Based Generative Modeling through Stochastic Differential Equations - by Yang Song et al
- The Principles of Diffusion Models - by Chieh-Hsin Lai et al
- What are Diffusion Models? - by Lilian Weng
- Generative Modeling by Estimating Gradients of the Data Distribution - by Yang Song
- Score-based Diffusion Models | Generative AI Animated - by Deepia
Introduction
There are three perspectives to understand diffusion model [Lai et al. 2025]. The most classical one is from the VAE perspective. The most general perspective is from the score-based generative model perspective through SDE (stochastic differential equation). This blog is a summary from VAE perspective.Training
To predict the noise added at each diffusion step, the true training objective is
\[
\mathcal{L} = \mathop{\mathbb{E}}\limits_{t} \ \mathop{\mathbb{E}}\limits_{\mathbf{x}_0 } \ \mathop{\mathbb{E}}\limits_{\boldsymbol{\epsilon} }
\left[ \| \boldsymbol{\epsilon} - \text{UNet}_\theta(\mathbf{x}_t, t) \|^2 \right]
\]
where $\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0+\sqrt{1-\bar{\alpha}_t} \boldsymbol{\epsilon}$.
Notations $t$, $\mathbf{x}_0$, $\mathbf{x}_t$ and $\boldsymbol{\epsilon}$ are all random variables,
with distributions $t \sim \mathcal{U}(\{1,..., T\})$, $\mathbf{x}_0 \sim p_{\text{data}}$, and $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ respectively.
The expectations can only be calculated using finite samples drawn from the distributions.
By drawing many samples from these distributions, the training objective can be approximated by
\[
\mathcal{L} = \sum_{\tilde{t}}\ \sum_{\tilde{\mathbf{x}}_0} \ \sum_{\tilde{\boldsymbol{\epsilon}}}
\left[ \| \tilde{\boldsymbol{\epsilon}} - \text{UNet}_\theta(\tilde{\mathbf{x}}_{\tilde{t}}, \tilde{t}) \|^2 \right]
\]
where $\tilde{\mathbf{x}}_{\tilde{t}} = \sqrt{\bar{\alpha}_{\tilde{t}}} \tilde{\mathbf{x}}_0+\sqrt{1-\bar{\alpha}_{\tilde{t}}} \tilde{\boldsymbol{\epsilon}}$.
Notations $\tilde{t}$, $\tilde{\mathbf{x}}_0$, $\tilde{\mathbf{x}}_{\tilde{t}}$ and $\tilde{\boldsymbol{\epsilon}}$ are all samples (realizations),
drawn from their respective distributions.
Computing the loss over all samples is still too expensive. In practice, we adopt the SGD (stochastic gradient descent).
In each iteration, we pick an image $\tilde{\mathbf{x}}_0$ from the training set, a noise value $\tilde{\boldsymbol{\epsilon}}$, and a time value $\tilde{t}$.
Then calculate the loss for this single sample,
\[
\mathcal{L} =
\| \tilde{\boldsymbol{\epsilon}} - \text{UNet}_\theta(\tilde{\mathbf{x}}_{\tilde{t}}, \tilde{t}) \|^2
\]
The UNet parameters $\theta$ are updated by one gradient step downhill [Ho et al. 2020].
- repeat
- Sample an image $\tilde{\mathbf{x}}_0$ from the training set $\mathcal{D}$
- Sample a time step $\tilde{t}$ uniformly from $\{1, \ldots, T\}$
- Sample a noise $\tilde{\boldsymbol{\epsilon}}$ from Gaussian distribution $\mathcal{N}(\mathbf{0}, \mathbf{I})$
- Calculate the gradient $\nabla_\theta \left\| \tilde{\boldsymbol{\epsilon}} - \text{UNet}_\theta\left(\sqrt{\bar{\alpha}_{\tilde{t}}}\tilde{\mathbf{x}}_0 + \sqrt{1 - \bar{\alpha}_{\tilde{t}}}\tilde{\boldsymbol{\epsilon}}, \tilde{t}\right) \right\|^2$
- Take a step towards the gradient's direction
- until converged
At each step, we know exactly the noise value $\tilde{\boldsymbol{\epsilon}}$ added to the image $\tilde{\mathbf{x}}_{\tilde{t}-1}$ at time $\tilde{t}$, either large or small. We hope to move the UNet parameters $\theta$ in the direction of predicting $\tilde{\boldsymbol{\epsilon}}$ a little bit. We are not making it predits exactly $\tilde{\boldsymbol{\epsilon}}$ at this step, since we just move one gradient step forward.
For example, from the previous training steps we already have a model $\text{UNet}_{\theta}$ and it gives a prediction $\text{UNet}_{\theta}(\tilde{\mathbf{x}}_{\tilde{t}}, \tilde{t})=0.3$. At this training step we know the true noise value $\tilde{\boldsymbol{\epsilon}}$ is 0.5, so we calculate the gradient and update $\theta$ to make the UNet predict 0.31 instead of 0.3 at this step. Over many training iterations, it learns a function that predicts the expected noise added to the image $\tilde{\mathbf{x}}_{\tilde{t}-1}$ at timestep $\tilde{t}$.
Sampling
After training, we have the $\text{UNet}_{\theta^*}$. We can use it to generate new images. We start from a pure noise image and iteratively denoise it using the UNet model. This generation process is also called sampling, since at each step we are sampling from a posterior distribution $p_{\theta^*}(\mathbf{x}_{t-1} \mid \mathbf{x}_t = \tilde{\mathbf{x}}_t)$.
- First, sample initial noise image $\tilde{\mathbf{x}}_T$ from the prior distribution $\mathcal{N}(\mathbf{0}, \mathbf{I})$
- Second, let $\tilde{\mathbf{x}}_T$ and time step $T$ go through the deterministic network and get the predicted noise $\text{UNet}_{\theta^*}(\tilde{\mathbf{x}}_T, T)$;
then calculate mean $\boldsymbol{\mu}_{\theta^*}(\tilde{\mathbf{x}}_T, T)=\frac{1}{\sqrt{\alpha_T}}\left(\tilde{\mathbf{x}}_T-\frac{1-\alpha_T}{\sqrt{1-\bar{\alpha}_T}} \textcolor[RGB]{0,61,165}{\text{UNet}_{\theta^*}(\tilde{\mathbf{x}}_T, T)}\right)$,
which fully describes the estimated posterior Gaussian distribution
$\qquad p_{\theta^*}(\mathbf{x}_{T-1} \mid \mathbf{x}_T = \tilde{\mathbf{x}}_T) = \mathcal{N}(\mathbf{x}_{T-1}; \boldsymbol{\mu}_{\theta^*}(\tilde{\mathbf{x}}_T, T), \sigma_T^2\mathbf{I})$ - Third, sample an image from this posterior $p_{\theta^*}(\mathbf{x}_{T-1} \mid \mathbf{x}_T = \tilde{\mathbf{x}}_T)$.
Since this posterior is Gaussian, this process is equivalent to computing
$\qquad \tilde{\mathbf{x}}_{T-1} = \boldsymbol{\mu}_{\theta^*}(\tilde{\mathbf{x}}_T, T) + \sigma_T \tilde{\mathbf{z}}$, where $\tilde{\mathbf{z}}$ is a sample from $\mathcal{N}(\mathbf{0}, \mathbf{I})$
Conceptually is "mean + randomness" - Repeat the second and third steps with $T$ times to get the final generated image $\tilde{\mathbf{x}}_0$.
Mathematical derivations summary
This is a summary of the mathematical derivations of ELBO that appears everywhere. Coming soon...