Wenxin Pan

Diffusion Model

March 8, 2026  ·  10 min read  ·  Wenxin Pan

Introduction

There are three perspectives to understand diffusion model [Lai et al. 2025]. The most classical one is from the VAE perspective. The most general perspective is from the score-based generative model perspective through SDE (stochastic differential equation). This blog is a summary from VAE perspective.

Training

To predict the noise added at each diffusion step, the true training objective is \[ \mathcal{L} = \mathop{\mathbb{E}}\limits_{t} \ \mathop{\mathbb{E}}\limits_{\mathbf{x}_0 } \ \mathop{\mathbb{E}}\limits_{\boldsymbol{\epsilon} } \left[ \| \boldsymbol{\epsilon} - \text{UNet}_\theta(\mathbf{x}_t, t) \|^2 \right] \] where $\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0+\sqrt{1-\bar{\alpha}_t} \boldsymbol{\epsilon}$.
Notations $t$, $\mathbf{x}_0$, $\mathbf{x}_t$ and $\boldsymbol{\epsilon}$ are all random variables,
with distributions $t \sim \mathcal{U}(\{1,..., T\})$, $\mathbf{x}_0 \sim p_{\text{data}}$, and $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ respectively.

The expectations can only be calculated using finite samples drawn from the distributions.
By drawing many samples from these distributions, the training objective can be approximated by \[ \mathcal{L} = \sum_{\tilde{t}}\ \sum_{\tilde{\mathbf{x}}_0} \ \sum_{\tilde{\boldsymbol{\epsilon}}} \left[ \| \tilde{\boldsymbol{\epsilon}} - \text{UNet}_\theta(\tilde{\mathbf{x}}_{\tilde{t}}, \tilde{t}) \|^2 \right] \] where $\tilde{\mathbf{x}}_{\tilde{t}} = \sqrt{\bar{\alpha}_{\tilde{t}}} \tilde{\mathbf{x}}_0+\sqrt{1-\bar{\alpha}_{\tilde{t}}} \tilde{\boldsymbol{\epsilon}}$.
Notations $\tilde{t}$, $\tilde{\mathbf{x}}_0$, $\tilde{\mathbf{x}}_{\tilde{t}}$ and $\tilde{\boldsymbol{\epsilon}}$ are all samples (realizations),
drawn from their respective distributions.

Computing the loss over all samples is still too expensive. In practice, we adopt the SGD (stochastic gradient descent). In each iteration, we pick an image $\tilde{\mathbf{x}}_0$ from the training set, a noise value $\tilde{\boldsymbol{\epsilon}}$, and a time value $\tilde{t}$. Then calculate the loss for this single sample, \[ \mathcal{L} = \| \tilde{\boldsymbol{\epsilon}} - \text{UNet}_\theta(\tilde{\mathbf{x}}_{\tilde{t}}, \tilde{t}) \|^2 \] The UNet parameters $\theta$ are updated by one gradient step downhill [Ho et al. 2020].

Algorithm 1 Training with SGD
  1. repeat
  2. Sample an image $\tilde{\mathbf{x}}_0$ from the training set $\mathcal{D}$
  3. Sample a time step $\tilde{t}$ uniformly from $\{1, \ldots, T\}$
  4. Sample a noise $\tilde{\boldsymbol{\epsilon}}$ from Gaussian distribution $\mathcal{N}(\mathbf{0}, \mathbf{I})$
  5. Calculate the gradient $\nabla_\theta \left\| \tilde{\boldsymbol{\epsilon}} - \text{UNet}_\theta\left(\sqrt{\bar{\alpha}_{\tilde{t}}}\tilde{\mathbf{x}}_0 + \sqrt{1 - \bar{\alpha}_{\tilde{t}}}\tilde{\boldsymbol{\epsilon}}, \tilde{t}\right) \right\|^2$
  6. Take a step towards the gradient's direction
  7. until converged

At each step, we know exactly the noise value $\tilde{\boldsymbol{\epsilon}}$ added to the image $\tilde{\mathbf{x}}_{\tilde{t}-1}$ at time $\tilde{t}$, either large or small. We hope to move the UNet parameters $\theta$ in the direction of predicting $\tilde{\boldsymbol{\epsilon}}$ a little bit. We are not making it predits exactly $\tilde{\boldsymbol{\epsilon}}$ at this step, since we just move one gradient step forward.

For example, from the previous training steps we already have a model $\text{UNet}_{\theta}$ and it gives a prediction $\text{UNet}_{\theta}(\tilde{\mathbf{x}}_{\tilde{t}}, \tilde{t})=0.3$. At this training step we know the true noise value $\tilde{\boldsymbol{\epsilon}}$ is 0.5, so we calculate the gradient and update $\theta$ to make the UNet predict 0.31 instead of 0.3 at this step. Over many training iterations, it learns a function that predicts the expected noise added to the image $\tilde{\mathbf{x}}_{\tilde{t}-1}$ at timestep $\tilde{t}$.

Sampling

After training, we have the $\text{UNet}_{\theta^*}$. We can use it to generate new images. We start from a pure noise image and iteratively denoise it using the UNet model. This generation process is also called sampling, since at each step we are sampling from a posterior distribution $p_{\theta^*}(\mathbf{x}_{t-1} \mid \mathbf{x}_t = \tilde{\mathbf{x}}_t)$.

Algorithm 2 Posterior sampling
  1. First, sample initial noise image $\tilde{\mathbf{x}}_T$ from the prior distribution $\mathcal{N}(\mathbf{0}, \mathbf{I})$
  2. Second, let $\tilde{\mathbf{x}}_T$ and time step $T$ go through the deterministic network and get the predicted noise $\text{UNet}_{\theta^*}(\tilde{\mathbf{x}}_T, T)$;
    then calculate mean $\boldsymbol{\mu}_{\theta^*}(\tilde{\mathbf{x}}_T, T)=\frac{1}{\sqrt{\alpha_T}}\left(\tilde{\mathbf{x}}_T-\frac{1-\alpha_T}{\sqrt{1-\bar{\alpha}_T}} \textcolor[RGB]{0,61,165}{\text{UNet}_{\theta^*}(\tilde{\mathbf{x}}_T, T)}\right)$,
    which fully describes the estimated posterior Gaussian distribution
    $\qquad p_{\theta^*}(\mathbf{x}_{T-1} \mid \mathbf{x}_T = \tilde{\mathbf{x}}_T) = \mathcal{N}(\mathbf{x}_{T-1}; \boldsymbol{\mu}_{\theta^*}(\tilde{\mathbf{x}}_T, T), \sigma_T^2\mathbf{I})$
  3. Third, sample an image from this posterior $p_{\theta^*}(\mathbf{x}_{T-1} \mid \mathbf{x}_T = \tilde{\mathbf{x}}_T)$.
    Since this posterior is Gaussian, this process is equivalent to computing
    $\qquad \tilde{\mathbf{x}}_{T-1} = \boldsymbol{\mu}_{\theta^*}(\tilde{\mathbf{x}}_T, T) + \sigma_T \tilde{\mathbf{z}}$, where $\tilde{\mathbf{z}}$ is a sample from $\mathcal{N}(\mathbf{0}, \mathbf{I})$
    Conceptually is "mean + randomness"
  4. Repeat the second and third steps with $T$ times to get the final generated image $\tilde{\mathbf{x}}_0$.

Mathematical derivations summary

This is a summary of the mathematical derivations of ELBO that appears everywhere. Coming soon...