Wenxin Pan

Score-based Diffusion Models

 ·  · Wenxin Pan

Introduction

Score-based diffusion models overview
Score-based models from [Song et al. 2021].

Physics

Forward SDE: (stochastic) \[ \mathrm{d} \mathbf{x}_t = \mathbf{f}(\mathbf{x}_t, t) \mathrm{d} t + g(t) \mathrm{d} \mathbf{w}_t \] Reverse SDE: (stochastic) \[ \mathrm{d} \mathbf{x}_t = \left[ \mathbf{f}(\mathbf{x}_t, t) - g(t)^2 \textcolor{red}{\nabla_{\mathbf{x}_t} \log p_t(\mathbf{x}_t)} \right] \mathrm{d} t + g(t) \mathrm{d} \mathbf{\bar{w}}_t \] The corresponding probability flow ODE: (deterministic) \[ \mathrm{d} \mathbf{x}_t = \left[ \mathbf{f}(\mathbf{x}_t, t) - \frac{1}{2}g(t)^2 \textcolor{red}{\nabla_{\mathbf{x}_t} \log p_t(\mathbf{x}_t)} \right] \mathrm{d} t \]

Corresponding means the probability flow ODE is designed to share the same marginal distribution $\{p_t(\mathbf{x}_t)\}_{t=0}^T$ as the SDE. Notably, there is a factor $\frac{1}{2}$.

Notations

The forward process is known and designed. The generation process is based on the reverse SDE, but we don't know the score function in this reverse SDE. So we need to utilize the forward SDE to train a UNet $\mathbf{s}_{\boldsymbol{\phi}}(\mathbf{x}_t, t)$ to predict the score function at each diffusion step, such that we are able to use the reverse process to generate new images.

Training

To predict the score function at each diffusion step, with a UNet with parameters $\boldsymbol{\phi}$, the true training objective is \[ \mathop{\mathbb{E}}\limits_{t\sim U(0,T)} \ \mathop{\mathbb{E}}\limits_{\mathbf{x}_t \sim p_t(\mathbf{x}_t)} \left[ \| \mathbf{s}_{\boldsymbol{\phi}}(\mathbf{x}_t, t) - \textcolor{red}{\nabla_{\mathbf{x}_t} \log p_t(\mathbf{x}_t)} \|^2 \right] \] We are not able to train $\mathbf{s}_{\boldsymbol{\phi}}(\mathbf{x}_t, t)$ with this loss function since we don't know $p_t(\mathbf{x}_t)$ (and its score is what we want to estimate!). The magic is to use a conditional trick. Finding the optimal $\boldsymbol{\phi}^*$ is equivalent to minimizing the following loss, \[ \mathop{\mathbb{E}}\limits_{t\sim U(0,T)} \ \mathop{\mathbb{E}}\limits_{\mathbf{x}_0 \sim p_0(\mathbf{x}_0)} \ \mathop{\mathbb{E}}\limits_{\mathbf{x}_t \sim p_{0t}(\mathbf{x}_t | \mathbf{x}_0)} \left[ \| \mathbf{s}_{\boldsymbol{\phi}}(\mathbf{x}_t, t) - \textcolor{Green}{\nabla_{\mathbf{x}_t} \log p_{0t}(\mathbf{x}_t | \mathbf{x}_0)} \|^2 \right] \] The two losses are not identical, but they give the same optimal $\boldsymbol{\phi}^*$.

When we design function $\mathbf{f}(\mathbf{x}_t, t)$ to be linear in $\mathbf{x}_t$ as $\mathbf{f}(\mathbf{x}_t, t) = f(t) \mathbf{x}_t$, we have $\textcolor{Green}{p_{0t}(\mathbf{x}_t | \mathbf{x}_0)}$ to be Gaussian with closed form, then the objective function becomes tractable.

In general, let this known Gaussian distribution be $\mathcal{N}(\mathbf{x}_t; \ \mu_t\mathbf{x}_0, \textcolor{Green}{\gamma_t^2}\mathbf{I})$, functions $\mu(t)$ and $\textcolor{Green}{\gamma(t)}$ can be derived from function $f(t)$ and $g(t)$. The score function can be calculated in closed-form as $\textcolor{Green}{\nabla_{\mathbf{x}_t} \log p_{0t} (\mathbf{x}_t | \mathbf{x}_0)} =-\frac{\mathbf{x}_t - \mu_t \mathbf{x}_0}{\textcolor{Green}{\gamma_t^2}}$. The training objective becomes $$ \begin{aligned} &\mathop{\mathbb{E}}\limits_{t\sim U(0,T)} \ \mathop{\mathbb{E}}\limits_{\mathbf{x}_0 \sim p_0(\mathbf{x}_0)} \ \mathop{\mathbb{E}}\limits_{\mathbf{x}_t \sim p_{0t}(\mathbf{x}_t | \mathbf{x}_0)} \left[ \left\| \mathbf{s}_{\boldsymbol{\phi}}(\mathbf{x}_t, t) + \frac{\mathbf{x}_t - \mu_t \mathbf{x}_0}{\textcolor{Green}{\gamma_t^2}} \right\|^2 \right] \\ = &\mathop{\mathbb{E}}\limits_{t\sim U(0,T)} \ \mathop{\mathbb{E}}\limits_{\mathbf{x}_0 \sim p_0(\mathbf{x}_0)} \ \mathop{\mathbb{E}}\limits_{\boldsymbol{\epsilon}\sim \mathcal{N}(\mathbf{0}, \mathbf{I})} \left[\left\|\mathbf{s}_{\boldsymbol{\phi}}(\mathbf{x}_t, t)+ \frac{\boldsymbol{\epsilon}}{\textcolor{Green}{\gamma_t}} \right\|_2^2\right] \end{aligned} $$ $$ \text{with}\ \mathbf{x}_t = \mu_t \mathbf{x}_0+\textcolor{Green}{\gamma_t} \cdot\boldsymbol{\epsilon} $$ If we choose the weighting $\lambda(t) = \textcolor{Green}{\gamma_t^2}$, the training objective becomes $$ \mathop{\mathbb{E}}\limits_{t\sim U(0,T)} \ \mathop{\mathbb{E}}\limits_{\mathbf{x}_0 \sim p_0(\mathbf{x}_0)} \ \mathop{\mathbb{E}}\limits_{\boldsymbol{\epsilon}\sim \mathcal{N}(\mathbf{0}, \mathbf{I})} \left[\left\|\mathbf{s}_{\boldsymbol{\phi}}(\mathbf{x}_t, t) \cdot \textcolor{Green}{\gamma_t} + \boldsymbol{\epsilon} \right\|_2^2\right] $$ which has the same per-timestep optimum but changes the relative importance of different time steps (across noise levels).

Similar to DDPM, with SGD (stochastic gradient descent), the loss function in each iteration is, In each iteration, we pick an image $\tilde{\mathbf{x}}_0$ from the training set, a noise value $\tilde{\boldsymbol{\epsilon}}$, and a time value $\tilde{t}$. Then calculate the loss for this single sample, \[ \mathcal{L} = \left\| \mathbf{s}_{\boldsymbol{\phi}}(\tilde{\mathbf{x}}_{\tilde{t}}, \tilde{t})\cdot\textcolor{Green}{\gamma_{\tilde{t}}} + \tilde{\boldsymbol{\epsilon}} \right\|^2 \] The UNet parameters $\boldsymbol{\phi}$ are updated by one gradient step downhill [Ho et al. 2020].

Algorithm 1 Training with SGD
  1. repeat
  2. Sample an image $\tilde{\mathbf{x}}_0$ from the training set $\mathcal{D}$
  3. Sample a time step $\tilde{t}$ uniformly from $\{1, \ldots, T\}$
  4. Sample a noise $\tilde{\boldsymbol{\epsilon}}$ from Gaussian distribution $\mathcal{N}(\mathbf{0}, \mathbf{I})$
  5. Calculate the gradient $\nabla_\phi \left\| \mathbf{s}_{\boldsymbol{\phi}}(\tilde{\mathbf{x}}_{\tilde{t}}, \tilde{t})\cdot\textcolor{Green}{\gamma_{\tilde{t}}} + \tilde{\boldsymbol{\epsilon}} \right\|^2$
  6. Take a step against the gradient's direction
  7. until converged

Key points of training:

Connection to DDPM and noise predictor

In DDPM we defined the discrete-time kernels $$ \mathbf{x}_t := \sqrt{1-\beta_t} \cdot \mathbf{x}_{t-1}+ \sqrt{\beta_t} \cdot \boldsymbol{\epsilon} $$ and we can derive $$ \textcolor{Green}{p_{0t}(\mathbf{x}_t | \mathbf{x}_0)} =\mathcal{N} \left(\mathbf{x}_t ; \sqrt{\bar{\alpha}_t}\cdot \mathbf{x}_0 , \ (1-\bar{\alpha}_t) \mathbf{I} \right) $$ In the continuous-time SDE formulation, DDPM is equivalent to a VP-SDE with $$ \mathrm{d} \mathbf{x}_t=-\frac{1}{2} \beta(t) \mathbf{x}_t \mathrm{d} t+\sqrt{\beta(t)} \mathrm{d} \mathbf{w}_t . $$ and we can derive $$ \textcolor{Green}{p_{0t}(\mathbf{x}_t | \mathbf{x}_0)} = \mathcal{N}\bigl(\mathbf{x}_t; \mu(t)\mathbf{x}_0, \textcolor{Green}{\gamma(t)^2} \mathbf{I}\bigr), \quad \text{with} \ \mu(t) = e^{-\frac{1}{2} \int_0^t \beta(s) \mathrm{d} s}, \quad \textcolor{Green}{\gamma(t)} = \sqrt{1-e^{-\int_0^t \beta(s) \mathrm{d} s}} $$ The discrete and continuous formulations are consistent through $$ \sqrt{\bar{\alpha}_t} \approx \mu(t), \quad \sqrt{1-\bar{\alpha}_t} \approx \gamma(t) $$ This follows from the $\log \left(1-\beta_s\right) \approx-\beta_s$ when $\beta_s$ is small.

Compare the training loss with DDPM we can observe that, a score predictor and a noise predictor are equivalent, only with a difference in a scaling factor as $$ \textcolor{red}{\mathbf{s}_{\phi^*}(\mathbf{x}_t, t)} = \textcolor{Green}{-\frac{1}{\sqrt{1-\bar{\alpha}_t}}} \boldsymbol{\epsilon}_{\mathbf{\theta}^*}\left(\mathbf{x}_t, t\right) = \textcolor{Green}{- \frac{1}{\gamma_t}} \boldsymbol{\epsilon}_{\mathbf{\theta}^*}\left(\mathbf{x}_t, t\right) $$ Therefore the choice of predictors is not a fundamental modeling difference but only a implementation difference and can be used (mostly) interchangeably.
Taking into account of the scaling factor, the above training loss for score predictor is equivalent to the training loss for noise predictor in DDPM.

Sampling

After training, we have the $\textcolor{red}{\mathbf{s}_{\boldsymbol{\phi}^*}(\mathbf{x}_t, t)}$. Thus, we have the following known reverse process and we can use it to generate new images. $$ \mathrm{d} \mathbf{x}_t=\left[\mathbf{f}(\mathbf{x}_t, t)-g(t)^2 \color{red}\mathbf{s}_{\boldsymbol{\phi}^*}(\mathbf{x}_t, t)\right] \mathrm{d} t+g(t) \mathrm{d} \overline{\mathbf{w}}_t $$ The most simple one is the Euler-Maruyama method.

Algorithm 2 Posterior sampling by Euler-Maruyama
  1. Initialize $t \leftarrow T$ and iterate until $t \approx 0$ ($t$ here acts as an iteration parameter, not a random variable or a realization)
  2. Let $\mathrm{d} \mathbf{x} = \mathbf{x}_{t+\Delta t}-\mathbf{x}_t$, $\mathrm{d}t= |\Delta t|$, $\mathrm{d} \mathbf{\bar{w}}_t = \sqrt{|\Delta t|} \cdot \boldsymbol{\epsilon}_t$ with sample $\boldsymbol{\epsilon}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$. Note that $\Delta t < 0$. Then $$ \mathbf{x}_{t+\Delta t} = \underbrace{ \mathbf{x}_t + \left[\mathbf{f}(\mathbf{x}_t, t) - g(t)^2 \color{red}\mathbf{s}_{\boldsymbol{\phi}^*}(\mathbf{x}_t, t)\right] \Delta t }_{\color{red}\text{mean}} + \underbrace{ g(t)\sqrt{|\Delta t|} \cdot \boldsymbol{\epsilon}_t }_{\color{NavyBlue}\text{randomness}} $$
  3. This discretization step is a tractable approximation of the posterior $p(\mathbf{x}_{t+\Delta t} \mid \mathbf{x}_{t})$ with $\Delta t < 0$. The process is equivalent to sampling from this distribution, which is Gaussian $p(\mathbf{x}_{t+\Delta t} \mid \mathbf{x}_{t}) =\mathcal{N}( \mathbf{x}_{t+\Delta t}\ ;\textcolor{red}{\text{mean}},\ g(t)^2 |\Delta t| \mathbf{I})$.
The last step does not include any noise but only return the mean.

For a VP-SDE like DDPM, the intial image is sampled from $\mathcal{N}(\mathbf{0}, \mathbf{I})$ at $t=T$. For a typical-schedule VE-SDE with $\sigma_{\min }=\sigma_1<\sigma_2<\cdots<\sigma_N=\sigma_{\max }$, the initial image is sampled from $\mathcal{N}(\mathbf{0}, \sigma_{\max }^2 \mathbf{I})$ at $t=T$.

Inverse Problem Solving

With this score-based SDE framework, we are able to solve inverse problems with diffusion prior with only modification in the sampling process.

Assume that we know the forward physical model $\mathbf{y} = \mathcal{A}(\mathbf{x}_0) + \boldsymbol{\epsilon}$, where $\mathcal{A}(\cdot)$ is a known operator, $\mathbf{x}_0$ is the ground-truth clean image, and $\boldsymbol{\epsilon}$ is the noise. We want to solve the inverse problem by sampling from the posterior distribution $p(\mathbf{x}_0 \mid \mathbf{y})$.

Conditional Generation to Solving Inverse Problems

Conditional generation for inverse problems
Conditional generation

Inverse problem solving can be viewed as a special case of conditional generation, since in both case we are generating images based on some given conditions $\mathbf{y}$.

The simliest example might be, if an unconditional generative model can generate animals, then the conditional generative model can generate dogs controlled by the given label $\mathbf{y}=\text{"dog"}$.

To solve inverse problem, the reconstruction image is then generated by the pre-trained unconditional generative model being controlled by the given measurement $\mathbf{y}$.

Modifications of sampling process

Unconditional generation: $$ \mathrm{d} \mathbf{x}_t =\left[f(t) \mathbf{x}_t-g(t)^2 \textcolor{red}{\nabla_{\mathbf{x}_t} \log p_t\left(\mathbf{x}_t \right)}\right] \mathrm{d} t+g(t) \mathrm{d} \overline{\mathbf{w}}_t $$ Conditional generation (controllable generation):
with the key fact that $p_t\left(\mathbf{x}_t \mid \mathbf{y}\right) = \frac{p_t (\mathbf{x}_t) \cdot p_t(\mathbf{y} \mid \mathbf{x}_t)}{p_t (\mathbf{y})}$ and $\nabla_{\mathbf{x}_t} p_t (\mathbf{y}) =0$, $$ \begin{aligned} \mathrm{d} \mathbf{x}_t &=\left[f(t) \mathbf{x}_t-g(t)^2 \textcolor{red}{\nabla_{\mathbf{x}_t} \log p_t\left(\mathbf{x}_t \mid \mathbf{y}\right)}\right] \mathrm{d} t+g(t) \mathrm{d} \overline{\mathbf{w}}_t \\ &=\left[f(t) \mathbf{x}_t-g(t)^2 \textcolor{red}{\left[\nabla_{\mathbf{x}_t} \log p_t\left(\mathbf{x}_t \right) + \textcolor{magenta}{ \nabla_{\mathbf{x}_t} \log p_t\left(\mathbf{y} \mid \mathbf{x}_t \right)} \right]} \right] \mathrm{d} t+g(t) \mathrm{d} \overline{\mathbf{w}}_t , \ t \in[0,1]. \end{aligned} $$

In this conditional generation,

Algorithm 3 Posterior sampling in inverse problem solving by Euler-Maruyama
  1. Initialize $t \leftarrow T$, and iterate until $t \approx 0$.
  2. Let $\mathrm{d} \mathbf{x} = \mathbf{x}_{t+\Delta t}-\mathbf{x}_t$, $\mathrm{d}t= |\Delta t|$, $\mathrm{d} \mathbf{\overline{w}}_t = \sqrt{|\Delta t|} \cdot \boldsymbol{\epsilon}_t$ with sample $\boldsymbol{\epsilon}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$. Note that $\Delta t < 0$. Then $$ \mathbf{x}_{t+\Delta t} = \underbrace{ \mathbf{x}_t + \left[\mathbf{f}(\mathbf{x}_t, t) - g(t)^2 \left[\color{red}\mathbf{s}_{\boldsymbol{\phi}^*}(\mathbf{x}_t, t) \color{magenta} + \nabla_{\mathbf{x}_t} \log p_t(\mathbf{y} \mid \mathbf{x}_t) \right]\right] \Delta t }_{\color{magenta}\text{mean}} + \underbrace{ g(t)\sqrt{|\Delta t|} \cdot \boldsymbol{\epsilon}_t }_{\color{NavyBlue}\text{randomness}} $$
  3. This discretization step is a tractable approximation of the posterior $p(\mathbf{x}_{t+\Delta t} \mid \mathbf{x}_{t}, \textcolor{magenta}{\mathbf{y}})$ with $\Delta t < 0$. The process is equivalent to sampling from this distribution, which is Gaussian $p(\mathbf{x}_{t+\Delta t} \mid \mathbf{x}_{t}, \textcolor{magenta}{\mathbf{y}}) =\mathcal{N}( \mathbf{x}_{t+\Delta t}\ ;\textcolor{magenta}{\text{mean}},\ g(t)^2 |\Delta t| \mathbf{I})$.

Estimation of $\color{magenta}{\nabla_{\mathbf{x}_t} \log p_t\left(\mathbf{y} \mid \mathbf{x}_t \right)}$

The above sampling already assumes that we are able to estimate $\color{magenta}{\nabla_{\mathbf{x}_t} \log p_t\left(\mathbf{y} \mid \mathbf{x}_t \right)}$. There are various methods trying to approximate $p(\mathbf{y} \mid \mathbf{x}_t)$. One standardized method is DPS by [Chung et al. 2022].

Compute and approximate $p(\mathbf{y} \mid \mathbf{x}_t)$: $$ \begin{aligned} p(\mathbf{y} \mid \mathbf{x}_t) & =\int p(\mathbf{y} \mid \mathbf{x}_0, \mathbf{x}_t) \cdot p(\mathbf{x}_0 \mid \mathbf{x}_t) \mathrm{d} \mathbf{x}_0 \\ & =\int p(\mathbf{y} \mid \mathbf{x}_0) \cdot p(\mathbf{x}_0 \mid \mathbf{x}_t) \mathrm{d} \mathbf{x}_0 \\ & =\mathop{\mathbb{E}}\limits_{\mathbf{x}_0 \sim p(\mathbf{x}_0 \mid \mathbf{x}_t)}[p(\mathbf{y} \mid \mathbf{x}_0)] \\[-4pt] & \hspace{14em} \Big\downarrow \ \text{key approximation} \\[-4pt] & \simeq p(\mathbf{y} \mid \textcolor{blue}{\mathop{\mathbb{E}}\limits_{\mathbf{x}_0 \sim p(\mathbf{x}_0 \mid \mathbf{x}_t)}[\mathbf{x}_0 \mid \mathbf{x}_t]}) \\ & \textcolor{blue}{:=} p(\mathbf{y} \mid \textcolor{blue}{\hat{\mathbf{x}}_0(\mathbf{x}_t)}) \end{aligned} $$ where $\textcolor{blue}{\mathop{\mathbb{E}}\limits_{\mathbf{x}_0 \sim p(\mathbf{x}_0 \mid \mathbf{x}_t)}[\mathbf{x}_0 \mid \mathbf{x}_t]}$ is the posterior mean of $\mathbf{x}_0$ given $\mathbf{x}_t$ — the average over all plausible clean images consistent with the noisy observation $\mathbf{x}_t$ — and can also be written as $\textcolor{blue}{\mathop{\mathbb{E}}\limits_{\mathbf{x}_0 \sim p(\mathbf{x}_0 \mid \mathbf{x}_t)}[\mathbf{x}_0 ]}$.
The approximation replaces the expectation of $p(\mathbf{y} \mid \mathbf{x}_0)$ with $p(\mathbf{y} \mid \cdot)$ evaluated at the posterior mean, effectively swapping the expectation and the nonlinear function.

We denote this posterior mean as $\textcolor{blue}{\hat{\mathbf{x}}_0(\mathbf{x}_t)}$ and treat it as our estimate of the clean image given the current noisy state $\mathbf{x}_t$.

The Tweedie's formula theoretically gives \begin{aligned} \textcolor{blue}{\hat{\mathbf{x}}_0(\mathbf{x}_t) := \mathop{\mathbb{E}}\limits_{\mathbf{x}_0 \sim p(\mathbf{x}_0 \mid \mathbf{x}_t)}[\mathbf{x}_0 \mid \mathbf{x}_t]} = \frac{1}{\sqrt{\bar{\alpha}_t}}\left(\mathbf{x}_t+(1-\bar{\alpha}_t) \textcolor{red}{\nabla_{\mathbf{x}_t} \log p_t\left(\mathbf{x}_t\right)}\right) \end{aligned} Then we are able to estimate it by the pre-trained unconditional score-predictor model $\textcolor{red}{\mathbf{s}_{\boldsymbol{\phi}}^* (\mathbf{x}_t, t)}$ or noise-predictor model $\boldsymbol{\epsilon}_{\theta^*}(\mathbf{x}_t, t)$ as $$ \begin{aligned} \textcolor{blue}{\hat{\mathbf{x}}_0(\mathbf{x}_t) :=\mathop{\mathbb{E}}\limits_{\mathbf{x}_0 \sim p(\mathbf{x}_0 \mid \mathbf{x}_t)}[\mathbf{x}_0 \mid \mathbf{x}_t]} & \simeq \frac{1}{\sqrt{\bar{\alpha}_t}}\left(\mathbf{x}_t\textcolor{Green}{+(1-\bar{\alpha}_t)} \textcolor{red}{\mathbf{s}_{\boldsymbol{\phi}}^* (\mathbf{x}_t, t)}\right) \\ & = \frac{1}{\sqrt{\bar{\alpha}_t}}\left(\mathbf{x}_t\textcolor{Green}{-\sqrt{1-\bar{\alpha}_t}} \boldsymbol{\epsilon}_{\theta^*}\left(\mathbf{x}_t, t\right)\right)\\ & = \frac{1}{\sqrt{\bar{\alpha}_t}}\left(\mathbf{x}_t\textcolor{Green}{-\gamma_t} \boldsymbol{\epsilon}_{\theta^*}\left(\mathbf{x}_t, t\right)\right) \end{aligned} $$

Overall DPS Algorithm

Algorithm 4 DPS (Score-Prediction Form)
Require: \(T,\; \mathbf{y},\; \{\zeta_t\}_{t=1}^T,\; \{\textcolor{Orange}{\sigma_t}\}_{t=1}^T\)
  1. \(\mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\)
  2. for \(t = T\) to \(1\) do
  3. \(\hat{\mathbf{s}} \leftarrow s_{\phi^*}(\mathbf{x}_t, t)\)
  4. \(\hat{\mathbf{x}}_0 \leftarrow \dfrac{1}{\sqrt{\bar{\alpha}_t}} \left( \mathbf{x}_t \textcolor{Green}{+(1-\bar{\alpha}_t)}\, \hat{\mathbf{s}} \right)\)
  5. \(\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\)
  6. \(\mathbf{x}_{t-1} \leftarrow \dfrac{\sqrt{\alpha_t}\,(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\,\mathbf{x}_t + \dfrac{\sqrt{\bar{\alpha}_{t-1}}\,\beta_t}{1-\bar{\alpha}_t}\,\hat{\mathbf{x}}_0 + \textcolor{Orange}{\sigma_t} \mathbf{z}\)
  7. \(\mathbf{x}_{t-1} \leftarrow \mathbf{x}_{t-1} - \zeta_t \nabla_{\mathbf{x}_t} \left\| \mathbf{y} - \mathcal{A}(\hat{\mathbf{x}}_0) \right\|_2^2\)
  8. end for
  9. return \(\hat{\mathbf{x}}_0\)
Algorithm 5 DPS (Noise-Prediction Form)
Require: \(T,\; \mathbf{y},\; \{\zeta_t\}_{t=1}^T,\; \{\textcolor{Orange}{\sigma_t}\}_{t=1}^T\)
  1. \(\mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\)
  2. for \(t = T\) to \(1\) do
  3. \(\hat{\boldsymbol{\epsilon}} \leftarrow \epsilon_{\theta^*}(\mathbf{x}_t, t)\)
  4. \(\hat{\mathbf{x}}_0 \leftarrow \dfrac{1}{\sqrt{\bar{\alpha}_t}}\left(\mathbf{x}_t \textcolor{Green}{- \sqrt{1-\bar{\alpha}_t}}\,\hat{\boldsymbol{\epsilon}}\right)\)
  5. \(\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\)
  6. \(\mathbf{x}_{t-1} \leftarrow \tfrac{1}{\sqrt{\alpha_t}} \left( \mathbf{x}_t - \tfrac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\hat{\boldsymbol{\epsilon}} \right) + \textcolor{Orange}{\sigma_t} \mathbf{z}\)  OR  \(\mathbf{x}_{t-1} \leftarrow \tfrac{\sqrt{\alpha_t}\,(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\,\mathbf{x}_t + \tfrac{\sqrt{\bar{\alpha}_{t-1}}\,\beta_t}{1-\bar{\alpha}_t}\,\hat{\mathbf{x}}_0 + \textcolor{Orange}{\sigma_t} \mathbf{z}\)
  7. \(\mathbf{x}_{t-1} \leftarrow \mathbf{x}_{t-1} - \zeta_t \nabla_{\mathbf{x}_t} \left\| \mathbf{y} - \mathcal{A}(\hat{\mathbf{x}}_0) \right\|_2^2\)
  8. end for
  9. return \(\hat{\mathbf{x}}_0\)

The two algorithms are equivalent, with the only difference of using a noise-predictor or a score-predictor.
In the noise-prediction form, in line 6, the two updates are equivalent, and can be verified by substituting the $\hat{\mathbf{x}}_0$ in line 4 into the update in line 6.

In the noise-prediction form, line 4 and 7 are used for inverse problem solving, the rest are identical to unconditional sampling DDPM.

P.S. The original implementation of DPS is using noise-predictor with update $\mathbf{x}_{t-1} \leftarrow \frac{\sqrt{\alpha_t}\,(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\,\mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}}\,\beta_t}{1-\bar{\alpha}_t}\,\hat{\mathbf{x}}_0 + \textcolor{Orange}{\sigma_t} \mathbf{z}$

In the final step (line 9) we can return $\hat{\mathbf{x}}_0$ or $\mathbf{x}_0$. When $t=1$, $\bar{\alpha}_0=1$, then the posterior mean (reverse mean) is exactly $\boldsymbol{\mu}_{\boldsymbol{\theta}^*}(\mathbf{x}_1,1)=\hat{\mathbf{x}}_0(\mathbf{x}_1)$.