扩散模型(Diffusion Models)学习笔记
一、概率建模与数学推导
1. 前向扩散过程
马尔可夫链假设
扩散模型假设数据 x 0 x_0 x0通过逐步添加高斯噪声的马尔可夫链过程被破坏为纯噪声 x T x_T xT,过程定义:
q ( x 1 : T ∣ x 0 ) = ∏ t = 1 T q ( x t ∣ x t − 1 ) q(x_{1:T}|x_0) = \prod_{t=1}^T q(x_t|x_{t-1}) q(x1:T∣x0)=t=1∏Tq(xt∣xt−1)
其中单步转移概率为:
q ( x t ∣ x t − 1 ) = N ( x t ; 1 − β t x t − 1 , β t I ) q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I) q(xt∣xt−1)=N(xt;1−βtxt−1,βtI)
累积噪声系数(关键推导)
引入 α t = 1 − β t \alpha_t = 1 - \beta_t αt=1−βt, α ˉ t = ∏ s = 1 t α s \bar{\alpha}_t = \prod_{s=1}^t \alpha_s αˉt=∏s=1tαs,任意时刻 x t x_t xt可表示为:
x t = α ˉ t x 0 + 1 − α ˉ t ϵ , ϵ ∼ N ( 0 , I ) x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon,\ \ \epsilon \sim \mathcal{N}(0,I) xt=αˉtx0+1−αˉtϵ, ϵ∼N(0,I)
推导过程:
-
递归展开(定义: α t = 1 − β t \alpha_t = 1 - \beta_t αt=1−βt), 从初始数据 x 0 x_0 x0开始,逐步展开前向过程:
x 1 = α 1 x 0 + β 1 ϵ 0 , x 2 = α 2 x 1 + β 2 ϵ 1 = α 2 α 1 x 0 + α 2 β 1 ϵ 0 + β 2 ϵ 1 , x 3 = α 3 x 2 + β 3 ϵ 2 = α 3 α 2 α 1 x 0 + α 3 α 2 β 1 ϵ 0 + α 3 β 2 ϵ 1 + β 3 ϵ 2 . \begin{aligned} x_1 &= \sqrt{\alpha_1} x_0 + \sqrt{\beta_1} \epsilon_0, \\ x_2 &= \sqrt{\alpha_2} x_1 + \sqrt{\beta_2} \epsilon_1 \\ &= \sqrt{\alpha_2 \alpha_1} x_0 + \sqrt{\alpha_2 \beta_1} \epsilon_0 + \sqrt{\beta_2} \epsilon_1, \\ x_3 &= \sqrt{\alpha_3} x_2 + \sqrt{\beta_3} \epsilon_2 \\ &= \sqrt{\alpha_3 \alpha_2 \alpha_1} x_0 + \sqrt{\alpha_3 \alpha_2 \beta_1} \epsilon_0 + \sqrt{\alpha_3 \beta_2} \epsilon_1 + \sqrt{\beta_3} \epsilon_2. \end{aligned} x1x2x3=α1x0+β1ϵ0,=α2x1+β2ϵ1=α2α1x0+α2β1ϵ0+β2ϵ1,=α3x2+β3ϵ2=α3α2α1x0+α3α2β1ϵ0+α3β2ϵ1+β3ϵ2. -
归纳可得任意时刻
t
的表达式:
x t = ∏ s = 1 t α s x 0 + ∑ k = 0 t − 1 β t − k ∏ m = 1 k α t − m + 1 ϵ k . x_t = \sqrt{\prod_{s=1}^t \alpha_s} x_0 + \sum_{k=0}^{t-1} \sqrt{\beta_{t-k} \prod_{m=1}^k \alpha_{t-m+1}} \epsilon_k. xt=s=1∏tαsx0+k=0∑t−1βt−km=1∏kαt−m+1ϵk. -
利用独立高斯变量相加性质,合并方差项
若 ϵ 1 , . . . , ϵ t ∼ N ( 0 , I ) \epsilon_1,...,\epsilon_t \sim \mathcal{N}(0,I) ϵ1,...,ϵt∼N(0,I)且相互独立,则:
∑ k = 1 t a k ϵ k ∼ N ( 0 , ( ∑ k = 1 t a k 2 ) I ) \sum_{k=1}^t a_k\epsilon_k \sim \mathcal{N}\left(0, \left(\sum_{k=1}^t a_k^2\right)I \right) k=1∑takϵk∼N(0,(k=1∑tak2)I)
-
应用该性质可得:
∑ k = 0 t − 1 β t − k ∏ m = 1 k α t − m + 1 ϵ k ∼ N ( 0 , [ 1 − ∏ s = 1 t α s ] I ) \sum_{k=0}^{t-1} \sqrt{\beta_{t-k}\prod_{m=1}^{k}\alpha_{t-m+1}}\epsilon_k \sim \mathcal{N}\left(0, \left[1 - \prod_{s=1}^t \alpha_s \right]I \right) k=0∑t−1βt−km=1∏kαt−m+1ϵk∼N(0,[1−s=1∏tαs]I) -
最终闭式解:
q ( x t ∣ x 0 ) = N ( x t ; α ˉ t x 0 , ( 1 − α ˉ t ) I ) q(x_t|x_0) = \mathcal{N}\left(x_t; \sqrt{\bar{\alpha}_t}x_0, (1-\bar{\alpha}_t)I \right) q(xt∣x0)=N(xt;αˉtx0,(1−αˉt)I)
其中 α ˉ t = ∏ s = 1 t α s \bar{\alpha}_t = \prod_{s=1}^t \alpha_s αˉt=∏s=1tαs
2. 反向去噪过程
- 变分下界(ELBO)的构造
目标最大化对数似然 log p θ ( x 0 ) \log p_\theta(x_0) logpθ(x0),引入变分分布 q ( x 1 : T ∣ x 0 ) q(x_{1:T}|x_0) q(x1:T∣x0):
log p θ ( x 0 ) ≥ E q ( x 1 : T ∣ x 0 ) [ log p θ ( x 0 : T ) q ( x 1 : T ∣ x 0 ) ] = ELBO \log p_\theta(x_0) \geq \mathbb{E}_{q(x_{1:T}|x_0)}\left[ \log \frac{p_\theta(x_{0:T})}{q(x_{1:T}|x_0)} \right] = \text{ELBO} logpθ(x0)≥Eq(x1:T∣x0)[logq(x1:T∣x0)pθ(x0:T)]=ELBO
联合分布分解:
p θ ( x 0 : T ) = p ( x T ) ∏ t = 1 T p θ ( x t − 1 ∣ x t ) p_\theta(x_{0:T}) = p(x_T)\prod_{t=1}^T p_\theta(x_{t-1}|x_t) pθ(x0:T)=p(xT)t=1∏Tpθ(xt−1∣xt)
详细展开:
ELBO = E q [ log p ( x T ) + ∑ t = 1 T log p θ ( x t − 1 ∣ x t ) q ( x t ∣ x t − 1 ) ] = E q [ log p ( x T ) q ( x T ∣ x 0 ) + ∑ t = 1 T log p θ ( x t − 1 ∣ x t ) q ( x t ∣ x t − 1 ) ] \begin{aligned} \text{ELBO} &= \mathbb{E}_q \left[ \log p(x_T) + \sum_{t=1}^T \log \frac{p_\theta(x_{t-1}|x_t)}{q(x_t|x_{t-1})} \right] \\ &= \mathbb{E}_q \left[ \log \frac{p(x_T)}{q(x_T|x_0)} + \sum_{t=1}^T \log \frac{p_\theta(x_{t-1}|x_t)}{q(x_t|x_{t-1})} \right] \end{aligned} ELBO=Eq[logp(xT)+t=1∑Tlogq(xt∣xt−1)pθ(xt−1∣xt)]=Eq[logq(xT∣x0)p(xT)+t=1∑Tlogq(xt∣xt−1)pθ(xt−1∣xt)]
- KL散度项的转换
通过马尔可夫链性质 q ( x t ∣ x t − 1 ) = q ( x t ∣ x t − 1 , x 0 ) q(x_t|x_{t-1}) = q(x_t|x_{t-1},x_0) q(xt∣xt−1)=q(xt∣xt−1,x0),分解各项:
ELBO = E q [ log p ( x T ) − log q ( x T ∣ x 0 ) + ∑ t = 2 T log p θ ( x t − 1 ∣ x t ) q ( x t − 1 ∣ x t , x 0 ) + log p θ ( x 0 ∣ x 1 ) ] \text{ELBO} = \mathbb{E}_q \left[ \log p(x_T) - \log q(x_T|x_0) + \sum_{t=2}^T \log \frac{p_\theta(x_{t-1}|x_t)}{q(x_{t-1}|x_t,x_0)} + \log p_\theta(x_0|x_1) \right] ELBO=Eq[logp(xT)−logq(xT∣x0)+t=2∑Tlogq(xt−1∣xt,x0)pθ(xt−1∣xt)+logpθ(x0∣x1)]
关键项分析:
• D K L ( q ( x T ∣ x 0 ) ∥ p ( x T ) ) D_{KL}(q(x_T|x_0) \| p(x_T)) DKL(q(xT∣x0)∥p(xT)):常数项,不影响优化
• ∑ t = 2 T E q ( x t ∣ x 0 ) [ D K L ( q ( x t − 1 ∣ x t , x 0 ) ∥ p θ ( x t − 1 ∣ x t ) ) ] \sum_{t=2}^T \mathbb{E}_{q(x_t|x_0)}[D_{KL}(q(x_{t-1}|x_t,x_0) \| p_\theta(x_{t-1}|x_t))] ∑t=2TEq(xt∣x0)[DKL(q(xt−1∣xt,x0)∥pθ(xt−1∣xt))]:主要优化目标
• E q ( x 1 ∣ x 0 ) [ − log p θ ( x 0 ∣ x 1 ) ] \mathbb{E}_{q(x_1|x_0)}[-\log p_\theta(x_0|x_1)] Eq(x1∣x0)[−logpθ(x0∣x1)]:最终重建项
- 真实后验分布推导
利用贝叶斯定理:
q ( x t − 1 ∣ x t , x 0 ) = q ( x t ∣ x t − 1 , x 0 ) q ( x t − 1 ∣ x 0 ) q ( x t ∣ x 0 ) q(x_{t-1}|x_t,x_0) = \frac{q(x_t|x_{t-1},x_0)q(x_{t-1}|x_0)}{q(x_t|x_0)} q(xt−1∣xt,x0)=q(xt∣x0)q(xt∣xt−1,x0)q(xt−1∣x0)
代入高斯分布表达式,经推导得:
q ( x t − 1 ∣ x t , x 0 ) = N ( x t − 1 ; μ ~ t ( x t , x 0 ) , β ~ t I ) q(x_{t-1}|x_t,x_0) = \mathcal{N}(x_{t-1}; \tilde{\mu}_t(x_t,x_0), \tilde{\beta}_t I) q(xt−1∣xt,x0)=N(xt−1;μ~t(xt,x0),β~tI)
其中:
μ ~ t = α t ( 1 − α ˉ t − 1 ) 1 − α ˉ t x t + α ˉ t − 1 β t 1 − α ˉ t x 0 β ~ t = 1 − α ˉ t − 1 1 − α ˉ t β t \tilde{\mu}_t = \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}x_t + \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha}_t}x_0 \\ \tilde{\beta}_t = \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\beta_t μ~t=1−αˉtαt(1−αˉt−1)xt+1−αˉtαˉt−1βtx0β~t=1−αˉt1−αˉt−1βt
3.参数化与损失函数简化
- 均值参数化技巧
将目标均值 μ θ \mu_\theta μθ参数化为:
μ θ ( x t , t ) = 1 α t ( x t − β t 1 − α ˉ t ϵ θ ( x t , t ) ) \mu_\theta(x_t,t) = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon_\theta(x_t,t) \right) μθ(xt,t)=αt1(xt−1−αˉtβtϵθ(xt,t))
推导过程:
从真实后验均值表达式出发:
μ ~ t = 1 α t ( x t − β t 1 − α ˉ t ϵ t ) \tilde{\mu}_t = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon_t \right) μ~t=αt1(xt−1−αˉtβtϵt)
将 ϵ t \epsilon_t ϵt替换为神经网络预测 ϵ θ ( x t , t ) \epsilon_\theta(x_t,t) ϵθ(xt,t)
- KL散度项计算
两个高斯分布 q ( x t − 1 ∣ x t , x 0 ) q(x_{t-1}|x_t,x_0) q(xt−1∣xt,x0)与 p θ ( x t − 1 ∣ x t ) p_\theta(x_{t-1}|x_t) pθ(xt−1∣xt)的KL散度:
D K L = 1 2 σ t 2 ∥ μ ~ t − μ θ ∥ 2 + 1 2 ( σ t 2 β ~ t − 1 − ln σ t 2 β ~ t ) D_{KL} = \frac{1}{2\sigma_t^2} \| \tilde{\mu}_t - \mu_\theta \|^2 + \frac{1}{2}\left( \frac{\sigma_t^2}{\tilde{\beta}_t} - 1 - \ln\frac{\sigma_t^2}{\tilde{\beta}_t} \right) DKL=2σt21∥μ~t−μθ∥2+21(β~tσt2−1−lnβ~tσt2)
简化假设:
• 固定方差 σ t 2 = β ~ t \sigma_t^2 = \tilde{\beta}_t σt2=β~t,此时KL项简化为:
D K L = 1 2 σ t 2 ∥ μ ~ t − μ θ ∥ 2 D_{KL} = \frac{1}{2\sigma_t^2} \| \tilde{\mu}_t - \mu_\theta \|^2 DKL=2σt21∥μ~t−μθ∥2
- 损失函数最终形式
将参数化后的均值代入KL项:
∥ μ ~ t − μ θ ∥ 2 = ∥ β t α t 1 − α ˉ t ( ϵ t − ϵ θ ) ∥ 2 = β t 2 α t ( 1 − α ˉ t ) ∥ ϵ t − ϵ θ ∥ 2 \begin{aligned} \| \tilde{\mu}_t - \mu_\theta \|^2 &= \left\| \frac{\beta_t}{\sqrt{\alpha_t}\sqrt{1-\bar{\alpha}_t}} (\epsilon_t - \epsilon_\theta) \right\|^2 \\ &= \frac{\beta_t^2}{\alpha_t(1-\bar{\alpha}_t)} \| \epsilon_t - \epsilon_\theta \|^2 \end{aligned} ∥μ~t−μθ∥2= αt1−αˉtβt(ϵt−ϵθ) 2=αt(1−αˉt)βt2∥ϵt−ϵθ∥2
加权处理(简化损失函数):
发现不同时间步的权重系数差异较大,通过实验发现忽略权重可提升性能:
L simple = E t , x 0 , ϵ ∥ ϵ − ϵ θ ( α ˉ t x 0 + 1 − α ˉ t ϵ , t ) ∥ 2 \mathcal{L}_{\text{simple}} = \mathbb{E}_{t,x_0,\epsilon} \| \epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon, t) \|^2 Lsimple=Et,x0,ϵ∥ϵ−ϵθ(αˉtx0+1−αˉtϵ,t)∥2
4.采样过程推导(DDPM)
- 反向迭代公式推导
从 p θ ( x t − 1 ∣ x t ) = N ( x t − 1 ; μ θ ( x t , t ) , σ t 2 I ) p_\theta(x_{t-1}|x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t,t), \sigma_t^2 I) pθ(xt−1∣xt)=N(xt−1;μθ(xt,t),σt2I)出发:
x t − 1 = μ θ ( x t , t ) + σ t z ( z ∼ N ( 0 , I ) ) x_{t-1} = \mu_\theta(x_t,t) + \sigma_t z \quad (z \sim \mathcal{N}(0,I)) xt−1=μθ(xt,t)+σtz(z∼N(0,I))
代入参数化后的均值表达式:
x t − 1 = 1 α t ( x t − β t 1 − α ˉ t ϵ θ ) + σ t z x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon_\theta \right) + \sigma_t z xt−1=αt1(xt−1−αˉtβtϵθ)+σtz
- 方差修正项分析
当选择 σ t 2 = β t \sigma_t^2 = \beta_t σt2=βt时:
x t − 1 = 1 α t ( x t − β t 1 − α ˉ t ϵ θ ) + β t z x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon_\theta \right) + \sqrt{\beta_t} z xt−1=αt1(xt−1−αˉtβtϵθ)+βtz
当选择 σ t 2 = β ~ t \sigma_t^2 = \tilde{\beta}_t σt2=β~t时:
x t − 1 = μ ~ t + β ~ t z x_{t-1} = \tilde{\mu}_t + \sqrt{\tilde{\beta}_t} z xt−1=μ~t+β~tz
- DDIM采样公式推导
通过引入非马尔可夫前向过程:
q σ ( x t − 1 ∣ x t , x 0 ) = N ( α ˉ t − 1 x 0 + 1 − α ˉ t − 1 − σ t 2 ⋅ x t − α ˉ t x 0 1 − α ˉ t , σ t 2 I ) q_\sigma(x_{t-1}|x_t,x_0) = \mathcal{N}\left( \sqrt{\bar{\alpha}_{t-1}}x_0 + \sqrt{1-\bar{\alpha}_{t-1}-\sigma_t^2}\cdot\frac{x_t-\sqrt{\bar{\alpha}_t}x_0}{\sqrt{1-\bar{\alpha}_t}}, \sigma_t^2I \right) qσ(xt−1∣xt,x0)=N(αˉt−1x0+1−αˉt−1−σt2⋅1−αˉtxt−αˉtx0,σt2I)
确定性采样( σ t = 0 \sigma_t=0 σt=0时):
x t − 1 = α ˉ t − 1 ( x t − 1 − α ˉ t ϵ θ α ˉ t ) ⏟ 预测的 x 0 + 1 − α ˉ t − 1 − σ t 2 ϵ θ x_{t-1} = \sqrt{\bar{\alpha}_{t-1}} \underbrace{\left( \frac{x_t - \sqrt{1-\bar{\alpha}_t}\epsilon_\theta}{\sqrt{\bar{\alpha}_t}} \right)}_{\text{预测的}x_0} + \sqrt{1-\bar{\alpha}_{t-1}-\sigma_t^2}\epsilon_\theta xt−1=αˉt−1预测的x0 (αˉtxt−1−αˉtϵθ)+1−αˉt−1−σt2ϵθ
二、网络结构与实现细节
代码均简化处理
1. U-Net架构
核心组件
class DenoiseNet(nn.Module):def __init__(self, dim=64):super().__init__()# 时间步嵌入层self.t_embed = nn.Sequential(SinusoidalPositionEmbeddings(dim),nn.Linear(dim, dim*4),nn.GELU(),nn.Linear(dim*4, dim))# 下采样路径self.down = nn.ModuleList([ResBlock(3, dim),Downsample(dim),ResBlock(dim, dim*2),Downsample(dim*2),ResBlock(dim*2, dim*4),Downsample(dim*4)])# 瓶颈层self.mid = ResBlock(dim*4, dim*4)# 上采样路径self.up = nn.ModuleList([Upsample(dim*4),ResBlock(dim*8, dim*2),Upsample(dim*2),ResBlock(dim*4, dim),Upsample(dim),ResBlock(dim*2, 3)])def forward(self, x, t):# 时间嵌入t = self.t_embed(t)# 下采样h = []for layer in self.down:x = layer(x, t)h.append(x)# 中间层x = self.mid(x, t)# 上采样for layer in self.up:x = layer(x, t)x = torch.cat([x, h.pop()], dim=1)return x
关键技术点
-
时间步嵌入:使用正弦位置编码将离散时间步映射为连续向量
class SinusoidalPositionEmbeddings(nn.Module):def __init__(self, dim):super().__init__()self.dim = dimdef forward(self, t):half_dim = self.dim // 2emb = math.log(10000) / (half_dim - 1)emb = torch.exp(torch.arange(half_dim) * -emb)emb = t[:, None] * emb[None, :]return torch.cat([emb.sin(), emb.cos()], dim=-1)
-
残差块设计:每个残差块包含时间嵌入的投影
class ResBlock(nn.Module):def __init__(self, in_c, out_c):super().__init__()self.mlp = nn.Sequential(nn.Linear(64, out_c),nn.GELU())self.conv = nn.Sequential(nn.Conv2d(in_c, out_c, 3, padding=1),nn.GroupNorm(8, out_c),nn.GELU(),nn.Conv2d(out_c, out_c, 3, padding=1),nn.GroupNorm(8, out_c))def forward(self, x, t):h = self.conv(x)h += self.mlp(t)[:,:,None,None]return h + x # 残差连接
2. 训练与采样算法
训练流程
def train_step(batch):optimizer.zero_grad()# 随机采样时间步t = torch.randint(0, T, (batch.size(0),))# 前向加噪alpha_bar = alphas_cumprod[t][:,None,None,None]noise = torch.randn_like(batch)noisy = torch.sqrt(alpha_bar)*batch + torch.sqrt(1-alpha_bar)*noise# 预测噪声pred = model(noisy, t)loss = F.mse_loss(pred, noise)loss.backward()optimizer.step()return loss.item()
采样流程(DDPM)
@torch.no_grad()
def sample(steps=1000):x = torch.randn(1, 3, 64, 64) # 初始噪声for t in reversed(range(steps)):ts = torch.full((1,), t, dtype=torch.long)pred_noise = model(x, ts)# 计算系数alpha = 1 - betas[t]alpha_bar_prev = alphas_cumprod_prev[t]# 更新公式x = (1 / torch.sqrt(alpha)) * (x - (betas[t]/torch.sqrt(1-alphas_cumprod[t]))*pred_noise)if t > 0:noise = torch.randn_like(x)x += torch.sqrt((1 - alpha_bar_prev) * betas[t]/(1 - alphas_cumprod[t])) * noisereturn x.clamp(-1,1)
三、损失函数与优化
1. 理论损失函数
完整变分下界:
L VLB = D K L ( q ( x T ∣ x 0 ) ∥ p ( x T ) ) ⏟ 常数项 + ∑ t = 2 T E q ( x t ∣ x 0 ) [ D K L ( q ( x t − 1 ∣ x t , x 0 ) ∥ p θ ( x t − 1 ∣ x t ) ) ] + E q ( x 1 ∣ x 0 ) [ − log p θ ( x 0 ∣ x 1 ) ] \mathcal{L}_{\text{VLB}} = \underbrace{D_{KL}(q(x_T|x_0) \| p(x_T))}_{\text{常数项}} + \sum_{t=2}^T \mathbb{E}_{q(x_t|x_0)}[D_{KL}(q(x_{t-1}|x_t,x_0) \| p_\theta(x_{t-1}|x_t))] + \mathbb{E}_{q(x_1|x_0)}[-\log p_\theta(x_0|x_1)] LVLB=常数项 DKL(q(xT∣x0)∥p(xT))+t=2∑TEq(xt∣x0)[DKL(q(xt−1∣xt,x0)∥pθ(xt−1∣xt))]+Eq(x1∣x0)[−logpθ(x0∣x1)]
2. 实际简化损失
实践中采用去噪得分匹配目标:
L simple = E t , x 0 , ϵ [ ∥ ϵ − ϵ θ ( α ˉ t x 0 + 1 − α ˉ t ϵ , t ) ∥ 2 ] \mathcal{L}_{\text{simple}} = \mathbb{E}_{t,x_0,\epsilon}\left[ \|\epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon, t)\|^2 \right] Lsimple=Et,x0,ϵ[∥ϵ−ϵθ(αˉtx0+1−αˉtϵ,t)∥2]
• 损失权重:原始论文建议不进行加权,后续改进如Progressive Distillation会调整
四、扩展模型与前沿改进
1. DDIM(Denoising Diffusion Implicit Models)
核心改进
• 引入非马尔可夫的前向过程,允许更快的确定性采样
• 生成过程满足:
x t − 1 = α t − 1 f θ ( x t , t ) + 1 − α t − 1 − σ t 2 ϵ θ ( x t , t ) + σ t ϵ x_{t-1} = \sqrt{\alpha_{t-1}}f_\theta(x_t,t) + \sqrt{1-\alpha_{t-1}-\sigma_t^2}\epsilon_\theta(x_t,t) + \sigma_t\epsilon xt−1=αt−1fθ(xt,t)+1−αt−1−σt2ϵθ(xt,t)+σtϵ
优势
• 采样步骤可减少至50-100步(原需1000步)
• 保持生成质量的同时提升效率
2. Stable Diffusion
关键创新
- 潜在空间扩散:在VAE的潜在空间进行扩散,降低计算量
- 交叉注意力机制:实现文本-图像的条件生成
class CrossAttention(nn.Module):def __init__(self, dim):super().__init__()self.q = nn.Linear(dim, dim)self.kv = nn.Linear(768, 2*dim) # 文本编码维度768self.proj = nn.Linear(dim, dim)def forward(self, x, context):Q = self.q(x)K, V = self.kv(context).chunk(2, dim=-1)attn = (Q @ K.transpose(-2,-1)) * (1.0 / math.sqrt(Q.size(-1)))attn = F.softmax(attn, dim=-1)return self.proj(attn @ V)
3. 条件扩散模型
实现方式
• 拼接条件信息: p θ ( x t − 1 ∣ x t , y ) p_\theta(x_{t-1}|x_t, y) pθ(xt−1∣xt,y)
• 分类器指导:采样时修正噪声预测
ϵ ^ θ ( x t , t ) = ϵ θ ( x t , t ) − 1 − α ˉ t γ ∇ x t log p ϕ ( y ∣ x t ) \hat{\epsilon}_\theta(x_t,t) = \epsilon_\theta(x_t,t) - \sqrt{1-\bar{\alpha}_t}\gamma\nabla_{x_t}\log p_\phi(y|x_t) ϵ^θ(xt,t)=ϵθ(xt,t)−1−αˉtγ∇xtlogpϕ(y∣xt)
其中 γ \gamma γ为指导强度系数
结论
扩散模型通过前向破坏与反向重建的对称过程,建立了强大的生成框架:
- 理论优势:严格数学推导保证模式覆盖,避免GAN的模式崩溃问题
- 实现特性:U-Net架构天然适合像素级预测任务,时间嵌入实现多步共享参数
- 应用扩展:与语言模型结合(DALL·E 2)、跨模态生成(Stable Diffusion)展现强大潜力
对比VAE与扩散模型:
特性 | VAE | 扩散模型 |
---|---|---|
生成质量 | 通常模糊 | 高清晰度 |
训练稳定性 | 易训练 | 需精细调参 |
理论保证 | ELBO优化 | 基于得分匹配的严格推导 |
采样速度 | 一步生成 | 需迭代采样(10-1000步) |
核心代码库参考:
• PyTorch实现:https://github.com/lucidrains/denoising-diffusion-pytorch
• Stable Diffusion官方:https://github.com/CompVis/stable-diffusion