denoising diffusion probabilistic models”。
DDPM: https://arxiv.org/abs/2006.11239
Input: noised image + iteration
model
output: noise predicted -> denoising
- text encoder to vector
- generation model (diffusion) Denoising U-Net
- decoder to final version in pixel space
parallel training
mid journey: during training process, illustrates the results encoded from the denoising images
- gpt coding/ BIRT
- criteria: CLIP score/ FID-10k
-
FID: standard -> pre-trained CNN classification model -> representation ; the distance between the representation of the generated images and the representation of the real images (assumption of Gaussians distribution)
两组distribution的距离 limitation: need a large scale of generated images -
CLIP: An additional Image Encoder model CLIP score: the vectors similarity between the encoded text and encoded generated image representations
-
feature: Training without knowing the correspondence between images and text intermediate:
- compressed image: sample and downsample -> train
- latent representation: auto-encoder ??
- input: H*W*3 latent: h*w*c (exceeding vision dimension)
input: noised image + text
output: intermediate
text(additional): condition (can be ignored during inferation)
加噪过程,改为加在中间杂序上,使用auto-encoder的encoder部分
train a noise predictor
denoising: initialized by sampling normal distribution noise
loss function during training:
2. xo -> clean images
4.
the larger t is, the more proportion the noise added
compared with the target noise you have sampled in step 4
difference with origin steps
noise and denoise step by step < DDPM training > predicting the noise by once
why?
generate image strangeness: elinimate the predicted noise and add a new one afterward (plus signal)
map the generated distribution to the actual world distribution
Q: to measure the similarity of the two-
A: maximum likelihood Estimation:(MLE)
sample
all objective for image generation model
KL diverges: 衡量两种分布差异程度 definition:$D_{KL}(P | Q) = \int p(x) \log \left(\frac{p(x)}{q(x)}\right) dx$ 非对称性
q(z|x) z: distribution (major Gaussians) given the data x (x -> image to imitate) maximize louwer bound maximize lower bound of logP(x) VAE: $\mathbb{E}{q(z|x)}[\log{\frac{p(x,z)}{q(z|x)}} ]$ DDPM: z->x_0 $\mathbb{E}{q(x_1:x_T|x)}[\log{\frac{p(x_0:x_T)}{q(x_1:x_T|x)}} ]$
course:C5 000
Prior approaches have focused on dataset filtering [30], post-generation filtering [29], or inference guiding [38]
- removal or guidance post-hoc: using classifier after training adding guidance to the inference process *[38] [38] Patrick Schramowski, Manuel Brack, Björn Deiseroth, and Kristian Kersting. Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models. arXiv preprint arXiv:2211.05105, 2022. sota guidance-based approach
- image cloaking adding adversarial perturbations
GAN -> diffusion model by a token for a new subject trained using only a handful of images
previous assumption: unintentional memorization; undesired knowledfe is identifiable on a set of training data points
our: erase a high-level visual concept
set-like composition?
energy-based models EBM
A and not B as the difference between log probability densities for A and B
[10], [11], [37], [38]
score based composition
reference:
(source/Kimi.jpg) future: EBM, stable, practical