CS 180: Computer Vision & Computational Photography, Fall 2024

Project 5: Diffusion Models

Calvin Vu

Overview

In this assignment, we will explore diffusion models, implement diffusion sampling loops, and use them for other tasks such as inpainting and creating optical illusions. We will also train our own diffusion model on the MNIST dataset.

Part A: The Power of Diffusion Models!

Part A - Section 0: Setup

First, we need to install our dependencies, load the models, seed our work, and test the model with the provided sampling loops. We are using seed 180.



3 Prompts - 20 Inference Steps


3 Prompts - 40 Inference Steps

Part A - Section 1: Sampling Loops

Starting with a clean image, x0, we can iteratively add noise to an image, obtaining progressively more and more noisy versions of the image, xt, until we're left with basically pure noise at timestep t = T. When t = 0, we have a clean image, and for larger t more noise is in the image. A diffusion model tries to reverse this process by denoising the image. By giving a diffusion model a noisy xt and the timestep t, the model predicts the noise in the image. With the predicted noise, we can either completely remove the noise from the image, to obtain an estimate of x0, or we can remove just a portion of the noise, obtaining an estimate of xt-1, with slightly less noise. To generate images from the diffusion model (sampling), we start with pure noise at timestep T sampled from a gaussian distribution, which we denote xT. We can then predict and remove part of the noise, giving us xT-1. Repeating this process until we arrive at x0 gives us a clean image.

For stage 1, we will use a test image, but resize it down to a 64x64 image and convert it to a (1, 3, 64, 64) tensor.



Test Image

Part A - Section 1.1: Implementing the Forward Process

The forward process can be defined by this equation: \[ x_t = \sqrt{\bar{\alpha}}x_0 + \sqrt{1 - \bar{\alpha_t}}\epsilon \enspace where \enspace \epsilon \sim N(0,1) \] That is, given a clean image x0, we get a noisy image xt at timestep t by sampling from a Gaussian with mean $$\sqrt{\bar{\alpha}}x_0$$ and variance $$ (1 - \bar\alpha_t). $$ Note that the forward process is not just adding noise -- we also scale the image.

Part A - Section 1.2: Classical Denoising

We will try to remove noise from the image, using Gaussian blur filtering, but this will be impossible for the most part.

Part A - Section 1.3: Implementing One Step Denoising

Instead of blurring, we'll use a pretrained diffusion model to denoise. The actual denoiser can be found at stage_1.unet. We can estimate this with \[ image \enspace - \enspace \sqrt{1 - \bar{\alpha_t}} \enspace * \enspace (noise\:estimate) / \sqrt{\bar{\alpha_t}} \]



Part A - Section 1.4: Implementing Iterative Denoising

We saw that the denoising UNet does a much better job of projecting the image onto the natural image manifold, but it does get worse as you add more noise. This makes sense, as the problem is much harder with more noise! But diffusion models are designed to denoise iteratively. We can speed up this iterative process by skipping steps. To skip steps we can create a list of timesteps that we'll call strided_timesteps, which will be much shorter than the full list of 1000 timesteps. strided_timesteps[0] will correspond to the noisiest image (and thus the largest t) and strided_timesteps[-1] will correspond to a clean image (and thus t = 0). One simple way of constructing this list is by introducing a regular stride step (e.g. stride of 30 works well). On the ith denoising step we are at t = strided_timesteps[i], and want to get to t' = strided_timesteps[i+1] (from more noisy to less noisy). To actually do this, we have the following formula: \[ x_{t'} = \frac{\sqrt{\bar\alpha_{t'}}\beta_t}{1 - \bar\alpha_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar\alpha_{t'})}{1 - \bar\alpha_t} x_t + v_\sigma \]

  • xt is your image at timestep t
  • xt' is your noisy image at timestep t' where t' < t (less noisy)
  • α ¯ t is defined as alphas_cumprod
  • α t = α ¯ t α ¯ t
  • β t = 1 α t
  • x0 is our current estimate of the clean image


  • Part A - Section 1.5: Diffusion Model Sampling

    We can use the iterative_denoise function is to generate images from scratch. We can do this by setting i_start = 0 and passing in random noise. This effectively denoises pure noise.



    Sample 1
    Sample 2
    Sample 3
    Sample 4
    Sample 5

    Part A - Section 1.6: Classifier Free Guidance

    You may have noticed that some of the generated images in the prior section are not very good. We can improve the output with Classifer Free Guidance (CFG). We define a new noise estimate ϵ = ϵ u + γ ( ϵ c ϵ u ) where γ controls the strength of CFG. Notice that for γ = 0, we get an unconditional noise estimate, and for γ = 1 we get the conditional noise estimate. The magic happens when γ > 1.



    Sample 1
    Sample 2
    Sample 3
    Sample 4
    Sample 5

    Part A - Section 1.7: Image-to-image Translation

    In Section 1.4, we take a real image, add noise to it, and then denoise. This effectively allows us to make edits to existing images. The more noise we add, the larger the edit will be. This works because in order to denoise an image, the diffusion model must to some extent "hallucinate" new things -- the model has to be "creative." Another way to think about it is that the denoising process "forces" a noisy image back onto the manifold of natural images. Here, we're going to take the original test image, noise it a little, and force it back onto the image manifold without any conditioning. Effectively, we're going to get an image that is similar to the test image (with a low-enough noise level).



    campanile
    i_start = 1
    i_start = 3
    i_start = 5
    i_start = 7
    i_start = 10
    i_start = 20


    car
    i_start = 1
    i_start = 3
    i_start = 5
    i_start = 7
    i_start = 10
    i_start = 20


    statue
    i_start = 1
    i_start = 3
    i_start = 5
    i_start = 7
    i_start = 10
    i_start = 20

    Part A - Section 1.7.1: Editing Hand-Drawn and Web Images



    avocado
    i_start = 1
    i_start = 3
    i_start = 5
    i_start = 7
    i_start = 10
    i_start = 20


    face
    i_start = 1
    i_start = 3
    i_start = 5
    i_start = 7
    i_start = 10
    i_start = 20


    sword
    i_start = 1
    i_start = 3
    i_start = 5
    i_start = 7
    i_start = 10
    i_start = 20


    dog
    i_start = 1
    i_start = 3
    i_start = 5
    i_start = 7
    i_start = 10
    i_start = 20

    Part A - Section 1.7.2: Inpainting

    Given an image xorig, and a binary mask m, we can create a new image that has the same content where m is 0, but new content wherever m is 1. To do this, we can run the diffusion denoising loop. But at every step, after obtaining xt, we "force" xtto have the same pixels as xorig where m is 0: x t m x t + ( 1 m ) forward ( x o r i g , t )



    inpainted
    image
    mask


    inpainted
    image
    mask


    inpainted
    image
    mask

    Part A - Section 1.7.3: Text-Conditional Image-to-image Translation

    We will do the same thing as Section 1.7, but guide the projection with a text prompt. We will use the prompt "a rocket ship".



    campanile
    i_start = 1
    i_start = 3
    i_start = 5
    i_start = 7
    i_start = 10
    i_start = 20


    car
    i_start = 1
    i_start = 3
    i_start = 5
    i_start = 7
    i_start = 10
    i_start = 20


    statue
    i_start = 1
    i_start = 3
    i_start = 5
    i_start = 7
    i_start = 10
    i_start = 20

    Part A - Section 1.8: Visual Anagrams

    We can create optical illusions with diffusion models, where the image looks like one image right side up, but a different image upside down. We will denoise an image at step t to get a noise estimate. At the same time we will flip the image upside down and get another noise estimate. We flipped the second noise estimate to make it right side up and then average the two estimates. We then do a reverse diffusion step with the new averaged noise estimate: ϵ 1 = UNet ( x t , t , p 1 ) ϵ 2 = flip ( UNet ( flip ( x t ) , t , p 2 ) ) ϵ = ( ϵ 1 + ϵ 2 ) / 2



    (old man + people around campfire) anagram
    anagram flipped




    (mean wearing hat + snowy mountain village) anagram
    anagram flipped




    (skull + dog) anagram


    anagram flipped

    Part A - Section 1.9: Hybrid Images

    In order to create hybrid images with a diffusion model we can use a similar technique as in Section 1.8. We will create a composite noise estimate , by estimating the noise with two different text prompts, and then combining low frequencies from one noise estimate with high frequencies of the other. The algorithm is: ϵ 1 = UNet ( x t , t , p 1 ) ϵ 2 = UNet ( x t , t , p 2 ) ϵ = f lowpass ( ϵ 1 ) + f highpass ( ϵ 2 )



    skull + waterfall
    dog + almafi cost
    rocket ship + pencil




    Part B: Diffusion Models from Scratch!

    Part B - Section 1: Training a Single-Step Denoising UNet

    Given a noisy image z, we aim to train a denoiser DΘ such that it maps z to a clean image x. To do so, we can optimize over an L2 loss: L = E z , x D θ ( z ) x 2

    Part B - Section 1.1: Implementing the UNet



    UNet Operations

    Our UNet will comprised of various simple and composed operations. We can implement these operations' forward passes as building blocks for our net.

  • Conv - Run a 2D batch normalization on a 2D convolution with stride length 1, then run a GELU function on that batch
  • DownConv - Run a 2D batch normalization on a 2D convolution with stride length 2, then run a GELU function on that batch
  • UpConv - Run a 2D batch normalization on a 2D convolution with stride length 2 and input padding, then run a GELU function on that batch
  • Flatten - Run 2D average pooling on tensor with kernel size 7
  • Unflatten - Run a 2D transposed convolution with kernel size 7 and stride length 1
  • ConvBlock - Call our previous 'Conv' operation on the tensor twice
  • DownBlock - Run the 'ConvBlock' operation on the 'DownConv' operation
  • UpBlock - Run the 'ConvBlock' operation on the 'UpConv' operation


  • Unconditional

    For the unconditional UNet, the order of operations will be \[DownBlock \rightarrow DownBlock \rightarrow Flatten \rightarrow Unflatten \rightarrow UpBlock \rightarrow UpBlock \rightarrow Conv, \] where we have ConvBlock in DownBlock and UpBlock.

    Part B - Section 1.2: Using the UNet to Train a Denoiser

    To train our denoiser, we need to generate training data pairs of (z,x), where each is a clean MNIST digit. For each training batch, we can generate z from x using the the following noising process: z = x + σ ϵ , where  ϵ N ( 0 , I ) .



    Noising Process

    Part B - Section 1.2.1: Training

    Lets load our MNIST dataset and initialize our model, optimizer, loss function, and hyperparameters:

  • Batch Size is 256
  • 5 Epochs
  • Learning Rate is 1e-4
  • Sigma is 0.5
  • 128 Hidden Dimensions

  • For each epoch, we will go through the training dataset, add noise to our clean digit, predict the denoised output using the model, then run the loss function on the prediction and label.


    After 1 Epoch

    After 5 Epochs

    Training Loss Over Training Steps

    Part B - Section 1.2.2: Out-of-Distribution Testing

    The model was trained with sigma value 0.5. Lets see the results with varying sigma values.


    Out-of-Distribution Sigma Values

    Part B - Section 2: Training a Diffusion Model

    L = E ϵ , x 0 , t ϵ θ ( x t , t ) ϵ 2 .

    We have a new loss function, but we have a new variable timestep t. This means we need to inject this value into our UNet.

    Part B - Section 2.1: Adding Time Conditioning to UNet


    Time Conditioning

    We need to add FCBlock to the Unflatten and UpBlock where the FCBlock is defined as: \[Linear(GELU(Linear(tensor)))\]

    Part B - Section 2.2: Training the UNet

    To train the model, we can load our optimizer, scheduler, and model again. We will use an Adam optimizer and an exponential learning rate decay scheduler. In the forward process, we randomly sample t and add noise to input image/tensor. We can sample the alpha bar value from the scheduler. Then to get xt, we can calculate it with: \[ x_t = \sqrt{\bar{\alpha}}x_0 + \sqrt{1 - \bar{\alpha}} * noise. \] Then we can get the predicted noise from the UNet using xt and run MSE Loss on the predicted noise and the noised image/tensor we got earlier.


    Training Losses

    Part B - Section 2.3: Sampling from the UNet

    To sample from our model, we first randomly sample a noised image. Then we iterate timesteps t through T = 300 steps, where at each step, we sample our alpha values given t and then get a tensor z based on xt where if the timestep t is less than 1, the value is zero. We then get the predicted noise from the UNet using xt and t. Each step, we find: \[ (1 / \sqrt{{\alpha_t}}) * (x_t - (1 - \alpha_t) / \sqrt{1 - \bar\alpha_t} * noise) + \sqrt{1 - \alpha_t} * z \]


    5 Epochs

    20 Epochs

    Part B - Section 2.4: Adding Class-Conditioning to UNet

    Here, we will adding two more FCBlocks to our UNet. We have a class-conditioning vector c but we will make it a one-hot vector when passing it into the FCBlock. We will add the new FCBlocks to the Unflatten and UpBlock.


    Training Losses

    Part B - Section 2.5: Sampling from the Class-Conditioned UNet

    For sampling, we will include classifer-free guidance, using a gamma value of 5.0. \[ \epsilon = \epsilon_u + \gamma(\epsilon_c - \epsilon_u) \] where εu is the unconditioned predicted noise and εc is the conditioned predicted noise. Then each step, we find like before: \[ (1 / \sqrt{{\alpha_t}}) * (x_t - (1 - \alpha_t) / \sqrt{1 - \bar\alpha_t} * noise) + \sqrt{1 - \alpha_t} * z \]


    5 Epochs

    20 Epochs