This text is geared toward those that need to perceive precisely how Diffusion Models work, with no prior information anticipated. I’ve tried to make use of illustrations wherever potential to offer visible intuitions on every a part of these fashions. I’ve stored mathematical notation and equations to a minimal, and the place they’re crucial I’ve tried to outline and clarify them as they happen.
Intro
I’ve framed this text round three important questions:
- What precisely is it that diffusion fashions be taught?
- How and why do diffusion fashions work?
- When you’ve educated a mannequin, how do you get helpful stuff out of it?
The examples can be based mostly on the glyffuser, a minimal text-to-image diffusion mannequin that I beforehand implemented and wrote about. The structure of this mannequin is a normal text-to-image denoising diffusion mannequin with none bells or whistles. It was educated to generate photos of latest “Chinese language” glyphs from English definitions. Take a look on the image beneath — even in case you’re not accustomed to Chinese language writing, I hope you’ll agree that the generated glyphs look fairly much like the true ones!
What precisely is it that diffusion fashions be taught?
Generative Ai fashions are sometimes stated to take an enormous pile of information and “be taught” it. For text-to-image diffusion fashions, the information takes the type of pairs of photographs and descriptive textual content. However what precisely is it that we wish the mannequin to be taught? First, let’s neglect concerning the textual content for a second and focus on what we try to generate: the pictures.
Likelihood distributions
Broadly, we are able to say that we wish a generative AI mannequin to be taught the underlying likelihood distribution of the information. What does this imply? Contemplate the one-dimensional regular (Gaussian) distribution beneath, generally written 𝒩(μ,σ²) and parameterized with imply μ = 0 and variance σ² = 1. The black curve beneath reveals the likelihood density perform. We are able to pattern from it: drawing values such that over a lot of samples, the set of values displays the underlying distribution. As of late, we are able to merely write one thing like x = random.gauss(0, 1)
in Python to pattern from the usual regular distribution, though the computational sampling course of itself is non-trivial!

We may consider a set of numbers sampled from the above regular distribution as a easy dataset, like that proven because the orange histogram above. On this specific case, we are able to calculate the parameters of the underlying distribution utilizing most chance estimation, i.e. by figuring out the imply and variance. The traditional distribution estimated from the samples is proven by the dotted line above. To take some liberties with terminology, you may think about this as a easy instance of “studying” an underlying likelihood distribution. We are able to additionally say that right here we explicitly learnt the distribution, in distinction with the implicit strategies that diffusion fashions use.
Conceptually, that is all that generative AI is doing — studying a distribution, then sampling from that distribution!
Knowledge representations
What, then, does the underlying likelihood distribution of a extra advanced dataset appear like, comparable to that of the picture dataset we need to use to coach our diffusion mannequin?
First, we have to know what the illustration of the information is. Usually, a machine studying (ML) mannequin requires information inputs with a constant illustration, i.e. format. For the instance above, it was merely numbers (scalars). For photographs, this illustration is usually a fixed-length vector.
The picture dataset used for the glyffuser mannequin is ~21,000 photos of Chinese language glyphs. The photographs are all the identical dimension, 128 × 128 = 16384 pixels, and greyscale (single-channel coloration). Thus an apparent selection for the illustration is a vector x of size 16384, the place every aspect corresponds to the colour of 1 pixel: x = (x₁,x₂,…,x₁₆₃₈₄). We are able to name the area of all potential photographs for our dataset “pixel house”.

Dataset visualization
We make the idea that our particular person information samples, x, are literally sampled from an underlying likelihood distribution, q(x), in pixel house, a lot because the samples from our first instance had been sampled from an underlying regular distribution in 1-dimensional house. Observe: the notation x ∼ q(x) is usually used to imply: “the random variable x sampled from the likelihood distribution q(x).”
This distribution is clearly rather more advanced than a Gaussian and can’t be simply parameterized — we have to be taught it with a ML mannequin, which we’ll talk about later. First, let’s attempt to visualize the distribution to realize a greater intution.
As people discover it troublesome to see in additional than 3 dimensions, we have to cut back the dimensionality of our information. A small digression on why this works: the manifold hypothesis posits that pure datasets lie on decrease dimensional manifolds embedded in a better dimensional house — consider a line embedded in a 2-D aircraft, or a aircraft embedded in 3-D house. We are able to use a dimensionality discount approach comparable to UMAP to venture our dataset from 16384 to 2 dimensions. The two-D projection retains a whole lot of construction, per the concept that our information lie on a decrease dimensional manifold embedded in pixel house. In our UMAP, we see two giant clusters comparable to characters through which the parts are organized both horizontally (e.g. 明) or vertically (e.g. 草). An interactive model of the plot beneath with popups on every datapoint is linked here.

Let’s now use this low-dimensional UMAP dataset as a visible shorthand for our high-dimensional dataset. Bear in mind, we assume that these particular person factors have been sampled from a steady underlying likelihood distribution q(x). To get a way of what this distribution may appear like, we are able to apply a KDE (kernel density estimation) over the UMAP dataset. (Observe: that is simply an approximation for visualization functions.)

This provides a way of what q(x) ought to appear like: clusters of glyphs correspond to high-probability areas of the distribution. The true q(x) lies in 16384 dimensions — that is the distribution we need to be taught with our diffusion mannequin.
We confirmed that for a easy distribution such because the 1-D Gaussian, we may calculate the parameters (imply and variance) from our information. Nevertheless, for advanced distributions comparable to photographs, we have to name on ML strategies. Furthermore, what we are going to discover is that for diffusion fashions in follow, slightly than parameterizing the distribution immediately, they be taught it implicitly by means of the method of studying how one can rework noise into information over many steps.
Takeaway
The goal of generative AI comparable to diffusion fashions is to be taught the advanced likelihood distributions underlying their coaching information after which pattern from these distributions.
How and why do diffusion fashions work?
Diffusion fashions have just lately come into the highlight as a very efficient methodology for studying these likelihood distributions. They generate convincing photographs by ranging from pure noise and regularly refining it. To whet your curiosity, take a look on the animation beneath that reveals the denoising course of producing 16 samples.

On this part we’ll solely speak concerning the mechanics of how these fashions work however in case you’re excited by how they arose from the broader context of generative fashions, take a look on the further reading part beneath.
What’s “noise”?
Let’s first exactly outline noise, because the time period is thrown round lots within the context of diffusion. Specifically, we’re speaking about Gaussian noise: think about the samples we talked about within the part about probability distributions. You possibly can consider every pattern as a picture of a single pixel of noise. A picture that’s “pure Gaussian noise”, then, is one through which every pixel worth is sampled from an unbiased customary Gaussian distribution, 𝒩(0,1). For a pure noise picture within the area of our glyph dataset, this might be noise drawn from 16384 separate Gaussian distributions. You possibly can see this within the earlier animation. One factor to bear in mind is that we are able to select the means of those noise distributions, i.e. heart them, on particular values — the pixel values of a picture, as an illustration.
For comfort, you’ll typically discover the noise distributions for picture datasets written as a single multivariate distribution 𝒩(0,I) the place I is the identification matrix, a covariance matrix with all diagonal entries equal to 1 and zeroes elsewhere. That is merely a compact notation for a set of a number of unbiased Gaussians — i.e. there are not any correlations between the noise on totally different pixels. Within the fundamental implementations of diffusion fashions, solely uncorrelated (a.ok.a. “isotropic”) noise is used. This article incorporates a superb interactive introduction on multivariate Gaussians.
Diffusion course of overview
Under is an adaptation of the somewhat-famous diagram from Ho et al.’s seminal paper “Denoising Diffusion Probabilistic Fashions” which supplies an outline of the entire diffusion course of:

I discovered that there was lots to unpack on this diagram and easily understanding what every part meant was very useful, so let’s undergo it and outline the whole lot step-by-step.
We beforehand used x ∼ q(x) to discuss with our information. Right here, we’ve added a subscript, xₜ, to indicate timestep t indicating what number of steps of “noising” have taken place. We discuss with the samples noised a given timestep as x ∼ q(xₜ). x₀ is clear information and xₜ (t = T) ∼ 𝒩(0,1) is pure noise.
We outline a ahead diffusion course of whereby we corrupt samples with noise. This course of is described by the distribution q(xₜ|xₜ₋₁). If we may entry the hypothetical reverse course of q(xₜ₋₁|xₜ), we may generate samples from noise. As we can not entry it immediately as a result of we would want to know x₀, we use ML to be taught the parameters, θ, of a mannequin of this course of, 𝑝θ(𝑥ₜ₋₁∣𝑥ₜ). (That ought to be p subscript θ however medium can not render it.)
Within the following sections we go into element on how the ahead and reverse diffusion processes work.
Ahead diffusion, or “noising”
Used as a verb, “noising” a picture refers to making use of a metamorphosis that strikes it in direction of pure noise by cutting down its pixel values towards 0 whereas including proportional Gaussian noise. Mathematically, this transformation is a multivariate Gaussian distribution centered on the pixel values of the previous picture.
Within the ahead diffusion course of, this noising distribution is written as q(xₜ|xₜ₋₁) the place the vertical bar image “|” is learn as “given” or “conditional on”, to point the pixel means are handed ahead from q(xₜ₋₁) At t = T the place T is a big quantity (generally 1000) we goal to finish up with photographs of pure noise (which, considerably confusingly, can be a Gaussian distribution, as mentioned previously).
The marginal distributions q(xₜ) symbolize the distributions which have collected the results of all of the earlier noising steps (marginalization refers to integration over all potential situations, which recovers the unconditioned distribution).
For the reason that conditional distributions are Gaussian, what about their variances? They’re decided by a variance schedule that maps timesteps to variance values. Initially, an empirically decided schedule of linearly growing values from 0.0001 to 0.02 over 1000 steps was offered in Ho et al. Later analysis by Nichol & Dhariwal instructed an improved cosine schedule. They state {that a} schedule is handiest when the speed of data destruction by means of noising is comparatively even per step all through the entire noising course of.
Ahead diffusion instinct
As we encounter Gaussian distributions each as pure noise q(xₜ, t = T) and because the noising distribution q(xₜ|xₜ₋₁), I’ll attempt to attract the excellence by giving a visible instinct of the distribution for a single noising step, q(x₁∣x₀), for some arbitrary, structured 2-dimensional information:

The distribution q(x₁∣x₀) is Gaussian, centered round every level in x₀, proven in blue. A number of instance factors x₀⁽ⁱ⁾ are picked as an example this, with q(x₁∣x₀ = x₀⁽ⁱ⁾) proven in orange.
In follow, the principle utilization of those distributions is to generate particular cases of noised samples for coaching (mentioned additional beneath). We are able to calculate the parameters of the noising distributions at any timestep t immediately from the variance schedule, because the chain of Gaussians is itself additionally Gaussian. That is very handy, as we don’t must carry out noising sequentially—for any given beginning information x₀⁽ⁱ⁾, we are able to calculate the noised pattern xₜ⁽ⁱ⁾ by sampling from q(xₜ∣x₀ = x₀⁽ⁱ⁾) immediately.
Ahead diffusion visualization
Let’s now return to our glyph dataset (as soon as once more utilizing the UMAP visualization as a visible shorthand). The highest row of the determine beneath reveals our dataset sampled from distributions noised to numerous timesteps: xₜ ∼ q(xₜ). As we enhance the variety of noising steps, you possibly can see that the dataset begins to resemble pure Gaussian noise. The underside row visualizes the underlying likelihood distribution q(xₜ).

Reverse diffusion overview
It follows that if we knew the reverse distributions q(xₜ₋₁∣xₜ), we may repeatedly subtract a small quantity of noise, ranging from a pure noise pattern xₜ at t = T to reach at an information pattern x₀ ∼ q(x₀). In follow, nonetheless, we can not entry these distributions with out understanding x₀ beforehand. Intuitively, it’s simple to make a identified picture a lot noisier, however given a really noisy picture, it’s a lot more durable to guess what the unique picture was.
So what are we to do? Since we’ve got a considerable amount of information, we are able to prepare an ML mannequin to precisely guess the unique picture that any given noisy picture got here from. Particularly, we be taught the parameters θ of an ML mannequin that approximates the reverse noising distributions, pθ(xₜ₋₁ ∣ xₜ) for t = 0, …, T. In follow, that is embodied in a single noise prediction mannequin educated over many alternative samples and timesteps. This enables it to denoise any given enter, as proven within the determine beneath.

Subsequent, let’s go over how this noise prediction mannequin is carried out and educated in follow.
How the mannequin is carried out
First, we outline the ML mannequin — typically a deep neural community of some kind — that may act as our noise prediction mannequin. That is what does the heavy lifting! In follow, any ML mannequin that inputs and outputs information of the right dimension can be utilized; the U-net, an structure significantly suited to studying photographs, is what we use right here and often chosen in follow. Newer fashions additionally use vision transformers.

Then we run the coaching loop depicted within the determine above:
- We take a random picture from our dataset and noise it to a random timestep tt. (In follow, we pace issues up by doing many examples in parallel!)
- We feed the noised picture into the ML mannequin and prepare it to foretell the (identified to us) noise within the picture. We additionally carry out timestep conditioning by feeding the mannequin a timestep embedding, a high-dimensional distinctive illustration of the timestep, in order that the mannequin can distinguish between timesteps. This could be a vector the identical dimension as our picture immediately added to the enter (see here for a dialogue of how that is carried out).
- The mannequin “learns” by minimizing the worth of a loss perform, some measure of the distinction between the expected and precise noise. The imply sq. error (the imply of the squares of the pixel-wise distinction between the expected and precise noise) is utilized in our case.
- Repeat till the mannequin is effectively educated.
Observe: A neural community is basically a perform with an enormous variety of parameters (on the order of 10⁶ for the glyffuser). Neural community ML fashions are educated by iteratively updating their parameters utilizing backpropagation to attenuate a given loss perform over many coaching information examples. This is a wonderful introduction. These parameters successfully retailer the community’s “information”.
A noise prediction mannequin educated on this method finally sees many alternative combos of timesteps and information examples. The glyffuser, for instance, was educated over 100 epochs (runs by means of the entire information set), so it noticed round 2 million information samples. Via this course of, the mannequin implicity learns the reverse diffusion distributions over your complete dataset in any respect totally different timesteps. This enables the mannequin to pattern the underlying distribution q(x₀) by stepwise denoising ranging from pure noise. Put one other method, given a picture noised to any given stage, the mannequin can predict how one can cut back the noise based mostly on its guess of what the unique picture. By doing this repeatedly, updating its guess of the unique picture every time, the mannequin can rework any noise to a pattern that lies in a high-probability area of the underlying information distribution.
Reverse diffusion in follow

We are able to now revisit this video of the glyffuser denoising course of. Recall a lot of steps from pattern to noise e.g. T = 1000 is used throughout coaching to make the noise-to-sample trajectory very simple for the mannequin to be taught, as modifications between steps can be small. Does that imply we have to run 1000 denoising steps each time we need to generate a pattern?
Fortunately, this isn’t the case. Basically, we are able to run the single-step noise prediction however then rescale it to any given step, though it won’t be excellent if the hole is simply too giant! This enables us to approximate the complete sampling trajectory with fewer steps. The video above makes use of 120 steps, as an illustration (most implementations will permit the person to set the variety of sampling steps).
Recall that predicting the noise at a given step is equal to predicting the unique picture x₀, and that we are able to entry the equation for any noised picture deterministically utilizing solely the variance schedule and x₀. Thus, we are able to calculate xₜ₋ₖ based mostly on any denoising step. The nearer the steps are, the higher the approximation can be.
Too few steps, nonetheless, and the outcomes turn into worse because the steps turn into too giant for the mannequin to successfully approximate the denoising trajectory. If we solely use 5 sampling steps, for instance, the sampled characters don’t look very convincing in any respect:

There may be then an entire literature on extra superior sampling strategies past what we’ve mentioned thus far, permitting efficient sampling with a lot fewer steps. These typically reframe the sampling as a differential equation to be solved deterministically, giving an eerie high quality to the sampling movies — I’ve included one on the end in case you’re . In production-level fashions, these are normally most popular over the easy methodology mentioned right here, however the fundamental precept of deducing the noise-to-sample trajectory is similar. A full dialogue is past the scope of this text however see e.g. this paper and its corresponding implementation within the Hugging Face diffusers
library for extra info.
Different instinct from rating perform
To me, it was nonetheless not 100% clear why coaching the mannequin on noise prediction generalises so effectively. I discovered that an alternate interpretation of diffusion fashions generally known as “score-based modeling” crammed a few of the gaps in instinct (for extra info, discuss with Yang Music’s definitive article on the subject.)

I attempt to give a visible instinct within the backside row of the determine above: primarily, studying the noise in our diffusion mannequin is equivalent (to a relentless issue) to studying the rating perform, which is the gradient of the log of the likelihood distribution: ∇ₓ log q(x). As a gradient, the rating perform represents a vector subject with vectors pointing in direction of the areas of highest likelihood density. Subtracting the noise at every step is then equal to transferring following the instructions on this vector subject in direction of areas of excessive likelihood density.
So long as there may be some sign, the rating perform successfully guides sampling, however in areas of low likelihood it tends in direction of zero as there may be little to no gradient to comply with. Utilizing many steps to cowl totally different noise ranges permits us to keep away from this, as we smear out the gradient subject at excessive noise ranges, permitting sampling to converge even when we begin from low likelihood density areas of the distribution. The determine reveals that because the noise stage is elevated, extra of the area is roofed by the rating perform vector subject.
Abstract
- The goal of diffusion fashions is be taught the underlying likelihood distribution of a dataset after which have the ability to pattern from it. This requires ahead and reverse diffusion (noising) processes.
- The ahead noising course of takes samples from our dataset and regularly provides Gaussian noise (pushes them off the information manifold). This ahead course of is computationally environment friendly as a result of any stage of noise may be added in closed kind a single step.
- The reverse noising course of is difficult as a result of we have to predict how one can take away the noise at every step with out understanding the unique information level upfront. We prepare a ML mannequin to do that by giving it many examples of information noised at totally different timesteps.
- Utilizing very small steps within the ahead noising course of makes it simpler for the mannequin to be taught to reverse these steps, because the modifications are small.
- By making use of the reverse noising course of iteratively, the mannequin refines noisy samples step-by-step, finally producing a practical information level (one which lies on the information manifold).
Takeaway
Diffusion fashions are a strong framework for studying advanced information distributions. The distributions are learnt implicitly by modelling a sequential denoising course of. This course of can then be used to generate samples much like these within the coaching distribution.
When you’ve educated a mannequin, how do you get helpful stuff out of it?
Earlier makes use of of generative AI comparable to “This Person Does Not Exist” (ca. 2019) made waves just because it was the primary time most individuals had seen AI-generated photorealistic human faces. A generative adversarial community or “GAN” was utilized in that case, however the precept stays the identical: the mannequin implicitly learnt a underlying information distribution — in that case, human faces — then sampled from it. To this point, our glyffuser mannequin does an analogous factor: it samples randomly from the distribution of Chinese language glyphs.
The query then arises: can we do one thing extra helpful than simply pattern randomly? You’ve doubtless already encountered text-to-image fashions comparable to Dall-E. They’re able to incorporate further that means from textual content prompts into the diffusion course of — this in generally known as conditioning. Likewise, diffusion fashions for scientific scientific functions like protein (e.g. Chroma, RFdiffusion, AlphaFold3) or inorganic crystal construction technology (e.g. MatterGen) turn into rather more helpful if may be conditioned to generate samples with fascinating properties comparable to a particular symmetry, bulk modulus, or band hole.
Conditional distributions
We are able to think about conditioning as a strategy to information the diffusion sampling course of in direction of specific areas of our likelihood distribution. We talked about conditional distributions in the context of forward diffusion. Under we present how conditioning may be considered reshaping a base distribution.

Contemplate the determine above. Consider p(x) as a distribution we need to pattern from (i.e., the pictures) and p(y) as conditioning info (i.e., the textual content dataset). These are the marginal distributions of a joint distribution p(x, y). Integrating p(x, y) over y recovers p(x), and vice versa.
Sampling from p(x), we’re equally prone to get x₁ or x₂. Nevertheless, we are able to situation on p(y = y₁) to acquire p(x∣y = y₁). You possibly can consider this as taking a slice by means of p(x, y) at a given worth of y. On this conditioned distribution, we’re more likely to pattern at x₁ than x₂.
In follow, to be able to situation on a textual content dataset, we have to convert the textual content right into a numerical kind. We are able to do that utilizing giant language mannequin (LLM) embeddings that may be injected into the noise prediction mannequin throughout coaching.
Embedding textual content with an LLM
Within the glyffuser, our conditioning info is within the type of English text definitions. We’ve got two necessities: 1) ML fashions choose fixed-length vectors as enter. 2) The numerical illustration of our textual content should perceive context — if we’ve got the phrases “lithium” and “aspect” close by, the that means of “aspect” ought to be understood as “chemical aspect” slightly than “heating aspect”. Each of those necessities may be met through the use of a pre-trained LLM.
The diagram beneath reveals how an LLM converts textual content into fixed-length vectors. The textual content is first tokenized (LLMs break textual content into tokens, small chunks of characters, as their fundamental unit of interplay). Every token is transformed right into a base embedding, which is a fixed-length vector of the dimensions of the LLM enter. These vectors are then handed by means of the pre-trained LLM (right here we use the encoder portion of Google’s T5 mannequin), the place they’re imbued with extra contextual that means. We find yourself with a array of n vectors of the identical size d, i.e. a (n, d) sized tensor.

Observe: in some fashions, notably Dall-E, extra image-text alignment is carried out utilizing contrastive pretraining. Imagen appears to indicate that we are able to get away with out doing this.
Coaching the diffusion mannequin with textual content conditioning
The precise methodology that this embedding vector is injected into the mannequin can range. In Google’s Imagen mannequin, for instance, the embedding tensor is pooled (mixed right into a single vector within the embedding dimension) and added into the information because it passes by means of the noise prediction mannequin; additionally it is included differently utilizing cross-attention (a way of studying contextual info between sequences of tokens, most famously used within the transformer fashions that kind the premise of LLMs like ChatGPT).

Within the glyffuser, we solely use cross-attention to introduce this conditioning info. Whereas a major architectural change is required to introduce this extra info into the mannequin, the loss perform for our noise prediction mannequin stays precisely the identical.
Testing the conditioned diffusion mannequin
Let’s do a easy take a look at of the totally educated conditioned diffusion mannequin. Within the determine beneath, we attempt to denoise in a single step with the textual content immediate “Gold”. As touched upon in our interactive UMAP, Chinese language characters typically include parts generally known as radicals which might convey sound (phonetic radicals) or that means (semantic radicals). A typical semantic radical is derived from the character that means “gold”, “金”, and is utilized in characters which can be in some broad sense related to gold or metals.

The determine reveals that although a single step is inadequate to approximate the denoising trajectory very effectively, we’ve got moved right into a area of our likelihood distribution with the “金” radical. This means that the textual content immediate is successfully guiding our sampling in direction of a area of the glyph likelihood distribution associated to the that means of the immediate. The animation beneath reveals a 120 step denoising sequence for a similar immediate, “Gold”. You possibly can see that each generated glyph has both the 釒 or 钅 radical (the identical radical in conventional and simplified Chinese language, respectively).

Takeaway
Conditioning permits us to pattern significant outputs from diffusion fashions.
Additional remarks
I discovered that with the assistance of tutorials and current libraries, it was potential to implement a working diffusion mannequin regardless of not having a full understanding of what was happening below the hood. I believe this can be a good strategy to begin studying and extremely suggest Hugging Face’s tutorial on coaching a easy diffusion mannequin utilizing their diffusers
Python library (which now contains my small bugfix!).
I’ve omitted some matters which can be essential to how production-grade diffusion fashions perform, however are pointless for core understanding. One is the query of how one can generate excessive decision photographs. In our instance, we did the whole lot in pixel house, however this turns into very computationally costly for big photographs. The final strategy is to carry out diffusion in a smaller house, then upscale it in a separate step. Strategies embody latent diffusion (utilized in Steady Diffusion) and cascaded super-resolution fashions (utilized in Imagen). One other subject is classifier-free steerage, a really elegant methodology for enhancing the conditioning impact to present a lot better immediate adherence. I present the implementation in my earlier put up on the glyffuser and extremely suggest this article if you wish to be taught extra.
Additional studying
A non-exhaustive listing of supplies I discovered very useful:
Enjoyable extras

Diffusion sampling utilizing the DPMSolverSDEScheduler
developed by Katherine Crowson and carried out in Hugging Face diffusers
—observe the sleek transition from noise to information.
Source link