to invertibility and beyond

may 2025

i put thing in black box, then i want thing back, but how?

this blog aims to give a structured introduction on how we can make neural networks invertible. note that this blog assumes a good level of knowledge of probability theory, linear algebra, and machine learning principles. suggestions always welcome, just drop me a message!

this ended up being incredibly long and technical, thanks to jake.

a primer on invertibility

in caveman terms, an invertible process is one we know how to undo. moving a banana is invertible, eating the banana is not.

mathematically, an invertible function is a bijective one, and a bijective function is both injective and surjective. these are defined as follows:

Injective and Surjective Functions

Injective: \( F(x) = F(y) \implies x = y \)

Surjective: \( \forall \, b \in B, \exists \, a \in A \) such that \( F(a) = b \)

an injective function is such that one output is given by one input exactly, with no two inputs giving the same output, and a surjective function is such that every possible output is mapped to by some input. together, they give a bijection. given any output, i know the input that mapped to it exists and is unique.

example o' clock, the function \(y = 4x + 10\) is invertible, because given any \(y\), i can give you back \(x\) via the inverse function \(x = \frac{y - 10}{4}\). note that this function is analytically invertible, because we are able to give a closed form expression of the inverse function.

from that alone, it is pretty easy to see that most neural networks are not analytically invertible in their usual forms. for one, the most common choice of activation function, \( ReLU(x) = max(0, x) \), would render the whole thing non-invertible. if we receive a positive value as output to ReLU, we can be sure that that is the same number that we put in. but we are certainly out of luck if a negative number is spit out.

since the composition of functions is only invertible iff each function is invertible, the usage of ReLU as an activation function to a linear layer makes the whole thing not analytically invertible. beyond ReLU and linear layers, common operations like convolutions on images and message passing on graphs are also not analytically invertible because of their aggregation steps. simply put, the result of an addition tells you nothing about the things you added to get there.

so is all hope lost? can we go home now?

no, not yet. buckle in.

the hard and soft ways to invertibility

before we move on to discussions about neural networks, it is useful to make a distinction between hard and soft invertibility. this will make our mission precise, as well as yield a paved path for us to pursue.

loosely speaking, soft invertibility is encouraged, and hard invertibility is enforced. encouraging a model to be invertible involves regularizing it via a loss term that punishes poor invertibility. enforcing a model to be invertible, on the other hands, means architectural designs that guarantees the model to be invertible, regardless of how the training run goes.

a classic example of a soft invertible model is the autoencoder, which is trained to learn an efficient encoding of unlabeled data. it consists of two networks, an encoder \( E_{\phi} \) and a decoder \( D_{\theta} \), the former encodes the input data into a latent space, and the latter decodes it back into the original data space. the autoencoder is trained to minimize reconstruction loss, which measures the distance between \(x\), the original input, and \( x' = D_{\theta}(E_{\phi}(x)) \), the result of pushing \( x \) through the network and back. that is to say, with appropriate encouragement, an autoencoder is softly invertible.

super important to note here is that models regularized to be invertible may only be accurately invertible on the data distribution it is trained on. in other words, an autoencoder that encodes and decodes pictures of bananas losslessly might not be able to do the same on pictures of oranges. on the other hand, hard invertibility gives us a guarantee that invertibility holds for all elements in the domain. in this blog, we are in pursuit of hard invertibility, and so our next stop is naturally to look at constraining our model architecture to enforce this property.

normalizing flows: flowing towards the normal

a comprehensive overview of normalizing flows can be found in this review

normalizing flows are one class of generative models that bakes hard invertibility into its architecture. unlike a discriminative model whose objective is to estimate a label conditioned on its input variales: \( p (y|x_1, x_2, ..., x_n) \), a generative model aims to learn the joint probability distribution \( p(x_1, x_2, ..., x_n) \) over all its input variables. generative models are attractive because of their flexibility - given a well trained model, we can turn it into a discriminator using bayes' rule, use it to perform density estimation, generate new samples from the original data distribution, and much more.

see section 1.1 of kingma et al's introduction to variational autoencoders for a great explanation of discriminative v.s generative modeling and motivation for the latter.

the key idea to normalizing flows is the following: all of our data is drawn from some unknown data distribution, but this distribution may be arbitrarily complex and therefore hard to sample new data points from. however, if we are able to transform this input data distribution into a simpler distribution via a series of invertible and differentiable mappings, then we would be able to generate new samples simply by sampling from the simple distribution (say that 10 times fast), and applying the inverse of the mappings one by one. most commonly, that simpler distribution is the standard normal distribution, hence the name of the model - we are flowing our data distribution towards the normal.

generative models aim to capture the underlying data generation process, and so one of the central objectives during training is to maximize the likelihood of seeing the observed data under our model. in other words, we want to optimize the parameters so that it assigns high probability to the data points we do observe. this process is known as maximum likelihood estimation, and is the primary learning objective for models that aim to explicitly model the entire joint distribution. our current subject of study, the normalizing flow family of models, is an example of explicit generative models.

for some great intuition on maximum likelihood and bayesian statistics, see think bayes 2

maximum likelihood estimation requires us to know how to compute the likelihood. lucky us, the transformations used by normalizing flows are invertible and differentiable, which enables exact computation of the likelihood via the change of variables formula. in short, the formula describes exactly how we would go about computing likelihoods resulting from these mappings:

Change of Variables Formula

Given \( Z \in \mathbb{R}^D \), a random variable with a known and tractable probability density function \( p_Z: \mathbb{R}^D \to \mathbb{R} \), if \( g \) is an invertible function, let \( f = g^{-1} \) and \( Y = g(Z) \). Then the probability density function of the random variable \( Y \) is:

\[ p_Y(y) = p_Z(f(y)) \left| \det \left( \frac{\partial f}{\partial y} \right) \right| \]

Here, \( \det \left( \frac{\partial f}{\partial y} \right) \) denotes the determinant of the Jacobian matrix of \( f \), evaluated at \( y \).

this formula is ✨ illuminating ✨ because it prescribes the necessary conditions to build an invertible model that allows for exact and tractable likelihood computations throughout training: we need a bijectiion that is expressive when stacked in layers, and we need this bijection, its inverse and the determinant of its jacobian to be speedily computable. if we had such a function, we would have collected all the puzzle pieces to a invertible and explicitly generative model.

and of course, we do have such a function.

a couple of coupling layers

original paper by dihn et al here

coupling layers as proposed by dihn et al. gives us just that. the core idea behind this sort of layer is to split the input \( x \) into two pieces, \(x_1\) and \(x_2\) and perform the following computation, with \( m \) being an arbitrarily complex function.

\begin{align*} y_1 &= x_1 \\ y_2 &= x_2 + m(x_1) \end{align*}

a little bit of rearranging very trivially gives us the inverse function:

\begin{align*} x_1 &= y_1 \\ x_2 &= y_2 - m(y_1) \end{align*}

note that both the forward and inverse computation of the entire coupling layer can happen with just the forward pass of the function \(m\), which can be as simple or as complex as the builder of the model desires it to be.

of course, the coupling layer leaves part of the input completely unchanged, but this issue of expressibility can be fixed simply by having multiple coupling layers, and alternating the block that receives the identity function at each turn. composition of invertible functions is invertible, and so this stacking of coupling layers is invertible.

all we need now is a fast way to compute the determinant of the jacobian of this function...

do you remember your matrix determinant rules?

O_O

bitesize determinants

the determinant of a matrix \(A\) is nonzero iff \(A\) is invertible, and it is intuitively the factor by which space is stretched after applying \(A\).

there is a beautiful 3b1b video with great visualizations on this stretching action!

how to compute it? well. you can sit down with your pencil and get it via laplace expansion or LU decomposition or a variety of other methods, giving you a time complexity of \( O(n^3) \). or, if your matrix is well behaved (not a mathematical definition), you can apply a number of properties of the determinant to compute it way faster.

one such useful property, is that the determinant of a triangular matrix is simply the product of its diagonal:

\[ \text{If } A \text{ is triangular, then } \det(A) = \prod_{i=1}^{n} A_{ii} \]

proof is left as an excercise for the readers (:

in the case of coupling layers, the determinant of the jacobian is simply 1. here's all the formulas again, explicitly written out:

\[ x = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}, \quad y = \begin{bmatrix} y_1 \\ y_2 \end{bmatrix}, \quad y_1 = x_1, \quad y_2 = x_2 + m(x_1) \]

\[ J = \frac{\partial y}{\partial x} = \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} \\ \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} \end{bmatrix} \]

since \(y_1 = x_1\), \( \frac{\partial y_1}{\partial x_1} = I \), similarly, \( \frac{\partial y_2}{\partial x_2} = I \)

and since \( y_1 \) does not depend on \(x_2\) at all, \( \frac{\partial y_1}{\partial x_2} = 0 \)

putting those blocks together, we get a lower triangular matrix, with ones along the diagonal:

\[ J = \begin{bmatrix} 1 & 0 \\ \frac{\partial y_2}{\partial x_1} & 1 \end{bmatrix} \]

by our aside on determinants, the determinant of the jacobian of this particular coupling layer is just 1, and we are able to drop this term entirely from our change of variables formula. this significantly simplifies our likelihood computations, and finally gives us the analytically invertible, explicitly generative model that we have worked so hard for.

many other types of coupling layers can be constructed using a similar logic of desinging the jacobian to be easily computable. dihn et al cover a few more variations, and prove that they also satisfy the conditions for invertibility and computational efficiency. similarly, coupling layers are not the only way to build normalizing flows, but again, similar principles apply.

but wait, you keep saying analytic invertibility like there is somehow a secret, second form of invertibility... can invertibility exist without a explicit inverse?

*nods mysteriously, with a smug smirk*

invertible residual networks

behrmann et al.

central to behrmann et al.'s thesis is that there is a very nice similarity between residual layers and euler's method for numerically solving ordinary differential equations (ODE) initial value problems (IVP):

\[ x_{t+1} \leftarrow x_t + g_{\theta_t}(x_t) \quad \text{ResNet} \]

\[ x_{t+1} \leftarrow x_t + h f_{\theta_t}(x_t) \quad \text{ODE IVP} \]

for clarity, here's a summary on the notation for each:

component residual layer euler's method
update rule \( x_{t+1} = x_t + g_{\theta_t}(x_t) \) \( x_{t+1} = x_t + f_{\theta_t}(x_t) \)
\( x_t \) input at layer \( t \) state at time \( t \)
\( g_{\theta_t} \), \( f_{\theta_t} \) learned residual function derivative function

given this similarity, behrmann et al. argued that the we can think of a residual neural network as an ODE problem and interpret the inverse of this ODE's IVP as the dynamics of the system going backwards in time. this backwards step looks as follows:

\[ x_{t} \leftarrow x_{t+1} - g_{\theta_t}(x_t) \quad \text{ResNet} \]

\[ x_{t} \leftarrow x_{t+1} - h f_{\theta_t}(x_t) \quad \text{ODE IVP} \]

solving this backwards dynamic would implement an inverse of our function. in other words, if we are able to use \( g_{\theta_t} \) and \( x_{t+1 }\) to get \( x_t \), then we would be able to invert a residual layer as defined above.

concretely, this can be done by finding the fixed point of the function \(T\) defined as follows:

\[ T(x) = x_{t + 1} - g_{\theta_t} (x) \]

what is a fixed point, you ask? well...

Fixed Point

Formally, \( c \) is a fixed point of a function \( f \) if \( c \) belongs to both the domain and the codomain of \( f \), and \(f(c) = c\).

Less formally, a fixed point is a value that does not change under a given transformation. For functions, it is a element that is mapped to itself under the function.

if we are able to find the fixed point to \( T \), we would have solved the backwards dynamics. since this fixed point \(x\) is exactly the inverse:

\[ T(x) = x \quad \text{by definition of a fixed point} \]

\[ T(x) = x_{t + 1} - g_{\theta_t} (x) \]

\[ x = x_{t + 1} - g_{\theta_t} (x) \implies x_{t + 1} = x + g_{\theta_t}(x) \]

...so \( x \) was the input to the residual layer whose output was \( x_{t + 1} \), and thus the fixed point \( x = x_t \) is the solution to the backwards dynamics.

as is with a lot of mathematics, we've solved one problem, and our reward is two more:

thankfully, banach's fixed point theorem addresses both of our problems in one fell swoop.

banach's fixed point theorem (and lipschitz continuity)

wikipedia pages for banach's fixed point theorem and contraction mapping

to start, obligatory math definitions, in very serious capitalizations.

Banach's Fixed Point Theorem

Let \( (X, d) \) be a non-empty complete metric space with a contraction mapping \( T: X \to X \). Then \( T \) admits a unique fixed point \( x^* \in X \) (i.e., \( T(x^*) = x^* \)). Furthermore, \( x^* \) can be found as follows: start with an arbitrary element \( x_0 \in X \) and define a sequence \( (x_n)_{n \in \mathbb{N}} \) by \( x_n = T(x_{n-1}) \, \text{for } n \geq 1 \). Then: \( \lim_{n \to \infty} x_n = x^* \).

banach's fixed point theorem guarantees us the existence of a unique fixed point, and provides a numerican way to compute it via what is aptly named fixed point iteration. in turn, it requires that our function be a contraction mapping.

now. what is a contraction mapping?

Contraction Mapping

A contraction mapping on a metric space \((M, d)\) is a function \(f\) from \(M\) to itself, with the property that there is some real number \(0 \leq k < 1\) such that for all \(x\) and \(y\) in \(M\),

\[ d(f(x), f(y)) \leq k \cdot d(x, y). \]

Any contraction mapping is Lipschitz continuous, and the infimum of all such constants \( k \) is known as the Lipschitz constant of \( f \). A contraction mapping is thus a special case of Lipschitz continuous functions, with the condition that the Lipschitz constant \( < 1\).

from now on, we will refer to functions with a lipschitz constant < 1, or contraction mappings, as 1-lipschitz functions.

a good anchor for intuition that i found while thinking about lipschitz continuity is that the lipschitz constant puts an upper bound on how much a function can change between any two points on the function. the wikipedia article on lipschitz continuity has an animation that offers a similar, but more visual intuition.

thus far, we have a function \( T(x) = x_{t + 1} - g_{\theta_t} (x) \) for which we want to find the fixed point, we know that a unique fixed point exists if this function is a 1-lipschitz function.

making \( T(x) = x_{t + 1} - g_{\theta_t} (x) \) 1-lipschitz

note that for all ensuing analysis, we pick the norm to be the euclidean norm

the function \( T \) is 1-lipschitz if:

\[ \|T(x) - T(y)\| \leq L \|x - y\|, \quad L < 1 \]

expanding and rewriting left hand side:

\[ T(x) - T(y) = -g_{\theta_t} (x) + g_{\theta_t} (y) = -(g_{\theta_t} (x) - g_{\theta_t} (y)) \]

\[ \Rightarrow \|T(x) - T(y)\| = \|g_{\theta_t} (x) - g_{\theta_t} (y) \| \]

\[ \Rightarrow \|g_{\theta_t} (x) - g_{\theta_t} (y) \| \leq L \|x - y\| \]

...we trivially get to the conclusion that making \( T \) 1-lipschitz requires only that we make \( g_{\theta_t} (x) \) 1-lipschitz.

making \( g_{\theta_t} (x) \) 1-lipschitz

miyato et al. spectral normalization for GANs, section 2.1

\( g_{\theta_t} \) can be any arbitrary function, so long as it is 1-lipschitz. we'd like it to be a neural network, which in its simplest form is composed of a linear layer, followed by a nonlinear activation function:

\[ f(h) = \phi(Wh) \]

important to note is that the composition of 1-lipschitz functions is 1-lipschitz, this can be shown as follows:

\[ \| f(x) - f(y) \| \leq L_f \| x - y \|, \quad L_f < 1 \]

\[ \| g(x) - g(y) \| \leq L_g \| x - y \|, \quad L_g < 1 \]

\[ \| f(g(x)) - f(g(y)) \| \leq L_f \| g(x) - g(y) \| \]

\[ \leq L_f L_g \| x - y \|, \quad L_f L_g < 1 \]

...so we just need both the activation function \( \phi \) and the linear function \( f(h) = Wh \) to be 1-lipschitz.

there are many popular nonlinearities that fit the bill of 1-lipschitzness, we can we just pick our favorite ReLU. for the linear layers, it's a tad bit more complicated.

by definition, \( \|g\|_{\text{Lip}} = \sup_{h} \sigma\left( \nabla g(h) \right) \), where \( \sigma \) is the largest singular value. for a linear layer where \( g(h) = Wh \), \( \nabla g(h) \) is simply \(W \). here, the intuition for me was that the matrix \( W \) ultimately represents a linear system of equations, and so the derivative of \( g(h) \) is simply the weights \( W \). then we have:

\[ \|g\|_{\text{Lip}} = \sup_{h} \sigma\left( \nabla g(h) \right) = \sup_{h} \sigma (W) = \sigma (W) \]

which says, essentially, the lipschitz constant of a linear layer is its largest singular value.

knowing this equality, we now have the power to do ✨anything✨, because there are lots of ways to fix a matrix \( A \) so that \( \sigma(A) < 1\).

the simplest way, and also the approach taken by behrmann et al. is to use power iteration to estimate \( \sigma(W) \), and then divide the weight matrix \( W \) by \( \sigma \). a small constant \(c < 1 \) is applied to account for the fact that power iteration undershoots the true largest singular value. this operation is applied before every forward pass of the model.

\[ W = \begin{cases} \dfrac{c}{\sigma} W, & \text{if } \dfrac{c}{\sigma} < 1 \\[0.5em] W, & \text{otherwise} \end{cases} \]

intuition for why this scaling works to reduce \( \sigma(W) \) for a matrix \( W \) can be found by thinking about singular value decomposition (SVD), which decomposes \( W \) into three pieces:

\[ W = U \Sigma V^T \]

where \( U \) and \(V^T\) are rotation matrices, which by definition have a singular value of 1 since they do not stretch space, and \( \Sigma \) is a diagonal matrix with the singular values along the diagonal. dividing the whole matrix \( W \) by some constant \( k \) directly corresponds to facting out \( k \) from \( \Sigma \), so all singular values are scaled down by \( k \).

\[ W = U \Sigma V^T \Rightarrow \frac{1}{k}W = U \left( \frac{1}{k} \Sigma \right) V^T \]

there are of course many other ways to correct a matrix to have a less than one singular value, for example, by projecting the weight matrices into orthogonal space, but we are plenty satisfied with this approach, which miyato et al. calls spectral normalization.

combining everything finally gives us a hard, implicit invertible neural network layer: making \( g_{\theta_t} (x) \) 1-lipschitz makes \( T(x) \) 1-lipschitz, which was needed to guarantee that \( T \) was a contractive mapping so that banach's fixed point theorem could hold, which was needed to guarantee the existence and uniqueness of a fixed point, this fixed point then finally solves the backwards formulation of the residual layer dynamics that we initially set out to address.

we need not be satisfied at having a single layer. the composition of invertible functions is invertible, and building a network would simply involve stacking many of these layers together, with the inverse being computed residual layer by residual layer using fixed point iteration.

what this framework also gives us, is a blueprint for extending invertibility beyond fully connected layers as the ones we've covered. given any residual formulation of a neural network \( y = F(x) + x \), we just need to prove that the residual stream \( F(x)\) is 1-lipschitz. of course, much more care is needed in practice to ensure the expressibility and usability of such layers, but we won't worry about that for now.

this must be enough math for one day.

postscript 1: i'm *holds up fingers* three months deep in this invertibility rabbit hole. from autoencoders, which are invertible only with gentle parenting and lots of tlc, to finding normalizing flows as a niche idea for analytical invertibility with significant architectural constraints, to accidentally stumbling upon a seemingly soft invertible network that was just a little too invertible on out of distribution data, to finally finding the mathematical framework that backs its invertibility after digging through wikipedia and many many references. it was quite the journey.

postscript 2: shoutout to eddie, my trusty companion on this long and treacherous road to invertibility and beyond. i am so grateful for the countless hours of back and forth yaps and sanity checks when it genuinely feels like i'm losing it.

postscript 3: next blog will be silly. mark my words.