javi

Laplacians and Spectral Clustering

2024-05-08T00:00:00+00:00

When thinking of partitioning data points into groups, we intuitively turn to clustering. A very classcal example is that of K-means. Briefly, the algorithm seeks to minimize the Euclidean distance between points in each assigned cluster $C$. Of course, partitioning the data requires that (1) the union of all clusters is equal to the data set and (2) none of these clusters overlap.

The algorithm is very often associated with Lloyd's algorithm. Select a set of $n$ distinct points from some domain $\phi$. Assign each point to the closest cluster mean $c_{i}$ and repeat. Very informally, of course. However, these traditional methods are limited to spherical clusters, leading to worse performance on more unusual cluster shapes. As it turns out, spectral clustering is a very nice method leveraging the eigenvalues and eigenvectors of a special structure called the similarity matrix. For this, we must turn to principles from graph theory.

Define the graph $G=(V,E)$ where $E\subseteq \binom{V}{2}$ where $E$ is a set of edges and $V$ is a set of vertices. $\binom{V}{2}$ is the set of all 2-element subsets of $V$ given that $V$ is finite. We say that $u\in V$ and $v\in V$ are adjacent if $\{u,v\}\in E$. This works both ways, which also means $\{v,u\}\in E$. To define the degree of a graph, first note that the neighborhood $N(V)$ of $V$ for $v\in V$ is the set of all vertices adjacent to $v$. Thus, the cardinality of $N(V)=\text{deg}(v)$. The adjacency matrix, $\mathbf{A}$, has its $(i,j)$-th entry set to $1$ if $v_{i}$ is adjacent to $v_{j}$ and $0$ otherwise. Next, define the incidence matrix, $\mathbf{M}$, as having its $(i,j)$-th entry set to $1$ is $v_{i}\in e{j}$ and $0$ otherwise. Generally, if $v\in e$, then we say that $v$ is incident with $e$ and vice versa.

Further note that $\sum^{n}_{j=1}A(i,j)=\text{deg}(v_{i})=\sum^{n}_{j=1}A(j,i)$. Now, define the matrix $\mathbf{D}$ as the matrix whose diagonal elements $\mathbf{D}(i,i)$ to be $\text{deg}(v_{i})$. We define the unnormalized Laplacian as $\mathbf{L}=\mathbf{D}-\mathbf{A}$. Notable properties of $L$ is that (1) its smallest eigenvalue, $\lambda_{1}$ is $0$, with a constant eigenvector $\mathbb{1}$, (2) the multiplicity of $\lambda_{1}$ is equal to the number of components of $G$, and (3) it has $n$ non-negative real-valued eigenvalues in increasing order. Did I mention $\mathbf{L}$ is positive semi-definite and symmetric?

Tensor.scatter_() for Dummies

2024-03-15T00:00:00+00:00

This little function Utilizes parameters: dim, index, src, reduce.

Tensor.scatter_() essentially uses the information from index to place src into our beloved Tensor.

Suppose we have the following code

  src = torch.arange(1, 11).reshape((2, 5))

Our tensor looks like:

>> tensor([[ 1,  2,  3,  4,  5],
        [ 6,  7,  8,  9, 10]])

Now define the index and target tensor:

idx = torch.tensor([2, 2, 2, 2, 2])
target = torch.zeros(3, 5, dtype=src.dtype)

Our target tensor will be the victim of our scatter_() bloodbath. It is simply a $3\times 5$ tensor of all zeroes. Meanwhile, idx will define “where” we place the elements of src into target. Here, it essentially acts as a middleman. Let’s see what happens if we call scatter.() with dim=0 to keep things simple:

target.scatter_(0, idx, src)
>> tensor([[0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [1, 2, 3, 4, 5]])

Mama mia. We have essentially moved the first row of src into the third row of target. Notice that the second row of src is nowhere to be found. This is because we specified idx with as a $1\times 4$ tensor, not as a $2\times 4$ tensor.

Vector Quantized Variational Autoencoders

2024-03-13T00:00:00+00:00

IN PROGRESS

Discretization of the gradient? The big catch with a VQ-VAE is that it performs *just as well* as it's continuous counterparts.

Vector quantisation is essentially finding an optimal $\phi$ which best minimizes the squared loss between some vector $x$ you want to predict and $e_{i}$ template vectors. Our "codebook" is the set of all these template vectors $e_{i}$. However, this does not scale well if we desire accuracy as this requires large codebooks. This is where our variational autoencoder comes along. It's all about compressing the input to a lower-dimensional latent space. But, of course, we are compressing latent representations (not the input itself).

As output, the encoder gives us a continuous latent distribution $z_{e}(x)$ which we need to discretize. Usually, in vanilla VAEs, we would just sample from the encoder's outputs. These distributions are usually multivariate Gaussian with diagonal covariances. Our new approach, however, utilizes our $K$ embedding vectors which are learned during training, and stored within the codebook. We seek to find the embedding vector $k$ that is closest to $z_{e}(x)$ by minimizing the L2 loss, or:

\[k^{*}=\text{argmin}_{j}||z_{e}(x)-e_{j}||_{2}\]

Suppose we have a latent embedding space $e\in R^{K\times D}$ where $K$ is the size of our latent space (also the number of our $e_{j}$ vectors!) and $D$ is the latent dimension of each embedding vector $e_{i}$. Then, we have that

Formula of The Month

2024-03-10T00:00:00+00:00

Formula of the Month:

The Lagrange basis function
$$\displaystyle L_{i}(x)=\prod^{N}_{j=1,i\neq j}\frac{x-x_{j}}{x_{i}-x_{j}}$$ This bad boy follows from the following result:
Suppose we have $f:[a,b]\rightarrow\mathbb{R}$ and $x_{1} < \dots < x_{N} \in [a,b]$. Then, there exists a unique polynomial, $p(x)$, with degree no greater than $n-1$ that interpolates $f$ at points $x_{1},\dots , x_{N}$. This interpolating polynomial can be expressed as:
$$\displaystyle p(x)=\sum^{N}_{i=1}f(x_{i})L_{i}(x)$$ Expanding this formula, we can see that: $$\displaystyle L_{j}(x)=\frac{(x-x_{0})(x-x_{1})\dots (x-x_{N})}{(x_{i}-x_{0})(x_{i}-x_{1})\dots (x_{i}-x_{N})}$$
I am referring to the Weierstrass Approximation Theorem, which (very informally) states that every continuous function on a closed interval is *almost* a polynomial. This seemingly harmless theorem has paved the way for many fields involving computational mathematics. Approximation is king, however not every function is approximated equally.
Take, for example, the Runge Phenomenon. This is a classic example where vanilla polynomial interpolation *simply fails*. At small degrees, our approximation is going nice, like a drive up Sedona. Unfortunately, as we increase the degree of our polynomials, the error grows without bound (wild oscillations occur near the endpoints). I leave you with the following illustration:

Credit to Matthew Scroggs.

Variational Inference: First Principles

2024-03-08T00:00:00+00:00

In Bayesian statistics, we are often interested in making an inference about data based on what we already have. Equivalently, it is the posterior distribution (which composes an uncertainty about yet-to-be observed variables) which is our topic of study. From Bayes' theorem, we set up the problem:

\[\displaystyle p(z|x;\theta)=\frac{p(x|z;\theta)p(z;\theta)}{p(x;\theta)}\]

Which just reinforces that the posterior is proportional to the likelihood times the prior. Note that $p(x;\theta)=\int p(x|z;\theta)p(z;\theta) \, dz$ is the evidence, which can bear a very high dimensionality (might be intractable). This is *no bueno* and is the source of many headaches, but also some cool algorithms.

Variational inference seeks to give better approximations when our posterior density is not so tractable.

Let me make your life worse by presenting a set of local variational parameters $\phi_{i}$ which belong to $q(z|x_{i};\phi_{i})$. This pretty much says that the latent distribution of vector $z$ (given $i$ observations) is equipped by local parameters. Now, given that we have global parameters $\theta$, we could possibly update each $\phi_{i}$ upon each successive observation. As we update, we can get closer and closer to our global $\theta$. More on this later.

One of the central approaches to tackling this optimization problem is through the Kullback-Leibler (KL) divergence. Horribly informalized, it is the measure between our approximating distribution $q(z)$ and the true density $p(z)$. Now, we have the objective function:

\[\displaystyle \mathcal{C}=\sum^{N}_{i=1}D_{KL}(q(z|x_{i};\phi_{i})\ ||\ p(z|x_{i};\theta))\]

where $D_{KL}=\mathbb{E}_{q}[\log q(z|x_{i};\phi_{i})-\log p(z|x_{i};\theta)]$

Which is just the expectation w.r.t. $q$ of the difference between the log densities. However, taking the expectations of a forward $D_{KL}$ does not yield a closed form. So we turn to approximations. Now, with a little bit of algebra, observe the following:

\[\displaystyle \begin{align} \\ \mathcal{C}=\sum^{N}_{i=1} \mathbb{E}_{q}\left[\log q(z|x_{i};\phi_{i})-\log p(z|x_{i};\theta)\right] \\ =\sum^{N}_{i=1}\mathbb{E}_{q}\left[\log q(z|x_{i};\phi_{i})-\log\frac{p(x_{i},z;\theta)}{p(x_{i};\theta)}\right] \\ =\sum^{N}_{i=1}\mathbb{E}_{q}\left[\log q(z|x_{i};\phi_{i})-\log p(x_{i},z;\theta)]+\sum^{N}_{i=1}\mathbb{E}_{q}[\log p(x_{i};\theta)\right] \\ \end{align}\]

The mean field approximation assumes that our variational posterior is fully factorizable

\[\displaystyle q(z_{1},\dots,z_{N})=\prod^{N}_{k=1}q(z_{k})\]

by partitioning elements of $z$ into disjoint groupings $z_{k}$.