<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://javier-cramirez.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://javier-cramirez.github.io/" rel="alternate" type="text/html" /><updated>2025-03-12T06:03:46+00:00</updated><id>https://javier-cramirez.github.io/feed.xml</id><title type="html">javi</title><subtitle>My official website.</subtitle><entry><title type="html">Laplacians and Spectral Clustering</title><link href="https://javier-cramirez.github.io/2024/05/08/LaplacianMatrices.html" rel="alternate" type="text/html" title="Laplacians and Spectral Clustering" /><published>2024-05-08T00:00:00+00:00</published><updated>2024-05-08T00:00:00+00:00</updated><id>https://javier-cramirez.github.io/2024/05/08/LaplacianMatrices</id><content type="html" xml:base="https://javier-cramirez.github.io/2024/05/08/LaplacianMatrices.html"><![CDATA[<script>
MathJax = {
  tex: {
    inlineMath: [['$', '$'], ['\\(', '\\)']]
  },
  svg: {
    fontCache: 'global'
  }
};
</script>

<script type="text/javascript" id="MathJax-script" async="" src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-svg.js">
</script>

<style> body { font-family: "Roboto Mono", monospace; } </style>

<p>When thinking of partitioning data points into groups, we intuitively turn to clustering. A very classcal example is that of K-means.
Briefly, the algorithm seeks to minimize the Euclidean distance between points in each assigned cluster $C$. Of course, partitioning the data requires
that (1) the union of all clusters is equal to the data set and (2) none of these clusters overlap. </p>
<p>The algorithm is very often associated with Lloyd's algorithm. Select a set of $n$ distinct points from some domain $\phi$. Assign each point to the closest cluster mean $c_{i}$ and repeat. Very informally, of course. However, these traditional methods are limited to spherical clusters, leading to worse performance on more unusual cluster shapes. As it turns out, spectral clustering is a very nice method leveraging the eigenvalues and eigenvectors of a special structure called the similarity matrix. For this, we must turn to principles from graph theory. </p>
<p>Define the graph $G=(V,E)$ where $E\subseteq \binom{V}{2}$ where $E$ is a set of edges and $V$ is a set of vertices. $\binom{V}{2}$ is the set of all 2-element subsets of $V$ given that $V$ is finite. We say that $u\in V$ and $v\in V$ are adjacent if $\{u,v\}\in E$. This works both ways, which also means $\{v,u\}\in E$. To define the degree of a graph, first note that the neighborhood $N(V)$ of $V$ for $v\in V$ is the set of all vertices adjacent to $v$. Thus, the cardinality of $N(V)=\text{deg}(v)$. The adjacency matrix, $\mathbf{A}$, has its $(i,j)$-th entry set to $1$ if $v_{i}$ is adjacent to $v_{j}$ and $0$ otherwise. Next, define the incidence matrix, $\mathbf{M}$, as having its $(i,j)$-th entry set to $1$ is $v_{i}\in e{j}$ and $0$ otherwise. Generally, if $v\in e$, then we say that $v$ is incident with $e$ and vice versa. </p>

<p>Further note that $\sum^{n}_{j=1}A(i,j)=\text{deg}(v_{i})=\sum^{n}_{j=1}A(j,i)$. Now, define the matrix $\mathbf{D}$ as the matrix whose diagonal elements $\mathbf{D}(i,i)$ to be $\text{deg}(v_{i})$. We define the unnormalized Laplacian as $\mathbf{L}=\mathbf{D}-\mathbf{A}$. Notable properties of $L$ is that (1) its smallest eigenvalue, $\lambda_{1}$ is $0$, with a constant eigenvector $\mathbb{1}$, (2) the multiplicity of $\lambda_{1}$ is equal to the number of components of $G$, and (3) it has $n$ non-negative real-valued eigenvalues in increasing order. Did I mention $\mathbf{L}$ is positive semi-definite and symmetric?</p>]]></content><author><name></name></author><category term="Other" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Tensor.scatter_() for Dummies</title><link href="https://javier-cramirez.github.io/2024/03/15/HowToTensorScatter.html" rel="alternate" type="text/html" title="Tensor.scatter_() for Dummies" /><published>2024-03-15T00:00:00+00:00</published><updated>2024-03-15T00:00:00+00:00</updated><id>https://javier-cramirez.github.io/2024/03/15/HowToTensorScatter</id><content type="html" xml:base="https://javier-cramirez.github.io/2024/03/15/HowToTensorScatter.html"><![CDATA[<script>
MathJax = {
  tex: {
    inlineMath: [['$', '$'], ['\\(', '\\)']]
  },
  svg: {
    fontCache: 'global'
  }
};
</script>

<script type="text/javascript" id="MathJax-script" async="" src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-svg.js">
</script>

<style> body { font-family: "Roboto Mono", monospace; } </style>

<p>This little function Utilizes parameters: <code>dim</code>, <code>index</code>, <code>src</code>, <code>reduce</code>.
<br /></p>
<blockquote>
  <p><code>Tensor.scatter_()</code> essentially uses the information from <code>index</code> to place <code>src</code> into our beloved <code>Tensor</code>.</p>
</blockquote>

<p>Suppose we have the following code</p>
<pre><code class="language-python">  src = torch.arange(1, 11).reshape((2, 5))
</code></pre>
<p>Our tensor looks like:</p>
<pre><code class="language-python">&gt;&gt; tensor([[ 1,  2,  3,  4,  5],
        [ 6,  7,  8,  9, 10]])
</code></pre>
<p>Now define the <code>index</code> and <code>target</code> tensor:</p>
<pre><code class="language-python">idx = torch.tensor([2, 2, 2, 2, 2])
target = torch.zeros(3, 5, dtype=src.dtype)
</code></pre>
<p>Our <code>target</code> tensor will be the victim of our <code>scatter_()</code> bloodbath. It is simply a $3\times 5$ tensor of all zeroes. Meanwhile, <code>idx</code> will define “where” we place the elements of <code>src</code> into <code>target</code>. Here, it essentially acts as a middleman. Let’s see what happens if we call <code>scatter.()</code> with <code>dim=0</code> to keep things simple:</p>

<pre><code class="language-python">target.scatter_(0, idx, src)
&gt;&gt; tensor([[0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [1, 2, 3, 4, 5]])
</code></pre>
<p>Mama mia. We have essentially moved the first row of <code>src</code> into the third row of <code>target</code>. Notice that the second row of <code>src</code> is nowhere to be found. This is because we specified <code>idx</code> with as a $1\times 4$ tensor, not as a $2\times 4$ tensor.</p>]]></content><author><name></name></author><category term="Other" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Vector Quantized Variational Autoencoders</title><link href="https://javier-cramirez.github.io/2024/03/13/VQVAEs.html" rel="alternate" type="text/html" title="Vector Quantized Variational Autoencoders" /><published>2024-03-13T00:00:00+00:00</published><updated>2024-03-13T00:00:00+00:00</updated><id>https://javier-cramirez.github.io/2024/03/13/VQVAEs</id><content type="html" xml:base="https://javier-cramirez.github.io/2024/03/13/VQVAEs.html"><![CDATA[<script>
MathJax = {
  tex: {
    inlineMath: [['$', '$'], ['\\(', '\\)']]
  },
  svg: {
    fontCache: 'global'
  }
};
</script>

<script type="text/javascript" id="MathJax-script" async="" src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-svg.js">
</script>

<style> body { font-family: "Roboto Mono", monospace; } </style>

<p><strong>IN PROGRESS</strong></p>

<p>Discretization of the gradient? The big catch with a VQ-VAE is that it performs *just as well* as it's continuous counterparts.</p>

<p>&nbsp;Vector quantisation is essentially finding an optimal $\phi$ which best minimizes the squared loss between some vector $x$ you want to predict and $e_{i}$ template vectors. 
Our "codebook" is the set of all these template vectors $e_{i}$. However, this does not scale well if we desire accuracy as this requires large codebooks. 
This is where our variational autoencoder comes along. It's all about compressing the input to a lower-dimensional latent space. But, of 
course, we are compressing latent representations (not the input itself). </p>

<p>As output, the encoder gives us a continuous latent distribution $z_{e}(x)$ which we need to discretize. Usually, in vanilla VAEs,
we would just sample from the encoder's outputs. These distributions are usually multivariate Gaussian with diagonal covariances. 
Our new approach, however, utilizes our $K$ embedding vectors which are learned during training, and stored within the codebook. 
We seek to find the embedding vector $k$ that is closest to $z_{e}(x)$ by minimizing the L2 loss, or: </p>

\[k^{*}=\text{argmin}_{j}||z_{e}(x)-e_{j}||_{2}\]

<p>Suppose we have a latent embedding space $e\in R^{K\times D}$ where $K$ is the size of our latent space (also the number of our $e_{j}$ vectors!)
and $D$ is the latent dimension of each embedding vector $e_{i}$. Then, we have that </p>]]></content><author><name></name></author><category term="Other" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Formula of The Month</title><link href="https://javier-cramirez.github.io/2024/03/10/Formula.html" rel="alternate" type="text/html" title="Formula of The Month" /><published>2024-03-10T00:00:00+00:00</published><updated>2024-03-10T00:00:00+00:00</updated><id>https://javier-cramirez.github.io/2024/03/10/Formula</id><content type="html" xml:base="https://javier-cramirez.github.io/2024/03/10/Formula.html"><![CDATA[<script>
MathJax = {
  tex: {
    inlineMath: [['$', '$'], ['\\(', '\\)']]
  },
  svg: {
    fontCache: 'global'
  }
};
</script>

<script type="text/javascript" id="MathJax-script" async="" src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-svg.js">
</script>

<style> body { font-family: "Roboto Mono", monospace; } </style>

<p align="center"><u>Formula of the Month:</u></p>
<p align="center">
  The Lagrange basis function 
  <br />
  $$\displaystyle L_{i}(x)=\prod^{N}_{j=1,i\neq j}\frac{x-x_{j}}{x_{i}-x_{j}}$$
  This bad boy follows from the following result:
  <br />
  Suppose we have  $f:[a,b]\rightarrow\mathbb{R}$ and $x_{1} &lt; \dots &lt; x_{N} \in [a,b]$. Then, there exists a unique polynomial, $p(x)$, with degree no greater than $n-1$ that interpolates $f$ at points $x_{1},\dots , x_{N}$. This interpolating polynomial can be expressed as:
  <br />
  $$\displaystyle p(x)=\sum^{N}_{i=1}f(x_{i})L_{i}(x)$$ 
  Expanding this formula, we can see that:
  $$\displaystyle L_{j}(x)=\frac{(x-x_{0})(x-x_{1})\dots (x-x_{N})}{(x_{i}-x_{0})(x_{i}-x_{1})\dots (x_{i}-x_{N})}$$
  <br />
  I am referring to the Weierstrass Approximation Theorem, which (very informally) states that every continuous function on a closed interval is *almost* a polynomial. This seemingly harmless theorem has paved the way for many fields involving computational mathematics. Approximation is king, however not every function is approximated equally. 
    <br />
    Take, for example, the Runge Phenomenon. This is a classic example where vanilla polynomial interpolation *simply fails*. At small degrees, our approximation is going nice, like a drive up Sedona. Unfortunately, as we increase the degree of our polynomials, the error grows without bound (wild oscillations occur near the endpoints). I leave you with the following illustration: 
</p>

<p align="center"><img src="https://www.mscroggs.co.uk/img/full/runge-uniform.gif" /></p>
<p><br />
<a align="center" href="www.mscroggs.co.uk">Credit to Matthew Scroggs.</a></p>]]></content><author><name></name></author><category term="Other" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Variational Inference: First Principles</title><link href="https://javier-cramirez.github.io/2024/03/08/your-new-blog-post.html" rel="alternate" type="text/html" title="Variational Inference: First Principles" /><published>2024-03-08T00:00:00+00:00</published><updated>2024-03-08T00:00:00+00:00</updated><id>https://javier-cramirez.github.io/2024/03/08/your-new-blog-post</id><content type="html" xml:base="https://javier-cramirez.github.io/2024/03/08/your-new-blog-post.html"><![CDATA[<script>
MathJax = {
  tex: {
    inlineMath: [['$', '$'], ['\\(', '\\)']]
  },
  svg: {
    fontCache: 'global'
  }
};
</script>

<script type="text/javascript" id="MathJax-script" async="" src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-svg.js">
</script>

<style> body { font-family: "Roboto Mono", monospace; } </style>

<p align="center"><img src="https://64.media.tumblr.com/122cb8fcdabd68832c61b62a403bf49c/9eb1947e2ed393cf-ee/s540x810/06c37a959200146a91c2799c5175f6a9956276ae.jpg" /></p>

<p>&nbsp; In Bayesian statistics, we are often interested in making an inference about data based on what we already have. Equivalently, it is the posterior distribution (which composes an uncertainty about yet-to-be observed variables) which is our topic of study. From Bayes' theorem, we set up the problem: </p>

\[\displaystyle p(z|x;\theta)=\frac{p(x|z;\theta)p(z;\theta)}{p(x;\theta)}\]

<p> Which just reinforces that the posterior is proportional to the likelihood times the prior. Note that $p(x;\theta)=\int p(x|z;\theta)p(z;\theta) \, dz$ is the evidence, which can bear a very high dimensionality (might be intractable). This is *no bueno* and is the source of many headaches, but also some cool algorithms. </p>

<blockquote>
  <p>Variational inference seeks to give better approximations when our posterior density is not so tractable. </p>
</blockquote>

<p>Let me make your life worse by presenting a set of local variational parameters $\phi_{i}$ which belong to $q(z|x_{i};\phi_{i})$. This pretty much says that the latent distribution of vector $z$ (given $i$ observations) is equipped by local parameters. Now, given that we have global parameters $\theta$, we could possibly update each $\phi_{i}$ upon each successive observation. As we update, we can get closer and closer to our global $\theta$. More on this later.</p>

<p>One of the central approaches to tackling this optimization problem is through the Kullback-Leibler (KL) divergence. Horribly informalized, it is the measure between our approximating distribution $q(z)$ and the true density $p(z)$. Now, we have the objective function:</p>

\[\displaystyle \mathcal{C}=\sum^{N}_{i=1}D_{KL}(q(z|x_{i};\phi_{i})\ ||\ p(z|x_{i};\theta))\]

<p>where $D_{KL}=\mathbb{E}_{q}[\log q(z|x_{i};\phi_{i})-\log p(z|x_{i};\theta)]$</p>
<p><br /></p>
<p>Which is just the expectation w.r.t. $q$ of the difference between the log densities.
However, taking the expectations of a forward $D_{KL}$  does not yield a closed form. So we turn to approximations. Now, with a little bit of algebra, observe the following: </p>

\[\displaystyle 
\begin{align} 
\\
\mathcal{C}=\sum^{N}_{i=1} \mathbb{E}_{q}\left[\log q(z|x_{i};\phi_{i})-\log p(z|x_{i};\theta)\right] 
\\
=\sum^{N}_{i=1}\mathbb{E}_{q}\left[\log q(z|x_{i};\phi_{i})-\log\frac{p(x_{i},z;\theta)}{p(x_{i};\theta)}\right]
\\
=\sum^{N}_{i=1}\mathbb{E}_{q}\left[\log q(z|x_{i};\phi_{i})-\log p(x_{i},z;\theta)]+\sum^{N}_{i=1}\mathbb{E}_{q}[\log p(x_{i};\theta)\right]
\\
\end{align}\]

<p>The mean field approximation assumes that our variational posterior is fully factorizable</p>

\[\displaystyle q(z_{1},\dots,z_{N})=\prod^{N}_{k=1}q(z_{k})\]

<p>by partitioning elements of $z$ into disjoint groupings $z_{k}$.</p>

<p align="center"><img src="https://media.tenor.com/BXQgJskV7LgAAAAj/9999.gif" /></p>]]></content><author><name></name></author><category term="Other" /><summary type="html"><![CDATA[]]></summary></entry></feed>