<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://rupesh4604.github.io//feed.xml" rel="self" type="application/atom+xml" /><link href="https://rupesh4604.github.io//" rel="alternate" type="text/html" /><updated>2026-05-07T12:46:36+00:00</updated><id>https://rupesh4604.github.io//feed.xml</id><title type="html">Rupesh’s Blog</title><subtitle>A personal blog on technology, research, books, and more.</subtitle><author><name>Rupesh Kumar Yadav Mediboyina</name><email>rupesh32003@gmail.com</email></author><entry><title type="html">Hello, blog</title><link href="https://rupesh4604.github.io//2026/04/15/welcome-to-my-blog/" rel="alternate" type="text/html" title="Hello, blog" /><published>2026-04-15T00:00:00+00:00</published><updated>2026-04-15T00:00:00+00:00</updated><id>https://rupesh4604.github.io//2026/04/15/welcome-to-my-blog</id><content type="html" xml:base="https://rupesh4604.github.io//2026/04/15/welcome-to-my-blog/"><![CDATA[<p>I’ve been meaning to start a blog for a while. The goal is to write about things I’m actually working through — not polished explainers, but the kind of notes I’d write for myself with a slightly more careful eye toward structure.</p>

<h2 id="what-to-expect">What to expect</h2>

<p><strong>Technical posts</strong> — ML papers I’m reading, algorithms from courses, implementation notes. I’ll show code and explain the intuition behind it rather than just dropping formulas.</p>

<p><strong>Book notes</strong> — I read a lot, especially in rationality, cognitive science, and history of science. I’ll share what stuck and why.</p>

<p><strong>Series</strong> — Some topics don’t fit in one post. For those, I’ll write a series: ordered parts that build on each other. You’ll see a navigation box at the top of each part showing the full thread and where you are in it.</p>

<h2 id="how-the-site-works">How the site works</h2>

<ul>
  <li><strong>Archive</strong> — all posts by date</li>
  <li><strong>Tags</strong> — browse by topic; clicking a tag on the home page filters the feed</li>
  <li><strong>Series</strong> — standalone page listing all multi-part threads</li>
  <li><strong>Search</strong> — press <code class="language-text highlighter-rouge">/</code> (or click the icon) for full-text search across all posts</li>
</ul>

<p>All posts are written in Markdown, the site is static, and it’s hosted for free on GitHub Pages.</p>

<p>Let’s see how this goes.</p>]]></content><author><name>Rupesh Kumar Yadav Mediboyina</name><email>rupesh32003@gmail.com</email></author><category term="meta" /><summary type="html"><![CDATA[A quick note on what this space is for and how it's organised.]]></summary></entry><entry><title type="html">Understanding Transformers — Part 3: Positional Encoding</title><link href="https://rupesh4604.github.io//2026/03/10/transformers-part-3-positional-encoding/" rel="alternate" type="text/html" title="Understanding Transformers — Part 3: Positional Encoding" /><published>2026-03-10T00:00:00+00:00</published><updated>2026-03-10T00:00:00+00:00</updated><id>https://rupesh4604.github.io//2026/03/10/transformers-part-3-positional-encoding</id><content type="html" xml:base="https://rupesh4604.github.io//2026/03/10/transformers-part-3-positional-encoding/"><![CDATA[<p>Here’s a surprising property of the attention mechanism we built: <strong>it’s entirely position-blind</strong>.</p>

<p>Given the sequence <code class="language-text highlighter-rouge">[A, B, C]</code> or <code class="language-text highlighter-rouge">[C, A, B]</code>, the self-attention output for token <code class="language-text highlighter-rouge">A</code> is identical — because dot products don’t care about order, only similarity.</p>

<p>For language, order is everything. We need to inject position information before the model can distinguish “dog bites man” from “man bites dog.”</p>

<h2 id="two-approaches">Two approaches</h2>

<ol>
  <li>
    <p><strong>Learned positional embeddings</strong> — add a trainable embedding vector $p_i$ for each position $i$. Simple, flexible, but can’t generalise beyond the training sequence length.</p>
  </li>
  <li>
    <p><strong>Sinusoidal positional encoding</strong> (Vaswani et al.’s choice) — deterministic function of position and dimension. Can extrapolate beyond training length and has a beautiful structure.</p>
  </li>
</ol>

<p>The original paper uses sinusoidal. Let’s understand why.</p>

<h2 id="the-sinusoidal-encoding">The sinusoidal encoding</h2>

<p>For position $pos$ and dimension $i$ (out of $d_{model}$ total):</p>

<p>\(PE_{(pos, 2i)}   = \sin\!\left(\frac{pos}{10000^{2i/d_{model}}}\right)\)
\(PE_{(pos, 2i+1)} = \cos\!\left(\frac{pos}{10000^{2i/d_{model}}}\right)\)</p>

<p>Even dimensions get sine, odd dimensions get cosine, with the frequency decreasing as $i$ increases.</p>

<h2 id="intuition-a-continuous-ruler">Intuition: a continuous “ruler”</h2>

<p>Think of it like a clock — but instead of one hand rotating at one frequency, you have $d_{model}/2$ pairs of hands, each rotating at a different frequency.</p>

<ul>
  <li><strong>High-frequency dimensions</strong> (small $i$) — change rapidly, distinguish nearby positions</li>
  <li><strong>Low-frequency dimensions</strong> (large $i$) — change slowly, encode coarse position (early vs. late in sequence)</li>
</ul>

<p>Together, every position $pos$ gets a unique fingerprint across all frequencies.</p>

<h2 id="why-10000">Why 10000?</h2>

<p>The base 10000 ensures the lowest frequency completes roughly one cycle over sequences of length ~62,800 ($2\pi \times 10000$). For realistic sequences (&lt; 10K tokens), all positions remain in a nearly-linear regime of the sinusoid — which empirically is good for the model’s ability to interpolate.</p>

<h2 id="the-key-property-relative-positions">The key property: relative positions</h2>

<p>Here’s the elegant part. For any fixed offset $k$:</p>

\[PE_{pos+k} = f(PE_{pos})\]

<p>…where $f$ is a <em>linear transformation</em> that depends only on $k$, not on $pos$.</p>

<p>This means the model can, in principle, learn to attend to “two positions to the right” by learning the appropriate linear transformation of the key — without ever seeing $pos$ explicitly. Relative relationships are linearly recoverable.</p>

<h2 id="implementation">Implementation</h2>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">import</span> <span class="nn">math</span>

<span class="k">def</span> <span class="nf">positional_encoding</span><span class="p">(</span><span class="n">max_len</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="n">d_model</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">:</span>
    <span class="s">"""
    Returns PE matrix of shape (max_len, d_model).
    Add to token embeddings before the first encoder layer.
    """</span>
    <span class="n">pe</span>  <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">max_len</span><span class="p">,</span> <span class="n">d_model</span><span class="p">)</span>
    <span class="n">pos</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">max_len</span><span class="p">).</span><span class="n">unsqueeze</span><span class="p">(</span><span class="mi">1</span><span class="p">).</span><span class="nb">float</span><span class="p">()</span>   <span class="c1"># (max_len, 1)
</span>
    <span class="c1"># Division term: 1 / 10000^(2i / d_model)
</span>    <span class="n">div</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span>
        <span class="n">torch</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">d_model</span><span class="p">,</span> <span class="mi">2</span><span class="p">).</span><span class="nb">float</span><span class="p">()</span> <span class="o">*</span> <span class="o">-</span><span class="p">(</span><span class="n">math</span><span class="p">.</span><span class="n">log</span><span class="p">(</span><span class="mf">10000.0</span><span class="p">)</span> <span class="o">/</span> <span class="n">d_model</span><span class="p">)</span>
    <span class="p">)</span>

    <span class="n">pe</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">::</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">sin</span><span class="p">(</span><span class="n">pos</span> <span class="o">*</span> <span class="n">div</span><span class="p">)</span>   <span class="c1"># even dims
</span>    <span class="n">pe</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">::</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">cos</span><span class="p">(</span><span class="n">pos</span> <span class="o">*</span> <span class="n">div</span><span class="p">)</span>   <span class="c1"># odd dims
</span>
    <span class="k">return</span> <span class="n">pe</span>   <span class="c1"># (max_len, d_model)
</span>

<span class="k">class</span> <span class="nc">TokenPlusPositionEmbedding</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">Module</span><span class="p">):</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">vocab_size</span><span class="p">,</span> <span class="n">d_model</span><span class="p">,</span> <span class="n">max_len</span><span class="o">=</span><span class="mi">5000</span><span class="p">,</span> <span class="n">dropout</span><span class="o">=</span><span class="mf">0.1</span><span class="p">):</span>
        <span class="nb">super</span><span class="p">().</span><span class="n">__init__</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">token_emb</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">Embedding</span><span class="p">(</span><span class="n">vocab_size</span><span class="p">,</span> <span class="n">d_model</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">dropout</span>   <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">Dropout</span><span class="p">(</span><span class="n">dropout</span><span class="p">)</span>

        <span class="n">pe</span> <span class="o">=</span> <span class="n">positional_encoding</span><span class="p">(</span><span class="n">max_len</span><span class="p">,</span> <span class="n">d_model</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">register_buffer</span><span class="p">(</span><span class="s">'pe'</span><span class="p">,</span> <span class="n">pe</span><span class="p">)</span>   <span class="c1"># not a parameter, but saved with model
</span>
    <span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
        <span class="c1"># x: (batch, seq_len) token indices
</span>        <span class="n">tok</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">token_emb</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="o">*</span> <span class="n">math</span><span class="p">.</span><span class="n">sqrt</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">d_model</span><span class="p">)</span>
        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">dropout</span><span class="p">(</span><span class="n">tok</span> <span class="o">+</span> <span class="bp">self</span><span class="p">.</span><span class="n">pe</span><span class="p">[:</span><span class="n">x</span><span class="p">.</span><span class="n">size</span><span class="p">(</span><span class="mi">1</span><span class="p">)])</span>
</code></pre></div></div>

<p>Note the $\sqrt{d_{model}}$ scaling on the token embeddings — this ensures the token signal and the positional signal are on the same order of magnitude.</p>

<h2 id="alternatives-used-in-modern-llms">Alternatives used in modern LLMs</h2>

<p>The Transformer paper’s sinusoidal encoding is elegant but most modern LLMs use variants:</p>

<table>
  <thead>
    <tr>
      <th>Method</th>
      <th>Used in</th>
      <th>Key idea</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Learned absolute PE</td>
      <td>BERT, GPT-2</td>
      <td>Trainable vectors per position</td>
    </tr>
    <tr>
      <td>Relative PE (Shaw et al.)</td>
      <td>T5, Music Transformer</td>
      <td>Bias attention scores by relative distance</td>
    </tr>
    <tr>
      <td>RoPE (Su et al.)</td>
      <td>LLaMA, Mistral, Gemma</td>
      <td>Rotate Q and K vectors; naturally encodes relative position in dot product</td>
    </tr>
    <tr>
      <td>ALiBi</td>
      <td>BLOOM, MPT</td>
      <td>Add a linear bias to attention scores based on distance</td>
    </tr>
  </tbody>
</table>

<p><strong>RoPE</strong> has become the dominant choice because it extrapolates well to longer sequences than seen during training — a crucial property for modern long-context models.</p>

<h2 id="wrapping-up-the-series">Wrapping up the series</h2>

<p>We now have all the primitives:</p>
<ul>
  <li><strong>Scaled dot-product attention</strong> (Part 1) — the core lookup mechanism</li>
  <li><strong>Multi-head attention</strong> (Part 2) — parallel specialised heads</li>
  <li><strong>Positional encoding</strong> (Part 3) — injecting order into a permutation-invariant operation</li>
</ul>

<p>The remaining pieces of the Transformer (layer norm, feed-forward sub-layers, encoder-decoder stack, training recipe) build on these foundations. I’ll cover those in a future series on building GPT from scratch.</p>]]></content><author><name>Rupesh Kumar Yadav Mediboyina</name><email>rupesh32003@gmail.com</email></author><category term="machine-learning" /><category term="deep-learning" /><category term="nlp" /><category term="transformers" /><summary type="html"><![CDATA[Attention is permutation-invariant — it treats 'the cat sat' identically to 'sat cat the' without help. Positional encoding is the elegant fix. Here's the sinusoidal construction and why it works.]]></summary></entry><entry><title type="html">Understanding Transformers — Part 2: Multi-Head Attention</title><link href="https://rupesh4604.github.io//2026/02/20/transformers-part-2-multi-head/" rel="alternate" type="text/html" title="Understanding Transformers — Part 2: Multi-Head Attention" /><published>2026-02-20T00:00:00+00:00</published><updated>2026-02-20T00:00:00+00:00</updated><id>https://rupesh4604.github.io//2026/02/20/transformers-part-2-multi-head</id><content type="html" xml:base="https://rupesh4604.github.io//2026/02/20/transformers-part-2-multi-head/"><![CDATA[<p>In <a href="/2026/02/05/transformers-part-1-attention/">Part 1</a>, we built scaled dot-product attention from scratch. A single attention head works — but it can only look at one “type” of relationship at a time.</p>

<p><strong>Multi-head attention</strong> runs $h$ independent attention operations in parallel, each with its own learned projections, then concatenates and re-projects the results.</p>

<h2 id="motivation-one-head-isnt-enough">Motivation: one head isn’t enough</h2>

<p>Consider the sentence: <em>“John said that he hurt himself.”</em></p>

<p>A single attention head must simultaneously track:</p>
<ul>
  <li><code class="language-text highlighter-rouge">he</code> → <code class="language-text highlighter-rouge">John</code> (coreference)</li>
  <li><code class="language-text highlighter-rouge">himself</code> → <code class="language-text highlighter-rouge">John</code> (reflexive)</li>
  <li><code class="language-text highlighter-rouge">hurt</code> → <code class="language-text highlighter-rouge">himself</code> (predicate-argument)</li>
</ul>

<p>These are structurally different relationships. If you force a single head to represent all of them with one weight matrix, it’s forced to compromise.</p>

<p>Multiple heads let the model <strong>specialise</strong>: empirically, different heads in trained models learn to track different syntactic and semantic patterns (coreference, positional proximity, subject-verb agreement, etc.).</p>

<h2 id="the-mechanics">The mechanics</h2>

<p>For each head $i$, we learn three projection matrices:</p>

\[W^Q_i \in \mathbb{R}^{d_{model} \times d_k}, \quad W^K_i \in \mathbb{R}^{d_{model} \times d_k}, \quad W^V_i \in \mathbb{R}^{d_{model} \times d_v}\]

<p>Each head then computes:</p>

\[\text{head}_i = \text{Attention}(X W^Q_i,\ X W^K_i,\ X W^V_i)\]

<p>Outputs are concatenated and projected through $W^O$:</p>

\[\text{MultiHead}(X) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\ W^O\]

<p>where $W^O \in \mathbb{R}^{h \cdot d_v \times d_{model}}$.</p>

<h2 id="dimension-arithmetic">Dimension arithmetic</h2>

<p>The original paper uses $d_{model} = 512$, $h = 8$, which gives $d_k = d_v = 512/8 = 64$.</p>

<p>So each head operates on a 64-dim subspace. The total computation is no more expensive than a single head on the full 512 dimensions — we just distribute it.</p>

<h2 id="pytorch-implementation">PyTorch implementation</h2>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">import</span> <span class="nn">torch.nn</span> <span class="k">as</span> <span class="n">nn</span>
<span class="kn">import</span> <span class="nn">math</span>

<span class="k">class</span> <span class="nc">MultiHeadAttention</span><span class="p">(</span><span class="n">nn</span><span class="p">.</span><span class="n">Module</span><span class="p">):</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">d_model</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="n">num_heads</span><span class="p">:</span> <span class="nb">int</span><span class="p">):</span>
        <span class="nb">super</span><span class="p">().</span><span class="n">__init__</span><span class="p">()</span>
        <span class="k">assert</span> <span class="n">d_model</span> <span class="o">%</span> <span class="n">num_heads</span> <span class="o">==</span> <span class="mi">0</span>

        <span class="bp">self</span><span class="p">.</span><span class="n">d_model</span>    <span class="o">=</span> <span class="n">d_model</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">num_heads</span>  <span class="o">=</span> <span class="n">num_heads</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">d_k</span>        <span class="o">=</span> <span class="n">d_model</span> <span class="o">//</span> <span class="n">num_heads</span>

        <span class="c1"># Projections for Q, K, V and the output
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">W_q</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">d_model</span><span class="p">,</span> <span class="n">d_model</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">W_k</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">d_model</span><span class="p">,</span> <span class="n">d_model</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">W_v</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">d_model</span><span class="p">,</span> <span class="n">d_model</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">W_o</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">d_model</span><span class="p">,</span> <span class="n">d_model</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">split_heads</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">batch_size</span><span class="p">):</span>
        <span class="c1"># x: (batch, seq, d_model) → (batch, heads, seq, d_k)
</span>        <span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">view</span><span class="p">(</span><span class="n">batch_size</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">num_heads</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">d_k</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">x</span><span class="p">.</span><span class="n">transpose</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">query</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="n">value</span><span class="p">,</span> <span class="n">mask</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
        <span class="n">B</span> <span class="o">=</span> <span class="n">query</span><span class="p">.</span><span class="n">size</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>

        <span class="c1"># Project and split
</span>        <span class="n">Q</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">split_heads</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">W_q</span><span class="p">(</span><span class="n">query</span><span class="p">),</span> <span class="n">B</span><span class="p">)</span>   <span class="c1"># (B, h, seq_q, d_k)
</span>        <span class="n">K</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">split_heads</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">W_k</span><span class="p">(</span><span class="n">key</span><span class="p">),</span>   <span class="n">B</span><span class="p">)</span>   <span class="c1"># (B, h, seq_k, d_k)
</span>        <span class="n">V</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">split_heads</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">W_v</span><span class="p">(</span><span class="n">value</span><span class="p">),</span> <span class="n">B</span><span class="p">)</span>   <span class="c1"># (B, h, seq_k, d_k)
</span>
        <span class="c1"># Scaled dot-product attention per head
</span>        <span class="n">scores</span> <span class="o">=</span> <span class="n">Q</span> <span class="o">@</span> <span class="n">K</span><span class="p">.</span><span class="n">transpose</span><span class="p">(</span><span class="o">-</span><span class="mi">2</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="o">/</span> <span class="n">math</span><span class="p">.</span><span class="n">sqrt</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">d_k</span><span class="p">)</span>
        <span class="k">if</span> <span class="n">mask</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span>
            <span class="n">scores</span> <span class="o">=</span> <span class="n">scores</span><span class="p">.</span><span class="n">masked_fill</span><span class="p">(</span><span class="n">mask</span> <span class="o">==</span> <span class="mi">0</span><span class="p">,</span> <span class="o">-</span><span class="mf">1e9</span><span class="p">)</span>
        <span class="n">attn_weights</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">softmax</span><span class="p">(</span><span class="n">scores</span><span class="p">,</span> <span class="n">dim</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span>  <span class="c1"># (B, h, seq_q, seq_k)
</span>        <span class="n">context</span> <span class="o">=</span> <span class="n">attn_weights</span> <span class="o">@</span> <span class="n">V</span>                    <span class="c1"># (B, h, seq_q, d_k)
</span>
        <span class="c1"># Concatenate heads and project
</span>        <span class="n">context</span> <span class="o">=</span> <span class="n">context</span><span class="p">.</span><span class="n">transpose</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">).</span><span class="n">contiguous</span><span class="p">().</span><span class="n">view</span><span class="p">(</span><span class="n">B</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">d_model</span><span class="p">)</span>
        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">W_o</span><span class="p">(</span><span class="n">context</span><span class="p">),</span> <span class="n">attn_weights</span>
</code></pre></div></div>

<h2 id="three-variants-of-attention-in-a-transformer">Three variants of attention in a Transformer</h2>

<p>The same <code class="language-text highlighter-rouge">MultiHeadAttention</code> module is used in three different roles:</p>

<table>
  <thead>
    <tr>
      <th>Location</th>
      <th>Q comes from</th>
      <th>K, V come from</th>
      <th>Purpose</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Encoder self-attention</strong></td>
      <td>encoder input</td>
      <td>encoder input</td>
      <td>Each token attends to all others in the source</td>
    </tr>
    <tr>
      <td><strong>Decoder self-attention</strong></td>
      <td>decoder input</td>
      <td>decoder input</td>
      <td>Each output token attends to prior outputs (masked)</td>
    </tr>
    <tr>
      <td><strong>Cross-attention</strong></td>
      <td>decoder</td>
      <td>encoder output</td>
      <td>Decoder attends to the full encoded source</td>
    </tr>
  </tbody>
</table>

<p>The masking in decoder self-attention is crucial: at inference time, we can’t attend to future tokens (they don’t exist yet), so we apply a causal mask — a lower-triangular matrix of 1s.</p>

<h2 id="what-do-heads-actually-learn">What do heads actually learn?</h2>

<p>Clark et al. (2019) probed BERT’s attention heads and found that different heads consistently specialise:</p>
<ul>
  <li>Some heads track <strong>direct syntactic objects</strong> (verb → direct object)</li>
  <li>Some track <strong>coreferents</strong> (pronoun → antecedent)</li>
  <li>A few “broad” heads attend somewhat uniformly, possibly acting as no-ops or residual paths</li>
</ul>

<p>This specialisation emerges from gradient descent alone — it’s not hardcoded.</p>

<h2 id="key-takeaways">Key takeaways</h2>

<ul>
  <li>Multi-head attention = $h$ independent attention heads, each with learned projections, concatenated and projected.</li>
  <li>Dimensionality per head = $d_{model} / h$ — total cost is the same as one full-width head.</li>
  <li>Three different configurations (self, masked self, cross) power the full Transformer.</li>
  <li>Heads empirically specialise on different linguistic relationships.</li>
</ul>

<p>Next: positional encoding — how the Transformer knows <em>where</em> each token is, since attention itself is position-blind.</p>]]></content><author><name>Rupesh Kumar Yadav Mediboyina</name><email>rupesh32003@gmail.com</email></author><category term="machine-learning" /><category term="deep-learning" /><category term="nlp" /><category term="transformers" /><summary type="html"><![CDATA[One attention head is a single lens. Multi-head attention runs several lenses in parallel — each free to specialise on a different relationship type. Here's exactly how and why.]]></summary></entry><entry><title type="html">Understanding Transformers — Part 1: The Attention Mechanism</title><link href="https://rupesh4604.github.io//2026/02/05/transformers-part-1-attention/" rel="alternate" type="text/html" title="Understanding Transformers — Part 1: The Attention Mechanism" /><published>2026-02-05T00:00:00+00:00</published><updated>2026-02-05T00:00:00+00:00</updated><id>https://rupesh4604.github.io//2026/02/05/transformers-part-1-attention</id><content type="html" xml:base="https://rupesh4604.github.io//2026/02/05/transformers-part-1-attention/"><![CDATA[<p>The Transformer architecture, introduced in Vaswani et al.’s <em>Attention Is All You Need</em> (2017), is the backbone of essentially every modern language model. But it can feel like an intimidating wall of matrix multiplications. This series builds it up piece by piece.</p>

<p>We start with the one idea everything else depends on: <strong>scaled dot-product attention</strong>.</p>

<h2 id="why-attention-at-all">Why attention at all?</h2>

<p>Consider translation: to translate “The animal didn’t cross the street because <strong>it</strong> was too tired” to French, you need to know that <em>it</em> refers to <em>animal</em>, not <em>street</em>. Earlier RNNs passed a single fixed-size “context vector” from encoder to decoder — a bottleneck that crushed this kind of long-range dependency.</p>

<p><strong>Attention</strong> lets the decoder, at each step, <em>directly look back at every encoder state</em> and decide which parts matter most. Instead of summarising everything into one vector, the model learns to weight its own memory dynamically.</p>

<h2 id="the-three-players-query-key-value">The three players: Query, Key, Value</h2>

<p>Attention takes three matrices as input:</p>

<table>
  <thead>
    <tr>
      <th>Name</th>
      <th>Role</th>
      <th>Analogy</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Query (Q)</strong></td>
      <td>“What am I looking for?”</td>
      <td>A search query</td>
    </tr>
    <tr>
      <td><strong>Key (K)</strong></td>
      <td>“What does each position offer?”</td>
      <td>A database index</td>
    </tr>
    <tr>
      <td><strong>Value (V)</strong></td>
      <td>“What information does each position hold?”</td>
      <td>The database contents</td>
    </tr>
  </tbody>
</table>

<p>The intuition: you compute how well each Key matches your Query (a dot product), normalise those scores into weights (softmax), then take a weighted sum of the Values.</p>

<h2 id="the-formula">The formula</h2>

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

<p>Breaking it down:</p>

<ol>
  <li><strong>$QK^T$</strong> — dot product between every query and every key. Shape: <code class="language-text highlighter-rouge">[seq_len, seq_len]</code>. High value = high relevance.</li>
  <li><strong>$/\sqrt{d_k}$</strong> — scale by the square root of key dimension. Without this, large $d_k$ pushes dot products into the saturated tail of softmax, killing gradients.</li>
  <li><strong>$\text{softmax}(\cdot)$</strong> — turn scores into a probability distribution over positions.</li>
  <li><strong>$\cdot V$</strong> — weighted average of value vectors. Shape: <code class="language-text highlighter-rouge">[seq_len, d_v]</code>.</li>
</ol>

<h2 id="python-implementation-numpy">Python implementation (NumPy)</h2>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>

<span class="k">def</span> <span class="nf">scaled_dot_product_attention</span><span class="p">(</span><span class="n">Q</span><span class="p">,</span> <span class="n">K</span><span class="p">,</span> <span class="n">V</span><span class="p">,</span> <span class="n">mask</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
    <span class="s">"""
    Q: (batch, seq_q, d_k)
    K: (batch, seq_k, d_k)
    V: (batch, seq_k, d_v)
    """</span>
    <span class="n">d_k</span> <span class="o">=</span> <span class="n">Q</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>

    <span class="c1"># (batch, seq_q, seq_k)
</span>    <span class="n">scores</span> <span class="o">=</span> <span class="n">Q</span> <span class="o">@</span> <span class="n">K</span><span class="p">.</span><span class="n">transpose</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> <span class="o">/</span> <span class="n">np</span><span class="p">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">d_k</span><span class="p">)</span>

    <span class="k">if</span> <span class="n">mask</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span>
        <span class="n">scores</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">where</span><span class="p">(</span><span class="n">mask</span> <span class="o">==</span> <span class="mi">0</span><span class="p">,</span> <span class="o">-</span><span class="mf">1e9</span><span class="p">,</span> <span class="n">scores</span><span class="p">)</span>

    <span class="n">weights</span> <span class="o">=</span> <span class="n">softmax</span><span class="p">(</span><span class="n">scores</span><span class="p">,</span> <span class="n">axis</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span>   <span class="c1"># attention weights
</span>    <span class="n">output</span>  <span class="o">=</span> <span class="n">weights</span> <span class="o">@</span> <span class="n">V</span>                <span class="c1"># (batch, seq_q, d_v)
</span>    <span class="k">return</span> <span class="n">output</span><span class="p">,</span> <span class="n">weights</span>

<span class="k">def</span> <span class="nf">softmax</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">axis</span><span class="o">=-</span><span class="mi">1</span><span class="p">):</span>
    <span class="n">e</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">x</span> <span class="o">-</span> <span class="n">x</span><span class="p">.</span><span class="nb">max</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="n">axis</span><span class="p">,</span> <span class="n">keepdims</span><span class="o">=</span><span class="bp">True</span><span class="p">))</span>
    <span class="k">return</span> <span class="n">e</span> <span class="o">/</span> <span class="n">e</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="n">axis</span><span class="p">,</span> <span class="n">keepdims</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</code></pre></div></div>

<h2 id="visualising-attention">Visualising attention</h2>

<p>A useful mental model: the attention weight matrix is a <code class="language-text highlighter-rouge">[seq_len × seq_len]</code> grid. Entry <code class="language-text highlighter-rouge">[i, j]</code> tells you “how much does position <em>i</em> attend to position <em>j</em>?”</p>

<p>In a well-trained translation model, when generating “elle” (she), position <code class="language-text highlighter-rouge">[output_elle, input_animal]</code> should be high — that’s the model correctly resolving the pronoun.</p>

<p>This interpretability is one reason attention became so popular even before it dominated benchmarks.</p>

<h2 id="the-scaling-trick--why-sqrtd_k">The scaling trick — why $\sqrt{d_k}$?</h2>

<p>Suppose $d_k = 64$. The dot product of two random unit vectors in 64 dimensions has variance 64. That’s a standard deviation of 8 — large enough to push softmax outputs close to 0 or 1, making gradients vanish.</p>

<p>Dividing by $\sqrt{64} = 8$ normalises the variance back to 1. Empirically, this makes training stable and significantly faster to converge.</p>

<h2 id="key-takeaways">Key takeaways</h2>

<ul>
  <li>Attention is a <strong>differentiable, soft lookup</strong>: query against keys, retrieve weighted values.</li>
  <li>Scaling by $\sqrt{d_k}$ is essential for stable gradients.</li>
  <li>The attention matrix gives you a direct read on which positions influence each output position — useful for interpretability.</li>
</ul>

<p>In Part 2, we’ll stack multiple attention heads in parallel and explore why that helps the model capture different kinds of relationships simultaneously.</p>]]></content><author><name>Rupesh Kumar Yadav Mediboyina</name><email>rupesh32003@gmail.com</email></author><category term="machine-learning" /><category term="deep-learning" /><category term="nlp" /><category term="transformers" /><summary type="html"><![CDATA[Before multi-head attention, before positional encoding, before the encoder-decoder stack — there's one core idea that makes Transformers work. Let's build it from scratch.]]></summary></entry><entry><title type="html">Book Notes: Thinking, Fast and Slow</title><link href="https://rupesh4604.github.io//2026/01/20/book-review-thinking-fast-and-slow/" rel="alternate" type="text/html" title="Book Notes: Thinking, Fast and Slow" /><published>2026-01-20T00:00:00+00:00</published><updated>2026-01-20T00:00:00+00:00</updated><id>https://rupesh4604.github.io//2026/01/20/book-review-thinking-fast-and-slow</id><content type="html" xml:base="https://rupesh4604.github.io//2026/01/20/book-review-thinking-fast-and-slow/"><![CDATA[<p>Daniel Kahneman’s <em>Thinking, Fast and Slow</em> is one of those books that reorganises how you see your own mind. Published in 2011, it summarises decades of research on cognitive biases, heuristics, and the dual-process theory of human judgment. Here are my notes, organised around what I found most useful.</p>

<h2 id="the-core-frame-system-1-and-system-2">The Core Frame: System 1 and System 2</h2>

<p>Kahneman divides cognition into two metaphorical “systems”:</p>

<ul>
  <li><strong>System 1</strong> — fast, automatic, emotional, associative. Fires without effort. Handles driving a familiar route, reading a face, or recognising that 2+2=4.</li>
  <li><strong>System 2</strong> — slow, deliberate, effortful, logical. Needed for any non-trivial calculation, careful argument-following, or fighting an impulse.</li>
</ul>

<p>The central insight: <em>we think we’re mostly using System 2, but we’re actually mostly using System 1 — and System 2 is often just rationalising what System 1 already decided.</em></p>

<h2 id="heuristics-and-their-failures">Heuristics and Their Failures</h2>

<p>System 1 navigates the world with mental shortcuts (heuristics). These work well on average but fail systematically in predictable ways.</p>

<h3 id="anchoring">Anchoring</h3>

<p>If I ask you to estimate the population of Istanbul, and first show you the number 10 million, your estimate will be higher than if I showed you 2 million — even if you consciously dismiss the anchor as irrelevant.</p>

<p>Lesson: negotiators, appraisers, and judges all fall for anchoring. Always generate your own estimate before seeing any external figure.</p>

<h3 id="availability-heuristic">Availability heuristic</h3>

<p>We estimate frequency by how easily examples come to mind. Plane crashes feel more frequent than they are because they’re vivid and covered extensively; heart disease kills far more but doesn’t make the evening news as dramatically.</p>

<h3 id="wysiati--what-you-see-is-all-there-is">WYSIATI — What You See Is All There Is</h3>

<p>System 1 builds the most coherent story it can from <em>available</em> evidence, without flagging what it doesn’t know. This is why:</p>
<ul>
  <li>We’re overconfident (our story feels complete)</li>
  <li>We jump to conclusions on thin evidence</li>
  <li>We don’t naturally ask “what’s missing?”</li>
</ul>

<blockquote>
  <p>“The confidence that individuals have in their beliefs depends mostly on the quality of the story they can tell about what they see, even if they see little.”</p>
</blockquote>

<h2 id="the-planning-fallacy">The Planning Fallacy</h2>

<p>One of the most practically important sections: we systematically underestimate how long and costly projects will be because we focus on the <em>inside view</em> (our specific plan) rather than the <em>outside view</em> (base rates for similar projects).</p>

<p>Fix: reference class forecasting. Before estimating, ask <em>“What is the track record of similar projects?”</em> Then adjust from that anchor.</p>

<p>This is something I now apply to any estimate I’m asked to make. For ML projects especially, the outside view is brutal but correct.</p>

<h2 id="what-hasnt-aged-well">What Hasn’t Aged Well</h2>

<p>Since publication, the replication crisis hit many of the priming studies Kahneman relied on. The “Florida effect” (thinking about the elderly makes you walk slower) has failed to replicate. Ego depletion is contested.</p>

<p>The dual-system framework itself is a useful <em>metaphor</em>, not a literal description of brain architecture. Don’t mistake it for neuroscience.</p>

<p>The core findings — anchoring, WYSIATI, prospect theory, loss aversion — hold up well. But treat the priming chapters with more scepticism than Kahneman himself might endorse today.</p>

<h2 id="prospect-theory-the-part-i-re-read-most">Prospect Theory (the Part I re-read most)</h2>

<p>Kahneman and Tversky’s Nobel-winning contribution: humans don’t evaluate outcomes in absolute terms, but <em>relative to a reference point</em>, and losses loom larger than equivalent gains (roughly 2:1).</p>

<p>This predicts:</p>
<ul>
  <li>Why people gamble to avoid losses but accept certain smaller gains</li>
  <li>Why framing matters (same thing sounds better as “90% survival” than “10% mortality”)</li>
  <li>Why investors hold losing stocks too long</li>
</ul>

<h2 id="key-takeaways">Key Takeaways</h2>

<p>After reading and re-reading sections:</p>

<ol>
  <li><strong>Slow down on important decisions</strong> — force System 2 to engage before committing.</li>
  <li><strong>Pre-mortem on any major project</strong> — assume it failed; now explain why. Surfaces blind spots.</li>
  <li><strong>Always ask for the outside view</strong> — before planning, find base rates.</li>
  <li><strong>Identify your reference point in any negotiation</strong> — it’s controlling you whether or not you notice it.</li>
  <li><strong>Be suspicious of “obvious” conclusions</strong> — if a story feels complete, ask what it’s missing.</li>
</ol>

<hr />

<p><em>Next in this series: notes on Kahneman’s collaborators and critics — Gigerenzen’s “Rationality for Mortals” and Thaler’s “Misbehaving”.</em></p>]]></content><author><name>Rupesh Kumar Yadav Mediboyina</name><email>rupesh32003@gmail.com</email></author><category term="books" /><category term="cognitive-science" /><category term="rationality" /><category term="psychology" /><summary type="html"><![CDATA[Kahneman's magnum opus on the two systems of thought — what still holds up, what's been replicated, and what I take away as a practitioner.]]></summary></entry></feed>