Rupesh’s Blog

Hello, blog

2026-04-15T00:00:00+00:00

I’ve been meaning to start a blog for a while. The goal is to write about things I’m actually working through — not polished explainers, but the kind of notes I’d write for myself with a slightly more careful eye toward structure.

What to expect

Technical posts — ML papers I’m reading, algorithms from courses, implementation notes. I’ll show code and explain the intuition behind it rather than just dropping formulas.

Book notes — I read a lot, especially in rationality, cognitive science, and history of science. I’ll share what stuck and why.

Series — Some topics don’t fit in one post. For those, I’ll write a series: ordered parts that build on each other. You’ll see a navigation box at the top of each part showing the full thread and where you are in it.

How the site works

Archive — all posts by date
Tags — browse by topic; clicking a tag on the home page filters the feed
Series — standalone page listing all multi-part threads
Search — press / (or click the icon) for full-text search across all posts

All posts are written in Markdown, the site is static, and it’s hosted for free on GitHub Pages.

Let’s see how this goes.

Understanding Transformers — Part 3: Positional Encoding

2026-03-10T00:00:00+00:00

Here’s a surprising property of the attention mechanism we built: it’s entirely position-blind.

Given the sequence [A, B, C] or [C, A, B], the self-attention output for token A is identical — because dot products don’t care about order, only similarity.

For language, order is everything. We need to inject position information before the model can distinguish “dog bites man” from “man bites dog.”

Two approaches

Learned positional embeddings — add a trainable embedding vector $p_i$ for each position $i$. Simple, flexible, but can’t generalise beyond the training sequence length.
Sinusoidal positional encoding (Vaswani et al.’s choice) — deterministic function of position and dimension. Can extrapolate beyond training length and has a beautiful structure.

The original paper uses sinusoidal. Let’s understand why.

The sinusoidal encoding

For position $pos$ and dimension $i$ (out of $d_{model}$ total):

$PE_{(pos, 2i)} = \sin\!\left(\frac{pos}{10000^{2i/d_{model}}}\right)$ $PE_{(pos, 2i+1)} = \cos\!\left(\frac{pos}{10000^{2i/d_{model}}}\right)$

Even dimensions get sine, odd dimensions get cosine, with the frequency decreasing as $i$ increases.

Intuition: a continuous “ruler”

Think of it like a clock — but instead of one hand rotating at one frequency, you have $d_{model}/2$ pairs of hands, each rotating at a different frequency.

High-frequency dimensions (small $i$) — change rapidly, distinguish nearby positions
Low-frequency dimensions (large $i$) — change slowly, encode coarse position (early vs. late in sequence)

Together, every position $pos$ gets a unique fingerprint across all frequencies.

Why 10000?

The base 10000 ensures the lowest frequency completes roughly one cycle over sequences of length ~62,800 ($2\pi \times 10000$). For realistic sequences (< 10K tokens), all positions remain in a nearly-linear regime of the sinusoid — which empirically is good for the model’s ability to interpolate.

The key property: relative positions

Here’s the elegant part. For any fixed offset $k$:

\[PE_{pos+k} = f(PE_{pos})\]

…where $f$ is a linear transformation that depends only on $k$, not on $pos$.

This means the model can, in principle, learn to attend to “two positions to the right” by learning the appropriate linear transformation of the key — without ever seeing $pos$ explicitly. Relative relationships are linearly recoverable.

Implementation

import torch
import math

def positional_encoding(max_len: int, d_model: int) -> torch.Tensor:
    """
    Returns PE matrix of shape (max_len, d_model).
    Add to token embeddings before the first encoder layer.
    """
    pe  = torch.zeros(max_len, d_model)
    pos = torch.arange(0, max_len).unsqueeze(1).float()   # (max_len, 1)

    # Division term: 1 / 10000^(2i / d_model)
    div = torch.exp(
        torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model)
    )

    pe[:, 0::2] = torch.sin(pos * div)   # even dims
    pe[:, 1::2] = torch.cos(pos * div)   # odd dims

    return pe   # (max_len, d_model)


class TokenPlusPositionEmbedding(torch.nn.Module):
    def __init__(self, vocab_size, d_model, max_len=5000, dropout=0.1):
        super().__init__()
        self.token_emb = torch.nn.Embedding(vocab_size, d_model)
        self.dropout   = torch.nn.Dropout(dropout)

        pe = positional_encoding(max_len, d_model)
        self.register_buffer('pe', pe)   # not a parameter, but saved with model

    def forward(self, x):
        # x: (batch, seq_len) token indices
        tok = self.token_emb(x) * math.sqrt(self.d_model)
        return self.dropout(tok + self.pe[:x.size(1)])

Note the $\sqrt{d_{model}}$ scaling on the token embeddings — this ensures the token signal and the positional signal are on the same order of magnitude.

Alternatives used in modern LLMs

The Transformer paper’s sinusoidal encoding is elegant but most modern LLMs use variants:

Method	Used in	Key idea
Learned absolute PE	BERT, GPT-2	Trainable vectors per position
Relative PE (Shaw et al.)	T5, Music Transformer	Bias attention scores by relative distance
RoPE (Su et al.)	LLaMA, Mistral, Gemma	Rotate Q and K vectors; naturally encodes relative position in dot product
ALiBi	BLOOM, MPT	Add a linear bias to attention scores based on distance

RoPE has become the dominant choice because it extrapolates well to longer sequences than seen during training — a crucial property for modern long-context models.

Wrapping up the series

We now have all the primitives:

Scaled dot-product attention (Part 1) — the core lookup mechanism
Multi-head attention (Part 2) — parallel specialised heads
Positional encoding (Part 3) — injecting order into a permutation-invariant operation

The remaining pieces of the Transformer (layer norm, feed-forward sub-layers, encoder-decoder stack, training recipe) build on these foundations. I’ll cover those in a future series on building GPT from scratch.

Understanding Transformers — Part 2: Multi-Head Attention

2026-02-20T00:00:00+00:00

In Part 1, we built scaled dot-product attention from scratch. A single attention head works — but it can only look at one “type” of relationship at a time.

Multi-head attention runs $h$ independent attention operations in parallel, each with its own learned projections, then concatenates and re-projects the results.

Motivation: one head isn’t enough

Consider the sentence: “John said that he hurt himself.”

A single attention head must simultaneously track:

he → John (coreference)
himself → John (reflexive)
hurt → himself (predicate-argument)

These are structurally different relationships. If you force a single head to represent all of them with one weight matrix, it’s forced to compromise.

Multiple heads let the model specialise: empirically, different heads in trained models learn to track different syntactic and semantic patterns (coreference, positional proximity, subject-verb agreement, etc.).

The mechanics

For each head $i$, we learn three projection matrices:

\[W^Q_i \in \mathbb{R}^{d_{model} \times d_k}, \quad W^K_i \in \mathbb{R}^{d_{model} \times d_k}, \quad W^V_i \in \mathbb{R}^{d_{model} \times d_v}\]

Each head then computes:

\[\text{head}_i = \text{Attention}(X W^Q_i,\ X W^K_i,\ X W^V_i)\]

Outputs are concatenated and projected through $W^O$:

\[\text{MultiHead}(X) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\ W^O\]

where $W^O \in \mathbb{R}^{h \cdot d_v \times d_{model}}$.

Dimension arithmetic

The original paper uses $d_{model} = 512$, $h = 8$, which gives $d_k = d_v = 512/8 = 64$.

So each head operates on a 64-dim subspace. The total computation is no more expensive than a single head on the full 512 dimensions — we just distribute it.

PyTorch implementation

import torch
import torch.nn as nn
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model: int, num_heads: int):
        super().__init__()
        assert d_model % num_heads == 0

        self.d_model    = d_model
        self.num_heads  = num_heads
        self.d_k        = d_model // num_heads

        # Projections for Q, K, V and the output
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def split_heads(self, x, batch_size):
        # x: (batch, seq, d_model) → (batch, heads, seq, d_k)
        x = x.view(batch_size, -1, self.num_heads, self.d_k)
        return x.transpose(1, 2)

    def forward(self, query, key, value, mask=None):
        B = query.size(0)

        # Project and split
        Q = self.split_heads(self.W_q(query), B)   # (B, h, seq_q, d_k)
        K = self.split_heads(self.W_k(key),   B)   # (B, h, seq_k, d_k)
        V = self.split_heads(self.W_v(value), B)   # (B, h, seq_k, d_k)

        # Scaled dot-product attention per head
        scores = Q @ K.transpose(-2, -1) / math.sqrt(self.d_k)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        attn_weights = torch.softmax(scores, dim=-1)  # (B, h, seq_q, seq_k)
        context = attn_weights @ V                    # (B, h, seq_q, d_k)

        # Concatenate heads and project
        context = context.transpose(1, 2).contiguous().view(B, -1, self.d_model)
        return self.W_o(context), attn_weights

Three variants of attention in a Transformer

The same MultiHeadAttention module is used in three different roles:

Location	Q comes from	K, V come from	Purpose
Encoder self-attention	encoder input	encoder input	Each token attends to all others in the source
Decoder self-attention	decoder input	decoder input	Each output token attends to prior outputs (masked)
Cross-attention	decoder	encoder output	Decoder attends to the full encoded source

The masking in decoder self-attention is crucial: at inference time, we can’t attend to future tokens (they don’t exist yet), so we apply a causal mask — a lower-triangular matrix of 1s.

What do heads actually learn?

Clark et al. (2019) probed BERT’s attention heads and found that different heads consistently specialise:

Some heads track direct syntactic objects (verb → direct object)
Some track coreferents (pronoun → antecedent)
A few “broad” heads attend somewhat uniformly, possibly acting as no-ops or residual paths

This specialisation emerges from gradient descent alone — it’s not hardcoded.

Key takeaways

Multi-head attention = $h$ independent attention heads, each with learned projections, concatenated and projected.
Dimensionality per head = $d_{model} / h$ — total cost is the same as one full-width head.
Three different configurations (self, masked self, cross) power the full Transformer.
Heads empirically specialise on different linguistic relationships.

Next: positional encoding — how the Transformer knows where each token is, since attention itself is position-blind.

Understanding Transformers — Part 1: The Attention Mechanism

2026-02-05T00:00:00+00:00

The Transformer architecture, introduced in Vaswani et al.’s Attention Is All You Need (2017), is the backbone of essentially every modern language model. But it can feel like an intimidating wall of matrix multiplications. This series builds it up piece by piece.

We start with the one idea everything else depends on: scaled dot-product attention.

Why attention at all?

Consider translation: to translate “The animal didn’t cross the street because it was too tired” to French, you need to know that it refers to animal, not street. Earlier RNNs passed a single fixed-size “context vector” from encoder to decoder — a bottleneck that crushed this kind of long-range dependency.

Attention lets the decoder, at each step, directly look back at every encoder state and decide which parts matter most. Instead of summarising everything into one vector, the model learns to weight its own memory dynamically.

The three players: Query, Key, Value

Attention takes three matrices as input:

Name	Role	Analogy
Query (Q)	“What am I looking for?”	A search query
Key (K)	“What does each position offer?”	A database index
Value (V)	“What information does each position hold?”	The database contents

The intuition: you compute how well each Key matches your Query (a dot product), normalise those scores into weights (softmax), then take a weighted sum of the Values.

The formula

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

Breaking it down:

$QK^T$ — dot product between every query and every key. Shape: [seq_len, seq_len]. High value = high relevance.
$/\sqrt{d_k}$ — scale by the square root of key dimension. Without this, large $d_k$ pushes dot products into the saturated tail of softmax, killing gradients.
$\text{softmax}(\cdot)$ — turn scores into a probability distribution over positions.
$\cdot V$ — weighted average of value vectors. Shape: [seq_len, d_v].

Python implementation (NumPy)

import numpy as np

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Q: (batch, seq_q, d_k)
    K: (batch, seq_k, d_k)
    V: (batch, seq_k, d_v)
    """
    d_k = Q.shape[-1]

    # (batch, seq_q, seq_k)
    scores = Q @ K.transpose(0, 2, 1) / np.sqrt(d_k)

    if mask is not None:
        scores = np.where(mask == 0, -1e9, scores)

    weights = softmax(scores, axis=-1)   # attention weights
    output  = weights @ V                # (batch, seq_q, d_v)
    return output, weights

def softmax(x, axis=-1):
    e = np.exp(x - x.max(axis=axis, keepdims=True))
    return e / e.sum(axis=axis, keepdims=True)

Visualising attention

A useful mental model: the attention weight matrix is a [seq_len × seq_len] grid. Entry [i, j] tells you “how much does position i attend to position j?”

In a well-trained translation model, when generating “elle” (she), position [output_elle, input_animal] should be high — that’s the model correctly resolving the pronoun.

This interpretability is one reason attention became so popular even before it dominated benchmarks.

The scaling trick — why $\sqrt{d_k}$?

Suppose $d_k = 64$. The dot product of two random unit vectors in 64 dimensions has variance 64. That’s a standard deviation of 8 — large enough to push softmax outputs close to 0 or 1, making gradients vanish.

Dividing by $\sqrt{64} = 8$ normalises the variance back to 1. Empirically, this makes training stable and significantly faster to converge.

Key takeaways

Attention is a differentiable, soft lookup: query against keys, retrieve weighted values.
Scaling by $\sqrt{d_k}$ is essential for stable gradients.
The attention matrix gives you a direct read on which positions influence each output position — useful for interpretability.

In Part 2, we’ll stack multiple attention heads in parallel and explore why that helps the model capture different kinds of relationships simultaneously.

Book Notes: Thinking, Fast and Slow

2026-01-20T00:00:00+00:00

Daniel Kahneman’s Thinking, Fast and Slow is one of those books that reorganises how you see your own mind. Published in 2011, it summarises decades of research on cognitive biases, heuristics, and the dual-process theory of human judgment. Here are my notes, organised around what I found most useful.

The Core Frame: System 1 and System 2

Kahneman divides cognition into two metaphorical “systems”:

System 1 — fast, automatic, emotional, associative. Fires without effort. Handles driving a familiar route, reading a face, or recognising that 2+2=4.
System 2 — slow, deliberate, effortful, logical. Needed for any non-trivial calculation, careful argument-following, or fighting an impulse.

The central insight: we think we’re mostly using System 2, but we’re actually mostly using System 1 — and System 2 is often just rationalising what System 1 already decided.

Heuristics and Their Failures

System 1 navigates the world with mental shortcuts (heuristics). These work well on average but fail systematically in predictable ways.

Anchoring

If I ask you to estimate the population of Istanbul, and first show you the number 10 million, your estimate will be higher than if I showed you 2 million — even if you consciously dismiss the anchor as irrelevant.

Lesson: negotiators, appraisers, and judges all fall for anchoring. Always generate your own estimate before seeing any external figure.

Availability heuristic

We estimate frequency by how easily examples come to mind. Plane crashes feel more frequent than they are because they’re vivid and covered extensively; heart disease kills far more but doesn’t make the evening news as dramatically.

WYSIATI — What You See Is All There Is

System 1 builds the most coherent story it can from available evidence, without flagging what it doesn’t know. This is why:

We’re overconfident (our story feels complete)
We jump to conclusions on thin evidence
We don’t naturally ask “what’s missing?”

“The confidence that individuals have in their beliefs depends mostly on the quality of the story they can tell about what they see, even if they see little.”

The Planning Fallacy

One of the most practically important sections: we systematically underestimate how long and costly projects will be because we focus on the inside view (our specific plan) rather than the outside view (base rates for similar projects).

Fix: reference class forecasting. Before estimating, ask “What is the track record of similar projects?” Then adjust from that anchor.

This is something I now apply to any estimate I’m asked to make. For ML projects especially, the outside view is brutal but correct.

What Hasn’t Aged Well

Since publication, the replication crisis hit many of the priming studies Kahneman relied on. The “Florida effect” (thinking about the elderly makes you walk slower) has failed to replicate. Ego depletion is contested.

The dual-system framework itself is a useful metaphor, not a literal description of brain architecture. Don’t mistake it for neuroscience.

The core findings — anchoring, WYSIATI, prospect theory, loss aversion — hold up well. But treat the priming chapters with more scepticism than Kahneman himself might endorse today.

Prospect Theory (the Part I re-read most)

Kahneman and Tversky’s Nobel-winning contribution: humans don’t evaluate outcomes in absolute terms, but relative to a reference point, and losses loom larger than equivalent gains (roughly 2:1).

This predicts:

Why people gamble to avoid losses but accept certain smaller gains
Why framing matters (same thing sounds better as “90% survival” than “10% mortality”)
Why investors hold losing stocks too long

Key Takeaways

After reading and re-reading sections:

Slow down on important decisions — force System 2 to engage before committing.
Pre-mortem on any major project — assume it failed; now explain why. Surfaces blind spots.
Always ask for the outside view — before planning, find base rates.
Identify your reference point in any negotiation — it’s controlling you whether or not you notice it.
Be suspicious of “obvious” conclusions — if a story feels complete, ask what it’s missing.

Next in this series: notes on Kahneman’s collaborators and critics — Gigerenzen’s “Rationality for Mortals” and Thaler’s “Misbehaving”.