The transformer architecture, introduced in Attention Is All You Need (Vaswani et al., 2017), fundamentally changed how we build sequence models. At its core lies a deceptively simple idea: scaled dot-product attention.
Scaled Dot-Product Attention
Given a set of queries Q, keys K, and values V, the attention function is defined as:
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) VWhere d_k is the dimensionality of the key vectors. The division by sqrt(d_k) is critical — without it, the dot products grow large in magnitude for high-dimensional vectors, pushing the softmax into regions with extremely small gradients.
Consider what happens geometrically. Each query vector is compared against every key vector via a dot product, producing a similarity score. These scores are normalized through softmax to form a probability distribution, which then weights the value vectors. The output is a weighted sum of values, where the weights reflect how “relevant” each key is to the query.
Multi-Head Attention
Instead of performing a single attention pass, transformers use multi-head attention:
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, n_heads):
super().__init__()
self.d_k = d_model // n_heads
self.n_heads = n_heads
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
def forward(self, Q, K, V, mask=None):
batch = Q.size(0)
Q = self.W_q(Q).view(batch, -1, self.n_heads, self.d_k).transpose(1, 2)
K = self.W_k(K).view(batch, -1, self.n_heads, self.d_k).transpose(1, 2)
V = self.W_v(V).view(batch, -1, self.n_heads, self.d_k).transpose(1, 2)
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
attn = torch.softmax(scores, dim=-1)
context = torch.matmul(attn, V)
context = context.transpose(1, 2).contiguous().view(batch, -1, self.n_heads * self.d_k)
return self.W_o(context)Each head learns to attend to different aspects of the input. In practice, some heads specialize in syntactic relationships (subject-verb agreement), while others capture semantic similarities or positional patterns.
Positional Encoding
Since attention is permutation-equivariant — it has no inherent notion of order — we must inject positional information. The original transformer uses sinusoidal encodings:
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))This encoding has a useful property: for any fixed offset k, PE(pos + k) can be represented as a linear function of PE(pos). This allows the model to learn relative positions through linear transformations.
Modern architectures have moved toward Rotary Position Embeddings (RoPE), which encode relative positions directly into the attention computation by rotating query and key vectors in a frequency-dependent manner. RoPE naturally decays attention scores with distance and extends more gracefully to longer sequences.
The Computational Reality
Standard self-attention has O(n²) complexity in sequence length, both in time and memory. For a sequence of length 4096 with 32 attention heads and d_model=4096, each attention matrix is 4096×4096 — and you have 32 of them per layer, across 32+ layers.
This is why techniques like Flash Attention (Dao et al., 2022) matter. Flash Attention restructures the computation to exploit GPU memory hierarchy — performing attention in tiles that fit in SRAM rather than materializing the full attention matrix in HBM. The algorithm is mathematically identical but 2-4x faster.
Key Takeaways
- Attention is a soft dictionary lookup — queries search for relevant keys, and values are the retrieved content
- The
sqrt(d_k)scaling prevents gradient vanishing in softmax - Multi-head attention provides multiple representation subspaces operating in parallel
- Position information must be explicitly injected since attention is order-agnostic
- The O(n²) cost is the primary bottleneck driving research into efficient attention variants