LLM 1.2. Attention! Is All You Need
2025-11-12
I continue to study "Build a Large Language Model (From Scratch)" by Sebastian Raschka. Previously, I learned how to prepare input text for training LLMs. This involved tokenizing text—splitting it into individual words and subword tokens—then converting those tokens into embeddings. The embedding layer functions as a lookup table: each token ID retrieves its corresponding vector representation.
Now, the focus shifts to coding attention mechanisms. The title of this article is a playful nod to the landmark paper "Attention Is All You Need," which introduced the transformer architecture that revolutionized modern AI. And true to that title, attention mechanisms are indeed the core of what makes LLMs work.
What I've just learned
Chapter 3 - Coding attention mechanisms
What attention mechanisms are
Bahdanau attention mechanism for RNNs - modifies the encoder/decoder RNN in a way the decoder can selectively access different parts of the input sequence at each decoding step
"Self" refers to the mechanism's ability to compute attention weights by relating different positions within a single input sequence
Implementing self-attention mechanism without trainable weights
Context vectors
Attention score between input and query
Normalizing attention weights (torch.softmax()), so they sum up to 1
Computing the context vector - a combination of all input vectors x weighted by the attention weights.
Causal/Masked Attention
Restricts a model so that during training, each token can only attend to itself and previous tokens (not future tokens). This prevents "cheating" by looking ahead.
- Multi-Head Attention - the heart of transformers. It implements scaled dot product attention with causal marking - preventing by that looking at future tokens.
Query: projects input to "what am I looking for?"
Key: projects input to "what do I contain?"
Value: projects input to "what information should I pass?"
out_proj: combines all heads back into a single representation
Steps: computing Q, K and V. Splitting them into multiple heads. Then, rearranging for parallel computation. Next, we compute attention scores. After that, we apply causal mask. Then, we apply attention weights and dropout. Finally, we compute context vectors and concatenate heads and projection.