#1 Building My Own LLM From Scratch
2025-11-05
Why this project?
I’ve recently started reading “Build a Large Language Model (From Scratch)” by Sebastian Raschka. I have discovered it thanks to Radek, velvetshark.com, a person I play with in the basketball fantasy dynasty league (sport really does connect people in many different areas).
Raschka helps to understand the structure of LLMs, different tools etc.
Instead of just reading, I'm planning to actively implement the pieces he provides from the ground up.
The goal
I'm not sure how much time it will eventually take. My goal is, however, that in ~3 months, I want to build a small GPT-like conversational model.
# a tiny teaser
for step in range(steps):
logits = model(tokens)
loss = compute_loss(logits, targets)
optimizer.zero_grad()
loss.backward()
optimizer.step()
Some tools that I have a plan to implement: custom tokenizer, transformer layers, sampling (top-k / nucleus), context windowing, training on a curated dataset
What I've just learned
Chapter 1 - Understanding Large Language Models
- Transformer architecture, transformers vs LLMs
- BERT-like and GPT-like LLMs
- Few-shot and zero-shot learning
- GPT architecture - left-to-right processing
- The stages of building LLMs, pretraining and finetuning
Chapter 2 - Working with Text Data
- Tokenizing text, word embbedding techniques
- Converting raw data (eg. video, audio, text) into a n-dimensional numerical vector (embedding)
- Building a vocabulary by tokenizing the entire text in a training dataset into individual tokens
- Encoding and decoding text with tokenizer
- Special tokens: <|unk|> to represent new and unknown words that were not part of the training data and <|endoftext|> to separate unrelated text sources
- Byte pair encoding: BPE tokenizers break down unknown words into subwords and individual characters
- Implementing efficient data loaders: tensor x for containing the inputs and tensor y containing the targets (next words)
- Converting Token IDs into Token embedding vectors
- Backpropagation algorithm
- Embedding layers perform a look-up operation, retrieving the embedding vector corresponding to the token ID from the embedding layer's weight matrix
- Main types of positional embeddings: absolute and relative