Vitek - AI & Programming Portfolio

Why this project?

I’ve recently started reading “Build a Large Language Model (From Scratch)” by Sebastian Raschka. I have discovered it thanks to Radek, velvetshark.com, a person I play with in the basketball fantasy dynasty league (sport really does connect people in many different areas).

Raschka helps to understand the structure of LLMs, different tools etc.

Instead of just reading, I'm planning to actively implement the pieces he provides from the ground up.

The goal

I'm not sure how much time it will eventually take. My goal is, however, that in ~3 months, I want to build a small GPT-like conversational model.

# a tiny teaser
for step in range(steps):
    logits = model(tokens)
    loss = compute_loss(logits, targets)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Some tools that I have a plan to implement: custom tokenizer, transformer layers, sampling (top-k / nucleus), context windowing, training on a curated dataset

What I've just learned

Chapter 1 - Understanding Large Language Models

Transformer architecture, transformers vs LLMs
BERT-like and GPT-like LLMs
Few-shot and zero-shot learning
GPT architecture - left-to-right processing
The stages of building LLMs, pretraining and finetuning

Chapter 2 - Working with Text Data

Tokenizing text, word embbedding techniques
Converting raw data (eg. video, audio, text) into a n-dimensional numerical vector (embedding)
Building a vocabulary by tokenizing the entire text in a training dataset into individual tokens
Encoding and decoding text with tokenizer
Special tokens: <|unk|> to represent new and unknown words that were not part of the training data and <|endoftext|> to separate unrelated text sources
Byte pair encoding: BPE tokenizers break down unknown words into subwords and individual characters
Implementing efficient data loaders: tensor x for containing the inputs and tensor y containing the targets (next words)
Converting Token IDs into Token embedding vectors
Backpropagation algorithm
Embedding layers perform a look-up operation, retrieving the embedding vector corresponding to the token ID from the embedding layer's weight matrix
Main types of positional embeddings: absolute and relative

#1 Building My Own LLM From Scratch

Why this project?

The goal

What I've just learned

Chapter 1 - Understanding Large Language Models

Chapter 2 - Working with Text Data