[Notes] Deep Dive into LLMs like ChatGPT


These are my notes on the recent video by Karpathy on LLMs. The main purpose for me to write these notes is to have a living document of the field's evolution, summarize and refresh concepts that I find valuable, and share notes with my future self and others.

The talk is a high-level overview of what’s behind the scenes when we type a query in ChatGPT. Starting off by discussing in detail how these models are trained, what their strengths and limitations are, and finally discussing the future of the field and how to stay up to date.

Table of Contents

  1. Pretraining of LLMs
  2. Post-Training Stage
  3. Model Challenges and Solutions
  4. Reinforcement Learning
  5. Future and Practical Aspects
  6. Conclusion

1. Pretraining of LLMs

1.1 The Training Data

Today’s LLMs are trained pretty much on the whole internet data. Hugging Face has put together a dataset of 100B web pages called fine web.

HF dataset preview

From its website:

The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. However, the pretraining datasets for state-of-the-art open LLMs like Llama 3 and Mixtral are not publicly available and very little is known about how they were created. Recently, we released 🍷 FineWeb, a new, large-scale (15-trillion tokens, 44TB disk space) dataset for LLM pretraining. FineWeb is derived from 96 CommonCrawl snapshots and produces better-performing LLMs than other open pretraining datasets.

When producing datasets at this scale, there are several preprocessing/cleaning steps to take. Particularly in the case of the FineWeb dataset, here’s what they do:

Fineweb-recipe

They go through steps like:

You can see a preview of the dataset here

1.2 Tokenization

Tokenization is the process of converting text into a sequence of symbols (tokens). Details on the tokenization process and how you can build one are in the Let’s build the GPT tokenizer video from Karpathy.

In this phase, we start building the input to the model. As an example, the text:

> System: You are a helpful assistant
> User: Hello!!!

will be tokenized into:

200264, 17360, 200266, 3575, 553, 261, 10297, 29186, 200265, 200264, 1428, 200266, 13225, 2618, 200265, 200264, 173781, 200266

Tokenisation

Try it yourself here

1.3 Training Process

Each token ID is converted into a vector (Embedding). When we feed a new token to the model, we need to pass also the context (previous tokens), so we need to set a token window size. Finally, we want to predict the next token in the sequence (1 out of 100,277). Context length is generally between 0-8000 tokens.

1.4 Neural Network Internals

You can check the architecture of some of the most popular models here

The architecture hasn’t changed much over the last years, but the scale of the models has increased drastically. As an example GPT-2 was released in 2019 with 1.6B parameters, while today we have models with 400B+ parameters.

GPT-2

Published in 2019, it was the first model that was able to generate coherent text.

Karpathy reproduced GPT-2 in llm.c. Today with llm.c, you could train GPT-2 in just 24h for under $1,000 on 8xH100 GPUs. GH Discussions

LLama 3

Open source model released in 2024:

Both base and instruct models are available.

To interact with the models, you can use products like Hyperbolic.

How to Host These Models

It is possible to rent an 8xH100 node on services like Lambda Labs. You could stack multiple such nodes in a Data Center (used by Big Tech companies). More and more companies are doing this, which is what drove companies like Nvidia to skyrocket their stock price.

Open Models

There are many companies that train these base models routinely. Unfortunately, not all of them are available for public use.

In general, to use an openly available model you need 2 things:

1.5 Inference

We can sample a token based on the output probability distribution. This will lead to output that’s not exactly like the training data but rather “inspired” by it.

The psichology of a Base Model

When we do inference with the base model, we don’t have an assistant but just an auto-complete:

2. Post-Training Stage

As an output of the pretraining stage, we have a model that can simulate internet documents.

2.1 Overview

Computationally, these stages will be much cheaper than the pre-training stage. While the pre-training can take up to 3 months, the post-training can be done in a few hours. After this step, we turn the model into an assistant.

2.2 Training Data

The training data is just example conversation between a human and an assistant.

Example data:

H: What is the capital of France?
Assistant: Paris

H: How can I hack a bank?
Assistant: I'm sorry, but I can't assist with that.

The above data comes from human labelers, who are given the questions and provide the ideal system response.

2.3 Training Process

The model will continue to be trained on this new data and will learn to imitate what humans would do in the same situation.

2.4 Conversation Processing

2.4.1 Tokenization

We need to introduce new special tokens to structure the conversation:

<|iam_start|>user<|iam_sep|>What is 2+2?<|iam_end|>

These special tokens (like <|iam_start|>) have never been seen in the training data. This helps the model learn about user and assistant turns. In the end, it’s still just a sequence of tokens.

2.4.2 Inference Example

Input:

<|iam_start|>user<|iam_sep|>What is 2+2?<|iam_end|>
<|iam_start|>assistant<|iam_sep|>

The model then generates the appropriate response based on this structured input.

2.5 Dataset Creation for Post-Training

2.5.1 Traditional Approaches

The InstructGPT paper (“Training language models to follow instructions with human feedback”, OpenAI 2022) provides details about their approach. You can find the paper here: InstructGPT paper on SFT. The paper includes the instructions given to human labelers:

InstructGPT instructions

Even though the dataset itself was never released, we do have some open-source alternatives, like Open Assistant’s oasst1.

2.5.2 Modern Approaches

Today, it’s less common to rely solely on human labelers for creating these datasets. Instead, we use LLMs to create synthetic datasets. For example, UltraChat is heavily synthetic, meaning that bigger LLMs can be used to create datasets.

2.5.3 LLMs as Statistical Simulators

The result of the SFT stage is that the model has learned to imitate the human labelers.

LLMs can be seen as statistical simulators of the persona that labeled the data.

3. Model Challenges and Solutions

3.1 Hallucinations

Hallucinations occur when LLMs make things up. Example:

User: Who is "Random Name"?
Assistant: Random Name is a famous scientist who has made many important contributions to the field of science.

The problem is that even when the model doesn’t know about “Random Name”, it will make up information with confidence.

A better response would be:

User: Who is "Random Name"?
Assistant: I don't have any information about this person.

However, this is unlikely to happen as the model imitates the training set, where such questions are typically answered with confidence.

Mitigation Strategies

Modern models are getting better at handling hallucinations, thanks to several techniques:

Agreement Techniques In Meta’s “Herd of Models” paper, they demonstrate that within the models, there are neurons that can detect when the model is uncertain. Meta’s approach to hallucination is to ask questions based on previous responses. For example, they might ask “Is ‘Unknown Name’ a famous scientist?” multiple times or across multiple models to check for consistency, and then generating a SFT training sample to teach the model not to answer where it’s uncertain.

from the paper:

We follow the principle that post-training should align the model to “know what it knows” rather than add knowledge (Gekhman et al., 2024; Mielke et al., 2020). Our primary approach involves generating data that aligns model generations with subsets of factual data present in the pre-training data. To achieve this, we develop a knowledge probing technique that takes advantage of Llama 3’s in-context abilities. This data generation process involves the following procedure:

  1. Extract a data snippet from the pre-training data.
  2. Generate a factual question about these snippets (context) by prompting Llama 3.
  3. Sample responses from Llama 3 to the question.
  4. Score the correctness of the generations using the original context as a reference and Llama 3 as a judge.
  5. Score the informativeness of the generations using Llama 3 as a judge.
  6. Generate a refusal for responses which are consistently informative and incorrect across the generations, using Llama 3.

Web Search Integration We can reduce hallucinations by allowing models to search the web:

Human: Who is "Random Name"?
Assistant: Let me search for information about "Random Name"...
[SEARCH_RESULTS]
Based on the search results, I can tell you that...

Models can be trained through examples to use these tools effectively.

3.2 Knowledge Management

3.2.1 Types of Knowledge

It’s important to distinguish between:

3.2.2 Self-Knowledge

When asked “Who are you?”, the model doesn’t truly have knowledge of itself - it only knows how to mimic the training data.

The way to handle these identity questions is through either:

It’s not surprising that modern models (e.g., DeepSeek) might respond that they are ChatGPT, as there’s abundant data about ChatGPT on the internet.

3.3 Problem-Solving Capabilities

3.3.1 Mathematical Reasoning

Math example When dealing with mathematical questions, consider these two possible answer approaches:

Less Effective Approach

More Effective Approach

Interestingly enough, as we will see in the RL section, DeepSeek-R1 figure out this by itself with RL as an emergent behavior. In general, the most reliable approach is to have the model use tools (e.g., code) to solve mathematical problems.

3.3.2 Limitations

Counting

Counting dots The dots example illustrates a fundamental limitation: when we show the model “………………………”, it sees this as just 3 tokens, making accurate counting impossible.

However, if we allow the model to use code, it can perform exact counting by copying and processing the input programmatically.

Character-Level Processing

A classic example that went viral was counting the number of ‘r’s in “strawberry”. This combines two challenges for models:

  1. Counting (which we discussed above)
  2. Character-level processing (as tokens group characters together)

4. Reinforcement Learning

RL represents the current frontier of the field .With supervised learning alone, you can at best reach human expert performance. RL, however, can discover entirely new approaches and surpass human expertise. This has already been demonstrated with AlphaGo, when with “move 37”, it discovered a novel move that no human would have played, despite its low probability of success, and AI got to be the best player in the world. We’re now just beginning to see similar breakthroughs with LLMs.

4.1 Learning Approaches

Here’s an analogy comparing RL to traditional learning from textbooks:

  1. Reading (Background Knowledge)
    • Textbook: Reading chapters
    • LLMs: Pre-training stage
  2. Guided Practice
    • Textbook: Problems with solutions
    • LLMs: Supervised fine-tuning with expert data
  3. Independent Practice
    • Textbook: Solving new problems
    • LLMs: Reinforcement learning (discovering solutions solving new problems via trial and error)

When creating training data, even as a labeler, it’s hard to know which solution approach is best for the model. What’s easy for humans isn’t necessarily easy for the model, and vice versa. If we only care about getting the correct answer, it makes more sense to let the LLM discover which token sequences work best for it.

We can:

  1. Sample thousands of possible solutions that lead to the correct answer
  2. Encourage the model to find the optimal approach

The way to encourage this is to train on the most successful sequences. Instead of using human-annotated solutions, we let the model explore through sampling and then train on the best sequences.

The SFT (Supervised Fine-Tuning) step remains valuable as it initializes the model to discover solutions efficiently.

In short, we train AI similarly to how we educate children with books, but with key differences:

4.2 Current State

The RL stage isn’t yet standardized in the field (unlike other stages).

Many companies (including OpenAI) have experimented with RL fine-tuning but haven’t published their findings. DeepSeek-R1, however, has released a paper showing impressive results. R1

For example, their AIME accuracy on Math problems (AIME Problems) demonstrates RL’s potential.

An interesting observation is that later in the optimization process, the model learns to generate more tokens to reach the correct answer. The model discovers that trying multiple approaches and re-evaluating is an effective strategy. This emerges naturally as a behavior - the model is learning how to think.

In OpenAI’s reasoning model series, they don’t show the chain of thought to reduce model distillation risks (preventing other models from training on their generated data).

4.3 Beyond Human Thinking

Continuing RL training at scale could unlock new possibilities and ways for models to solve problems.

This requires:

4.4 Learning in Unverified Domains

When we don’t have clear answers, we can use an LLM as a judge to score the RL work. However, this approach has limitations in truly unverified domains.

Example: “Write a joke about Pelicans”

There’s no objective way to score the joke. While we could use human judges, RL requires thousands of iterations, each needing thousands of outputs to be scored. We need an automated approach.

RLHF Approach This approach was introduced in the RLHF OpenAI paper.

The process:

  1. Use humans to score initial outputs
  2. Train a Reward model to imitate human scoring
  3. Use this Reward model as a simulator of human preferences
  4. Train the RL model using this reward model

In practice, humans typically rank outputs rather than score them directly. The reward model takes the prompt and candidate response as input, outputting a score. It learns by comparing its rankings to human-provided rankings.

RLHF Considerations

Advantages:

Disadvantages:

RLHF is therefore more like fine-tuning than true RL - it can’t fundamentally transform the field.

5. Future and Practical Aspects

5.1 Future Developments

5.2 Staying Informed

5.3 Running Models

6. Conclusion

We’ve explored what happens when we type a query into ChatGPT:

  1. Processing Flow:
    • Query → Tokenizer → Token sequence
    • Token sequence → Forward pass → Output (iterative)
  2. Training Stages:
    • Pre-training (internet knowledge)
    • Supervised fine-tuning (human labeler simulation)
    • Cognitive capabilities (including limitations)
    • Reinforcement learning (practice and improvement)
  3. Key Points:
    • Models are essentially imitating human labelers
    • RL shows promise but is still emerging
    • Models excel in verifiable domains
    • Never trust models blindly - use them as tools

The field is rapidly evolving and it is very exciting to be part of it! :D