Kapathy hand-crafted ChatGPT with 8,000 lines of code for just $100. After 12 hours of training, CORE outperformed GPT-2. Here's a step-by-step tutorial!

This article is machine translated
Show original

A pure manual clone of ChatGPT with a cost of $100 and 8,000 lines of code!

Andrej Karpathy , the former AI director of Tesla, founding member of OpenAI, and AI guru who announced that he would work in education full-time, has been silent for a long time, and finally, finally, finally, he is teaching a new class!

The new work nanochat is described by the author as one of the most "insane" and self-indulgent works.

It is a minimalist, full-stack training/inference pipeline built from scratch, implementing a simple version of ChatGPT in a single code base with minimal dependencies.

As long as you start a cloud GPU server and run a script, it will take as fast as 4 hours to be able to communicate with the large language model you trained in a web interface similar to ChatGPT.

The entire project has about 8,000 lines of code and can implement the following functions:

Training tokenizers based on the new Rust language

Pre-training a large language model with Transformer architecture on the FineWeb dataset and evaluating the CORE score using multiple metrics

Midtrain is performed on the SmolTalk user-assistant dialogue dataset, multiple-choice question dataset, and tool usage dataset.

Perform instruction fine-tuning (SFT) and evaluate the performance of the dialogue model on the World Knowledge Multiple Choice Questions dataset (ARC-E/C), the Mathematics dataset (GSM8K), and the Code dataset (HumanEval)

Optional reinforcement learning (RL) training of the model using the "GRPO" algorithm on the GSM8K dataset

Implement efficient model inference in the inference engine, support KV caching, simple pre-fill/decoding process, tool use (Python interpreter in a lightweight sandbox environment), and interact with the model through CLI or ChatGPT-like WebUI

Generate a single Markdown report card that summarizes the entire training and inference process, and adds a "gamification" presentation (such as visually displaying the results in the form of scores, progress, etc.)

The total cost is only about $100 (training for 4 hours on 8×H100) to train and replicate a simplified ChatGPT model that can conduct basic conversations, create story poems, and answer simple questions.

The overall performance indicators are as follows:

After about 12 hours of training, the model's performance on the CORE indicator can surpass GPT-2.

If the cost is further increased to about $1,000 (training time is about 41.6 hours), the model performance is significantly improved, and it can solve simple math/coding problems and long choice questions.

To give a specific example: after 24 hours of training, a model with a depth of 30 can achieve a score of over 40 on the MMLU dataset, over 70 on the ARC-Easy dataset, and over 20 on the GSM8K dataset (equivalent to the computing power consumption of GPT-3 Small 125M, only one thousandth of GPT-3).

Kapasi said his goal is to integrate this complete "strong baseline" technology stack into a unified, minimalist, easy-to-read, modifiable, and easy-to-distribute code base.

Nanochat will be the culminating project for the LLM101n course (which is still under development).

I think it has the potential to be developed into a research tool framework or benchmarking tool, just like nanoGPT before it. The project is far from being fully optimized (in fact, there is a lot of room for improvement), but the overall framework is complete enough to be released on GitHub, and all subsequent modules can be further optimized by the community.

Netizens who are waiting for the new work have gone crazy. As soon as the project was released, the number of GitHub stars soared to 4.8k:

Cool! I can put “ML Engineer” on my resume after running this project once!

What you release is not just code, but also understandable wisdom, explosive value, Shuan Q.

In the comments section, Kapasi also explained that the basic architecture of nanochat is similar to that of Llama, but simpler, and also borrows some of the design of modded-nanoGPT. The overall goal is to find a robust infrastructure for models of this scale.

And this project is basically entirely handwritten .

I did try to use agents like Claude or Codex to help, but the results were terrible and almost unhelpful. Probably because the structure of this repo deviated from the distribution of their training data, so they simply didn't "match".

Without further ado, here is a detailed guide to getting started with nanochat.

The best ChatGPT you can create for $100

I launched an 8-card H100 server from Lambda GPU Cloud, which cost about $24 per hour, so I had to race against time.

Environment Setup

Clone the project:

The goal is to train a best-in-class ChatGPT model for $100, called a "speedrun." The speedrun.sh script is designed to run from start to finish on a brand new server.

But then, Kapasi walks through each step.

First, make sure you have the popular uv project manager installed. Install uv, create a new virtual environment in the .venv directory, get all the dependencies, and then activate the environment. This way, when you type python, you will use the Python in the virtual environment instead of the system's own Python:

Next, you need to install Rust/Cargo so that you can compile a custom Rust tokenizer . Introducing a new/custom tokenizer is a bit of a hassle, but unfortunately, Kapathy felt that the Python version in the early minbpe project was too slow, and the huggingface tokenizer was too bloated and confusing.

Therefore, we created our own new word segmenter specifically for training (tested to be consistent with the Python version), but we will still use OpenAI's tiktoken to ensure efficiency during inference.

Now let's compile the tokenizer:

Training the word segmenter

Next, we need to obtain pre-training data so that we can: 1) train the word segmenter; 2) pre-train the model.

The pre-training data is the text content of a large number of web pages . Here we will use the FineWeb-EDU dataset.

Normally, you can use huggingface datasets.load_dataset() directly, but Karpathy doesn't like it being too bloated and heavy and covering up the simple logic, so I repackaged the entire dataset into simple, completely shuffled shards so that it can be easily and efficiently accessed at will, and re-uploaded its sample-100B version as karpathy/fineweb-edu-100b-shuffle.

On this page, you can also preview sample text from the dataset. Each shard is a simple parquet file of approximately 0.25M characters, which takes up approximately 100MB on disk when compressed (gzip format). There are 1822 shards in total, but only 240 of them are needed to train a model with a depth of 20.

Let's start downloading all the data now. It's about 24GB, but it's usually pretty fast on cloud servers:

By default, all of these will be downloaded to the ~/.cache/nanochat directory.

After the download is complete, we begin training the tokenizer—the tokenizer responsible for converting between strings and a codebook of symbols. By default, the training vocabulary size is 2¹⁶ = 65,536 tokens (a good number), with some tokens reserved as special tokens for later chat mode use. The training set contains 2B characters, and training takes only about a minute.

The training algorithm is exactly the same as OpenAI’s (regex splitting, byte-level BPE). For more information, see Kapathy’s video on tokenization techniques.

After training is complete, you can evaluate the tokenizer:

The evaluation results show that a compression ratio of about 4.8 was achieved (that is, an average of 4.8 characters in the original text were compressed into 1 token). The comparison results with the GPT-2 and GPT-4 word segmenters can also be seen.

Compared to GPT-2 (which has 50,257 tokens), it is better at compressing text overall, and only slightly inferior in mathematical content:

Compared to GPT-4, its performance isn't outstanding, but it's important to consider GPT-4's larger vocabulary (100,277 tokens). GPT-4 is particularly superior in multilingual processing (a reasonable result given the FineWeb dataset's heavy focus on English content), and it also excels in coding and mathematics:

Despite this, even with a smaller vocabulary, we still outperform GPT-4 by a small margin on the FineWeb dataset - this is the dataset we trained on, so our tokenizer is perfectly suited to that document distribution (e.g., it may be more advantageous for compressing English text).

Pre-training

Before starting pre-training, you need to download another file that Kapathy calls an " eval bundle ."

During pre-training, the script will periodically evaluate the CORE metric. You can read some details in the DCLM paper, but it's essentially a good, standardized, and general metric for measuring how well a model performs on a large number of autocomplete datasets.

These datasets include HellaSwag, jeopardy, bigbench QA wikidata, ARC-Easy/Challenge, copa, commonsense qa, piqa, lambada, winograd, boolq, etc. (22 in total).

Download and decompress the evaluation package, and place the evaluation package directory in the base directory ~/.cache/nanochat/eval_bundle:

It is also recommended (although optional) to make one more setting:

Configure wandb so that you can view beautiful charts during training. uv has already installed wandb, but you still need to create an account and log in:

Now we can start pre-training! This is the most computationally intensive part. We train a Large Language Model (LLM) to compress internet webpage text by predicting the next token in a sequence. In the process, the Large Language Model acquires a lot of knowledge about the world:

Here, we launch training on 8 GPUs using the scripts/base_train.py script. We are training a 20-layer Transformer. By default, each GPU processes 32 rows of data with 2048 tokens per row during each forward/backward pass, for a total of 32 × 2048 = 2¹⁹ = 524,288 ≈ 0.5M tokens per optimizer step.

If you have already set up wandb, you can add --run=speedrun (all training scripts support this parameter) to set the run name and record relevant data.

When you start training, you should see output similar to this (a lot of content has been omitted for brevity):

As you can see, this Transformer has 1280 channels and 10 attention heads, each with dim=128. It has approximately 560M parameters. To comply with the Chinchilla scaling law, this means we need to use 560M × 20 ≈ 11.2B tokens for training.

Since each step of the optimizer processes 524,288 tokens, this means 11.2B/0.5M ≈ 21,400 iterations.

By multiplying the estimated FLOPs per token by the total number of tokens, we can see that this will be a model with a computational cost of approximately 4e19 FLOPs.

The learning rate is automatically scaled by 1/sqrt(dim) because larger models prefer smaller learning rates.

We use Muon to optimize the matrices and AdamW to optimize the embeddings and de-embeddings. There are no other trainable parameters in this model (such as biases, rmsnorm parameters, etc.). The training process periodically reports "validation set bpb," which is the number of bits per byte on the validation dataset.

Bits per byte is a better metric than the typical cross-entropy loss because it further normalizes the per-token loss by the number of bytes per token, making the metric independent of the tokenizer.

So, whether you use a tokenizer with a small vocabulary or a tokenizer with a large vocabulary, this value is comparable, while the raw cross entropy loss is not.

Note that each step takes about 0.5 seconds, lrm is the learning rate decay multiplier (it decreases linearly to 0 near the end of training), and the reported MFU (model flops utilization) looks pretty good, almost reaching half, which means we fully utilize the available bfloat16 computing power.

Now, wait about 3 hours until those 4e19 FLOPs of computation are complete … In your wandb graph, you should see something like this:

Over time, the bpb decreases, which is a good sign (indicating the model is getting more accurate at predicting the next token). In addition, the CORE score is increasing.

In addition to these approximate metrics, you can also evaluate the model more comprehensively:

It can be seen that the bpb of the training set/validation set reached about 0.81, and the CORE index rose to 0.22.

For comparison, the evaluation package includes the CORE score of the GPT-2 model. Specifically, the CORE score of 0.22 is slightly higher than that of GPT-2 large (0.21), but slightly lower than that of GPT-2 xl (i.e., the "standard" GPT-2, which is 0.26).

At this point, the model acts like an advanced autocomplete tool, so we can run some prompts to get a feel for the knowledge stored in the model. The base_loss.py file runs these prompts. These prompts include:

The completed text is as follows:

So, the model knows that Paris is the capital of France, Au stands for gold, Saturday comes after Friday, "cold" is the opposite of "hot," and even knows the planets in our solar system.

However, it is still unsure about the color of the sky and has difficulty doing simple math problems.

For a model that cost $72 to train, this is not too bad. The inference process uses a custom Engine class that utilizes KV caching for efficient inference and also simply implements two common inference stages: pre-population and decoding.

Our Engine class also supports the use of tools (such as the Python interpreter), which is useful when training on the GSM8K dataset (described in detail later).

Mid-training

Next is mid-term training, which further fine-tunes the model on the smol-SmolTalk dataset.

The algorithm is exactly the same as the pre-training, but the dataset has become the conversation content, and the model will adapt to the new special tokens used to build the multi-turn dialogue structure. Now, each conversation looks like this, roughly following the OpenAI Harmony chat format:

Tokens shown like <|example|> are special tokens that follow the OpenAI special token format. They are very useful for various adaptations of the model during the mid-training phase :

The model learns special tokens related to multi-turn dialogue (except for the <|bos|> token used to separate documents, which are not present during base model pre-training).

The model adapts to the data distribution of conversations rather than the data distribution of Internet documents.

It was crucial for us to teach the model to long choice questions, as it cannot learn this from random internet data at such a small scale. Specifically, the model had to learn to associate a few options with a few letters (such as ABCD) and then output the correct answer. This was achieved by mixing in 100,000 multiple-choice questions from the MMLU auxiliary training set. It's important to understand that the problem isn't that the model lacks the relevant knowledge, but rather that it doesn't understand how multiple-choice questions work and can't express that knowledge. This is important because many common model evaluations, such as MMLU, use multiple-choice questions.

You can teach the model to use various tools. For our purposes, we'll teach the model to use the Python interpreter by enclosing Python commands between the special tokens <|python_start|> and <|python_end|>. This will be useful later when solving GSM8K problems.

There are many other adaptations you can also train for during mid-training, such as context length expansion (not explored yet).

The default mid-term training mixed data is as follows:

Then start it as follows:

This run only takes about 8 minutes, much shorter than the pre-training time of about 3 hours. Now that the model is a real chat model that can play the role of an assistant to answer user questions, it can be evaluated:

The following results were obtained for the model at this stage:

You can see:

World Knowledge: The first three items (ARC-E/C and MMLU) are multiple-choice tests that measure the model's world knowledge in various areas. Since there are four options (A, B, C, D), random guessing is correct about 25%, so the model already performs better than random guessing. (Multiple-choice questions are quite difficult for such a small model.)

Math: GSM8K is an elementary school-level math problem. Our baseline performance here is 0% because the model must write the actual numerical answer. Our performance is still not very strong, solving only 2% of the problems.

Code: HumanEval is a Python coding benchmark, again, random benchmark performance is 0%.

ChatCORE: This is Kapathy's attempt to replicate the CORE score's evaluation method for base models and extend it to chat models. Specifically, the baseline performance is subtracted from all the above metrics, resulting in a score between 0 and 1 (for example, a random model would score 0, not 25% on MMLU), and the average across all tasks is reported. It provides a single number summarizing the current model's strength.

These assessments are still quite incomplete, and there are many other aspects that could be measured but are not yet measured.

There really isn’t a great chart to show this step, but here’s an example from a previous mid-train run of another, larger model, just to give you an idea of ​​what these metrics look like as they rise during a fine-tuning run:

Supervised fine-tuning

The mid-term training is followed by a supervised fine-tuning (SFT) phase.

This is an additional round of fine-tuning on the conversational data, ideally you’d carefully select the highest quality good data , and it’s also where you do the safety training (like training the assistant to reject inappropriate requests).

Our model isn't even sure about the color of the sky yet, so it's probably safe for now when it comes to things like biohazards. One domain adaptation that happens here is that SFT stretches the rows and pads them to mimic the format used during testing.

In other words, examples are no longer randomly concatenated into long rows for training efficiency as they were during pre-training/mid-training. Fixing this domain mismatch is another small "tightening the screws" improvement. We can run SFT and re-evaluate:

This process also only takes about 7 minutes to run, and you should see a slight improvement in all metrics:

Finally, we can talk to the model as a user!

You can actually talk to it after mid-term training, but it will be better now. You can communicate with it through the terminal window (method 1) or the web interface (method 2):

The chat_web script uses FastAPI to provide the Engine service. To access it correctly, for example, on Lambda, use the public IP address of your node followed by the port number, for example, http://209.20.xxx.xxx:8000/.

That would look great, something like this:

It won't be winning physics or poetry competitions anytime soon, but that said - it's pretty cool to have achieved this on such a small budget, and the project is far from fully polished .

Reinforcement Learning

The final stage of “speed running” is reinforcement learning.

Reinforcement learning based on human feedback (RLHF) is a good method that can improve performance by several percentage points and alleviate many model defects caused by the sampling cycle itself - such as hallucinations, infinite loops, etc.

But at our scale, these are not major considerations. That being said, of all the datasets we have used so far, GSM8K is the only one that has a clear, objective reward function (the correct answer to the math problem).

So we can run the RL(/GRPO) script to directly improve performance on the answer through a simple reinforcement learning cycle of alternating sampling and training:

During the reinforcement learning process, the model will traverse all GSM8K questions in the training set and sample the completion status. We will then reward these sampled results and train on samples that receive high rewards.

We use a highly simplified GRPO training loop, e.g., no trust regions are used (discarding the reference model and KL regularization), on-policy is used (discarding the ratio + clipping of PPO), GAPO-style normalization is used (based on token level rather than sequence level normalization), and the advantage function is simply a simple reward shift by the mean (discarding the z-score normalization by dividing by the standard deviation).

The result is something that looks more like the REINFORCE algorithm, but retains the GR ("group relative") component for calculating reward dominance. This approach works reasonably well for the current scale and simplicity of the task. See the script for more details.

Currently reinforcement learning is commented out by default because it is not yet well tuned and we do not have a complete and general RLHF.

Reinforcement learning is only performed on GSM8K, which is why the -a flag is used to limit the evaluation to GSM8K as well. Since reinforcement learning is like drawing supervision signals through a straw, this process will run for a considerable amount of time.

For example, after running for about 1.5 hours with the default settings, the results are as follows:

score

The last thing Kapathy pointed out was the report.md file that appears in the project folder. It contains a lot of details about the run, and a nice summary table at the end:

Characters: 333,989

Lines: 8,304

Files: 44

Tokens (approx.): 83,497

Dependencies(uv.lock lines): 2,004

Total time: 3 hours and 51 minutes

It's important to note that due to the limited support for reinforcement learning (RL), it's excluded from the total time calculation. The entire process, up to the supervised fine-tuning (SFT) stage, took 3 hours and 51 minutes, for a total cost of (3 + 51 / 60) × 24 = $92.4 (if reinforcement learning were included, the total time would now be closer to 5 hours).

There was even $8 left for ice cream.

It's your turn

With nanochat, you can tune any part.

There are many ideas you can try: changing the tokenizer, modifying arbitrary data, adjusting hyperparameters, improving the optimization process. You might also want to train a larger model. This repository is set up to make it easy for you to do so.

Simply change the number of layers using the --depth parameter, and all other related settings will automatically adjust based on this parameter as a single adjustment of complexity. For example, the number of channels will increase, the learning rate will be adjusted accordingly, etc.

In principle, just by varying the depth, you can explore a whole "mini-family" of nanochat models. Using a larger depth and waiting longer should theoretically give you significantly better results.

You need to pass in the depth parameter during the pre-training phase of base_train.py. For example, to get a model with a CORE index of about 0.25 and performance close to GPT-2, depth=26 is a good choice.

However, when training larger models, you need to adjust the device's maximum batch size, for example, from 32 to 16:

The code will detect this change and automatically compensate for it, it will go through 2 gradient accumulation cycles to reach the target batch size of 0.5M. To train a model with depth=30, you need to further reduce the settings:

And so on. You are welcome to read the code, Kapathy has done his best to keep it readable, with comments added, so the code is clean and easy to understand.

Of course, you can also package all the content and ask your favorite large language model, or even more simply, use Devin/Cognition's DeepWiki to ask questions about this code repository. Just change the URL of the code repository from github.com to deepwiki.com, such as nanochat DeepWiki.

That's it, tweak any part of the process, rerun it, and have fun!

A highly popular figure in the AI ​​field who focuses on education

Kapathy was previously the head of AI at Tesla before joining OpenAI and leaving the company in February last year.

He is extremely popular in the entire AI community, a large part of which comes from his courses.

Including his own early blog text sharing and a series of Youtube video tutorials later, he also collaborated with Fei-Fei Li to open Stanford University's first deep learning course CS231n "Convolutional Neural Networks and Visual Recognition".

Many of today’s scholars and entrepreneurs started their careers with him.

Kapasi's passion for education can be traced back to his student days when he taught people how to solve Rubik's Cube online.

Last July, Kapathy, who resigned from OpenAI, suddenly announced his startup and established a new AI-native school - Eureka Labs .

How to understand AI native?

Imagine studying high-quality materials with Feynman, who will guide you 1-on-1 every step of the way.

Unfortunately, even if we could find a master like Feynman in every discipline, they wouldn't be able to personally tutor the 8 billion people on the planet.

But AI can, and AI has infinite patience and is proficient in all the languages ​​in the world.

Therefore, Kapasi wants to create a "symbiosis of teachers + artificial intelligence" that can run the entire course on a common platform.

If we succeed, it will become easier for anyone to learn anything, expanding the scope and extent of education itself.

Eureka Labs' first product is also its first course, LLM101n.

We will guide you step by step to build a large story generation model similar to ChatGPT, as well as a supporting web application.

GitHub repo: https://github.com/karpathy/nanochat

Detailed guide: https://github.com/karpathy/nanochat/discussions/1

Reference link: https://x.com/karpathy/status/1977755427569111362

This article comes from the WeChat public account "Quantum Bit" , author: Xifeng, and is authorized to be published by 36Kr.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments