"This is one of the most insane things I've ever written." Andrej Karpathy, former AI director at Tesla and founding member of OpenAI, just released his latest open source project, a repository called nanochat. As of now, the project has surpassed 7.9k stars on GitHub!
GitHub repository: https://github.com/karpathy/nanochat
According to reports, unlike Kapathy's previous similar repository nanoGPT, which only contained pre-training functions, nanochat is a minimalist, full-process training/inference tool chain built from scratch. It can be used to build a simple version of the ChatGPT reproduction model, and the entire code base has only one file with very few dependencies.
A model that took half a day and $100 to train beat GPT-2
"The best ChatGPT you can buy for $100," Kapathy described nanochat in the announcement. With nanochat, you simply spin up a cloud GPU server, run a script, and in as little as four hours, you can be conversing with your trained Large Language Model (LLM) on a ChatGPT-like web interface.
Specifically, the project can achieve the following functions:
- Training tokenizer based on the new Rust language implementation
- Pre-train the Transformer architecture large language model on the FineWeb dataset and evaluate the CORE score using multiple metrics
- Midtrain is performed on the SmolTalk user-assistant dialogue dataset, multiple-choice question dataset, and tool usage dataset.
- Perform command fine-tuning (SFT) on the chat model and evaluate the model performance on world knowledge multiple-choice questions (ARC-E/C, MMLU), math questions (GSM8K), and coding tasks (HumanEval)
- Optional reinforcement learning (RL) training of the model on the GSM8K dataset using the "GRPO" algorithm
- Efficient inference in an inference engine with KV caching, supporting a simple pre-population/decoding process, tooling (Python interpreter in a lightweight sandbox), and the ability to interact with the model via a command-line interface (CLI) or a ChatGPT-like web interface (WebUI)
- Automatically generate a Markdown-formatted "report card" to summarize the entire project process and present various indicators in a "gamified" way
According to Kapathy, even at a cost as low as $100 (training time of approximately 4 hours on an 8-card H100 node), a simple, conversational ChatGPT replica model can be trained using nanochat. The model can write stories, poems, and answer simple questions. After approximately 12 hours of training, the model's performance surpasses the CORE index of GPT-2.
On Github, Kapasi explained the detailed process of "rapidly training" the optimal ChatGPT model with $100.
Detailed technical steps: https://github.com/karpathy/nanochat/discussions/1
If the cost is further increased to approximately $1,000 (approximately 41.6 hours of training), the model's coherence improves significantly, enabling it to solve simple math problems, coding tasks, and complete multiple-choice tests. For example, after 24 hours of training, a model with a depth of 30 (its computational FLOPs are comparable to GPT-3 Small (1.25 billion parameters), only 1/1000 of GPT-3's) can achieve scores of over 40 on the MMLU dataset, over 70 on the ARC-Easy dataset, and over 20 on the GSM8K dataset.
Kapathy's goal is to integrate this complete "strong benchmark" toolchain into a coherent, minimalist, readable, modifiable, and forkable repository. "nanochat will become a core project for the LLM101n course (currently under development). I also believe it has the potential to become a research tool framework or benchmarking tool, much like nanoGPT did before it."
He revealed that this project is by no means final, and neither full tuning nor performance optimization has been completed. However, the overall framework is complete enough to be published on GitHub, and all subsequent modules can be further improved by the community. Furthermore, Kapathy stated that nanochat actually has many easily implementable optimization points.
8,000 lines of handwritten code, "Agent can't help"
The entire project has only about 8,000 lines of code, but Kapasi emphasized that "the code structure is quite clear." Furthermore, the code repository is essentially entirely handwritten by Kapasi, with only a tab-key auto-completion feature.
"I've tried using Claude or Codex's Agent several times before, but the results were very poor and in the end they didn't help. This may be because the code style and functionality of this repository deviate too much from the regular code in the training data of these tools." Kapasi said.
Talking about the model architecture of nanochat, Kapasi introduced that it is similar to the Llama model as a whole, but with a simpler structure. It also borrows some design ideas from modded-nanoGPT (an improved version of nanoGPT).
He attempted to identify a reliable baseline architecture for models of this size, which was as follows:
- Dense Transformer (no sparse structure)
- Rotary embeddings, no other positional encodings are used
- QK Normalization (QK Norm, normalizes the query vector Q and key vector K)
- The weights of the embedding layer and the unembedding layer are not shared
- Normalize the token embedding results
- Using the relu squared (relu²) activation function in a multi-layer perceptron (MLP)
- Root Mean Square Normalization (RMSNorm) does not contain learnable parameters
- No biases are used in linear layers
- Multi-Query Attention (MQA)
- Logit softcap (limiting the range of logit values to stabilize training)
nanochat's optimizer uses a Muon + AdamW combination, a design heavily inspired by modded-nanoGPT. Kapathy reportedly has a to-do item: attempting to remove the reliance on Muon by optimizing Adam's learning rate (for example, by setting dedicated learning rates for different modules), but he hasn't yet invested enough energy in this endeavor.
Netizen: Happy to get the machine and learn the title of engineer
In addition to Github, the newly released nanochat is also very popular on social platforms.
"I have always liked the Nano series of projects! This minimalist end-to-end training/inference toolchain will definitely have a profound impact on many machine learning learners and researchers." said one netizen.
Another netizen said, "Personally, this code repository is an excellent resource for future learning—it's very helpful for understanding low-level deep learning implementations based on Rust, as well as (more fundamentally) deep learning development in Python." He also pointed out, "If everyone could use this repository to train their own large language models (LLMs) with minimal effort, wouldn't the technological advantages of companies like Anthropic and OpenAI be weakened? After all, there are many excellent engineers on the market. With sufficient resources, they are entirely capable of training even more powerful large language models."
Another person pointed out, "I think the biggest audience for this code repository is researchers. Many people may have ideas for improving the Large Language Model (LLM), but turning these ideas into complete implementations requires a lot of effort and the final results are uncertain. Now, we have a set of ready-made tools and processes that everyone can use to experiment. What used to be just a fantasy of 'What if this could be done?' has now become a concrete action of 'I'll try to implement this idea next weekend.'"
One netizen even joked, “After running this, I will definitely add the title of ‘machine learning engineer’ to my resume.”
Reference Links:
https://x.com/karpathy/status/1977755427569111362
https://github.com/karpathy/nanochat
This article is from the WeChat public account "AI Frontline" , compiled by Hua Wei, and published by 36Kr with authorization.