AI master Kapathi's open source project went viral, cloning ChatGPT in just 4 hours and 8,000 lines of code

This article is machine translated

Show original

In the early morning of October 14, Andrej Karpathy, a well-known expert in the field of AI, released a new open source project called "nanochat". He described it as one of the "most unconstrained" and crazy projects he had ever written.

Unlike the earlier nanoGPT, which only covered pre-training, the new nanochat is a minimalist, full-stack training/inference process from scratch, which enables a complete build of a simple version of ChatGPT through a single code base with minimal dependencies.

The usage process of nanochat is very simple: you only need to rent a cloud GPU server, run a single script, and in as fast as 4 hours, you can communicate with the large language model (LLM) you trained in a web interface similar to ChatGPT.

What is nanochat?

Based on the principles provided by Kapathy, Nanochat packages all the steps and tools needed to build a chatbot from scratch, including:

1. Data preparation: Starting with raw web text (such as the FineWeb dataset), create a tokenizer to convert massive amounts of text into numbers that the model can understand.

2. Model pre-training: Train a basic Transformer model on large-scale data to allow it to learn language grammar, facts, and basic reasoning. This is the most time-consuming and core step.

3.Alignment fine-tuning:

a. Instruction fine-tuning: Use high-quality question-answering and dialogue data to teach the model how to follow instructions and communicate with people like an assistant.
b. Reinforcement Learning: (Optional) Further improves the model's performance on specific tasks (such as math problem solving) through rewards and penalties.

4. Model Reasoning: Provides an efficient engine that allows you to have real-time conversations with your trained model in the command line or a web interface similar to ChatGPT.

5. Evaluation ( After training is complete, the system automatically generates a detailed "report" showing the model's performance on multiple standard tests (such as mathematics, coding, and common sense reasoning).

Karpathy's previous project, nanoGPT , focused on step 2: model pre-training . It was a minimalist GPT model training code, intended for educational purposes, to help everyone understand how large models are trained.

Nanochat is a full-stack project that not only includes the pre-training part of nanoGPT, but also completes all subsequent key steps (command fine-tuning, reinforcement learning, reasoning, and UI interface), ultimately delivering a chatbot that can actually engage in conversation.

All this was achieved with just 8,000 lines of code typed by Kapasi.

What is the significance of Kapasi making this nanochat?

First, it provides education and learning. It's currently the best resource for understanding how to build a ChatGPT from scratch. It allows ordinary developers and researchers to create their own small chat models at a relatively low cost, allowing them to fully experience the entire process from raw text to an intelligent conversational assistant.

Secondly, it provides a research and experimentation platform. It provides researchers with a lightweight, controllable, and reproducible experimental platform. They can use this framework to quickly test new model architectures, training methods, or alignment techniques without having to use expensive large-scale computing resources.

Finally, the netizen on X also discovered its new possibilities. He believed that this system could become a new benchmark for hardware evaluation.

This is awesome! This should become the new benchmark for hardware evaluation – all we have to do is report an ordered triple:

● Total end-to-end training cost (USD)

● Total end-to-end training time (minutes)

● Overall performance on a specific test set

And the whole process is highly reproducible.

$100 to train an AI from scratch

So how much money can Nanochat save?

● For only about $100 (~4 hours of training on 8XH100 nodes), you can train a small ChatGPT clone that can have basic conversations, write stories and poems, and answer simple questions

(The web interface shows a conversation with a nanochat model that took four hours and cost $100. It can already write poetry.)

(The nanochat report card shows some summary metrics generated by this $100 speedrun. Overall, it’s pretty good.)

● It takes about 12 hours of training to surpass GPT-2 in terms of CORE indicators

● If the budget is increased to approximately $1,000 (41.6 hours of training), the model becomes more coherent, capable of solving simple math and programming problems and passing multiple-choice tests. For example, a model with a depth of 30 can achieve scores in the 40s on MMLU, over 70s on ARC-Easy, and over 20s on GSM8K after 24 hours of training (computational load equivalent to GPT-3 Small 125M or 1/1000 of GPT-3).

Kapasi personally reveals the technology behind it

On the X platform, Kapasi engaged in a question-and-answer dialogue with netizens, revealing the behind-the-scenes development details and related technologies of nanochat.

The following are selected questions and answers:

Q: What kind of model design is the training/infrastructure of this model based on?

Kapathy: The model architecture of nanochat is basically similar to the Meta Llama model, but with some simplifications and some design ideas from its improved version, the modded-nanoGPT project. Its goal is to establish a "robust baseline" for models of this scale.

Key architectural features include:

● Dense Transformer

● Rotary Embeddings (rotational position encoding), no explicit positional embeddings

● QK Norm (normalizes query and key vectors)

● Embedding and Unembedding weights are not shared (untied weights)

● Perform normalization after token embedding

● MLP uses ReLU² activation function

● There are no learnable parameters in RMSNorm

Bias-free linear layers

● Adopt Multi-Query Attention (MQA)

● The output layer uses Logit Softcap technology

The optimizer uses a Muon + AdamW combination, which is largely influenced by modded-nanoGPT. Kapathy mentioned that he plans to try to remove the Muon in the future by carefully adjusting the learning rate of each module of the Adam optimizer, but this work has not yet been completed.

Q: Can I train it with my own data? Like all my Notion notes, my health data, and my conversations with other large models? Like building a personal chatbot that truly understands me?

Kapathy: I don't think the codebase is well-suited for this purpose. Think of these miniature models as very young children (like kindergarteners); they don't really have the native intelligence of the larger models. If you fine-tune/train them on your own data, you might get some interesting responses that seem to mimic your writing style, but the end result will be crude.

To achieve the desired results, you might need to first organize your original data, rewrite it with a large amount of synthetic data (this step is tricky, highly uncertain, and a research area), and then fine-tune a leading open source large model. This process might also require mixing in a large amount of pre-trained data to avoid losing the model's original intelligence during the fine-tuning process.

So, to be honest, getting this process to work perfectly is still cutting-edge research.

Currently, the most feasible non-technical solution is to import all your data into a tool like NotebookLM. This tool processes your data using RAG (Retrieval of References in Blocks) technology. Your information is passed to the model through a context window, but the model's weights remain unchanged. While the model won't truly "know you," this is probably the easiest approximation to implement.

Q: How much of this code did you write by hand?

Kapathy: The code is mostly handwritten (with tab-based autocompletion). I've tried using AI coding assistants like Claude and Codex a few times, but they've been completely ineffective and generally unhelpful. Perhaps my codebase's style deviates too much from the style of their training data.

This article is from Tencent Technology , author: Jinlu, and published by 36Kr with authorization.

Source

Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.

Add to Favorites

Comments

Relevant content

All-in station

Revealing the tool to help trace the fraudulent digital currency AnTex of Shark Binh

All-in station

Sam Bankman-Fried claims his arrest was politically motivated

Blockbeats

How to play with so many prediction markets? Read this article!