He Kaiming's first language model: 105M parameters, avoiding the old path of GPT autoregression.

This article is machine translated

Show original

He Kaiming has also entered the field of language modeling.

However, this time the team he led did not use the familiar autoregressive paradigm of "next token prediction" behind ChatGPT.

Instead, it's another new approach that has become incredibly popular in the image field over the past few years and is now being increasingly adopted for text generation: Diffusion Language Model (DLM).

In their latest paper, Kaiming He's team unveiled a novel continuous diffusion language model: ELF: Embedded Language Flows .

Unlike many language models that still rely on token-level diffusion, ELF keeps the entire generation process in a continuous embedding space until the final step, when it re-discretizes the representation back into a token .

With this design, ELF outperformed a number of mainstream diffusion language models using only 105M parameters, 45B training tokens, and 32 sampling steps.

The most intuitive metric is that it reduced the generative perplexity to 24 on OpenWebText.

Here's a brief explanation of generation perplexity: essentially, it involves a powerful language model "checking" the generated results to see how much the text resembles real human-written language.

The lower the value, the higher the quality of the generated data, and the less AI-like and more natural the output of the model.

In comparison with mainstream diffusion language models, ELF achieved a lower generation perplexity despite requiring nearly 10 times fewer training tokens and fewer sampling steps .

It can be said that for a long time in the past, the progress of diffusion language models has almost all occurred on the Discrete DLM side.

ELF proved for the first time that continuous methods not only run, but also work well.

What exactly did ELF do?

To understand ELF, you must first understand what the diffusion language model is actually doing.

There are two main technical approaches to diffusion language models. One is the discrete approach , represented by MDLM and Duo, which directly performs diffusion in the token space, processing discrete random variables at each step.

The second category includes continuous methods such as Diffusion-LM, CDCD, and DiffuSeq, which map tokens into continuous embeddings and denoise them in continuous space.

Previous research has shown that discrete approaches such as MDLM, LLaDA, and Dream 7B have been dominant. The reason is simple: language itself is discrete.

Regarding this seemingly common-sense understanding, Kaiming's team offered the exact opposite assessment—

The problem may not be that "language must be discrete," but rather that previous researchers simply did not allow for a continuous path to be followed.

While methods like Diffusion-LM denoise the embedding space, they calculate token-level cross-entropy at each step, binding continuous trajectories to the vocabulary.

Later LD4LG and Cosmos adopted the latent diffusion approach, which made the denoising process continuous, but required training a separate decoder to decode the latent data back into tokens, which is equivalent to adding an extra module.

Based on this, ELF leaves all denoising in the continuous embedding space; the token is only returned in the final step t=1.

Specifically, during training, discrete tokens in ELF are first encoded into continuous embeddings, and then noise is added to form z_t. The model is either responsible for restoring it to a clean embedding (MSE) or directly predicting the token (CE).

During inference, the model starts from Gaussian noise z_0 and denoises continuously in the space until the last step, when it switches to decode mode and puts the embedding back into the token.

For the first time, ELF completely separated the issues of "continuous representation" and "discrete output," which were previously considered to require repeated alignment:

The intermediate denoising is entirely handled by the continuous space; the final language generation is left to the last step of discretization.

Without hard-aligning to the vocabulary at every step, and without needing to train an additional decoder, the entire generation process truly achieves this for the first time:

Continuous is continuous, and discrete is discrete.

This is precisely the key reason why ELF can outperform a host of diffusion language models with fewer sampling steps and fewer training tokens.

ELF is not about "diffusing first and then decoding".

In its specific implementation, ELF also solves three problems:

How do you make a token continuous? How do you remove noise from the continuous token? And how do you finally convert it back to a token?

Convert the token into a continuous embedding

To apply continuous diffusion to language, the first step is to transform discrete tokens into continuous representations.

In the paper, ELF first segments it into a token sequence and then maps it to a continuous embedding space. There are actually several options for how to perform this mapping.

By default, ELF uses a T5 pre-trained encoder to generate bidirectional contextual embeddings . The paper also tests different schemes such as jointly trained embeddings and random embeddings.

It is worth noting that this encoder is only used during the training phase and does not add any additional modules during inference.

Perform Flow Matching in a contiguous embedding space

After obtaining the continuous representation, ELF performs flow matching in the embedding space.

Simply put, Flow Matching defines a continuous flow path from noise to real data:

At t=0, it is Gaussian noise;

At t=1, the embedding is clean;

All intermediate states are linear interpolations of the two, which is the rectified flow in the paper.

In traditional flow matching, the network typically predicts the "velocity field" v directly.

However, ELF did not do that. Instead, it adopted the approach proposed by Kaiming's team six months ago in "Back to Basics: Let Denoising Generative Models Denoise"—

Directly predict the clean embedding x, which is x-prediction .

The training objective is to minimize the mean squared error (MSE) between the predicted embedding and the actual embedding.

As for why x-prediction was used, the paper gave two reasons:

First, it is more stable in high-dimensional representations—such as 768-dimensional or even higher-dimensional token embeddings; second, it is naturally aligned with the target of the final step, "predicting clean tokens".

The paper also specifically mentions that although it is theoretically possible to predict the velocity v first and then convert it to x, this would make it difficult to establish the weight sharing between denoising and decoding.

In their experiments, they also found that once weights were shared, the performance of v-prediction deteriorated significantly.

From continuous embeddings back to discrete tokens

The generated language ultimately outputs discrete tokens.

Therefore, ELF only needs to put the continuous embeddings back into the token space at the last time step (t = 1) .

However, unlike many latent diffusion methods, ELF does not train an additional decoder in this step. Instead, it treats the final step as a continuous-to-discrete decoding process.

In other words, the decoder and the denoiser mentioned earlier are actually the same network.

To prevent the final training step from being too simplistic (because theoretically, by t→1, the input is already very close to a clean embedding), ELF adds an extra token-level corruption step in the final step to construct a perturbed input.

Subsequently, the same network outputs a clean embedding, which is then projected into token logits through a learnable unembedding matrix W.

The training objective is standard token-level cross-entropy loss. The entire network shares the same set of parameters and additionally receives a binary mode token: denoising mode/decoding mode.

During inference, ELF starts with Gaussian noise and denoises continuously in the space until the last step t = 1, at which point it switches to decode mode and outputs the final token through argmax.

It's worth mentioning that CFG (classifier-free guidance), one of the most commonly used techniques in image generation, has also been incorporated into ELF.

ELF uses self-conditioning as the conditional signal, and applies training-time CFG (simulating two inferences with one forward, without inference overhead), directly transplanting the solution from the image side.

Experimental comparison

In the experimental section, ELF essentially answered a question that had been lingering for the past two years:

Can continuous diffusion language models really win? The answer is: not only can they win, but for the first time, they win simultaneously in three dimensions: quality, speed, and training cost.

As mentioned at the beginning, in the OpenWebText generation task, without distillation, ELF only uses 32 sampling steps to reduce the generation perplexity to 24.

Previously, mainstream discrete diffusion models often required 1024 steps to approach this level.

Even more remarkably, ELF achieved this result using only 45 bytes of training tokens.

Meanwhile, competitors of the same level generally achieve 500B+. In other words, the number of sampling steps is an order of magnitude less, and the training data is also an order of magnitude less, yet the results are better.

And ELF also performed admirably on conditional generation tasks, where many diffusion models are most prone to falling behind.

Whether it's WMT14 machine translation or XSum text summarization, ELF consistently outperforms existing diffusion language models, even surpassing many autoregressive baselines.

The paper concludes with a rather restrained statement: ELF achieves a strong trade-off between generation quality, sampling efficiency, and training cost.

In layman's terms: the "continuous attack" approach isn't incapable of success. It's just that they haven't consistently maintained that continuity in the past.

Author Introduction

Finally, let's introduce the author of this article.

The two first authors of this paper share a common contribution, and their order of appearance is determined by a coin toss.

Hu Keya is one of the two first authors of this article. She is a first-year PhD student in EECS at MIT and one of the first PhD students supervised by Kaiming at MIT. She is currently co-supervised by Kaiming and Jacob Andreas .

△

She graduated from the ACM class at Shanghai Jiao Tong University with a bachelor's degree. Her current research interests mainly lie at the intersection of language and vision, and she is committed to building intelligent agents with higher data efficiency and stronger generalization ability.

It's worth mentioning that on the Kaiming MIT homepage, Hu Keya is listed first among the Grade students, making her arguably the senior student in the group.

The second first author, Linlu Qiu , is also a PhD student at MIT, supervised by Yoon Kim .

△

She graduated from the University of Hong Kong with a bachelor's degree and from the Georgia Institute of Technology with a master's degree. She also worked as an AI Resident at Google.

Interestingly, this isn't her first collaboration with Kaiming. Just recently, she and Kaiming's team won the CVPR 2026 paper "ARC Is a Vision Problem!", redefining the ARC reasoning problem as a vision problem.

Another author, Hanhong Zhao, is an undergraduate student at MIT. He attended the High School Affiliated to Renmin University of China and was a gold medalist in the International Physics Olympiad (IPhO).

△

Another author , Lu Yiyang , has a background that has a bit of a "gifted youth program" feel to it.

△

He is a sophomore in Tsinghua University's Yao Class and is currently interning at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) under the supervision of Kaiming He. His main research areas are computer vision and deep generative models.

During his high school years, he was a physics competition student. He ranked first among Jiangsu Province students and ninth nationwide, winning the gold medal in the 39th National Physics Olympiad for Secondary School Students (CPhO) in 2022.

Previously, he co-authored a paper with Kaiming titled "Bidirectional Normalizing Flow: From Data to Noise and Back".

Another core author, Li Tianhong , is a postdoctoral fellow in Kaiming's group.

△

He received his undergraduate degree from Tsinghua University's Yao Class and his doctorate from MIT. He was the first author of the paper "Back to Basics: Let Denoising Generative Models Denoise" published half a year ago.

In addition, the paper's other authors include Yoon Kim and Jacob Andreas , two professors at MIT EECS specializing in language modeling, as well as Kaiming He himself.

Reference link [1] https://arxiv.org/pdf/2605.10938

This article is from the WeChat public account "Quantum Bit" , author: henry, and published with authorization from 36Kr.

Sector:

SEI Ecosystem

Smart Contracts

Interoperability

Source

Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.

Add to Favorites

Comments

Relevant content

BeInCrypto Việt Nam

3 altcoins that benefited most from the CLARITY Act and why.

XRP

2.57%

BeInCrypto Việt Nam

on-chain data proves the Bitcoin cycle has changed.

BTC

2.88%

TechFlow

Trump's Q1 stock trading activities revealed; these stocks he recently bought.