The classic PPO algorithm: once rejected by NeurIPS.

This article is machine translated

Show original

Being rejected does not mean failure.

Article author and source: Machine Heart

That's really surprising.

PPO (Proximal Policy Optimization) , a classic algorithm that was later widely used in RLHF and large model training, was rejected by NIPS 2017.

This matter was recently brought up by John Schulman, the author of PPO. He summarized the story in just one sentence: PPO was rejected by NIPS 2017.

This paper, first published in July 2017, initially appeared to be a simpler, more engineering-friendly policy optimization algorithm. Its goal was to reduce implementation complexity while preserving TRPO stability, making reinforcement learning training more responsive and practical.

But a few years later, what truly propelled PPO to a larger stage was not traditional reinforcement learning tasks like Atari and robot control, but large language models.

From RLHF to today's RLVR, PPO has become one of the fundamental algorithms that cannot be bypassed in the training of large models. According to Schulman, PPO has ushered in a second wave of popularity in the LLM era, for reasons that even exceeded the expectations of the original paper.

This doesn't seem like Schulman complaining about being rejected back then, but rather like a reflection after the fact: the true impact of a technology often unfolds in ways that the inventor didn't initially anticipate.

Seeing this, many people will naturally wonder: Why was PPO rejected back then?

Schulman later explained that the paper was considered to have limited innovation at the time, and its improvement over existing baseline methods was not significant enough.

One netizen commented, "This actually reflects a mismatch between academic evaluation and real industry needs. The academic community often values novelty and improvement relative to the baseline in small-scale, controlled experimental environments; while the real world cares more about whether the method can be scaled up to a larger scale, whether it can remain stable in complex systems, and whether it can actually run."

Schulman also seemed quite calm about it. He said that was a long time ago, and he hoped that over the years, the academic community had gradually come to understand and absorb this "simple but scalable" aesthetic.

What truly surprised him was that the PPO paper and its objective function could have such a lasting impact. Whether an algorithmic change is merely a minor tweak that will be quickly forgotten and replaced, or whether it will remain in the system for a long time and become a fundamental component that is difficult to surpass, is often difficult to determine from the outset.

The story of PPO perfectly illustrates this point.

In fact, it's not just PPO. Many works in the history of AI that later proved to have a profound impact were rejected by top conferences when they were first submitted.

LSTM : It was rejected by NIPS in 1996 because it was considered too complex and lacked biological rationale. However, it later became a core technology for sequence modeling tasks such as speech recognition and machine translation.

SIFT was rejected by ICCV 1997 and CVPR 1998 due to its cumbersome and inelegant engineering process. However, it later dominated computer vision for more than a decade before the era of deep learning.

Dropout : Rejected by NIPS in 2012 due to its perceived engineering hack nature and insufficient theoretical rigor, it later became one of the most important regularization methods for deep neural networks and won the NeurIPS Time Test Award.

Sometimes, time is the most rigorous and fairest judge.

Source

Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.

Add to Favorites

Comments

Relevant content