Pre-training speedup increased by 2 to 3 times, but Nous's new solution, TST, is embroiled in controversy over "collision" with its competitors.

This article is machine translated
Show original
According to ME News, on May 14th (UTC+8), according to Beating's monitoring, Nous Research released a new pre-training scheme for large models called Tense Stacking Training (TST). This scheme reduces pre-training time by 2 to 3 times with the same computational load by packaging and compressing adjacent lexical units in the early stages of training. TST consists of two stages. In the first 20% to 40% of training, the model no longer reads lexical units one by one, but instead "packs" adjacent lexical units, calculates the average value as input, and predicts which lexical units (regardless of internal order) will be included in the next package at the output. After that, the model reverts to predicting the next lexical unit as usual. Because the underlying architecture is not modified, the resulting model is exactly the same as the regular model during inference. This method has been validated on the MoE model with up to 10 billion parameters. The essence of this scheme is "trading data for computational power," exchanging faster corpus consumption for reduced computation time. If high-quality text becomes scarce in the future, its ability to accelerate data consumption may become a weakness. Furthermore, hours after the paper's publication, readers pointed out the striking similarity between TST's mechanism and its 2024 publication, *Beyond Next Token Prediction*. The authors subsequently acknowledged on Hugging Face that this was "unfortunate convergent research" and promised to update the paper with additional citations. (Source: ME)

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments