Listened to this plus Gavin’s ai thoughts post. He seems very confident in pre-training scaling laws holding and I’m just… not so sure? The argument is very focused on advancements in compute pushing pre-training but, definitionally, there needs to be commensurate increases in data to scale, right?
Since we all know the famous Ilya line about pre-training data, my question is of course, where is this data coming from? It seems like people are pointing to the idea of synthetic data being fed back into pre-training, but that idea has never really sat right with me.
I’ve held this intuitive sense that a model creating its own data to pre-train on should lead to a messy ouroboros of a system unable to progress. It’s learning in isolation, unexposed to novel data from different creators. BUT, I haven’t actually read any papers on the benefits or limitations of pre-training models on self-generated synergetic data.
Anyone else have this thought and/or research to point to? And will note this I specifically for pre-training, not SFT, post-training, etc.