The slow progress of the model is due to the slow construction of data centers and the training O1 computing power is 100 times higher than GPT4.
Recently, Bob McGrew, former chief research officer of OpenAI, had an in-depth conversation with the Unsupervised Learning channel that was extremely information-dense and full of practical content. It is highly recommended to read the full text .
Bob McGrew served as chief scientist at OpenAI for six and a half years and left in 2023.
In this interview, he discussed in depth the current status and future of AI, covering important topics such as the progress of pre-trained models, breakthroughs in multimodal AI, the future of robotics, the organization and culture of AI research, and the impact of AI on society.
McGrew predicts that the AI field will see huge changes in the next few years. The computing power competition will further heat up, video generation models and robotics technology will usher in explosive growth, and multimodal AI will profoundly change our lives.
McGrew pointed out that although GPT-5 is still some time away from release, AI development is not stagnant. OpenAI is focusing on "test-time computing" technology, which can achieve computing power growth without building new data centers. This means that OpenAI is expected to continue to improve the performance of AI models without significantly increasing costs, bringing continued confidence to investors.
McGrew also predicted that video generation models will completely change the way movies are made in the next two years. He believes that award-winning movies generated entirely by AI will soon appear. This is undoubtedly a disruptive change for the film and television industry, and it also brings huge imagination space for investors in related fields.
In addition, McGrew believes that robotics technology will be widely used in five years. Retail, warehouse and other working environments will be the first to usher in the "robot revolution." This also means that companies in the robotics industry chain will usher in unprecedented development opportunities.
Facing the AI wave, McGrew reminded us to pay attention to the cultivation of AI talents. He believes that mathematics, programming and writing skills will be the core competitiveness of future talents, and investment in related education fields will also reap rich returns.
01 The key points are as follows
When will GPT-5 arrive? A breakthrough in test-time computing
Many people think that progress on large language models has stalled since the release of GPT-4, but insiders actually have a completely different view. Developing large language models like GPT requires a lot of computing power, which relies on the construction of new data centers, which is a slow, multi-year process.
Going from GPT-4 to GPT-5 will require a 100-fold increase in computing power, which will take time. Before GPT-5 is officially released, we may see a transitional version with a 10-fold increase in computing power.
Currently, OpenAI focuses on "test-time computing", that is, investing more computing power in the process of generating answers in the model to obtain longer and more coherent chains of thinking. For example, OpenAI expanded GPT-4 to a 0,1 model, achieving a 100-fold increase in computing power.
"Test-time computing" does not require the construction of new data centers, so there is still a lot of room for algorithm improvement. In the next few years, "test-time computing" will be one of the most exciting developments in the field of AI.
A breakthrough in multimodal AI: How Sora is leading the video generation revolution
Unlike other modalities (such as images), video is an extended sequence of events that requires a full user interface to account for how the story unfolds over time. In addition, video models are very expensive to train and run.
Sora is the first high-quality video generation model that addresses some of the challenges in video generation through a storyboard feature that allows users to place checkpoints at different points in time to guide the generation of a video.
In the future, video models will be much better, take longer to generate, and cost less. Just like with LLM, you will be able to see very beautiful, realistic videos, and they will cost almost nothing.
Expect to see award-winning films generated entirely by AI within two years. The appeal of these films will be how directors use video models to realize their creative vision and do things in the medium that they could not film themselves.
The future of robotics: In five years, we will be interacting with robots in our daily lives
Robotics will see widespread adoption in five years, albeit with some limitations. The emergence of foundational models is a major breakthrough in robotics, allowing robots to start quickly and generalize in important ways.
Training robots in the real world has advantages over simulation. Simulators are good at simulating rigid bodies, but in the real world, many objects are soft, like cloth or cardboard, which simulators are not good at handling.
For any robot to be truly general, training it in the real world is essential.
In five years, we will be interacting with robots in our daily lives in ways that feel strange today. Robots will be widely used in work environments such as retail and warehouses.
OpenAI's culture: a blend of entrepreneurial spirit and collaborative spirit
OpenAI's culture is similar to that of a startup, emphasizing collaboration and shared goals. They have a common opinion about the right direction and give researchers a lot of freedom to explore areas of their interest.
OpenAI's culture encourages collaboration and ensures that people work together to build a product rather than publish many papers. This is in stark contrast to the culture of academia, which is more focused on individual glory and competition.
About AGI
Many people worry that AI will lead to mass unemployment, but in reality AI can only automate individual tasks. Most jobs contain some tasks that cannot be automated, even programming.
Advances in AI will continue, it will be exciting, and it won’t slow down, but it will change. We are transitioning from a world where intelligence may be a critical scarce factor in society to a world where intelligence is ubiquitous and free.
When intelligence is no longer scarce, agency will become the scarce factor of production. Agency is the ability to ask the right questions and pursue the right projects. We need to think about how to develop this agency so that we can work with AI.
The future will be continuous, with AI advances changing our lives incrementally. We should focus on areas that require infinite patience, such as double-checking spending or comparison shopping, where AI can do a better job.
How to train children to adapt to the AI era?
Even though AI is developing rapidly, we should not change the way we educate our children. We should still teach them math, programming, and writing because these skills can help them think about problems in a structured way.
The future is unpredictable, and how AI actually works will remain mysterious and will reveal itself to us over time. We should encourage children to try things that push the limits of their abilities and build their resilience.
02 Full Interview
Jacob, host: Bob McGrew was the Chief Research Officer at OpenAI for six and a half years. He recently left a few months ago, and we were lucky enough to be one of his first appearances on the Unsupervised Learning podcast. So we had the opportunity to ask him all about the future of AI. We talked about whether models have hit a plateau, and we talked about robot models, video models, computer usage models, and what timelines and capabilities Bob sees in the future. We talked about the unique culture of OpenAI and what makes its research so effective, and some of the key decision points and what it was like to go through those decisions. We explored why AGI might feel not that different than it is today, and Bob also shared why he left OpenAI and what he plans to do next. I think you're going to enjoy this episode. Without further ado, here's Bob. Bob, thank you so much for being on the podcast. Thanks for having me, and I'm looking forward to this conversation. Really glad to have you here. I know we're going to talk about a lot of different topics. I thought we might as well start with one that I think is the biggest issue on everyone's mind right now, which is the heated debate about whether models have hit a plateau. We'd love to hear your thoughts on that and how much potential you think there is left in pre-training.
Bob McGrew: Well, I think this is probably where there's the biggest divergence of opinion between outside observers and people inside the big labs. I think if you look at it from the outside, a lot of people initially started paying attention to AI because of ChatGPT. And then six months later, GPT-4 came out. It felt like everything was accelerating really fast and progress was being made. And yet, GPT-4 was released a year and a half ago, and everyone knew it had been trained before that. So, what's happening now? Why isn't there anything new coming out, right?
On the outside, people are wondering, are we hitting a data bottleneck? What’s going on? But you have to remember that progress in pre-training, in particular, requires a massive increase in compute. From GPT-2 to GPT-3, or from GPT-3 to GPT-4, the effective amount of compute has increased by a factor of 100. That’s what this increment represents. You do this by adding more floating point operations, adding more chips, expanding data centers, and improving algorithms. Algorithmic improvements can give some gains—50%, 2x, or 3x is great. But fundamentally, you have to wait for new data centers to be built.
There's no shortage of new data centers being built. You just have to look at the news to see that places like Meta, X, and other cutting-edge labs are also building new data centers, even if that doesn't always make the headlines. But fundamentally, this is a very slow process that takes years. In fact, before you see a full generational transition, like from GPT-4 to GPT-5, you're going to see something that's only a 10x improvement. People often forget that we went from GPT-3 to GPT-3.5 to GPT-4.
Now what's interesting is that pre-training is being done. I think we'll have to wait and see when the next generation of models are released. If you look at something like O1, we've been able to make progress using reinforcement learning. O1 represents 100 times more computation than GPT-4 by various metrics. Some people may not realize this because of the decision to name it O1 instead of GPT-5. However, in reality, this is a new generation of models.
When the next generation, hypothetical GPT-4.5, is trained, the interesting question is how does this pre-training progress compare to the reinforcement learning process? I think we will just have to wait and see what announcements come out.
Moderator Jordan: That brings up the question, given this multi-year process heading into 2025, do you think there will be as much progress in AI next year as there was last year, or do you think things will start to slow down?
Bob McGrew: Well, I think there will be progress. I think it will be a different kind of progress. One thing is, when you go into any next generation, you always run into problems that you didn't see in the previous generation. So even if the data centers are built, it takes time for people to work out the problems and finish training the models.
The reinforcement learning process that we used to train O1, which is what OpenAI used to train O1, creates a longer, more coherent chain of thoughts, effectively packing more compute into the answer. So, you know, if one model takes a few seconds to generate an answer and another model takes, say, a few hours to generate an answer, then that's 10,000 times more compute if you can actually leverage that, right?
Honestly, we've been thinking about how to use test-time computing since about 2020. And in the end, I think this is actually the real answer to how to do this, how to do it without wasting a lot of computing resources. The benefit of this is that it doesn't require new data centers. There's a lot of room for improvement here because this is a new technology that's just getting started, and there's a lot of opportunity for algorithmic enhancements.
In theory, there’s no reason that the same fundamental principles and ideas used to get O1 from, say, a few seconds of what GPT-4 can do, to O1 taking 30 seconds, a minute, or several minutes to think can’t scale to hours or even days. Just like from GPT-3 to GPT-4, there’s no fundamental new technology; both are trained in roughly the same way, but scaling is very difficult.
So that's really the core of the question: Can you actually scale? I think that's going to be the type of progress we're going to see, and it's going to be the most exciting.
Jacob: Yeah, in 2025. Given the focus on test time calculations and the current use of O1, I think it's really interesting to think about how people will actually use these models, right? I think you recently tweeted something that I found interesting about you needing these new form factors to unlock the capabilities of certain models. So maybe expand on that a little bit. For example, have you seen any early form factors that you think are interesting in using these models?
Bob McGrew: Well, yeah. To explain this, chatbots have been around for a while. Most of the interactions that people have with chatbots today, GPT-4-level models can do those tasks very well. You know, if you ask ChatGPT, who was the fourth Roman emperor? Or how do I heat basmati rice? Most of our day-to-day conversations are handled pretty well.
When we were thinking about releasing O1 in preview, there were a lot of questions about whether people would use it and whether they would find anything to do with it. I think those questions are valid. It's about understanding what you need to do with this model to really get value from it. Programming is a great use case for this because it presents a structured problem where you're trying to make progress over a long period of time, and it leverages reasoning capabilities significantly.
Another example is if you are writing a policy brief. In this case, you need to write a long document that needs to be meaningful and cohesive. The truth is, while there are a lot of programmers out there, most non-programmers don’t have to solve tasks like this on a daily basis. However, coming back to the potential breakthrough here, it’s important to have a coherent chain of thought and a structured approach to solving the problem.
This process involves more than just thinking about a problem; it can also involve taking action and developing a plan of action. What I’m most excited about with models like O1 — and I’m sure other labs will soon be rolling out similar models — is using them to enable long-term actions, essentially acting as agents. While I think the term “agent” is overused and doesn’t clearly communicate what we’re trying to achieve, there are many tasks in my life where I want models to order things for me, shop for me, and solve problems in ways that involve interacting with the rest of the world.
I think that's the product shape we really need to solve: understanding what it is and how we can deploy it effectively. Right now, I don't think anyone has figured that out yet.
Jacob: That's so interesting. I mean, it makes total sense. I think everyone, you know, has a lot of imagination about what these agents can do and what problems they can solve for people and businesses. So what are the biggest barriers to making that happen today? Obviously, you've seen some early models, like the computer use model that Anthropic has released, and I'm sure other labs are working on this as well. But when you think about what's holding us back from getting there, what are some of the hard problems that still need to be solved?
Bob McGrew: Yeah, there are a lot of issues. I think the most immediate issue is reliability. So, you know, if I ask something to be done, let’s put the action aside, right? If I ask an agent to do something on my behalf, even if it’s just thinking or writing some code for me, and I need to step away for five minutes or an hour to let it do its work, if it strays from the task and makes a mistake, and when I come back it hasn’t done anything, then I just wasted an hour. That’s a big problem.
Now add to that the fact that this agent is going to be performing actions in the real world. Maybe it’s buying something for me. Maybe it’s submitting a PR release. Maybe it’s sending a note, an email, a Slack message on my behalf. If it doesn’t do a good job, there will be consequences. I’ll be embarrassed at the very least, and I might even lose some money. So reliability becomes even more important than it has been in the past.
I think a rule of thumb when thinking about reliability is that going from 90% reliability to 99% reliability is probably going to increase the amount of computation by an order of magnitude. That's a 10x improvement. To go from 99% reliability to 99.9% reliability is going to require another order of magnitude improvement. So every additional "9" requires a huge leap in model performance. That 10x improvement is significant and represents a year or two of work.
So I think that's the first question we have to face. I think the second interesting question is, everything we've talked about so far has been for consumers, right? You haven't embedded in the enterprise. But when you're talking about agents performing tasks, for a lot of us, that's going to be something we do at work, embedded in the enterprise. I think that opens up a whole host of other considerations.
Moderator Jordan: That's interesting. We see in the enterprise today that a lot of consulting firms are actually doing a good job because currently deploying these technologies to the enterprise requires a lot of hand-holding. Do you think this hand-holding and the need for help from the enterprise will continue for a while? Or do you think it will become more accessible and enterprises can deploy these large language models very easily in the future?
Bob McGrew: Yeah, I think that's a really interesting question. And, I mean, even starting to build, what's the problem with deploying large language models in the enterprise? Well, if it's going to automate a task for you or do your job, it probably needs context. Because in the consumer space, there's not a lot of context. Okay, you like red, great. Not very interesting.
Host Jacob: Thank you for using red as an example (your podcast is called RedPoint).
Bob McGrew: But, you know, in the enterprise, you know, who are your colleagues? What projects are you working on? What is your code base? You know, what have people tried? What do people like and don’t like? All of this information exists in the enterprise in an ambient way. It’s in your Slack. It’s in your Docs. You know, maybe it’s in your Figma or whatever. So how do you gain access to that?
Well, you need to build some of these one-off things yourself. I think there's definitely a path where people build libraries of these connectors and then you can come in and do that. And that's very similar to what we do at Palantir, where the fundamental problem that Palantir solves is integrating data across the enterprise. And I think that's one of the reasons why something like Palantir's AI platform, AIP, is so interesting. So I think that's the first path, where you're kind of building a library of these things. You can build an entire platform based on that.
Another one is the opportunity to do computer use. So now, instead of having to do it in this very specific and potentially customized way, you now have a tool that you can use to do everything. Anthropic launched this; it's really interesting, we at Anthropic were talking about these computer use agents before they left OpenAI in 2020, and Google DeepMind published papers on this. Every lab has thought about this problem and is working on it.
The difference between the agents used by computers and these programmatic API integrations is that now, because you are controlling a mouse and keyboard, the actions you take now involve more steps. You may need 10 or even 100 times the number of tokens that you would need to use these programmatic integrations.
So now, what are we back to? You need a model with a very long and coherent chain of thought that can consistently solve problems over a long period of time, which is exactly the kind of problem that O1 solves. I’m sure there are other ways to solve this problem. But I think this is going to be a breakthrough that we’re going to see in the next few years.
Jacob: Next year. How do you think this will ultimately play out? Because I think on the one hand, obviously, a general model where you can use computers in any context seems very attractive. I think it's probably hard to get to 99.999% reliability. And, you know, there are a lot of steps that can go wrong at different points. You know, another view on how this would work is, I'm sure some of these problems could be simplified if you opened up the underlying application APIs in some way, right? Or some other approach, or you could provide a specific model for using Salesforce or some specific tool, I don't know. If you can access the underlying experience, then integrations will ultimately be a huge advantage. So you can do things in an instant, rather than sitting there and watching a computer do things on a screen.
Bob McGrew: Yeah, well, I mean, I think you're definitely going to see a mix of these approaches, some of which use these integrations and some of which, you know, computer usage becomes a fallback if you don't have something custom built to use. And then maybe you look at what people use and if that works, you'll come up with a more detailed integration.
I think in terms of the question about you're going to see a computer use agent specifically for Salesforce, technically that doesn't make a lot of sense to me because I think you're fundamentally leveraging data. Someone went out and collected a massive dataset of how Salesforce is used.
You can throw this data in -- it's to Salesforce's benefit to share these datasets with Anthropic, OpenAI, and Google. They train their own models. I think every application provider would want this to be public and part of every base model. So I don't think, you know, it doesn't seem like a reason to have specialized models in this way.
Moderator Jacob: No, that's a really compelling point because I think when you're in a competitive space and your competitors are making their data public and their products are becoming easier to use, you definitely want your product to be like that, too.
Bob McGrew: Yeah, it's a little bit of a mystery to me why there hasn't been this ecosystem of people cramming data into large language models. This is really the equivalent of Google's SEO.
Jacob: That's a really interesting point. How far do you think we are from widespread adoption of computer use?
Bob McGrew: Well, I mean, I think a good rule of thumb for these things is when you see a demo, it's super attractive, but it doesn't work very well. It's a pain to use. And then, you know, give it a year, it's ten times better. And that improvement is logarithmic linear. So ten times better, you know, is just one level of improvement. But one level of improvement is pretty remarkable. You'll start to see it being used in limited use cases. And then give it another year. By then, it'll be amazingly effective, but you can't rely on it every time. We're doing that with chatbots today, you still have to worry about them hallucinating. So the question of adoption really depends on the level of reliability that you require. Any area that can tolerate errors will be automated faster than those that can't tolerate errors.
Jacob: So I want to go back to Jordan's original question, basically, it makes perfect sense that right now you need a lot of assistance to integrate into the right data and define custom safeguards and workflows. So what kind of middle layer will exist between, "Hey, great computer usage model, are businesses ready to sign up?" What will that middle layer look like?
Bob McGrew: Man, I think there should be startups that define it. You know, I don’t think we fully know the answer yet. I think you’ll see an interesting phenomenon when you have a general tool like computer use, the problems it solves are fractal in difficulty, it can solve a lot of problems. But then you’ll see a really important problem and you can’t quite solve it. And then you say, okay, now we’re going to do something very specific about this, and maybe we’ll go with a programmatic approach to this. So I think we’ll see a mix of approaches for a while.
Jordan: I'm curious, you've obviously been working on the research side and are responsible for some really cutting-edge research. We talked a little bit about test-time computing. What other areas are you particularly interested in?
Bob McGrew: Well, I think we've talked about pre-training. We've talked about test-time computation. Another really exciting thing is multimodality. Big day for multimodality. Yeah, today was the announcement of Sora. And this is actually kind of the culmination of this long journey. Large language models, we're saying were invented in 2018. And obviously you can apply the Transformer and some of the same techniques to other modalities. So you have vision, you have image output, audio input, and audio output.
First of all, these things started out as auxiliary models like DALLE or Whisper. Eventually, they were integrated into the main model. The modality that resisted this for a long time was video. I think Sora was the first to do a demonstration; other companies, like Runway, and some other models followed. Now Sora itself has been released. I think there are two really interesting and different things about video compared to other modalities.
When you're creating an image, you probably really just want to create an image from a prompt. Maybe you try it a few times. If you're a professional graphic designer, you might edit some of the details in this image. But let's be honest, none of us are. A lot of the uses here are, do you need some slides? Do you want an image to go with your tweet or presentation? It's a pretty straight forward process.
Whereas, for video, wow. I mean, it's an extended series of events. It's not a prompt. So now you actually need a whole UI. You need to think about how to make this story unfold over time. And I think that's one of the things we saw with the Sora launch. Sora spent a lot more time thinking about that; the product team put a lot more effort into that than some of the other platforms.
Another thing you have to consider is that video is very expensive. It's very expensive to train these models, it's very expensive to run these models. So while it's interesting to see Sora quality video -- and I think Sora quality is indeed better -- you have to pay a little attention to see that it's better quality, at least if you're only watching a short clip.
Sora is now available to anyone with a Plus account. OpenAI has released a $200/month Pro account that includes unlimited slow builds of Sora. When you have this level of quality and distribution, two problems have been solved. This will be a high barrier for other competitors to reach.
Moderator Jacob: What will the development of video models look like in the next few years? I mean, obviously in the field of large language models, we have seen huge progress, and it feels like last year's models are now ten times cheaper and much faster. Do you think there will be similar improvements in video?
Bob McGrew: I think the analogy is pretty direct, actually. So if I think about the difference between today's video model and two years from now, first of all, the quality is going to be better. The instantaneous quality is already very good now. You can see the reflections. If you share something, all the hard problems, you can point out, oh, look, there's a reflection there. There's some smoke. You know, the hard part is extended, coherent generation.
So the SOAR product team has a storyboard feature that allows you to set checkpoints at different points in time, like every five seconds or every ten seconds, to help guide the build. You know, fundamentally, if you want to go from a few seconds of video to an hour of video, that's a very difficult problem. I think that's something you're going to see in the next generation of models.
On the other hand, another analogy is that I actually think it's going to be very much like large language models, where if you want a GPT-3 quality token, it's 100 times cheaper than when GPT-3 first came out. And the same thing will be true with Sora, where you're going to be able to see these really beautiful, realistic videos, and they cost next to nothing.
Jacob: I think the dream is to have a full movie generated by AI that wins some awards or something, and you know, just as a shameless podcast question, when do you think we'll have a movie like that?
Bob McGrew: I can only guess. Oh, gosh. Yeah. Honestly, winning an award is kind of a low bar, right? I think there are a lot of award shows. Really, is this a movie you actually want to see? Yeah. I think we'll see it in two years, but it's actually going to be less impressive than I just said, because the reason you want to see it is not because of the video itself, but because there's a director who has a creative vision and uses the video model to achieve his creative vision. I think they do that because they can do things in this medium that they couldn't film. We can imagine it. None of us here are directors, but we can all imagine a lot of possibilities. We're not graphic designers, we're not directors, but, yes, it will be like that in the future.
Jordan: Exactly. Yes, we have some very specific skills here. Yes, we're seeing a lot of companies popping up trying to be the Pixar of AI. And we always ask the question, when is this actually going to be feasible? So it sounds like it's going to be a lot sooner than we at least expected.
Bob McGrew: That's my guess. Once something gets to the point where you can demonstrate it, progress will be very rapid. Before that, progress is very slow, or at least it's not visible.
Moderator Jordan: I want to move from the video to robotics. You originally joined OpenAI to work on a lot of robotics stuff. We'd love to get your thoughts on the field and where we are today and where you think it's going.
Bob McGrew: This is a really personal question. When I left Palantir, one of my thoughts was that robotics was going to be the area where deep learning became real, not just a button on someone’s website. So, I spent a year between Palantir and OpenAI getting really deep into robotics, writing some early code on vision with deep learning. It’s a very challenging area. At the time, I thought it might be another five years; that was 2015, and that was completely wrong. But, I think it’s right now. I believe that robotics will be widely used in five years, albeit with some limitations. So I think it’s a good time to start a robotics company.
One fairly obvious point is that the base model has made a huge breakthrough in getting robots up and running quickly, enabling them to generalize in important ways. There are a few different aspects to this. One of the more obvious ones is the ability to use vision and turn it into action plans, which is what the base model brings. A slightly less obvious and perhaps more interesting aspect is that there's a whole ecosystem that has grown up around this. Now that I've left OpenAI, I've spent some time with the founders, and I talked to some of the robot founders. One robot founder told me that they've actually set up the robot to be able to have a conversation. That's really cool and a lot easier; you can tell the robot what to do, and it'll understand the gist. It uses some specialized models to do the action. Before, it was cumbersome to write out what you wanted, and you had to sit in front of a computer instead of looking at the robot. Now you just talk to it.
I think one major difference in results that we still don't understand is whether you're learning in simulation or learning in the real world. Our major contribution in robotics over the last two years has been showing that you can train in a simulator and have it generalize to the real world. There are a lot of reasons to use a simulator; for example, it's cumbersome to run in a production system or in the real world. You can do free testing and so on. But simulators are good at simulating rigid bodies. If you're doing a grasping and placing task with rigid objects, that's great. But a lot of things in the world are squishy objects. You have to deal with cloth, or, when you think about warehouses, cardboard. Unfortunately, simulators don't do a particularly good job of handling those scenarios. So for anything that wants to be truly generalizable, our only way right now is to use real-world demonstrations. And as you can see from some of the work that's come out recently, this can actually produce promising results.
Jacob: It works really well. And then, I guess, obviously this is somewhat unknowable, like, you know, when people figure out scaling laws in robotics and how much data people might need for teleoperation, but do you think we're close to it? Or, I mean, obviously, you know, in 2015, you thought it was five years away. How close do you think we are to the moment where people say robotics is like ChatGPT, where people say, oh, that's really great, that looks different and it works.
Bob McGrew: In terms of predictions, especially about robotics, you really have to think about the space. So I'm pretty pessimistic about mass consumer adoption of robotics because it's scary to have a robot in your home. Robotic arms are lethal. They could kill you, and more importantly, they could kill your children. And, you know, you can use different kinds of robotic arms that don't have those disadvantages, but they have other disadvantages. The home is a very unfettered place.
But I do think that in all forms of retail or other work environments, I think we'll see this in five years. If you go to an Amazon warehouse, you can even see this; they already have robots that solve their mobility problems. You know, they're working on pick and place. I think you're going to see a lot of robots rolled out in warehouse environments.
And then, you know, it's going to be incremental over a period of time in a field-by-field basis. I'm not going to predict when it's going to be in the home, but I think you're going to see it being used widely. I think in five years, we're going to be interacting with them in our daily lives in a way that feels strange today.
Jacob: I mean, obviously there are some standalone robotics companies. And to some extent, obviously robotics leverages the foundational, you know, advances in LLM. I'm curious, like, you know, is this all going to converge? Obviously there are companies that just do video models. There are companies that focus on biological, material science. When you think about where this is going long term, you know, is there going to be one giant model that encompasses all of this?
Bob McGrew: At the cutting edge of model scaling, I think you should continue to expect these companies to come out with a model that is going to be the best at every dimension in every form of data that they have. That's an important caveat.
What specialization really buys you is price/performance. Over the past year, you've seen cutting-edge labs get better at small models that have a lot of intelligence that can do chatbot-like use cases at a very low cost.
If you're a company, a very common pattern at this point is that you figure out what you want AI to do for you, and then you run it using the most cutting-edge model you like. Then you generate a huge database and fine-tune some smaller models to do that. You know, this is a very common practice; OpenAI provides this as a service, and I'm sure this is a common pattern on every platform.
You can say, you know, this is very, very cheap. Now, if you trained a chatbot like this, your customer service chatbot is trained like this, if someone deviates from the script, it's not going to be as good as if you had used the original cutting-edge model. But that's OK; this is the price/performance ratio that people are willing to accept.
Jacob: One thing I found interesting is that when we were chatting earlier, you mentioned a macro view of the progress of AI, which basically said that in 2018, we had expected that by 2024, we would have all kinds of model capabilities that you would think from first principles have completely changed. Like the world is almost unrecognizable relative to 2018. While you do have a huge impact on the wider world, I can't say that the popularity of AI has completely changed the way the world works. Why do you think this is the case?
Bob McGrew: Well, I just want to recap a little bit, I think, as strange as it sounds, the right mindset to have about AI is to be deeply pessimistic. Like, why is progress so slow? Why, you know, some people say that AI is responsible for 0.1% of GDP growth. But that's not because of the productivity gains from using AI, it's because of the capital expenditures incurred to build the data centers that are needed to train AI. So why isn't AI evident in the productivity statistics? Just like people said in the 1990s about the internet.
I think there are a few reasons for this. First, the idea in 2018 that once you can talk to it and it can write code, then everything will be automated instantly. This is the same idea that engineers have when they're asked to write a feature. You might think, "Oh yeah, I can do this in a couple of weeks." But then you start writing code and you realize, "Oh, actually, this feature is a lot more complicated than I thought." If you're a good engineer, you might estimate two weeks, but in reality the project might take two months. If it's a bad engineer, they might find that the feature is impossible to write.
I think this is what happens when we really dig into how humans get their jobs done. Yes, you might talk to them on the phone, but that doesn't mean all they do is talk to you. There's real work involved. Fundamentally, all that AI can automate is one task. However, a job is made up of many tasks. When you look closely at real jobs, you'll find that for most jobs, there are tasks that can't be automated.
Even if you look at programming, for example, the boilerplate code is optimized first, and the trickier parts, like figuring out what exactly you want to do, are solved last. So I think as we continue to roll out AI, we're going to find more and more complexities and limitations in how it can automate the full scope of human work.
Moderator Jordan: So with that in mind, in terms of progress, what areas do you think are underappreciated today and should be getting more attention than they are getting?
Bob McGrew: Well, okay. Here’s an answer, the startups that I’m really interested in are the ones where people are using AI to solve some really boring problems.
Imagine you run a company and you can hire all the smart people you want to do super boring things like go through all your spending and make sure you're price-matching appropriately. If your procurement department was full of people like Elon Musk, for example, who really, really carefully controlled spending, you could probably save a lot of money.
Nobody does that because, you know, the people who can actually save money, they'll get bored. They'll hate the job, right? But AI is infinitely patient.
It doesn't have to be infinitely smart. And, you know, I think anywhere that you're running your business where you can get value from what people who are infinitely patient do, then that's what AI should automate.
Jacob: That's interesting because I always thought of consultants as an arbitrage way to get smart people to solve boring problems or work in boring industries. And obviously, with cutting-edge AI models, you can get a person with a very high IQ to solve problems that you could never get a smart person to do.
Bob McGrew: Yeah, I mean, the first time I heard about this, somebody did a productivity study and it showed that AI was really delivering a 20 to 50 percent improvement, I thought, wow, that's great. And then I realized, oh, it's consultants. Well, you know, AI is really good at bullshitting, and consultants are bullshitting. So maybe we shouldn't be surprised that productivity gains are showing up there first.
Host Jacob: Yeah, I think the improvement is also the biggest among the lower half of the group, right?
Bob McGrew: Exactly. Well, actually, I think this is a little bit hopeful. Because if you look at the lower half of the people who are performing, you know, they have skills that humans have that are hard to automate, which is the hopeful version of this. They know what they’re doing, but they don’t know how to write code to do it. And then the model comes along and it says, oh, I know how to write code to do it, but I don’t know what I’m supposed to do. So now these lower performers can actually get really good at what they do. So I think this is very hopeful.
Moderator Jordan: I guess, in terms of performance, you have worked and are working with some of the best researchers in the world. What do you think makes an AI researcher the best?
Bob McGrew: There are many different types of researchers doing different things. If you think about someone like Alec Radford, who invented the GPT family and CLIP, who basically invented Large Language Models (LLMs) and then went on to do various forms of multimodal research. Alec is someone who likes to work alone at odd hours late at night. In contrast, other outstanding people like Ilya Sutskevi and Jacob Pichoki, who were the first and second chief scientists of OpenAI, respectively, have great ideas and visions. They help others solve challenges and play a key role in setting the overall roadmap for the company.
One key characteristic that the best scientists all share is a certain amount of perseverance. I’ll always remember watching Aditya Ramesh, who invented DALL-E, struggle with the problem of generating an image that wasn’t in the training set, to prove that neural networks could be creative. The original idea for DALL-E was to see if it could create a picture of a pink panda skating on ice, an image that Aditya was certain didn’t exist in the training data. He worked on it for 18 months, maybe two years, trying to achieve this goal.
I remember about a year later, Elia came over to show me a picture and said, "Look, this is the latest generation. It's really starting to work." All I saw was a blur, with faint pink at the top and white at the bottom—just pixels starting to come together. I couldn't make out much at the time, but Aditya persisted. This kind of tenacity is what every successful researcher must have when solving fundamental problems. They must see this as their "last battle" and be determined to keep at it for years, if necessary.
Jacob: To make it work. What did you learn from forming a research institute like this with a group of people like this?
Bob McGrew: Well, interestingly, the best analogy I can think of actually comes from Alex Carp at Palantir, who always says engineers are artists. And that makes a lot of sense. When you talk to a really good engineer, they just want to create. They have something in their mind. Code is how they bring that sculpture in their mind to life.
At Palantir, you know, you have to ask them to fix bugs, but every time you do that, their artist side gets sad. You have to have a process to get people to work together, but their artist side gets sad. The fact is, engineers are artists, a 10x engineer is a 10x artist, and a researcher is a 100x artist of any engineer.
There are a lot more things to consider when setting up an organization with researchers. There is a way of engineering management where you say it would be great if everyone is interchangeable parts and you have a process that allows them to work together. However, working with researchers requires very close attention because it is critical that you do not stifle their artistry.
It is the passion for the vision in their heads that makes them willing to take on all the challenges of turning that vision into reality.
Moderator Jordan: You are lucky to have worked at Palantir and OpenAI, and there are a lot of articles discussing how Palantir's culture is very special. When you think about OpenAI, I believe there will be a lot of articles about its culture in the future. What do you think these articles will say?
Bob McGrew: Yeah. I mean, I think one of the things is working with researchers like we just talked about. The other crazy thing about OpenAI is how many transformations it's gone through, or I prefer to think of it as multiple rebuilds. So when I joined OpenAI, it was a nonprofit. The vision of the company was to build AGI by writing papers. We knew that was wrong; it didn't feel right. A lot of the people in the early days, Sam, Greg, and I, were entrepreneurial people, and this path to AGI didn't feel right.
A few years later, the company transitioned from a nonprofit to a for-profit organization. That was very controversial within the company, in part because we knew at some point we were going to have to interact with the product. We had to think about how to make money. The partnership with Microsoft became another rebuilding moment, which was also very controversial. I mean, maybe making money is one thing, but giving it to Microsoft, to a big tech company, wow, that was bad.
And then, just as importantly, we decided to say, okay, not only are we going to work with Microsoft, we're going to build our own product using the API. And then finally, adding consumer services to enterprise services through ChatGPT. These are the defining transformations that a startup goes through. At OpenAI, it feels like every 18 months or every two years, we're fundamentally changing the purpose of the company and the identity of the people who work there.
We went from the idea that writing papers was your job to the idea of building a model that everyone in the world could use. What's really crazy is that if you asked us in 2017 what the right mission was, it wouldn't be to achieve AGI by writing papers; it would be that we wanted to build a model that everyone could use. But we didn't know how to achieve that, so we just had to explore and figure out all these things along the way.
Host Jacob: What do you think makes you so successful in making these major transitions?
Bob McGrew: Well, I mean, first of all, necessity. These aren't arbitrary choices, right? You have a nonprofit, you're running out of money, maybe you need to find a way to raise money; maybe in order to raise money, you have to become a for-profit company. Your partnership with Microsoft, maybe they don't see the value of the models that you're creating, so you need to build an API because it might actually work. And then you can show them that people actually want these models.
ChatGPT, I think that's something that we really believe after GPT-3, that with the right advancements, the right form is not just an API where people have to go through an intermediary to talk to the model, but the model will be something that you can talk to directly. So that was one of the things that I think was very deliberate. But the way it happened, as we all know, was an accident. We're working on it. We've actually trained GPT-4, and we hope to release it when the model is good enough that we use it every day.
We all looked at ChatGPT in November and we thought, does it pass the bar? Not exactly. John Schulman, one of the co-founders who was leading the team, said, look, I really just want to launch it. I want to get some outside experience. I remember thinking, if a thousand people use it, that would be a success. You know, our bar for success was pretty low. We made a decision not to put it behind a wait list.
And then, you know, the world forced our hand again, and all of a sudden, everybody in the world wanted to use it. What were those first few days like when you launched it? Oh, my goodness, it was very intense. At first, there was some disbelief that this was actually going to happen. There was some anxiety. We quickly tried to figure out how to get GPUs. So we temporarily moved some of our research computing resources there.
And then there's this question of, when is it going to stop? Is this going to continue or is this going to be a fad? Because we almost went through something similar with DALL-E. The DALL-E 2 model was a sensation on the internet and then it just disappeared. So people were worried that ChatGPT would actually disappear as well. This is where I'm very confident that it's not going to disappear, it's actually going to be more important than the API.
Jacob: I mean, what an interesting experience. I think one of the cool things is that you're so close to cutting-edge AI research. I'm curious, what thoughts have changed in the past year in the field of AI?
Bob McGrew: The interesting thing is, I don’t think I’ve changed my mind about anything. After GPT-3, going into 2020, 2021, if you’re in the middle of it, a lot of what needs to happen over the next four or five years feels like a given. We’re going to have these models. We’re going to make the models bigger, they’re going to be multimodal. Even in 2021, we’re talking about how we need to use RL on language models and trying to figure out how to make it work. And, the real difference between 2021 and 2024 is not what needs to happen, but the fact that we’re able to make it happen. And, you know, we, the field as a whole, are able to make it happen. But in a sense, it also feels a little bit destined that we’re where we are now.
Jacob: I guess, looking forward, when you think about scaling pre-training and scaling test-time compute, does it feel like it's destined to reach AGI with just those two? Or, how do you think about that?
Bob McGrew: I have a hard time wrapping my head around the idea of AGI. And I think if anything, one of my deep critiques of AGI is that there's not a single moment, and really, these problems are fractal. And we're going to see more and more things being automated. But somehow we're - I don't know. I have a feeling it's going to get really banal, and somehow we're all going to be driving self-driving cars to the office and commanding armies of AI there. And then we're going to be like, oh, this is a little boring. It still feels like being in the office, and my boss is still an idiot. And that's presumably our AGI future. We're going to be waiting for the clock to go off at 5 p.m. or something like that.
On a more serious note, I've always felt, and I think this is a common view within OpenAI and other cutting-edge labs as well, that solving reasoning is the last fundamental challenge needed to scale to human-level intelligence. You need to solve pre-training, you need to solve failure modes, you need to solve reasoning. At that point, the remaining challenge is scaling. But it's very important.
Scaling is very hard. There's really not a lot of fundamental ideas at all. Almost all of the work is in how to scale them up to take more and more compute. It's a systems problem. It's a hardware problem. It's an optimization problem. It's a data problem. It's a pre-training problem. All of the problems are really just about scaling. So, yes, I think to some extent it's already doomed. The work here is to scale it up, but it's hard. A lot of work.
Jacob: Obviously, I think people are talking about the societal impact of these models expanding their capabilities. I think we're still in the early stages of that discussion and there's probably a lot of different conversations that need to be had. But what are some of the areas that you're particularly interested in and passionate about that you think we should be talking about?
Bob McGrew: Yeah. I think what’s most interesting is that we’re moving from an era where intelligence is probably the scarcest resource in society to an era where intelligence is going to be ubiquitous and free. So what is the scarce factor of production? And, I don’t think we know. I’m guessing it’s agency. That is, the ability to get things done. What are the right questions you need to ask? What are the right projects you need to pursue? I think those types of problems are going to be very hard for AI to solve for us. I think those are going to be the core problems that humans are going to need to figure out. And, not everyone is good at that. So, I think what we need to think about is how do we develop the kind of agency that allows us to work with it.
Host Jordan: Do you think this is now or in the future?
Bob McGrew: I think it will feel very continuous. It's an exponential curve. And the thing about exponential curves is, they have no memory. It always feels like you're always going at the same speed, the same rhythm.
Jacob: Won't these models eventually also figure out, I mean, if you think about figuring out what to do or the goal of a project, you just mentioned it a few times? For example, you could imagine, at the most basic level in the future, saying to the model, hey, build a good company, or create an interesting work of art, or make a movie, and so on. As these models become more powerful, this agency, I guess, maybe talk about that.
Bob McGrew: Yeah, I mean, can you just ask an AI to figure out everything? Well, I think you can, and you'll get some results. But let's take Sora as an example. If you're making a video, and you give it a very vague prompt, it's going to create a video for you completely. Maybe it's going to be a really cool video. Maybe it's going to be better than the coolest video you could think of. But it's probably not going to be the video you wanted.
So you can also interact with it, you give it a very detailed prompt, you say, I made these specific choices about the videos I want to see. This allows you to create videos that please yourself or your audience.
I think this tension will persist no matter how advanced AI becomes, because how you fill in the blanks will determine a lot of what the final product will be.
Host Jacob: How do you use the state-of-the-art O1 model today?
Bob McGrew: My preferred way of understanding and interacting with models is that I spend a lot of time teaching my eight-year-old son to program. He loves to ask questions, so I'm always thinking about how to connect what he's interested in today to the lesson I want to teach him.
For example, one day he said, "Dad, what is a web crawler? How does it work?" That gave me an opportunity, and I said, well, can I use a short program to teach him how the web works? I tried to use an O1 model, working hard to create a program that was short enough and didn't introduce too many new concepts that I hadn't already taught him.
The goal was to teach him about networking, which were the core concepts I wanted him to understand, while making sure the content was understandable for an eight-year-old. It took some time to tweak the program, but I believe part of the learning process is experimentation, and testing different ideas is an important aspect of that.
Moderator Jordan: I guess in terms of testing, when you think about it from a research testing perspective, what are the core assessments that you typically do when new models come out and which ones do you rely on the most?
Bob McGrew: Well, I mean, the first thing to point out here is that it changes with each generation of models. You know, when we were developing the O1 model, the right metric to look at was GPQA, which stands for Google Proof Question Answering. However, by the time we were ready to launch, it was no longer a very interesting metric because we had gone from having almost nothing at the beginning to it being completely saturated. The few remaining questions at the end were usually poorly worded or not very interesting questions. So the metric that you choose depends a lot on what you're trying to do in your research, and I think that's a general rule of thumb.
However, one thing that has been useful over the last few years is programming. Programming is a structured task that many people, including myself and other researchers, can understand, and it is very important. It can scale from completing a line of code to writing an entire website. We are not yet at the point where programming is fully solved, and I think we still have a long way to go. I believe there are still several orders of magnitude before we can actually do the work of a true software engineer.
Jacob: One thing that's clear from your early career is that you were doing a PhD in computer science, and I remember at least part of it was focused on game theory. Obviously, I think there are a lot of interesting implications for using these models to explore topics in game theory. I wanted to ask, in general, how do you think AI will change social science research, policy making, and other related fields? If you were to revisit your previous work today with the power of these models, what would you try to do?
Bob McGrew: First of all, I'm actually very disappointed with academia. I think it has a terrible set of incentives. In some ways, I designed OpenAI's organization to mirror academia, to create a place where collaboration can flourish.
One of the interesting aspects of business is that a lot of product management work is similar to experimental social science. You have an idea and you want to test it on humans. You want to see how it works while taking good methods. A/B testing is a good example; when you do that, you're actually doing a kind of social science.
That's one of the things I'm particularly excited about: if you're doing an A/B test, why not take all the interactions you have with your users right now, fine-tune a model with that data, and then all of a sudden you have a simulated user that reacts the same way your actual users would? This means you can do an A/B test without ever putting it into production. And maybe afterwards, you can do an in-depth interview with one of those simulated users to get their thoughts.
Is this feasible today? I don’t know. I haven’t tried it yet, but maybe it will be possible tomorrow. I think this is a good general principle: whenever you find yourself wanting someone to do something for you, consider whether you could ask an AI to do it instead. And, an AI can probably handle hundreds of tasks, while a human might only be able to do one, and struggle with it.
Jordan: Yeah, I had Jacob do a lot of the tasks for me, so.
Jacob: Yeah, you should stop doing that. You should start asking my model. Thank you for delivering it. You saved me a lot of time. You mentioned that, I think, you designed the incentives that exist in academia and designed the OpenAI organization in contrast to that. Can you talk more about that?
Bob McGrew: Yeah, yeah. I mean, think back to 2017, 2018, 2019. At that time, AI research labs were not a big industry. They were just research labs. A lot of the people involved in them came from academia. If you look at the structure of academia, it had a set of incentives that were good enough for its original design. However, there was a lot of focus on credit—who actually did this? In what order are the names on the paper listed? That was very important to people with an academic background.
Maybe you don't want to collaborate with others because it dilutes your contribution to the results. If two people work on a problem together, it often feels more like a competition than an opportunity to work twice as fast. In this context, I think DeepMind is thinking about building a lab that mimics academia but operates within a corporate framework, so that I can mentor people and focus only on deep learning.
On the other hand, I think the original goal of Brain was to gather some scholars to do exploratory research in a very academic way. I would not impose the direction, but would arrange product managers externally so that they could maybe catch these great ideas and turn them into products. At the same time, we are a group of entrepreneurial people, as well as some outstanding researchers, including people like Ilya. Our view is that a research lab should be run like a startup.
We think it's important to give people a lot of freedom while being clear about the direction we're going in, especially to exceptional researchers—some of whom we didn't even realize were exceptional at the time. Our goal is to let them find the hills they want to climb to create the exceptional work they aspire to create. We emphasize collaboration, making sure people are working together toward a unified goal, rather than just focusing on publishing a lot of papers.
Jacob: I like that. I think you've reviewed some of the most famous decisions in OpenAI's history earlier, from the nonprofit to the transformation, the partnership with Microsoft, and the release of ChatGPT's API. Is there a decision point that may not be so famous, but you think is a key decision point? Or, which decision do you think is difficult to make, or which decision really changed the direction of the organization?
Bob McGrew: I think one decision that I haven't talked about before, but was also quite controversial at the time, was the decision to double down on language modeling and make it really the central focus of OpenAI. This decision was complex for a number of reasons. Changes like this involve restructuring and adjusting structures, and people have to change what they do.
Again, our initial culture encouraged trying a variety of different approaches to see what worked. Our first major effort was a concerted effort to play the game Dota 2, which continued the great tradition of AI solving increasingly difficult games. You went from chess to Go, then to Dota 2 and StarCraft, which somehow felt less cool. However, I can assure you that mathematically these games are really harder than Go and chess, even if they are less elegant.
The Dota 2 project was a huge success, and it taught us a lot. From that experience, we came up with the belief that you can solve problems by scaling them up, and have a set of technical tools for that. So by deciding to shut down more exploratory projects, like the robotics team and the gaming team, and really refocus on language models and general generative models, including multimodal work, I believe that was a very critical choice, even though it was very painful at the time.
Jacob: One thing I noticed early on is that you obviously mentioned that you were testing these models on your eight-year-old. And I guess in the time that you've been a parent, obviously the world eight years ago was very different than it is now, in large part due to the advances that you've driven in the field of artificial intelligence. I'm wondering if anything has changed in terms of your life and how you're raising your children based on your updated beliefs about how quickly the power of these models will manifest in the world?
BOB McGREW: Yeah, I think the fact is that I haven't changed anything. And I think that's probably one of my failings, right? Like, who better to figure out what kids should be learning than me? And yet, I think I'm pretty much still trying to teach them the same things that I was eight years ago.
Why would I teach my eight-year-old son to code when ChatGPT can code for him? I think that’s a mystery. But in a sense, the future is set, but the outlines of how it actually works, I think will be very mysterious and will be revealed to us over time.
So I think the old truth of trying things that are just at the edge of your ability is really important. You try to learn math, you try to learn to code, you try to write, you try to write well, you try to read widely. I think those are going to develop the skills that kids and frankly, adults are going to need, no matter what AI ends up doing.
Because fundamentally, it's not about coding. It's not about math. It's about you learning how to think about problems in a structured way.
Jordan: Okay, that's all awesome. I'm sure we could talk to you for hours. But we like to end with some quick Q&A. The first question is, what is overhyped and what is underrated in the field of AI today?
Bob McGrew: Wow, okay. Well, a simple answer to what's overhyped is, I think it's new architectures. There are a lot of new architectures out there. They look interesting, but they tend to break down at scale. So if there's an architecture that doesn't break down at scale, then it's not overhyped. Until then, they're all overhyped. As for underhyped, I think it's 01. I think it's been hyped a lot, but is it hyped appropriately? No. I think it's underhyped.
Moderator Jacob: I know our listeners will be curious, so I'll ask, but can you share some thoughts about why you left OpenAI at this time?
Bob McGrew: Well, the fact is, I was there for eight years, and I really felt like I accomplished most of the things that I wanted to accomplish when I came here. And it's no coincidence that the time I announced my resignation was right after the O1 preview was released. You know, we developed a specific project, a research project, again, pre-training, multimodal reasoning. Those problems were solved. It was hard work, frankly. When I felt like I had accomplished what I needed to do, it was time to hand it off to the next generation of people who are passionate about this job and committed to solving the remaining problems. I think the problems they have are very exciting.
What are your plans for the future? After I left Palantir, I spent two years before joining OpenAI. I started a robotics company and tried a lot of things. I built things myself and talked to a lot of people. To be honest, I made a lot of mistakes, but none of them really mattered. In the process, I learned a lot and formed my own theories about what is important to the world and what the nature of technological progress is.
All of these experiences, the people I met, and the ideas I came up with helped me get to OpenAI. It turned out to be a much better job than anything I could have chosen in the first six months after leaving Palantir. So, I'm in no rush. I'm going to continue meeting people and figuring things out. I'm really enjoying the process of thinking about and learning new things.
Jacob: Now that you have more time, are there any areas that you would like to delve deeper into, or any areas that you have always wanted to spend more time on?