In the early hours of the morning, OpenAI released Prism, a next-generation AI research tool powered by GPT-5.2. This platform allows scientists to write and collaborate on research, and is now free for all users with a ChatGPT account. In the words of Chinese AI entrepreneur Yuchen Jin, "Every paper will list ChatGPT as a co-author."
Yesterday, Kevin Weil, Vice President of OpenAI and head of the newly established OpenAI for Science team, posted on X to tease the future, saying, "Our goal is to empower every scientist with AI superpowers so that they can do more and enable the world to conduct 2050-level scientific research by 2030."
In the three years since ChatGPT's explosive popularity, OpenAI's technology has revolutionized various aspects of daily life. Now, OpenAI is clearly focusing on the scientific research field, targeting researchers. In October, the company announced the formation of a new OpenAI for Science team, primarily dedicated to exploring how its Large Language Models (LLMs) can assist researchers and optimizing its tools to support them. Over the past few months, social media has seen a surge of related content, and academic journals have published numerous research findings. Researchers from fields such as mathematicians, physicists, and biologists have written articles describing how Large Language Models, especially GPT-5, have helped them make new discoveries or guided them towards solutions they might have otherwise missed.
So why did OpenAI choose this moment to enter the field? What goals does this move aim to achieve? How does its focus on scientific research align with the company's broader mission? OpenAI is already a latecomer to this area. Google DeepMind established its AI-for-science team several years ago, creating groundbreaking scientific models such as AlphaFold and AlphaEvolve. In a 2023 interview, Google DeepMind CEO and co-founder Demis Hassabis stated regarding the team, "This is the initial motivation for founding DeepMind. In fact, it's the reason I've dedicated my entire career to AI."
In a recent interview, Kevin Weil not only addressed these questions directly but also gave a more conservative assessment of the current model's capabilities than before: The model is not yet at the level to achieve groundbreaking new discoveries, but if it can save people from wasting time on problems that have already been solved, it can accelerate scientific research. Interestingly, he revealed that a researcher who proactively contacted OpenAI and subscribed to the paid GPT-5 service reported that GPT-5 makes some basic errors, even more foolish than human errors, but it is constantly improving.
In addition, in accordance with OpenAI's strategy in the field of AI research, it will optimize the overall design of the model in two main ways: first, by making GPT-5 lower its confidence level when giving answers, thus exhibiting cognitive humility; and second, by using GPT-5 to perform fact-checking on its own output.
“The significance of 2026 for the scientific research field will be comparable to that of 2025 for software engineering,” Weil said. “At the beginning of 2025, if someone used AI to write most of the code, they would just be an early adopter; but 12 months later, if they haven’t used AI to write most of the code, they may already be behind. Right now, the scientific research field is showing a similar early development momentum to the programming field. A year from now, if a researcher has not yet deeply utilized AI in their research, they will miss the opportunity to improve the quality of their thinking and accelerate their research progress.”
The modeling capabilities of AGI already surpass those of 90% of graduate students; AGI's greatest value lies in driving scientific progress.
Several years ago, Weil joined OpenAI as Chief Product Officer, having previously served as Head of Product at Twitter and Instagram. However, his career began in scientific research: he completed two-thirds of his PhD in particle physics at Stanford University before leaving academia to pursue his Silicon Valley dream. Weil is happy to mention this academic background, saying, "I used to think I'd be a physics professor for the rest of my life, and I still read math books on vacation."
When asked how OpenAI for Science fits with the company's existing white-collar productivity tools and the wildly popular video app Sora, Weil immediately replied, "OpenAI's mission is to develop artificial general intelligence (AGI) and make this technology benefit all of humanity." He suggested imagining the future changes this technology could bring to the scientific research field: entirely new drugs, materials, and devices.
“Imagine how it can help us explore the nature of reality and tackle unsolved scientific problems. Perhaps the most significant and positive value that AGI can create for humanity is its ability to drive scientific progress,” he added. “The emergence of GPT-5 has shown us this possibility.”
According to Weil, today's large language models are good enough to be valuable collaborators for researchers. They can generate ideas, suggest new research directions, and find fruitful connections between new problems and old solutions published decades ago in obscure or foreign language journals. But this wasn't the case about a year ago. Since the release of its first inference model (a logic learning model that breaks down problems into multiple steps and solves them one by one) in December 2024, OpenAI has been pushing the boundaries of this technology. The advent of the inference model has significantly enhanced the ability of large language models to solve mathematical and logical problems.
“A few years ago, a model scoring 800 on the SAT would have been enough to amaze us all,” Weil said. Now, large language models are winning math competitions and solving physics problems at the graduate level. Last year, OpenAI and Google DeepMind both announced that their large language models achieved gold medal-level results in the International Mathematical Olympiad, one of the world's most challenging math competitions. Weil stated, “The capabilities of these models have long surpassed those of 90% of graduate students; they have truly reached the limits of human ability.”
This assertion is bold, but not without its flaws. However, there is no doubt that GPT-5, equipped with an inference model, represents a significant leap forward over GPT-4 in solving complex problems. The industry benchmark GPQA, containing over 400 multiple-choice questions, specifically tests doctoral-level expertise in biology, physics, and chemistry. GPT-4 achieved only 39% accuracy on this test, far below the human expert benchmark of approximately 70%. In contrast, according to OpenAI data, the latest version of GPT-5, GPT-5.2, released in December 2024, achieved an accuracy rate of 92%.
Even after reading through 30 years of research papers, the model has failed to produce any groundbreaking new discoveries.
Weil's excitement was evident, but perhaps a little excessive. Last October, Weil and other OpenAI executives publicly declared on the X platform that GPT-5 had found solutions to several unsolved mathematical problems. However, mathematicians quickly pointed out that GPT-5 had actually only unearthed existing answers from early research papers, including at least one German paper. While such capabilities are valuable, they are far from the groundbreaking achievement OpenAI claimed. Weil and his colleagues subsequently deleted the related posts.
At the time, this caused quite a stir. Initially, rumors circulated that GPT-5 had solved 10 previously unsolved Erdős problems and made progress on 11 others. However, Thomas Bloom, the mathematician responsible for maintaining the Erdős problem website, clarified that GPT-5 had simply found some references that could solve these problems. DeepMind CEO Demis Hassabis pointed out that the team's communication was "too hasty." Former Meta Chief AI Scientist Yann LeCun sarcastically remarked that OpenAI was "hoisted by its own GPTards," having "lifted its own GPT stone and shot itself in the foot."
Just a few days ago, news broke that GPT-5.2 Pro had solved the Erdős conjecture, specifically problem number 281 in the Erdős problem database. This proof was spearheaded by mathematician Neel Somani, and the process was verified by Fields Medal winner Terence Tao, who described it as "one of the most definitive examples of AI solving open-ended mathematical problems." Currently, GPT-5.2 Pro's proof of this problem has been included on the Erdős problem website.
It is reported that GPT-5.2Pro offers a new proof for this problem. Although it ignores previous proofs, Terence Tao points out that GPT-5.2Pro's proof approach is "quite different" from previous methods, with only some conceptual overlap. There are now two possible approaches to this problem: one is the ergodic theory framework used by GPT-5.2Pro, employing a variant of the "Ferstenberg correspondence principle"; the other is a combination of two theorems that existed as early as 1936 and 1966: the Davenport-Eldos theorem and Rogers' theorem, with a simpler solution.
However, Weil is now more cautious. He says that finding existing but forgotten answers is significant in itself: "We all stand on the shoulders of giants. If large language models can integrate this knowledge, allowing us to avoid wasting time on problems that have already been solved, that in itself is an acceleration of scientific research." He also downplays the claim that large language models will soon make groundbreaking discoveries: "I don't think the current models are at that level yet, but perhaps they will in the future. I'm optimistic about that."
However, he emphasized that this is not the team's core mission: "Our mission is to accelerate scientific progress, and the standard for accelerating scientific progress does not necessarily require a complete reimagining of the entire field like Einstein did." In Weil's view, there is only one core question: Is the pace of scientific progress really faster? "When researchers collaborate with models, they can accomplish more work and be more efficient than when they study alone. I think we have already seen that."
Last November, OpenAI released a series of case studies provided by researchers both inside and outside the company, showcasing the practical applications of GPT-5 and its role in supporting scientific research through real-world examples. Weil stated, "Most of the researchers in these cases were already using GPT-5 directly in their research. They contacted us through various channels, telling us, 'Let's see what these tools can do for me.'" GPT-5 excels at: identifying existing research findings and related clues that researchers haven't yet realized, which can sometimes spark new ideas; assisting researchers in drafting mathematical proofs; and providing experimental ideas for researchers to verify hypotheses in the laboratory.
“GPT 5.2 has read almost every paper published in the last 30 years. It not only understands the content of scientists’ fields, but also extracts analogous ideas from other unrelated fields,” Weil said. “This is incredibly powerful. You can always find human collaborators in related fields, but finding thousands of collaborators in thousands of potentially related fields is much more difficult. In addition, I can work with the model late at night; it never needs rest, and I can ask it ten questions at the same time. Doing these things with a human would inevitably be awkward.”
GPT-5 makes more mistakes than humans, and robots are more willing to follow its commands?
According to reports, OpenAI contacted several researchers to corroborate Weil's viewpoint, and the vast majority of them agreed with it. Robert Scherrer, a professor of physics and astronomy at Vanderbilt University, had previously only used ChatGPT as a pastime. He told me, "I once had it rewrite the theme song of Gilligan's Isle in the style of Beowulf, and it did a fantastic job." It wasn't until his colleague at Vanderbilt, Alex Lupsasca, a physicist now working at OpenAI, told him that GPT-5 had helped him solve a problem in his research that he changed his opinion of the model.
Lupsasca subscribed to GPT-5 Pro for Scherrer, OpenAI's premium $200-per-month subscription service. Scherrer said, "My graduate students and I struggled with a problem for months without any results, but GPT-5 solved it." However, he also admitted that the model isn't perfect: "GPT-5 still makes some basic mistakes. Of course, I make mistakes myself, but GPT-5's mistakes are more silly." Nevertheless, he stated that its progress is remarkable: "If the current trend continues, I think soon all researchers will be using large language models. Of course, this is just a hypothesis."
Derya Unutmaz, a biology professor at the Jackson Laboratory, a nonprofit research institution, uses GPT-5 for brainstorming, paper summarization, and experimental planning in his research related to the immune system. In a case study he shared with OpenAI, his team analyzed an old dataset, and GPT-5's analysis of this data yielded entirely new insights and interpretations. He said, "Large language models have become crucial for scientists. Dataset analysis that used to take months can now be done with large language models; it's simply impossible without them."
Nikita Zhivotovskiy, a statistician at the University of California, Berkeley, says he has been using large language models in his research since the first version of ChatGPT was released. Like Scherrer, he believes the most useful aspect of large language models is their ability to uncover unexpected connections between his research and previously unknown existing findings. “I believe large language models are becoming an indispensable technological tool for scientists, just as computers and the internet were before. Those who refuse to use these tools will be at a long-term disadvantage.” However, he doesn’t expect large language models to yield any new discoveries in the short term. “I’ve hardly seen models offer truly new ideas or arguments worthy of separate publication. So far, they seem to primarily integrate existing research, sometimes making mistakes, rather than creating truly new research methods.”
Some researchers who have no connection with OpenAI are not so optimistic.
Andy Cooper, Professor of Chemistry at the University of Liverpool and Director of the Leverhum Centre for Functional Materials Design, said, “So far, we haven’t seen large language models fundamentally change the way scientific research is done, but our recent findings suggest that such tools do have their uses.” Cooper is leading the development of a so-called AI scientist, a system that can fully automate parts of the research workflow. He stated that his team will not use large language models to conceive research ideas, but the technology is beginning to show practical value in larger automated systems, such as where large language models can assist in controlling robots.
“I suspect that large language models will be used more in robotic workflows, at least initially. Because I’m not sure if people will be willing to follow the commands of large language models; I certainly wouldn’t myself,” Cooper said.
The team's key focus: to make GPT less confident and more humble.
The practicality of large language models may increase daily, but caution remains crucial. Last December, Jonathan Oppenheim, a scientist researching quantum mechanics, pointed out an error caused by a large language model in a scientific journal. He published an article on the X platform stating, "OpenAI's management is promoting a paper in *Physics Letters B*, whose core ideas were proposed by GPT-5. This may be the first paper with core ideas contributed by a large language model and peer-reviewed. However, there's a small problem: GPT-5's proposed idea is completely wrong in terms of the object of verification. Researchers asked GPT-5 to design a verification experiment to detect nonlinear theories, but it provided a scheme to detect nonlocality theories. The two seem related, but are actually completely different. It's like you want a COVID-19 test kit, but a large language model enthusiastically hands you a chickenpox test kit."
Clearly, many researchers are using large language models in creative and practical ways. However, it's equally evident that the errors this technology can make can be extremely subtle, even going unnoticed by experts. This problem stems in part from the interactive nature of ChatGPT, which often uses a conciliatory tone to lull users into a false sense of security. As Jonathan Oppenheim stated, "The core problem is that the training goal of large language models is to cater to users, while scientific research needs tools that challenge us." In one extreme case, an ordinary person outside the research field was misled by ChatGPT and believed for months that they had invented a new branch of mathematics.
Of course, Weil is well aware of the illusion problem in large language models, but he emphasizes that the probability of new-generation models producing illusions has been greatly reduced. Even so, he believes that focusing solely on illusions may miss the point.
“A colleague of mine, a former mathematics professor, once said something that left a deep impression on me: ‘When I do research, I exchange ideas with colleagues, and 90% of my views are wrong, but that’s precisely the point. We’re all boldly brainstorming ideas, just to find a feasible research path.’” Weil said, “This is actually the ideal state in scientific research. When you put forward enough incorrect ideas, someone accidentally discovers a glimmer of truth, and another person seizes on this and continues to discuss: ‘What you said isn’t entirely correct, but if we change our perspective,’ people can gradually find their way forward in the fog of scientific research.”
This is precisely the core vision Weil set for OpenAI for Science. He believes that while GPT-5 is excellent, it is not a panacea. The value of this technology lies in guiding people to explore new directions, rather than providing the final answer. In fact, OpenAI is currently working on optimizing a feature of GPT-5: making it lower its confidence level when giving an answer. It will no longer directly say "the answer is here," but will tell researchers in a more subtle way: "The following ideas are for reference." "This is exactly what we are currently investing a lot of effort in: trying to make the model have a certain cognitive humility," Weil said.
It has been revealed that another direction OpenAI is exploring is using GPT-5 to fact-check its own output. In practical applications, it's common to see situations where if you re-input a GPT-5 answer into the model, it will analyze it line by line and point out errors. Weil stated, "We can let the model act as its own verifier. This allows us to build a workflow: the model first completes initial reasoning, then submits the result to another model for review; if this model finds areas for improvement, it will feed the result back to the original model, indicating 'Note that this part is incorrect, but this line of thought is valuable and can be retained.' This is like two intelligent agents working together; only after the output passes the verifier's review will it be finally presented."
This mechanism is highly similar to the model Google DeepMind developed for AlphaEvolve. AlphaEvolve is a tool that encapsulates the large language model Gemini within a larger system that filters out high-quality responses and feeds them back to the model for improvement. Google DeepMind has used AlphaEvolve to solve several real-world scientific research problems.
Today, OpenAI faces fierce competition from other companies whose large language models, while unable to achieve all the features OpenAI claims for its models, can still accomplish most of them. If so, why would researchers choose GPT-5 instead of Gemini or Anthropic's Claude series models, which are also iterating and upgrading annually? Ultimately, OpenAI's strategy for Science is largely about gaining a head start in this new field. True technological innovation, however, has yet to arrive.
Reference link:
https://www.technologyreview.com/2026/01/26/1131728/inside-openais-big-play-for-science/
https://openai.com/zh-Hans-CN/prism/
This article is from the WeChat official account "AI Frontline" , compiled by Huawei, and published with authorization from 36Kr.




