Six days after launching ChatGPT Health, OpenAI was overtaken on its own healthcare benchmark.

avatar
ME News
01-14
This article is machine translated
Show original
Have you ever asked an AI assistant about your health?

Article author: Li Yuan

Article source: MarsBit

If you're a heavy user of AI like me, you've probably tried it too.

According to data provided by OpenAI, health has become one of the most common use cases for ChatGPT, with more than 230 million people worldwide asking health and wellness-related questions every week.

Therefore, as we enter 2026, the health sector is showing signs of becoming a fiercely contested battleground in the field of AI.

On January 7, OpenAI released ChatGPT Health, allowing users to connect electronic medical records and various health applications to receive more targeted medical responses; and on January 12, Anthropic immediately launched Claude for Healthcare, emphasizing the new model's capabilities in medical scenarios.

Interestingly, this time, Chinese companies did not fall behind; in fact, they seemed poised to take the lead.

On January 13, Baichuan Intelligence announced the release of its Baichuan M3 model, which surpassed OpenAI's GPT-5.2 High on HealthBench, a healthcare evaluation benchmark set released by OpenAI, achieving state-of-the-art (SOTA) performance.

After facing numerous doubts after announcing its all-in approach to healthcare, BaiChuan Intelligent seems to have finally proven itself. GeekPark also had a special interview with Wang Xiaochuan to discuss BaiChuan Intelligent's views on the capabilities of the M3 model and the ultimate future of AI in healthcare.

01 For the first time, it surpasses OpenAI on a health-related test set.

One of the most impressive achievements of the newly released M3 model is that it surpassed OpenAI's GPT-5.2 High on HealthBench, a healthcare evaluation benchmark set released by OpenAI, achieving state-of-the-art (SOTA) status.

SOTA On Healthbench, Healthbench Hard and Hallucination Evaluation

Healthbench is a healthcare evaluation dataset released by OpenAI in May 2025. It was jointly built by 262 doctors from 60 countries and includes 5,000 highly realistic multi-turn medical dialogues. It is one of the most authoritative and realistic clinical evaluation datasets in the world.

Since its release, OpenAI's model has consistently topped the charts.

This time, Baichuan Intelligence's new generation open-source medical big data model, Baichuan-M3, achieved a comprehensive score of 65.1, ranking first in the world. It even won the HealthBench Hard test, which specifically tests complex decision-making capabilities, setting a new record.

Baichuan also released the results of a hallucination rate test. The M3 model achieved a hallucination rate of 3.5%, which is the lowest in the world.

It is worth noting that this hallucination rate is the medical hallucination rate under pure model settings without relying on external retrieval tools.

BaiChuan Intelligence stated that the key to achieving these two points lies in the introduction of reinforcement learning algorithms suitable for medical applications.

Baichuan first used Fact Aware RL (Reinforcement Learning) technology on the M3 model, achieving the effect of preventing the model from making clichés or speaking nonsense.

This is actually very crucial in the medical field.

Asking medical questions in an unoptimized model is most likely to result in two types of problems. One is that the model directly fabricates your symptoms and conjectures a disease; the other is that the semantics are ambiguous, ultimately suggesting that you still need to see a doctor, which is not very helpful to either doctors or patients.

This is precisely because many models use pure illusion rate as the optimization objective, and in this case, the model may dilute the overall illusion rate by piling up simple and correct facts. Baichuan introduces semantic clustering and importance weighting mechanisms—clustering eliminates the interference of redundant statements, and weighting ensures that core medical conclusions receive higher weight.

At the same time, simply introducing a high-weight hallucination penalty can easily force the model into a conservative strategy of "saying less and making fewer mistakes". Therefore, the Fact Aware RL algorithm also incorporates a dynamic weight adjustment mechanism to adaptively balance these two objectives based on the model's current capability level. In the capability building phase, the focus is on learning and expressing medical knowledge (high Task Weight); after the capability matures, the factual constraints are gradually tightened (increasing Hallucination Weight).

When online searches are available, Baichuan also added an online verification module based on multi-round searches, and introduced an efficient caching system to align massive amounts of medical knowledge.

02. Its diagnostic capabilities surpass those of human doctors, entering a usable stage.

However, surpassing OpenAI on Healthbench was not the only highlight of the event.

What's even more interesting is that Baichuan has creatively built its own SCAN-benche benchmark dataset. Compared to benchmark datasets that dominate OpenAI's charts, Baichuan's self-built benchmark dataset may better illustrate the direction Baichuan Intelligence wants to optimize in the medical field.

The key to the evaluation dataset built by Baichuan this time lies in optimizing "end-to-end consultation capabilities." This stems from Baichuan's own experimental insights: for every 2% increase in consultation accuracy, the accuracy of treatment results increases by 1%.

In other words, compared to OpenAI's HealthBench, which still mainly focuses on "whether AI can answer questions," BaiChuan's SCAN-benche aims to evaluate whether AI can obtain effective information in a question-and-answer process and provide correct diagnostic results and medical opinions.

Typically, when we ask an AI assistant a question, simply mentioning "you are an experienced doctor" won't yield very good model results. This is because real doctors have a highly standardized consultation process—which Baichuan summarizes as the four-quadrant SCAN principle: Safety Stratification, Clarity Matters, Association & Inquiry, and Normative Protocol.

Based on the SCAN principle, Baichuan drew on the OSCE method, which has been used in medical education for a long time, and collaborated with more than 150 front-line doctors to build the SCAN-bench evaluation system. The system breaks down the diagnosis and treatment process into three major stages: medical history collection, auxiliary examination, and accurate diagnosis. It conducts assessments in a dynamic and multi-round manner to fully simulate the entire process of doctors from consultation to diagnosis, and optimizes the model by obtaining better results in each of these processes.

Baichuan also released the evaluation results of the M3 model on the SCAN-Benz.

The results were quite interesting. Baichuan not only compared the model to real doctors, but also compared them to actual doctors. In all four quadrants, the real doctors actually lagged behind the levels the model could achieve.

GeekPark specifically asked the Baichuan team about this, and their response was: This evaluation involved real specialist doctors comparing the model with specific cases. The model's success was due, firstly, to its greater patience, but more importantly, to its superior interdisciplinary knowledge.

For example, in one case, a 10-year-old child had recurrent fevers. Fever is a very complex medical phenomenon. If we only ask about the condition of the lungs, such as coughing, we may easily overlook serious problems in the joints and urinary system and misdiagnose it as a common infection.

Human doctors are usually only good at treating diseases within their own specialties. This is why complex symptoms often require expert consultation, or why experts often have to consult books and information for difficult and complicated diseases.

Those who are simply playing the role of a doctor without special training often find it difficult to answer these kinds of questions well.

03 Next Steps: Gradually begin developing consumer-facing (C-end) products and advance more serious medical practices.

For BaiChuan Intelligence, surpassing the level of human doctors is of great significance: it means that AI has begun to cross the threshold of usability and can be deployed in use cases.

Starting January 13th, users can now experience the answers provided by the M3 model on BaiXiaoYing's website and app.

The current website design is quite interesting. Although both use the M3 model for responses, there are separate versions for doctors and users. In the doctor's version, the answers are more concise, cite more references, and are more "unnatural." In the patient's version, the model almost never provides an answer all at once, but instead asks more follow-up questions to make a more specific diagnosis.

BaiChuan Intelligence mentioned that the model's thinking in the background is very interesting. "We often see the model saying in its thought process, 'This patient didn't respond to my question, but I still have to ask it.' We've even seen extreme cases where the model has asked the patient 20 times, exceeding the maximum set number of rounds, but it still insists on asking the question. This is because during training, the model won't receive a reward for using clever or persuasive language; it must genuinely obtain enough key information and arrive at a correct diagnosis to receive a reward. This is a significant difference between how we train models and how others do it."

Recently, many AI companies have begun to get involved in the medical field. This is also what BaiChuan Intelligence believes is its biggest difference – to focus on more serious medical issues.

"This means that when Baichuan chooses scenarios, it doesn't just look at which scenario is easiest to work on. On the contrary, Baichuan insists on continuously pushing its technological capabilities and tackling more difficult problems," Wang Xiaochuan said.

A typical example is that Baichuan will prioritize solutions for oncology in the future, while psychological healing will be a lower priority for Baichuan.

In popular opinion, AI is generally considered simpler and easier to implement in providing psychological healing. Baichuan's reasoning differs. They believe that the field of oncology has more rigorous scientific evidence. Here, AI is more likely to achieve serious medical results, thus reaching or surpassing the level of human doctors. In contrast, the field of psychology lacks this definitive scientific anchor.

For example, some companies choose to create AI clones for doctors, but Wang Xiaochuan believes this is not the direction Baichuan (the company behind Baichuan) wants to pursue. A doctor's clone cannot fully replicate, let alone surpass, the doctor's skill level. Such AI will ultimately become nothing more than a facade and a customer acquisition tool, and cannot truly advance serious healthcare.

This insistence on seriousness has profoundly influenced many of Baichuan's business choices.

This directly relates to Wang Xiaochuan's thoughts on the fundamental issues of the next stage of medical AI. He believes that the most important task at this stage is to gradually provide more medical services based on enhancing AI capabilities.

For many years, China has been trying to promote a tiered medical system and a general practitioner system. The original intention was to allow ordinary people to see a doctor at the grassroots level first, in order to solve the problems of difficulty in making appointments, long queues, and overcrowding at large hospitals.

The reason this system is difficult to implement is essentially due to insufficient medical resources. Primary healthcare institutions lack highly skilled doctors. People are willing to queue at top-tier hospitals even for a simple cold because they lack confidence in the medical care provided by primary care facilities.

This is precisely where medical AI comes into play. Large-scale models enable the distribution of cutting-edge medical knowledge at scale. It fills the supply gap at the grassroots level, allowing every community and every family to have the same diagnostic and treatment capabilities as specialists in top-tier hospitals.

In the long run, this could have a broader impact, potentially shifting decision-making power in healthcare from doctors to patients. In traditional healthcare settings, patients benefit but often lack decision-making power, which is concentrated in the hands of doctors. This power asymmetry often leads to increased communication costs and discomfort during treatment.

Baichuan hopes to use AI to make it easier for patients to access high-quality medical resources. "Many people think that medicine is too complicated and that patients will never understand it. But we think of the jury system in the US judicial system. Law is also a very professional matter, and ordinary people on the jury don't understand it. So we require judges, lawyers, and prosecutors to lead the debate, to make the arguments clear, to a level that ordinary people can judge as guilty or innocent, so that ordinary people can make a normal judgment based on logic," Wang Xiaochuan said.

This is one of the reasons why Baichuan Intelligence is unwilling to focus on simple scenarios, but instead hopes to continuously advance towards more complex and serious medical treatments.

When asked whether solving highly complex problems is the most commercially rewarding, Wang Xiaochuan gave a profound answer.

He believes that solving minor issues like colds and fevers is unlikely to build sufficient trust among users. Healthcare is an industry highly dependent on trust. Only when AI can solve complex problems such as serious illnesses can a true foundation of trust be established.

From a business perspective, patients facing serious health issues are more willing to pay for high-quality AI services. This trust is not only a prerequisite for commercial returns but also the core reason why AI in healthcare can be applied on a large scale.

In a more fundamental sense, healthcare, for Baichuan Intelligence and Wang Xiaochuan himself, still represents a path close to artificial general intelligence (AGI).

Wang Xiaochuan believes that AI has already found practical solutions in fields such as humanities, sciences, engineering, and arts, while medicine is a very unique field. Humanity's exploration of medicine is far from exhausted, and AI is still in the exploratory stage in this area.

Baichuan's roadmap is very clear. First, it aims to improve diagnostic efficiency through AI, addressing the current shortage of medical resources. Building on this foundation, Baichuan is committed to establishing deep trust with patients. When patients are willing to use AI tools for long-term medical consultations, AI can accumulate real and high-quality medical data through this sustained interaction.

The ultimate goal of this data is to build mathematical models of life. This is a path that human doctors have yet to fully explore, and it is highly likely that AI will be the first to achieve it. If a model of the essence of life can be completed, this will be a key step in propelling general artificial intelligence towards a higher level of advancement.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments