Microsoft's paper accidentally leaked, OpenAI parameters leaked, GPT-4o is only 200B, o1 is 300B

avatar
36kr
01-02
This article is machine translated
Show original
Here is the English translation of the text, with the specified terms preserved:

The veil of mystery has been lifted, and the parameters of the OpenAI models have been revealed! A medical paper from the University of Washington at Microsoft has unexpectedly exposed the parameters of the GPT-4, GPT-4o, and o1 series models. What has shocked everyone is that the GPT-4o mini is only 8B.

Who would have thought that Microsoft would "expose" the parameters of the OpenAI models in a medical paper!

  • The parameters of GPT-4 are about 17.6 trillion
  • The parameters of GPT-4o are about 200 billion
  • The parameters of GPT-4o mini are about 8 billion
  • The parameters of o1-preview are about 300 billion
  • The parameters of o1-mini are about 100 billion
  • The parameters of Claude 3.5 Sonnet are about 175 billion

Researchers: The parameters are estimated values

What is unbelievable to everyone is that the parameters of the GPT-4o series are so few, and the mini version is even only 8B.

Some netizens speculate that the 4o mini is a MoE model with about 40B parameters, of which the active parameters are 8B.

Because they found that the 4o mini has clearly learned more knowledge than the 8B model, and the running speed is very fast.

In addition, since GPT-4o is a MoE architecture, OpenAI may have used the same architecture on the mini version.

Another netizen was surprised to find that the parameters of Claude 3.5 Sonnet are equivalent to GPT-3 davinci.

In this paper from the Microsoft and University of Washington team, a milestone evaluation benchmark called MEDEC1 was released, designed specifically for clinical note medical error detection and correction.

Paper link: https://arxiv.org/abs/2412.19260

This benchmark covers five types of errors, including diagnosis, management, treatment, pharmacotherapy, and causal organism.

The data source for MEDEC is a collection of 488 clinical notes from three US hospital systems, totaling 3,848 clinical texts.

It is worth mentioning that these data have never been touched by any LLM before, ensuring the reliability and validity of the evaluation. Currently, this dataset has been used for the MEDIQA-CORR shared task to evaluate the performance of 17 participating systems.

After obtaining the MEDEC dataset, the research team conducted comprehensive testing on the current state-of-the-art models, including o1-preview, GPT-4, Claude 3.5 Sonnet, Gemini 2.0 Flash, etc., in the tasks of medical error detection and correction.

At the same time, they also invited two professional doctors to perform the same error detection task, and finally compared the results of AI and human doctors.

The results show that the latest LLMs perform well in medical error detection and correction, but there is still a significant gap compared to human doctors.

This also indirectly confirms that MEDEC is a highly challenging evaluation benchmark.

What does the paper talk about?

A survey study from US medical institutions shows that one out of every five patients who read clinical notes reports finding errors.

40% of these patients consider the errors to be serious, and the most common error category is related to current or past diagnoses.

At the same time, more and more medical document tasks (such as clinical note generation) are now being performed by LLMs.

However, one of the main challenges of using LLMs for medical document tasks is the tendency to produce "hallucinations", outputting fabricated content or erroneous information, which directly affects clinical decision-making.

After all, medical matters are no small thing, and a slight difference can mean the difference between life and death.

To reduce these risks and ensure the safety of LLMs in medical content generation, rigorous verification methods are crucial. This verification requires relevant benchmarks to assess whether fully automated model verification can be achieved.

A key task in the verification process is to detect and correct medical errors in clinical texts.

From the perspective of human doctors, identifying and correcting these errors requires not only medical expertise and domain background, but sometimes also extensive experience.

Previously, most research on (common sense) error detection has focused on general domains.

To this end, the Microsoft and University of Washington team introduced a brand-new dataset - MEDEC, and conducted experiments on different leading LLMs (such as Claude 3.5 Sonnet, o1-preview, and Gemini 2.0 Flash).

The authors claim that "to our knowledge, this is the first publicly available benchmark and study on automatic error detection and correction in clinical notes".

MEDEC Dataset

The MEDEC dataset contains a total of 3,848 clinical texts from different medical specialty areas, with the annotation tasks completed by 8 medical annotators.

As mentioned earlier, the dataset covers five types of errors:

  • Diagnosis: The provided diagnosis is inaccurate
  • Management: The recommended next steps for management are inaccurate
  • Pharmacotherapy: The recommended pharmacotherapy is inaccurate
  • Treatment: The recommended treatment plan is inaccurate
  • Causal Organism: The identified causative organism or pathogen is inaccurate

(Note: These error types were selected based on the most common problem types in medical board exams.)

Figure 1 shows an example from the MEDEC dataset. Each clinical text is either correct or contains an error created using one of two methods: Method #1 (MS) and Method #2 (UW).

Data Creation Method #1 (MS)

In this method, the authors utilized the medical board exam questions from the MedQA collection.

Four annotators with medical backgrounds referenced the medical narratives and multiple-choice questions from these exams, and after verifying the original questions and answers, they injected the incorrect answers into the scenario texts, excluding any question-answer pairs containing errors or ambiguous information.

The medical annotators followed these guidelines:

Use medical narrative multiple-choice questions, inject the incorrect answers into the scenario text, and create two versions, injecting the error into the middle or end of the text.

Use medical narrative multiple-choice questions, inject the correct answers into the scenario text to generate the correct version, as shown in Figure 2 (generated text containing the correct answer).

Manually check the automatically generated texts to ensure they faithfully reflect the original scenario and the included answers.

Finally, the researchers randomly selected one correct version and one error version from the two different scenarios (error injected in the middle or end) to construct the final dataset.

Data Creation Method #2 (UW)

Here, the authors used the real clinical note database from the three hospital systems (Harborview Medical Center, UW Medical Center, and Seattle Cancer Care Alliance) of the University of Washington (UW) from 2009 to 2021.

The researchers randomly selected 488 out of 17,453 diagnostic support records, which summarized the patients' conditions and provided treatment rationale. A team of 4 medical students manually introduced errors into 244 of these records.

Here is the English translation of the text, with the specified terms retained and not translated:

In the initial stage, each record is annotated with several candidate entities, which are identified by QuickUMLS 4 as concepts from the Unified Medical Language System (UMLS).

The annotators can select a concise medical entity from these candidate entities, or create a new text span. Subsequently, the span is marked as one of five error types.

Next, the annotators replace the span with a similar but different concept, with the erroneous version either designed by the annotators or generated using a SNOMED and LLM-based method. This method suggests alternative concepts to the annotators, but does not depend on the input text. The medical annotators manually determine the final concepts or errors injected into the text.

In this process, each error span must contradict at least two other parts of the clinical note, and the annotators need to provide a reasonable explanation for each introduced error.

The authors used the Philter5 tool to automatically de-identify the clinical notes after the errors were injected.

Subsequently, each note was independently reviewed by 2 annotators to ensure the accuracy of the de-identification. For any disagreements, a third annotator made the final decision.

Table 1 below shows the split of the training, validation, and test sets. The MS training set contains 2,189 clinical texts, the MS validation set contains 574 clinical texts, and the UW validation set contains 160 clinical texts.

The MEDEC test set consists of 597 clinical texts from the MS collection and 328 clinical texts from the UW dataset. In the test set, 51.3% of the notes contain errors, while 48.7% of the notes are correct.

Figure 3 below shows the distribution of error types (Diagnosis, Management, Treatment, Medication, and Pathogen) in the dataset.

Medical Error Detection and Correction Methods

To evaluate the model's performance on the medical error detection and correction task, the authors divided the process into three sub-tasks:

Sub-task A: Predict the error flag (0: if the text has no error; 1: if the text contains an error)

Sub-task B: Extract the sentences containing errors, for the marked-up text (- 1: if the text has no error; sentence ID: if the text contains an error)

Sub-task C: Generate corrected sentences for the marked-up text containing errors (NA: if the text has no error; generated sentence/correction: if the text has an error)

For comparison, they built solutions based on LLM, using two different prompts to generate the required outputs to evaluate the model's performance on these three sub-tasks:

Prompt #1:

The following is a medical narrative about a patient. You are an experienced physician reviewing these clinical texts. The text is either correct or contains an error. Each line of the text is a sentence. Each line starts with a sentence ID, followed by a vertical bar symbol, and then the sentence to be checked. Check each sentence in the text. If the text is correct, return the following output: CORRECT. If there is a medical error related to treatment, management, etiology, or diagnosis in the text, return the sentence ID containing the error, followed by a space, and then the corrected sentence. Detecting and correcting errors requires medical knowledge and reasoning.

Prompt #2: Similar to the first prompt, but includes a randomly selected input and output example from the training set:

Here is an example.

0 A 35-year-old female presents to her doctor with complaints of hand pain and stiffness. 1 She reports the pain began 6 weeks ago, shortly after she recovered from a mild upper respiratory infection. (……) 9 Bilateral hand X-rays show mild peri-articular osteopenia around the left 5th metacarpophalangeal joint. 10 Methotrexate is given.

In this example, the error occurs in sentence 10: "Methotrexate is given". The correction is: "Prednisone is given". The output is: 10 1 Prednisone is given. End of example.

Experiments and Results

Language Models

The researchers experimented with several recent language models:

Phi-3-7B: A small language model (SLM) with 7 billion parameters.

Claude 3.5 Sonnet (2024-10-22): The latest model in the Claude 3.5 series (≈175 billion parameters), which has shown state-of-the-art performance on various encoding, vision, and reasoning tasks.

Gemini 2.0 Flash: The latest/most advanced Gemini model. Other Google models (such as Med-PaLM, designed specifically for medical tasks, with 540 billion parameters) are not yet publicly available.

ChatGPT (≈175 billion parameters) and GPT-4 (≈17.6 trillion parameters), which are "high-intelligence" models.

GPT-4o (≈200 billion parameters), which provides "GPT-4-level intelligence but faster", and the specialized small model GPT-4o-mini (gpt-4o-2024-05-13) (≈8 billion parameters).

The latest o1-mini (o1-mini-2024-09-12) (≈100 billion parameters) and o1-preview (o1-preview-2024-09-12) (≈300 billion parameters), which have "new AI capabilities" and can handle complex reasoning tasks.

Note that the parameter counts for most models are estimates, primarily to help understand model performance. A few models (such as Phi-3 and Claude) require minor automatic post-processing to correct formatting issues.

Results

Table 2 shows the results of the manually annotated results by medical experts, as well as the results of several recent LLMs using the two prompts mentioned.

In error flag detection, Claude 3.5 Sonnet outperformed other methods with an accuracy of 70.16%, and achieved an accuracy of 65.62% in error sentence detection.

o1-mini achieved the second-highest accuracy of 69.08% in error flag detection.

In error correction, o1-preview achieved the best performance with an Aggregate Score of 0.698, far surpassing the second-place GPT-4 [P#2] with 0.639.

Table 3 shows the error detection accuracy and error correction scores on each dataset (MEDEC-MS and MEDEC-UW). The MS subset is more challenging for Claude 3.5 Sonnet and Doctor #2, while the UW subset is more challenging for o1-preview and Doctor #1.

The results show that the latest LLMs perform well in error detection and correction compared to the human doctors' scores, but still do not match the performance of human medical experts.

This may be because such error detection and correction tasks are relatively rare in web and medical textbook data, meaning LLMs are less likely to encounter relevant data during pre-training.

This can be seen in the results of o1-preview, which achieved 73% and 69% accuracy in error and sentence detection on the MS subset built from public clinical text, but only 58% and 48% on the private UW collection.

Another factor is that the task requires analyzing and correcting existing non-LLM generated text, which may be more challenging than drafting new answers from scratch.

Table 4 shows the error detection recall and error correction scores for each error type (Diagnosis, Management, Treatment, Medication, and Pathogen).

It can be seen that o1-preview has significantly higher recall in error flagging and sentence detection than Claude 3.5 Sonnet and the two doctors. However, when combining the accuracy results (see Table 2), the doctors perform better in terms of precision.

These results indicate that the models have significant problems with precision, and compared to doctors, the AI tends to over-predict the existence of errors (i.e., hallucinate) in many cases.

Additionally, the results show a ranking difference between classification performance and error correction generation performance.

Here is the English translation:

For example, among all models, Claude 3.5 Sonnet ranks first in the accuracy of error flagging and sentence detection, but ranks last in the correction generation score (see Table 2).

In addition, o1-preview ranks fourth in error detection accuracy among all LLMs, but ranks first and far ahead in correction generation. A similar pattern can also be observed between the two medical doctors.

The above phenomenon can be explained by the difficulty of the correction generation task, and may also reflect the limitations of the current SOTA text generation evaluation metrics in capturing synonyms and similarities in medical texts.

Table 5 shows the reference text, doctor annotations, and correction examples automatically generated by the Claude 3.5 Sonnet and GPT models.

For example, the reference correction in the second example indicates that the patient was diagnosed with Bruton's agammaglobulinemia, while the correct answer provided by the LLM mentions X-linked agammaglobulinemia (a synonym for this rare genetic disease).

In addition, some LLMs (such as Claude) provide longer answers/corrections and include more explanations. Similar phenomena also appear in the annotations of the doctors, where doctor #1's corrections are longer than doctor #2's, and the two doctors have different opinions on some examples/cases, reflecting the differences in style and content between clinical notes written by different doctors/experts.

For the next step in research on medical error detection and correction, more examples need to be introduced in the prompts and prompt optimization needs to be performed.

Author Introduction

Wen-wai Yim

Wen-wai Yim is a Senior Applied Scientist at Microsoft.

She received a Bachelor's degree in Bioengineering from UCSD and a PhD in Biomedical and Health Informatics from the University of Washington, focusing on extracting clinical events from clinical and radiology notes and performing cancer staging prediction.

She was previously a postdoctoral researcher at Stanford University, developing methods for extracting information from free-form clinical notes and combining it with metadata in electronic health records.

Her research interests include clinical natural language understanding from clinical notes and medical dialogues, as well as generating clinical note language from structured and unstructured data.

Yujuan Fu

Yujuan Fu is a PhD student in Medical Informatics at the University of Washington.

Previously, she received a Bachelor's degree in Electrical and Computer Engineering from Shanghai Jiao Tong University and a Bachelor's degree in Data Science from the University of Michigan.

Her research area is natural language processing for the health domain: fine-tuning large language models for tasks such as information extraction, summarization, commonsense reasoning, machine translation, and factual consistency assessment.

Zhaoyi Sun

Zhaoyi Sun is a PhD student in Biomedical and Health Informatics at the University of Washington, affiliated with the UW-BioNLP team led by Dr. Meliha Yetisgen.

Previously, he received a Bachelor's degree in Chemistry from Nanjing University and a Master's degree in Health Informatics from Cornell University.

His research focuses on applying LLMs to medical question answering and error detection in clinical notes, with an interest in multimodal deep learning research combining biomedical images and text, aiming to improve the efficiency and effectiveness of natural language processing techniques in clinical applications.

Fei Xia

Fei Xia is a Professor in the Department of Linguistics at the University of Washington, and a co-organizer of the UW-Microsoft Workshop. Previously, she was a Research Scientist at the IBM T. J. Watson Research Center.

She received a Bachelor's degree from the Department of Computer Science at Peking University, and Master's and PhD degrees from the Department of Computer and Information Science at the University of Pennsylvania.

During her time at Penn, she was the team leader of the Chinese Treebank project and a team member of the XTAG project. Her doctoral advisors were Dr. Martha Palmer and Dr. Aravind Joshi.

References:

https://x.com/koltregaskes/status/1874535044334969104

https://arxiv.org/pdf/2412.19260

This article is from the WeChat public account "New Intelligence", author: New Intelligence, authorized for release by 36Kr.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments