Microsoft's next-generation 14B small model Phi-4 has arrived! Using only 40% synthetic data, it outperformed GPT-4o in mathematical performance, and the latest 36-page technical report has been released.
1.4 billion parameters, 40% synthetic data, the annual SLM king is born!
Recently, Microsoft's next-generation small model Phi-4 has officially debuted. In the GPQA and MATH benchmarks, its mathematical performance directly crushes GPT-4o and Gemini Pro 1.5.
Moreover, Phi-4 has crushed other small models, with performance on par with Llama-3.3-70B-Instruct.
Even on the 2024 ACM math competition problems, Phi-4 achieved a 91.8% accuracy rate.
Sebastien Bubeck, the former head of the Phi series, was very surprised to see this result.
The following example demonstrates Phi-4's capabilities in mathematical reasoning, which are not only lightning-fast but also accurate.
Digging deeper, Phi-4 inherits the tradition of the previous generations of the Phi series, and it has also completed its training on "synthetic data" at the textbook level.
The proportion of synthetic data is as high as 40%
In addition to synthetic data, it has also achieved three major core technical breakthroughs, including carefully selected native data and leading post-training technologies, such as the Pivotal Tokens Search in DPO.
Phi-4's success has indirectly refuted the "data wall" viewpoint claimed by several industry leaders like Ilya and Alexander Wang.
Currently, the new model is available on Microsoft Azure AI Foundry, and it will be launched on HuggingFace next week.
01 Defeating GPT-4o in Mathematics, 36-page Technical Report Released
Unlike most language models, whose pre-training is mainly based on naturally generated data sources such as web content or code, Phi-4 strategically incorporates synthetic data throughout the training process.
Although the previous Phi series models' performance was mainly derived from distilling the capabilities of teacher models (especially GPT-4), Phi-4 has significantly surpassed its teacher model in STEM-related question-answering capabilities, demonstrating that data generation and post-training techniques can bring greater capability improvements than model distillation.
Paper link: https://arxiv.org/abs/2412.08905
Phi-4 is mainly composed of three core technologies:
- Synthetic data for pre-training and mid-training
- Screening and filtering of high-quality organic data
- Post-training
Thanks to these innovations, Phi-4's performance on reasoning-related tasks is on par with or even exceeds larger models.
For example, on many widely used reasoning-related benchmark tests, its performance reaches or exceeds that of Llama-3.1-405B.
From Table 1, we can see that Phi-4 significantly outperforms its teacher model GPT-4o on the GPQA (graduate-level STEM Q&A) and MATH (math competition) benchmarks.
Table 1 Performance of Phi-4 on classic benchmarks
To verify whether Phi-4 has overfitting or data contamination issues, the researchers tested the model on the AMC-10 and AMC-12 math competitions in November 2024.
The data in these two competitions had never been collected during training, so their competition performance can effectively serve as an indicator of the model's generalization performance.
As shown in the figure below, although Phi-4 is only 14B, its average score significantly exceeds that of its teacher model GPT-4o.
Phi-4 outperforms many larger models, including Gemini Pro 1.5, on math competition problems.
02 The Advantages of Synthetic Data
Synthetic data constitutes a large portion of Phi-4's training data, generated through various techniques, including multi-agent prompting, self-revision workflows, and instruction reversal.
These technical methods can build datasets that drive the model to have stronger reasoning and problem-solving capabilities, addressing some weaknesses in traditional unsupervised datasets.
Synthetic data is not a cheap substitute for organic data, but rather has several direct advantages over organic data.
Structured Data and Support for Incremental Learning
In organic datasets, the relationships between tokens are often complex and indirect. It may require many inference steps to connect the current token to the next token, making it difficult for the model to learn effectively from the next-token prediction objective.
In contrast, since each token generated by the language model is predicted based on the previous tokens, this structured token can also make the model's training more efficient.
Aligning Training and Inference Contexts
Synthetic data can avoid the model learning some data characteristics from organic datasets that are not suitable for subsequent training.
For example, online forums often have their own specific communication styles and language habits, while people's language style and interaction logic when conversing with large models are different.
If the forum data is directly used for training, and some content has a very distinctive style, the model may think that the probability of such content appearing in the dialogue is very low. Therefore, when the model performs reasoning in subsequent dialogues, it cannot accurately match the dialogue content to the corresponding forum content.
Synthetic data rewrites the content of online forums into a language style that is more compatible with the context of LLM chat inference.
Synthetic data also plays a key role in Phi-4's post-training, where new methods such as rejection sampling and direct preference optimization (DPO) are used to optimize the model's output.
03 The Source of Synthetic Data
Pre-training and Training Data
To this end, the research team created 50 types of diverse synthetic datasets, each relying on different seeds and multi-stage prompting procedures, covering a variety of topics, skills, and interactive properties, totaling about 400 billion unweighted tokens.
Through the following methods, they ensured that the synthetic data was not contaminated by low-quality web data, making it a high-quality training dataset.
Construction of Seed Datasets
1. Web and code seeds: Extract excerpts and code snippets from web pages, books, and code repositories, focusing on content with high complexity, reasoning depth, and educational value. To ensure quality, the team adopted a two-stage filtering process: first, identifying the key high-value pages to focus on, and then dividing the selected pages into paragraphs and scoring the objectivity and reasoning content of each paragraph.
2. Question datasets: Collected a large number of questions from websites, forums, and Q&A platforms. Then used voting techniques to filter these questions to balance the difficulty. Specifically, the team generated multiple independent answers for each question and applied majority voting to assess the consistency of the answers. Questions with all answers consistent (indicating the question is too simple) or completely inconsistent (indicating the question is too difficult or ambiguous) were then discarded.
Here is the English translation of the text, with the specified terms preserved and not translated:3. Creating Q&A pairs from multiple sources: Using language models to extract Q&A pairs from organic sources such as books, scientific papers, and code. This approach is not just reliant on identifying explicit Q&A pairs in the text. Instead, it involves a pipeline aimed at detecting chains of reasoning or logical processes in the text. The language model identifies key steps in the reasoning or problem-solving process and rephrases them as questions and corresponding answers. Experiments show that training on generated content (with improvements on academic and internal benchmarks) can be more effective than training on the original content.
Rewriting and augmentation: Seed content is transformed into synthetic data through a multi-step prompting workflow. This includes rewriting most of the useful content in a given paragraph into exercises, discussions, or structured reasoning tasks.
Self-revision: Initial responses are iteratively optimized through a feedback loop where the model self-evaluates based on criteria focused on reasoning and factual accuracy, and then improves its own output.
Instruction inversion for code and other tasks: To improve the model's ability to generate outputs from instructions, the team employed instruction inversion techniques. For example, they selected existing code snippets from a code data corpus and used them to generate corresponding instructions containing problem descriptions or task prompts. Only instructions whose regenerated code had high similarity to the original code were retained, to ensure the instructions matched the output content.
Post-training data
The post-training dataset primarily consists of two parts:
- Supervised fine-tuning (SFT) dataset: Using carefully curated user prompts from public datasets and synthetic data, generating multiple model responses, and selecting the best response using an LLM-based evaluation process.
- Direct Preference Optimization (DPO): Generating DPO pairs based on rejection sampling and LLM evaluation, with some based on a method of creating key token pairs.
The researchers used the generated SFT data and DPO data pairs to mitigate the model's hallucination issues.
As shown in Figure 6 results, this approach significantly reduced the hallucination phenomenon in SimpleQA.
04 Pre-training
Phi-4 is also built on the Transformer architecture, with 14B parameters and a default context length of 4096. During training, it was expanded to a 16K context.
Since pre-trained models are not adept at following instructions, zero-shot evaluation using tasks that require answers in a specific format (e.g., simple evaluation) is not very informative.
Therefore, the team used an internally implemented benchmark for pre-training evaluation, which uses a mix of log-likelihood and few-shot prompts for various tasks.
Specifically, they used log-likelihood evaluation for MMLU (5-shot), MMLU-pro, and ARCC (1-shot), while using 1, 3, 4, and 8 few-shot examples for TriviaQA (TQA), MBPP, MATH, and GSM8k, respectively, to help the model follow the answer format.
Table 2 Improvement of Phi-4 over Phi-3-medium in pre-training benchmark evaluation
In the long-context HELMET benchmark, Phi-4 achieved near-leading performance on metrics like recall and maximum context.
05 Post-training
As mentioned earlier, the most important technique in the post-training process is Pivotal Token Search (PTS), so what is it?
Pivotal Token Search (PTS)
When a model generates a response to a prompt token-by-token, each token corresponds to a prefix of the model's answer.
For each such prefix, two key tokens can be considered: one is the conditional probability of the model's answer being correct under that prefix; the other is the probability increment that the token brings, i.e., the difference in correctness probability before and after generating that token.
In fact, when an AI model generates an answer, often only a few key tokens determine the correctness of the entire answer.
In the research, the team observed an interesting phenomenon: when the model was solving math problems, simply generating the key token "negative" could turn a potentially failing solution into a successful one.
And subsequently, generating a token like "(a" could then drastically reduce the correctness probability.
Now, combining this method with the DPO training method, several noteworthy issues were discovered.
As shown in the figure 3, many tokens have probabilities much lower than the key token "negative" at 0.31, and these tokens will introduce noise in the training and dilute the effective signal from the key tokens.
Worse, tokens like "(a" that lead to unstable solutions are actually receiving strong positive learning signals due to their low probability (0.12).
Additionally, the intuition suggests that comparing the next-token probabilities of two texts (as DPO does) may lose meaning when there are substantive differences in the text content.
In summary, more meaningful signals should come from the initial tokens where the text starts to deviate.
To address the previous issues, the Microsoft team proposed an innovative method called Pivotal Token Search (PTS).
This method specifically targets the generation of preference data for individual key tokens, using DPO optimization to precisely target the effect on specific tokens.
The core task of PTS is to find those key tokens in the full token sequence (T_full = t1, t2, ...) that significantly impact the success rate, i.e., p(success | t1, ..., ti).
PTS will convert the discovered key tokens into training data, using Q + t1, ..., ti-1 as the query baseline and selecting individual tokens that increase/decrease the success rate as "accept" and "reject" samples, respectively.
While the binary search algorithm used in PTS cannot guarantee finding all key tokens, it has two important properties:
- The tokens found are guaranteed to be key tokens
- If the success probability changes monotonically during the problem-solving process, it can find all key tokens
Figure 5 shows an example of the preference data generated using PTS.
In the math question-answering example, the researchers found an interesting phenomenon: the key tokens are often not obvious mistakes, but rather choice points that guide the model towards different solution paths.
For example, Method A - multiply by the denominator separately; Method B - directly cross-multiply.
While both methods are mathematically correct, the latter is often more robust for the model.
The training data generated by PTS can help Phi-4 make better choices at these key decision points.
06 Punching above its weight, Phi-4 wins big
Based on the above technical innovations, Phi-4 was able to demonstrate impressive performance across various benchmarks.
In Table 1, compared to the Qwen-2.5-14B-Instruct model of a similar scale, Phi-4 outperformed in nine out of the 12 benchmark tests.
Furthermore, the researchers believe Phi-4's performance on SimpleQA is actually better than Qwen.
In fact, their base model achieved a higher benchmark score on SimpleQA than Qwen-2.5-14B-Instruct, but the team intentionally modified the model's behavior during post-training to optimize the user experience rather than pursuing a higher benchmark score.
Additionally, Phi-4 has demonstrated exceptional capabilities on STEM question-answering tasks.
For example, on GPQA (graduate-level STEM questions) and MATH (math competition), it even surpassed its teacher model GPT-4.
In terms of coding ability, as measured by HumanEval and HumanEval+, it also scored higher than any other open-source models (including the larger Llama models).
win big
The areas where Phi-4 performed poorly were SimpleQA, DROP, and IFEval.
As for the first two, the researchers believe that the numbers reported by simple-evals are too simplified and do not accurately reflect the model's performance on the benchmark problems.
However, IFEval revealed a real weakness of Phi-4 - difficulty in strictly following instructions.
In the next step of the research, the researchers believe that through targeted synthetic data, the instruction following performance of the Phi series models can be significantly improved.
Next, I'm really looking forward to the release of the next Phi series small model.
References:
https://x.com/iScienceLuvr/status/1867377384145727635
https://x.com/peteratmsr/status/1867375567739482217
https://x.com/VentureBeat/status/1867376462589739098
This article is from the WeChat public account "New Intelligence", author: New Intelligence, authorized by 36Kr for release.