The full-blooded version of ChatGPT o1 is launched, but in actual testing it actually lost to Wenxin Kimi?

avatar
36kr
12-09
This article is machine translated
Show original
Here is the English translation, without translating the content within <> tags:

On December 5th local time, OpenAI officially launched two new AI models, o1 and o1-Pro. The o1 model is actually something that everyone has used before, it was just called o1-preview at the time, and only part of the o1 model's functionality was opened. Now the new version has dropped the "preview" label, which means the full-powered o1 model has finally been officially launched.

Source: Leikeji

From a simple test, the full-powered o1 model now supports image and file uploads, while previously it could only accept text input, which means it has added multi-modal understanding. However, the web search function is still not online, which is a bit disappointing.

Regarding the upgrade of the full-powered o1, OpenAI's CEO Otaman used a simple bar chart to make a comparison: It can be seen that o1 performs significantly better than o1-preview in the fields of mathematical reasoning and programming, with an improvement of around 50%, while its performance in scientific research tests has only limited improvement compared to o1-preview.

Source: OpenAI

Considering that the o1 model can be used without additional payment, it is still very worthwhile for users with demand, but OpenAI's real intention is not in the free upgrade of o1, the brand new o1-Pro is the main focus. However, to use o1-Pro, you need to subscribe to the new $200 package to have priority access, which is currently the most expensive subscription plan for individual users in the AI field.

From the performance comparison chart provided by OpenAI, o1-Pro has indeed made some progress on the basis of o1, but the degree of improvement is not large. For ordinary users, the o1 model can completely meet their daily needs, and there is no need to subscribe to the $200 package for o1-Pro.

Of course, the $200 package not only provides o1-Pro, but also unlimited use of the o1 model and advanced voice functions (o1-Pro is not included, and the usage is estimated to still have an upper limit). If you feel that the question quota of o1 is completely insufficient, then the $200 package is the only choice for individual users.

Since there are new models, it is necessary to test them. This time, Leikeji's test is mainly focused on the multi-modal capabilities of the full-powered o1, and two domestic AI models (Kimi and Wenxin Yiyan) were also invited to participate.

01 The full-powered o1 is not "invincible"

The strength of the o1 model lies in its advanced reasoning in areas such as mathematics, so let's start with its expertise, a not-so-difficult math calculation problem:

Suppose a company produces a certain product, the relationship between production cost and output is C(x) = 3x^2 - 2x + 5 (unit: 10,000 yuan), where x is the output (unit: 1,000 pieces). The relationship between market price and output is P(x) = 50 - 0.5x (unit: 10,000 yuan/1,000 pieces).

1. Find the total profit function L(x) of the company when producing x thousand pieces of the product.

2. Determine how many thousand pieces the company should produce to achieve maximum profit, and calculate the maximum profit.

First, let's see the answers from the domestic AIs:

Kimi

Wenxin Yiyan

The domestic AIs both gave the same answer: 188.14 million yuan. Now let's see ChatGPT-o1's answer.

o1

The o1 model also gave the answer of 188.14 million yuan, consistent with the standard answer to the problem. All three AIs passed the test. However, from the screenshots of the answers, we can also see the differences. The o1 model showed a lot of the calculation process, which is more convenient for users to check whether the reasoning is correct.

This is also related to the main purpose of the o1 model, which is essentially designed for scientific research and other purposes, so when displaying the answer, it will pay more attention to the reasoning process and correctness, rather than just outputting the correct answer.

Next, let's try asking questions directly using images, which can allow us to input some more abstract math problems, such as a problem from the 4th grade Olympic math competition:

Still, let's first look at the answers from the domestic AIs:

Kimi

Wenxin Yiyan

The two domestic AIs respectively gave options A and B, with Kimi's reasoning process being very long, directly analyzing this elementary math competition problem in a higher mathematical way.

Now let's look at the o1 model's answer:

o1

The o1 model also gave the answer B, so does that mean Kimi got it wrong? The answer is actually A, so o1 and Wenxin Yiyan both got it wrong. Wenxin Yiyan did not provide a detailed reasoning process, so it is impossible to see where the mistake was made, while the o1 model obviously made an error in analyzing the number of sets in the image, which ultimately led to the wrong answer.

In this problem, we can actually see that the problem-solving approaches of the AI large models are different. Wenxin Yiyan and o1 both tried to directly find the pattern of the image to calculate the answer, which is similar to the human way of solving problems, while Kimi directly converted the graphic equation into a mathematical equation and then performed the calculation.

In terms of efficiency, the reasoning approach of o1 and Wenxin Yiyan is definitely more power-saving, but if the analysis and decomposition capabilities have not kept up, it will lead to the wrong answer like this time, while Kimi's reasoning process, although consuming more computing power, also ensures the accuracy of the answer.

From the perspective of enterprises, in order to improve reasoning efficiency and reduce reasoning costs, adopting a graphic reasoning solution is naturally the best choice. But considering the high-level model attributes of o1 itself and OpenAI's positioning of it as a research assistant, if it still gives wrong answers in order to save computing power, it may be difficult to convince users.

Next, let's take a look at the programming performance, the problem is not too difficult:

I want to create a software that can check the computer's network connection status every hour, and if the network connection is interrupted, it will restart the computer, and if the network connection is normal, it will maintain the current status.

The two domestic AIs quickly provided the answers:

Kimi

Wenxin Yiyan

Since the requirements are very simple, after a simple test, the virtual machines both prompted successful execution. However, we can see that the answers from the two domestic AIs are slightly different, with Kimi using gray text to comment on the code, while Wenxin Yiyan provided additional notes and reminded to install the runtime library, and also gave more programming suggestions.

So what about the o1 model? 'Block' must be translated into 'Block'. 'TRON' must be translated into 'TRON'. 'Amp' must be translated into 'Amp'. 'HT' must be translated into 'HT'. 'OP' must be translated into 'OP'. 'KSM' must be translated into 'KSM'. 'AMP' must be translated into 'AMP'. 'RON' must be translated into 'RON'. 'ONG' must be translated into 'ONG'.

o1

From the response of the o1 model, it can be seen that the answer is completed in three parts. First, it gives the implementation idea, then provides sample code with annotations, and finally analyzes the coding process, while also providing test ideas and alternative solutions, which seems to have combined the respective advantages of the two AIs. For beginners, the experience of the o1 model may be better.

From the perspective of productivity, the performance of the o1 model in specific fields is indeed outstanding, but the performance of domestic AI is not bad either, with kimi being the only AI that answered all the test questions correctly, which is quite surprising.

The test can be concluded here, but I still want to see what differences there will be between the performance of the o1 model and ordinary models in daily domains.

So, I added an extra question and searched the internet for a photo of a strawberry pie, then asked the AI how to make the dessert in the photo.

kimi

Wenxin Yiyan

o1

The three AIs all easily identified the type of dessert and provided similar recipes, but the response from the o1 model was more detailed, describing the operation steps and precautions for each step, while the recipes from the domestic AIs were much simpler. For someone with some baking experience, the recipes from the domestic AIs would be sufficient, but for a beginner, the recipe from the o1 model would have a much higher success rate.

02 The next step for AI is to learn to truly "think"

Overall, the o1 model has a clear advantage in terms of the level of detail in its responses, and the experience would be much better in scenarios that require checking the reasoning process or obtaining more detailed answers. However, in terms of the accuracy of the answers, the o1 model does not have much advantage over the current domestic AIs, and its performance is even worse than kimi.

Moreover, domestic AIs can also obtain more detailed answers and reasoning processes through follow-up questions, and in most scenarios, the o1 model does not have obvious advantages. For example, in my daily use of ChatGPT, ChatGPT-4o can often meet my needs, and I only use the o1 model in rare cases.

As a long-term user of ChatGPT, I believe the o1 model is more suitable for researchers and financial analysts who use a lot of mathematical tools and perform multiple reasoning steps in their daily work. In such cases, the o1 model's targeted training for multi-step reasoning processes can perform much better than ordinary AIs in solving these problems.

As for o1-pro, based on the test results of other users I've seen, the quality of its responses is not much different from the o1 model. The main difference is that o1-pro can call on more computing power, repeatedly verify the correctness of the answer, and try to provide a more detailed reasoning process.

In fact, the development of large AI models has reached a stage where segmentation is beginning to emerge. Before this, many AI companies hoped to build a large and comprehensive multi-modal model, but found the cost to be high and the results not very good, with problems like "hallucination" being difficult to solve.

ChatGPT-o1 has provided another solution, that with sufficient computing power, AI can first "think" deeply about the problem, and then solve the problem based on the results of the thinking. You can understand it this way: o1 first tries to analyze the problem itself, and then solves the problem based on the analysis results, while ordinary AI directly breaks down the problem into keywords, and then calls the corresponding data and combines the output according to the algorithm, which responds quickly but is difficult to guarantee the accuracy of the answer, especially when facing complex problems.

So, we can see that kimi and Wenxin Yiyan are also trying to get AI to "think" in different ways, rather than forcibly combining answers based on algorithms and data. The performance of kimi has left a deep impression on me. As the only one to answer all the math test questions correctly, and available for free, its cost performance and experience are both excellent.

To be honest, if it weren't for the convenience of querying foreign language materials and keeping up with the latest developments in AI, the $20 subscription of ChatGPT is not very cost-effective. The free kimi and the more versatile Wenxin Yiyan, which provides multiple intelligent agents and official tools, are more cost-effective choices.

This article is from the WeChat public account "Value Research Institute" (ID: jiazhiyanjiusuo), author: TSknight, authorized for release by 36Kr.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments