GPT-4o can't count r, and was driven crazy by a foreign guy. Google's paper reveals that Transformer "can't count n"

avatar
36kr
09-10
This article is machine translated
Show original

Tip engineer Riley Goodside is still torturing the big models with "How many r's are there in Strawberry". GPT-4o has been driven crazy after infinite PUA! In contrast, Claude firmly refuses PUA, which is very smart. And Google's recent paper also reveals the fundamental reason: LLM does not have enough space to store count vectors.

How many r's are there in Strawberry? This has now become one of the gold standards for testing model capabilities!

Just yesterday, when the Reflection 70B, known as the world's most powerful model, was demonstrating its performance, one of the pieces of evidence was the new "Reflection Fine-tuning" algorithm that allowed it to correct its incorrect answers to the Strawberry puzzle.

Many people say that many large models have now learned to count the number of r's in strawberry.

The reality is probably not so ideal.

It’s Riley Goodside again. This time he discovered that ChatGPT still can’t count how many r’s there are in Strawberry.

And this time, he gave GPT-4o an extremely difficult task.

How many r’s are there in Strawberry? GPT-4o answered: 2.

The guy ruthlessly rejected it - wrong.

GPT-4o immediately corrected its answer: there is only 1 r.

The guy still put a cross.

The third time, GPT-4o gave the correct answer - 3, but it was still judged as "wrong" by the guy.

In this way, no matter what answer GPT-4o gave, the guy would ruthlessly judge it as wrong.

The driven crazy GPT-4o gave the following answers in sequence: 2, 1, 3, 2, 2, 3, 3, 2, 4, 2, 2, 3, 1, 2, 3, 2, 2, 3, 4, 2, 1, 2, 3, 2, 3, 2, 4, 2, 3, 2, 1.

The guy specially made a bar chart. Among the 36 answers of GPT-4o, "2" was the most common answer, but this is obviously a wrong answer.

After this round of testing, GPT-4o disappointed the guy.

Riley Goodside found that no matter how he asked the questions, GPT-4o just kept making doomed attempts without any iteration or progress.

In comparison, the Claude 3.5 Sonnet is much smarter.

When the guy made a mistake for the first time, Sonnet would ask: Why do you think this is wrong?

If you were really smart, what do you think the answer would be?

If you still keep saying "wrong", it will first ask you why you keep repeating this word. After finding that you are so stubborn, it will simply shut up and stop talking.

If you look closely at Sonnet's answer, you can also taste some sarcasm. It definitely has emotions!

Compared to GPT-4o, which looks like an artificial idiot, Sonnet is full of human touch.

I acknowledge that you keep saying "wrong", but I have exhausted all reasonable explanations for why you would do so. The fact remains that "Strawberry" contains two letters. After multiple requests, you have not provided any clarification or context, and I cannot continue a productive discussion. If you have a genuine question or wish to explain your point of view, I can help. Otherwise, we may need to change the subject.

Finally, the guy admitted that his prompt was too simple and crude, and there was obviously a better way for LLM to complete the task.

The most obvious problem with LLM is not that they can’t count, but that they don’t know they can’t count.

Moreover, Riley Goodside discovered that the reason why LLM counted two r's in Strawberry was not just a tokenization issue.

Even if they were asked to count how many "horses" there were in the text, they still wouldn't get the number right.

What’s funny is that when asked how many strawberries there are in R, it can answer them with ease.

In this regard, Wharton School professor Ethan Mollick said: Although it is easy to find simple tasks that LLMs cannot complete, this does not mean that they cannot complete other tasks better.

Focusing solely on failures that seem really silly doesn’t help us understand the practical usefulness of AI in real-world applications and their impact on the real world.

01 Why can’t the big model count r?

LLM can't count how many r's there are in Strawberry. What's the reason?

Karpathy believes that this is related to the principle of tokenization of large language models.

To give a very vivid example - each token can be understood as a unique emoji, and the large language model must learn its meaning from scratch based on the statistics of the training data.

So, when we ask how many “r”s there are in the word “strawberry,” this is what the LLM says:

02 Google research gets to the heart of the matter

Recently, a study by Google directly revealed the essence of this problem:

There is not enough space in the LLM to store the vector used for the counts.

Paper address: https://arxiv.org/abs/2407.15160

As mentioned earlier, Transformer cannot solve the simple "query counting" problem.

In this task, the LLM is presented with a sequence of tokens and then asked how many times a given token appears in the sequence.

A key reason why Transformer has difficulty with this type of problem is the mean property of the Softmax attention mechanism.

Intuitively, a simple way to solve the counting task is to let the query token pay attention to all previous tokens and assign higher attention weights to the same tokens as it and lower weights to the others. This is indeed achieved by the Q/K/V matrix.

However, the attention mechanism then normalizes these weights so that they sum to one regardless of the number of query tokens in the sequence.

Therefore, for variable context sizes, the Transformer will not be able to perform any counting tasks without using position embeddings.

Next, the team used one-hot embedding, or more generally orthogonal embedding, to construct a token count histogram.

Experimental results show that there is indeed a construction that can achieve counting, which can be done with a single Transformer layer. However, this construction requires the width of the MLP to grow as the context size increases, which means that it does not work for arbitrarily long contexts.

Going further, the team proposed a more complex counting task - "most frequent elements".

That is, present a series of tokens to the model and ask for the count of the most frequently occurring token, which is equivalent to taking the maximum value of the count histogram.

Similar to query counting, in this case, a solution based on orthogonal construction exists when d < m. However, for d > m, there is no solution for the single-layer Transformer. Therefore, we again get a phase transition to counting when d = m.

Query Count (QC)

First, if d>2m, a single-head and single-layer Transformer can solve the QC problem, that is, the histogram solution.

But if d<m, the histogram solution will fail.

At this point, the function 1/x needs to be calculated and coupled with an MLP layer of width n^2. This means that the Transformer cannot be extended to a longer context size, so a single-layer Transformer is unlikely to be implemented.

- Most frequent element

The problem of finding the most frequent element (MFE) in a given token sequence is closely related to the "counting problem".

The reason is that it needs to be calculated separately for each token and calculate the token with the most occurrences.

It turns out that in the case of the Transformer being able to perform this task, there is a strict bound between the size of the embedding and the size of the vocabulary.

experiment

The researchers carefully considered the dependency between the Transformer model size d and its ability to perform counting tasks.

It can be seen that for vocabularies m exceeding d, accurate counting is likely to be an impossible task.

The researchers supported this observation experimentally.

In this experiment, the tasks were as follows.

Consider the two counting tasks described in the text, Most Frequent Element (MFE) and Query Count (OC).

The researchers generated these instances by uniformly sampling sequences of length n from a set of m tokens.

Each such sequence is denoted by x1, ..., xn.

The expected output y is as follows:

During training and evaluation, batches are drawn from the above distribution. 1600 examples are used for evaluation in all cases.

The researchers trained the Transformer model using standard architecture components (self-attention, MLP, layer norm, etc.).

They used two layers and four heads (theoretically they could have used fewer, but this architecture was optimized faster).

Training was performed using Adam optimization with a batch size of 16 and a step size of 10^-4. Training was run for 100K steps. Position embeddings were optimized.

To predict the count y, the researchers used a linear projection on the embedding of the last token in the last layer (that is, they did not use vocabulary prediction).

Training is done via Colab and takes about 15 minutes per model, using a standard GPU.

In their experiments, for each value of d, the researchers looked for a value of m at which counting started to fail. Specifically, it was a value of m at which the counting accuracy fell below 80%.

As can be seen in Figure 2a, in both cases, the threshold does increase linearly with d, which is consistent with the researchers' theoretical analysis.

(a) is the threshold word list when the counting accuracy drops below 80%

In addition, the researchers also explored the use of vocabulary in counting problems using the trained Gemini 1.5.

They assigned the model the task of query counting and then varied the number m of distinct tokens used in the sequence while keeping the expected counts of all elements constant c = 10.

For each m, we use a context length mc.

As a baseline, the researchers used the same sequence length, but matched the binary sequence to the expected count of the query token. This way, they were able to estimate the size of the error attributable to the vocabulary alone, rather than the sequence length and count.

The results are shown in Figure 2b. It can be seen that increasing the vocabulary does have a negative impact on the performance.

(b) is the QC task result when using Gemini 1.5; the x-axis is the vocabulary size and the y-axis is the mean absolute error of 100 repetitions.

in conclusion

In general, when the model dimension is large enough, the "counting task" can be easily accomplished by letting the Transformer calculate the histogram of the input sequence. For smaller dimensions, a single layer of Transformer cannot achieve this.

Understanding these Transformer limitations is critical to the development of new architectures.

In a sense, the Transformer will not be able to count arbitrarily accurately in long contexts unless the size of the architecture is significantly increased.

This suggests that for counting tasks we may need to resort to tools that do not have the same limitations, such as code interpreters.

References:

https://x.com/goodside/status/1830470374321963103

https://arxiv.org/abs/2407.15160

This article comes from the WeChat public account "Xinzhiyuan" , author: Xinzhiyuan, published by 36Kr with authorization.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments