New research from Technion, Google Research, and Apple shows that large language models (LLMs) have a deeper understanding of correctness than expected.
A major issue with large language models (LLMs) is the tendency to generate outputs that are erroneous or meaningless, often referred to as the "hallucination" phenomenon. The term "hallucination" does not have a common definition and encompasses a range of LLM failures.
In this latest study, the researchers adopt a broad interpretation: that is, they consider hallucination to be all errors produced by LLMs, including factual mistakes, biases, and other real-world errors.
Most previous research has focused on analyzing the external behavior of LLMs and how users perceive these errors, while this new study investigates the inner workings of LLMs, specifically the "correct answer tokens" - the response tokens that, if modified, would change the accuracy of the answer - to assess the accuracy of the outputs.
The researchers conducted experiments on four variants of the Mistral 7B and Llama 2 models across 10 datasets, finding that information related to output accuracy is concentrated in the correct answer tokens. They discovered that training classification models to predict features related to the accuracy of the outputs significantly improves error detection.
"These patterns are consistent across nearly all datasets and models, suggesting a common mechanism by which LLMs encode and process accuracy during text generation," the researchers stated.
To predict "hallucination," the researchers trained "probing classifier" models to predict features related to the correctness of the generated outputs based on the internal activities of the LLM. Training these models on the "correct answer tokens" significantly improved error detection.
They also investigated whether a discovery-trained classifier on one dataset could detect errors in other datasets and found that these classifiers do not generalize across different tasks, but can generalize within tasks that require similar skills.
Additional experiments showed that the probing classifiers can predict not only the presence of errors but also the type of error the model is likely to make. This finding suggests that the internal activities of the model can correctly identify the right answer, but the model often generates the wrong answer. This indicates that current evaluation methods may not accurately reflect the true capabilities of these models.
Finally, the findings suggest that current evaluation methods may not accurately reflect the true capabilities of LLMs. Understanding and better leveraging the internal knowledge of these models could significantly reduce errors.
The research findings could help design better systems to mitigate hallucination. However, the techniques it uses require access to the internal representations of the LLM, which is primarily feasible for open-source models.
Leading AI labs like OpenAI, Anthropic, and Google DeepMind have been working on various techniques to interpret the inner workings of language models. These studies may help build more reliable systems.