How do AI models get their answers? Until now, this question has been difficult to answer. But now researchers have developed a method to reveal the hidden concepts inside large AI language models. The technology not only shows the basic settings and “personality traits” of the artificial intelligences, but also opens up the possibility of specifically changing these in order to increase the quality of the answers. At the same time, weak points also become transparent, such as the tendency to hallucinate information or ignore built-in security mechanisms when responding to certain prompts.
AI-based large language models have acquired large amounts of humanity’s knowledge and are now much more than just answer generators. Based on their enormous wealth of data, artificial intelligences can internalize abstract concepts and adopt certain tones, personalities or moods. How exactly this happens and how an AI’s inner “beliefs” influence its answers is still a black box.
On the trail of hidden concepts
A team led by Daniel Beaglehole from the University of California in San Diego has now developed a method to make the hidden concepts internalized by the AI transparent. To do this, the researchers used an algorithm called “Recursive Feature Machine” (RFM). This is based on machine learning and is able to identify patterns in data and map complex relationships.
In this way, Beaglehole and his colleagues examined different versions of the AI language model Llama for a total of 512 concepts, including personalities, moods and fears. For example, they analyzed which internal connections were activated when they asked the model to respond from the perspective of someone who loves Boston or works as a social influencer.
Benefits and risks
These insights allowed the team to specifically strengthen or weaken the relevant links and thus influence future answers. “With our method there are ways to extract these different concepts and activate them in a way that is not possible with prompting,” reports co-author Adityanarayanan Radhakrishnan from the Massachusetts Institute of Technology (MIT) in Cambridge.
However, as the team found out, this possibility of influencing is a double-edged sword: On the one hand, it allows the quality of the answers to be increased and the AI becomes more powerful for certain tasks without much training. On the other hand, misuse is also possible: for example, if the researchers weakened the concept that requires the AI to reject harmful requests, it would readily provide instructions on how to rob a bank or consume cocaine. If they strengthened the concept of “conspiracy theories,” the AI wrote about a NASA image of the Earth that it was fake and that the Earth was actually flat.
Look into the black box
But even in the case of previously possible abusive applications, the new method can help to uncover and eliminate corresponding weaknesses. It can also help to track down the causes of hallucinations – i.e. the typical AI problem of thinking up information. Compared to other methods, RFM technology also requires little computing capacity, as the researchers explain. This means it can be easily integrated into existing training structures for AI language models in order to open the black box of artificial intelligence.
“Our results suggest that the models know more than they express in their responses and that understanding the internal representations could lead to fundamental performance and security improvements,” the research team writes.
Source: Daniel Beaglehole (University of California San Diego, La Jolla, USA) et al., Science, doi: 10.1126/science.aea6792