Highlights:

  • According to the researchers at OpenAI, it is conceivable to disassemble GPT-2 into its parts using this neuron-based architecture.
  • According to OpenAI’s researchers, the research may one day improve LLM performance by minimizing the drawbacks.

ChatGPT’s creator, OpenAI LP, is developing a tool that it claims will eventually enable it to comprehend which components of a large language model are in charge of its behavior.

Although the tool is incomplete, the company has open-sourced the code and made it accessible on GitHub so that others can examine and improve it.

OpenAI explained in a blog post that LLMs are sometimes compared to a “black box.” It’s challenging to comprehend why a generative artificial intelligence model reacts a certain way to particular types of inputs. Its “interpretability research” aims to understand better the factors that influence LLM behavior.

Researchers at OpenAI stated, “Language models have become more capable and more broadly deployed, but our understanding of how they work internally is still very limited. For example, it might be difficult to detect from their outputs whether they use biased heuristics or engage in deception.”

Ironically, OpenAI’s new tool depends on an LLM to attempt to determine how certain parts of other, less complex LLMs function. In their research, OpenAI tried to understand one of its predecessors, GPT-2, using GPT-4, its most recent and advanced LLM.

It’s essential to first comprehend how LLMs function. They resemble the human brain in that they are composed of numerous “neurons” that analyze a different pattern in a text to determine how the model will react to a given input. As a result, a neuron focused on Marvel superheroes may boost the likelihood that the LLM will name characters from the Marvel comic and film universe if a model is asked which superheroes have the best superpowers.

According to the researchers at OpenAI, it is conceivable to disassemble GPT-2 into parts using this neuron-based architecture. The tool searches for instances where running text sequences regularly stimulates a particular neuron. GPT-4 is then shown these extremely active neurons and asked to come up with an answer.

The tool will specifically ask GPT-4 to forecast the neuron’s potential behavior. The accuracy of these predictions will then be evaluated by comparing them to the neuron’s actual behavior. According to OpenAI, the methodology enables it to score each neuron’s pattern explanation based on how it actually behaves when prompted and to explain each neuron’s behavior within the GPT-2 system.

The total number of neurons in GPT-2 is 307,200, and according to OpenAI’s researchers, they were able to come up with an explanation for every single one of them. Following that, a database containing these justifications was created and released as open-source software along with the tool itself.

According to OpenAI’s researchers, the research may one day contribute to improving LLM performance by minimizing drawbacks like “toxicity” or “bias.” The team behind it acknowledged it would take some time before the tool is actually useful for this purpose.

The results show that it was only highly confidently able to explain the nature of roughly 1,000 of the GPT-2 neurons. Much work must be done to understand better and predict the actions of the remaining 306,000 neurons.

OpenAI stated that there is much room for advancement in their studies. For instance, even though it concentrated on providing brief explanations in natural language, it acknowledged that some neurons might exhibit considerably more sophisticated behavior that is difficult, to sum up in such a brief manner. The researchers mentioned, “For example, neurons could be highly polysemantic (representing many distinct concepts) or could represent single concepts that humans don’t understand or have words for.”

According to the company, one of OpenAI’s objectives is to move beyond simple neurons to identify and comprehend entire neural circuits that are in charge of carrying out more sophisticated behaviors. These circuits include neurons and the “attention heads” interacting with them. The researchers would also wish to describe each neuron’s methods to produce a specific behavior.

The researchers wrote, “We explained the behavior of neurons without attempting to explain the mechanisms that produce that behavior. This means that even high-scoring explanations could do very poorly on out-of-distribution texts since they are simply describing a correlation.”

OpenAI expressed excitement about its progress in employing LLMs to generate, test, and iterate on regular hypotheses, precisely like a human interpretability researcher would. However, there is still much work to be done.