A recent paper by a group of researchers at OpenAI outlines a novel approach to solve one of the limitations of current deep neural networks (DNNs), namely their lack of interpretability. By Using GPT-4, the researchers aim to build a technique to explain what events cause a neuron to activate, as a first step towards automating DNN interpretability.
OpenAI's approach to DNN interpretability consists of three steps: generating an explanation of the neuron's behavior, simulating the neuron's activation based on the explanation, and calculating a score for the explanation.
In the first step, a prompt is sent to the explainer model, which will generate some explanation of the neuron's activation. For example, one explanation could look like: "Explanation of neuron 1 behavior: the main thing this neuron does is find phrases related to community".
Once an explanations is available, the next step is using it to simulate the neuron's behavior. This means determining how the neuron activates for each token in a particular sequence based on the hypothesis that the found explanation is correct. This will produce a list of tokens and integers between 0 and 10, representing the probability of activation.
In the third step, the aims is scoring an explanation by comparing the simulated and actual neuron behavior. This can be accomplished by comparing the list produced in the simulation step with the output produced by the real neuron for the same list of tokens. This step is the most complex of the three and admits a number of different algorithms producing distinct results.
Using this strategy, OpenAI researchers have been able to find likely explanation for non-trivial neurons, such as a neuron for phrases related to certainty and confidence, another for things done correctly, and many more. The results are still preliminary, though, as a number of fundamentals questions remain, including whether neurons' behavior admits an explanation, say the researchers.
DNN interpretability is still very much a research topic pursuing the goal of providing an explanation of DNN behavior in terms that are understandable to a human and related to the domain application.
Interpretability is key to allow a human supervisor to understand whether a DNN is behaving as expected and thus can be trusted. This property can be crucial where DNN failure may cause catastrophic results. Additionally, it can help engineers to identify the root causes of DNN misbehavior.
Interpretability also has ethical and legal implications. For example, European laws establish that people have the right not to be subject to algorithmic decisions and to obtain human intervention, which would be impossible if the human controller had no means to interpret the algorithmic decision.
If you are interested in the details of OpenAI's approach to interpret DNNs, do not miss their original article, which includes prompt examples and a full discussion of scoring validation techniques, results, limitations, and alternative evaluations algorithms.