NeuroAI for AI Safety

Patrick Mineault^†

Niccolò Zanichelli^*

Joanne Zichen Peng^*

Anton Arkhipov

Eli Bingham

Julian Jara-Ettinger

Emily Mackevicius

Adam Marblestone

Marcelo Mattar

Andrew Payne

Sophia Sanborn

Karen Schroeder

Zenna Tavares

Andreas Tolias

Use neuroscience methods for interpretability

Core idea
Why does it matter for AI safety and why is neuroscience relevant?
Details
Single-neuron (Sherringtonian) view
Population-level (Hopfieldian view)
Alternative views on neural computation
Mechanistic interpretability
Distributed representations and representation engineering
Evaluation
Opportunities

Core idea

Over the past several decades, neuroscientists have built a suite of tools to understand single neuron and population computation [578]. Neuroscientists trying to understand the brain face many of the same problems as AI researchers when opening the black box that is an artificial neural network: both deal with a complex system, designed for performance rather than transparency, containing millions or billions of distinct subunits, that iteratively reformats its inputs in baffling ways to create emergent adaptive behavior.

Many of the methods of neuroscience are highly relevant to interpretability research, which seeks to understand and control artificial neural networks at the level of weights, neurons, subnetworks, and representations [579]. Mechanistic interpretability (MechInt) seeks to build human-understandable, bottom-up, circuit-based explanations of deep neural networks, often by examining weights and activations of neural networks to identify circuits that implement particular behaviors [580, 581, 582]. Representation engineering [583] seeks to understand neural networks from the top-down at the representation level, and control these representations to make networks safer, less toxic, and more truthful. Both of these approaches have been inspired in part by neuroscience. Transparency and control are important components of safe AI systems, and there are still many more ideas in neuroscience that await fruitful deployment in AI safety.

Why does it matter for AI safety and why is neuroscience relevant?

Take the point of view of mechanistic interpretability [580]:

Understanding AI systems’ inner workings is critical for ensuring value alignment and safety. […] Mechanistic interpretability [seeks to] reverse-engineer the computational mechanisms and representations learned by neural networks into human-understandable algorithms and concepts to provide a granular, causal understanding.

The exact theories of impact by which interpretability could improve AI safety have been enumerated by researchers in the field [584], which include:

Auditing: models can be checked for safety characteristics using the methods of mechanistic interpretability. Traditional behavior checks and red-teaming can only enumerate problems that are thought of by human checkers. By contrast, leveraging knowledge about a model’s internals has the potential to uncover undesirable outcomes that would otherwise be hard to elicit, e.g. textual adversarial attacks in vision-language models [585]. It also extends to instances where we need to track down the source of undesirable behavior, e.g. telling apart lack of knowledge, parroting of incorrect information, or deception [583, 586].
Correction: by understanding mechanistically why models fail to be aligned, we can correct them, e.g. by steering the system away from unintended behavior [587], ablating components of the system that are causing the issue [588], or patching/engineering the system so it doesn’t display the undesirable behavior [589].

Reverse-engineering the black box that is the brain at a causal level is precisely what much of neuroscience tries to accomplish [53, 590, 591, 592]. Insofar as neuroscientists have had a multi-decade head start in trying to break down complex black-box neural networks, their methods are relevant to reverse-engineer computational mechanisms in silico, and thus positively affect AI safety.

Box: Similarities and differences between ANN and BNN interpretability

AI researchers have noted similarities and differences between the goals of neuroscience and artificial interpretability research [42, 590, 593, 594]. Not every tool in the neuroscience toolbox is relevant to artificial interpretability research, and vice-versa. Some of the main ways in which artificial and biological neural networks differ from an interpretability perspective include:

Access to every weight, activation, full model architecture, and response to arbitrary stimuli: AI researchers have access to far more information than is typically available in neuroscience experiments. It has only recently become feasible to access the complete connectome for a larger animal than C. elegans [595, 596]. Access to almost complete brain activity of an organism at the single neuron level is currently limited to C. elegans [597] and zebrafish [180], while whole-brain recording at a coarser level than single neurons is feasible in Drosophila [598]. Coverage in mammals is improving [137, 138], with a recent report of up to a million neurons simultaneously imaged in mice [139].

Expressive editing tools: AI researchers have access to a number of advanced causal interventions that neuroscientists could only dream of: replacing the latent activity of one token for another via activation patching [589], steering neural activity toward desirable outcomes [587], or deleting or swapping layers of trained models [599]. By contrast, patterned optogenetics [600, 601] and multi-electrode stimulation have seen limited adoption, and we only have very limited means of reshaping the activity of a neural circuit.

Noise and spikes: cortical neurons are noisy [602]. Inferring latent rates from single trials generally require powerful denoising models [162, 163]. This creates a different set of challenges than in AI, where full, noise-free activity–including activity prior to nonlinearities–is available. Many interpretability methods in neuroscience involve a fair amount of consideration about sample efficiency to overcome noise issues [603], which don’t exist in AI interpretability.

Recurrence vs. feed-forward computing: Many neural systems are recurrent, which facilitates their analysis as dynamical systems [604]. By contrast, in AI, while RNNs can be fruitfully analyzed through a dynamical systems lens [605], they have fallen out of fashion for transformers [606], which don’t have a natural state-space interpretation [607]. Resurging interest in state-space models [608, 609] may make these methods relevant again in the near future.

Capacity and modularity: the human neocortex contains about 100 billion neurons and 100 trillion synaptic connections. The number of distinct computational elements and weights in the largest-scale transformers [158] is at least two orders of magnitude lower than that of brains, despite being exposed to far more information than a human would in their lifetime (e.g. all of Wikipedia, Google Books, arXiv, and a significant proportion of the internet). Empirical and theoretical work on the superposition hypothesis [610] and earlier work on sparse coding [611] points to a more tangled, dense representation in artificial neural networks than is found in the brain, which might hinder interpretation. Modularity and hierarchy in brains [612] and topographical within-area organization might facilitate interpretation, as similarly tuned neurons are spatially clustered [93, 613, 614, 615, 616]. Note, however, that modularity may occur in otherwise generic ANNs composed of equivalent units via symmetry breaking [4, 473, 617, 618, 619].

Evolution and development: As Dobzhansky famously noted [620], nothing in biology makes sense except in light of evolution. Natural neural networks emerge through developmental processes guided by genomic bottlenecks [75], which constrain their architecture and function. The resulting principles are highly conserved through evolutionary lineages, making phylogenetic and cross-species analyses particularly enlightening for understanding neural organization [39, 575]. While the evolution of deep learning architectures has been compared to natural selection [621], and some work has explored self-organizing and evolving architectures [622, 623], fundamental differences persist. Unlike biological systems, conventional artificial neural networks lack true inheritance mechanisms and developmental constraints. Consequently, traditional tools from evolutionary neuroscience, such as comparative analysis and developmental trajectories, would need to be substantially reformulated to apply to artificial architectures.

Details

Neuroscientists have long been interested in understanding how neural circuits compute. Yet neurons are noisy and it is difficult to know how they operate in situ, due to a combination of sparse sampling, noise, limited recording time, and complexity. Over the years, scientists have built a set of tools to analyze neuron computation both at the single neuron (aka Sherringtonian) level and at the population (aka Hopfieldian) level [578].

Single-neuron (Sherringtonian) view

Many methods have been developed to characterize the responses of sensory neurons. These methods attempt to characterize neurons’ selectivity, preferred and non-preferred stimuli, as well as transfer functions. These methods include:

Characterizing the receptive fields of neurons [624]
- Characterizing areas driving excitation and inhibition through a qualitative, manual analysis [106, 625]
- Characterizing neurons’ transfer function using noise stimuli in terms of a Wiener-Volterra expansion. First-order methods (i.e. including only a linear term in the expansion) include spike-triggered averaging [626, 627, 628]. Second-order methods include spike-triggered covariance [629] and second-order forms [630].
- Characterizing a neuron’s transfer function to naturalistic stimuli using parametric methods, such as Linear-Nonlinear-Poisson models [86, 631], nonlinear-input-models (NIM) methods comprised of stacks of additive layers [101, 632], or deep learning models [83, 87, 88, 93, 633, 634, 635]
- Attributing visual decisions to particular parts of a stimulus through methods involving random masking, such as Bubbles [636]
Characterizing the tuning curves of neurons using parameterized stimuli [102, 637, 638]
Using active stimulus design to characterize a system
- Finding preferred, anti-preferred, or diverse stimuli using inception loops [67, 68, 112]
- Characterizing iso-response curves [639]
- Finding stimuli that maximize the delta between predictions of different models to facilitate model comparisons [522, 640]
Relating an organism’s behavior in decision-making tasks to the responses of single neurons, for example through neurometric curves [641]

These methods aim to characterize the input-output functions of single neurons through a rich set of tools. Although these methods can exhaustively characterize the response properties of neurons, they tend not to directly address the relationship of single neurons to perception and behavior [642]. Causal manipulation tools, including micro-stimulation, optogenetics, and thermogenetic tools can all be used to activate or deactivate a neuron or a set of neurons in situ, thus establishing a causal link between single neurons and behavior [643, 644].

Population-level (Hopfieldian view)

In contrast to the single-neuron view, the Hopfieldian view [578] seeks to understand neural representation from the perspective of how multiple neurons together coordinate to create and transform a representation. The population level view has been popularized by the rise of large-scale recordings of single neurons–Utah arrays, multi-electrode arrays, flexible probes, V-probes, Neuropixels, and multi-photon calcium imaging [138]. This shift in perspective was first popularized in the study of the motor cortex [604, 645]. The view that populations of neurons should be the primary object of interest in neuroscience has been referred to as the neural population doctrine [646, 647] or neural manifolds [181, 648, 649]. In this view, the brain’s state is viewed as evolving over time in a phase space with a lower dimensionality than the ambient dimension [182], tracing out manifolds; neurons are viewed as (linear) projections of the phase space.

In parallel, a large body of work in noninvasive human measurements, in particular in functional magnetic resonance imaging (fMRI), has focused on the geometry of representations [418, 650]. This is partly due to methodological concerns: fMRI and other noninvasive modalities average over the responses of hundreds of thousands of neurons, making single-neuron analysis impractical. However, some aspects of the geometry of the representations, and the distance between stimuli in voxel space, are preserved by random projections by the Johnson-Lindenstrauss lemma [651]. This allows fruitful investigation of the geometry of representations despite the methodological limitations caused by measuring neural activity indirectly.

Population-level analyses are an integral part of the toolkit of the neuroscientist seeking to understand the brain. These methods include:

Methods that seek to find sparse representations of neural and/or artificial data. These are often, strictly speaking, machine learning methods, but they’ve often been motivated from a neuroscience perspective, either from first principles or from a use-case perspective:
- ICA [652]
- Sparse coding [524]
- Nonnegative Matrix Factorization [653]
- Tensor component analysis and their variants [654, 655]
Methods that seek to perform decoding of neural activity
- Multivariate pattern analysis [656]
- Encoding and decoding models [657]
Methods that compare representations
- Representational Similarity Analysis [418]
- Canonical Correlation Analysis [658]
- Neural shape metrics [115, 116]
Methods adapted from dynamical systems analysis, including phase-space analysis and the analysis of attractors and bifurcations [659, 660]
- Estimating latent dimensions from noisy neural data [164]
- Methods based on delay embeddings, e.g. Empirical Dynamic Modelling [603]
- Dynamical Similarity Analysis [226]

Alternative views on neural computation

Intermediate between the single-neuron view and population view is that of axis-aligned coding. In axis-aligned coding, single neurons have non-accidental selectivity, but population activity nevertheless has a central role [661, 662]. A number of metrics have been proposed to characterize axis-aligned coding [115], although no consensus metric has thus far emerged.

An additional line of research has focused on characterizing the connections between neurons as opposed to their activity [663, 664, 665, 666]. This edge-centric viewpoint has been popularized in the subfield of connectomics. Connectomics has made remarkable progress in explaining neural circuits, including in particular in the fly, where reference connectomes have been published, first of the fly hemibrain [595, 596], and more recently of the entire brain of larvae and adults [147, 241]. Connectomics-centric approaches have been used to explain the mushroom body [545], ring attractors for navigation [667], functional properties of the visual system [151, 233, 668], and sensorimotor transformations [149].

Parallel analysis methods in neuroscience and mechanistic interpretability.
Method	Neuroscience	Mechanistic Interpretability
Continued from previous page
Method	Neuroscience	Mechanistic Interpretability
Continues on next page
Tuning Curves	Characterizing neuron responses to parameterized stimuli [102, 637, 638]	Probing artificial neurons with parameterized inputs [669]
Receptive Fields / Preferred Stimuli	Manual characterization of excitation and inhibition areas [106, 625]; Finding preferred stimuli using inception loops [67, 68, 112]; Systems identification methods including: Spike-triggered analysis [626, 627, 628, 629] Linear-Nonlinear-Poisson models [86, 631] Nonlinear input models [101, 632] Deep learning models [87, 633, 634, 635]	Calculating maximizing stimuli for artificial neurons [670, 671]
Causal Manipulations	Micro-stimulation, optogenetics, and thermogenetic tools for neural manipulation [643, 644]	Ablations [672]; Activation patching [673]; Steering [583, 674]; Causal mediation analysis [675]
Circuit Analysis	Connectomics studies [147, 241, 595, 596]; Functional circuit analysis [149, 151, 233, 545, 667, 668]	Identifying circuits across layers in neural networks [582, 671]
Population-level Analysis	Analyzing coordinated activity of multiple neurons [646, 647]; Neural manifolds [181, 648, 649]	Examining distributed representations in neural networks [583]
Dimensionality Reduction	PCA and ICA [652]; NMF [653]; Tensor component analysis [654, 655]	Similar techniques applied to artificial neural network activations
Sparse Representations	Sparse coding analysis [524]; Tensor component analysis [654, 655]	Sparse autoencoders for interpreting neural network activations [610, 676, 677]
Representation Comparison	Representational Similarity Analysis [418]; Canonical Correlation Analysis [658]; Neural shape metrics [115, 116]	Centered Kernel Alignment (CKA) [419]
Decoding	Multivariate pattern analysis [656]; Encoding and decoding models [657]	Using classifier probes to understand representations in artificial networks [678]
Dynamical Systems Analysis	Analysis of attractors and bifurcations [659, 660]; Empirical Dynamic Modelling [603]	Applying similar concepts to understand the dynamics of artificial neural networks [605]

Mechanistic interpretability

Interpreting the inner workings of artificial neural networks has a long history. Early ANNs were often hand-designed, in a bottom-up fashion, and the relationship between weights, architectures and capabilities was clear [679]. During the first wave of neural networks trained through backpropagation, it became common practice to visualize the weights of neural networks (i.e. Hinton diagrams). During the early deep learning resurgence, it was common to visualize the weights of networks trained on visual tasks or to visualize samples from their learned generative models [4, 680, 681]. Later, methods were devised to attribute decisions to particular elements in inputs such as images or text [682]. These methods can potentially give insights into what drives a decision and identify when algorithms use non-robust features to perform a task–shortcut learning [41].

Mechanistic interpretability [580] aspires to reverse-engineer the algorithms implemented by artificial neural networks [581, 582]. This approach has a broader scope than earlier methods of interpretability, in that it seeks to build mechanistic (e.g. pseudocode) breakdowns of computations in artificial neural networks by the bottom-up analysis of neurons, connections, and activations. This is often complemented by an analysis of the mathematics of neural networks [582]. The analogy between mechanistic interpretability–or MechInt–and neuroscience is not lost on the field; leaders in the field of mechanistic interpretability have sometimes referred to it as “neuroscience of artificial neural networks” [42, 594].

Early work on mechanistic interpretability on image classification models [581] cut across many of the concepts and tools from decades of visual neuroscience, and no doubt would be familiar to Hubel & Wiesel. This included:

Calculating tuning curves to parametrized stimuli
Identifying maximizing stimuli (i.e. receptive fields)
Identifying circuits across layers (across visual areas, respectively) to implement specific functions

An example of this line of work is the identification of a circuit for the detection of curves in images in a large-scale convolutional neural network trained on image classification. A combination of estimating preferred stimuli using gradient descent [683, 684] and visualizing natural image patches causing high activations [685] identified a subset of neurons that were putatively selective for curves. These neurons were then probed using parameterized stimuli to confirm their specific tuning to orientation and curvature, thus forming tuning curves. This identified a family of curve detectors in a layer of the network, selective for similar curvature, but at different orientations. A bottom-up circuit with hand-tuned, programmatically-generated weights was then built, recapitulating the main findings from the backpropagation-trained model [669]. This work could be seen to recapitulate long-standing work in neuroscience mapping the selectivity of contours in primate areas V1, V2, and V4 [102, 106, 686].

This approach toward mechanistic interpretability has since been adapted to the analysis of large language models [579]. Specialized circuits implementing important atomic functions have been identified in LLMs, such as induction heads [687], indirect object identification [672], factual recall [688], and addition [689, 690]. A standard mechanistic interpretability toolkit has slowly formed [42, 580] around visualization techniques, classifier probes, causal methods that patch activations between circuits, and automated circuit discovery methods [589, 673, 691, 692]. These analyses are complemented by the explicit construction of transformer micro-circuits with domain-specific languages [693, 694, 695].

Distributed representations and representation engineering

Early mechanistic interpretability work was focused on a neuron-and-circuits-centric view of artificial neural networks. Even at this stage, however, one could see the cracks in the underlying assumptions: convolutional neural networks for image classification contain polysemantic units, which respond to a variety of concepts [684]. Polysemanticity cast doubt on the feasibility of finding crisp explanations for single neurons, e.g. a curve or cat detector. By contrast, the first generation of neural networks was motivated by the idea of distributed representation as the locus of conceptual information [679, 696, 697, 698, 699]. Later parallel work in computational neuroscience showed that mixed representations were both ubiquitous in the brain and advantageous in terms of coding capacity [700]. If representations are distributed rather than localized, where does this leave the project of understanding neural networks from the bottom up?

These conceptual issues became particularly relevant as the field shifted to analyzing large-language models, where polysemanticity is the norm [610]. The exact proportion of monosemantic units in LLMs can vary by architecture, capacity, and criterion, but it has been estimated at anywhere between 0% [701] and 35% [610]. That is, most units represent a variety of concepts and defy simple explanations. A related phenomenon is the infeasibility of removing undesirable behavior by the ablation of the units involved in it because of the redundancy of representations and mechanisms [702].

Just as the field of neuroscience has moved from thinking mostly in terms of single neurons to thinking about populations of neurons (i.e. manifolds) over the 2010s, so has the field of mechanistic interpretability. An example of this trend is the recent work on sparse autoencoders (SAEs), which attempts to explain the functioning of LLMs in terms of multiple partially overlapping representations. Activations of neural networks are decomposed using a sparse autoencoder [676, 701], similar to sparse component analysis in neuroscience [524]. The activations of the discovered factors tend to be more interpretable than the activations of the underlying network, as evidenced by human judges and automated labeling and verification by other large language models [674]. There is now an entire menagerie of variants of sparse autoencoders: classic, top-k, jump, gated, etc. with different sparseness and efficiency in training [677, 703, 704]. As training SAEs requires significant computational resources [705], it has become fashionable to release pretrained latent factors in the form of atlases for community-based investigation. The discovered features, labeled automatically by other large language models, can be perused and visualized, much like neurophysiology lab monitors are filled with visualizations of neurons, tuning curves, and manifolds. SAE features can also form a new basis for automated circuit discovery mechanisms [706].

An important tool for the verification of these discovered latent factors is steering. By manipulating the activations of networks in the (linear) direction of discovered latent factors, one can change the output of networks toward desired ends. An evocative example of this method is Golden Gate Claude, a version of the Claude large language model that consistently steers conversations back to discussions of the Golden Gate bridge [674]. These methods can also potentially be used for safety applications, suppressing undesirable behaviors such as sycophancy, repeating imitative falsehoods, or complying with instructions to generate dangerous, biased, or toxic outputs.

Latent feature steering is just one of a large family of techniques for causally intervening in large language models, known as activation patching. These allow one to measure the causal role of a set of neurons and activations in a particular behavior. These include techniques that graft activations from one sentence or token to another [589]; measure contrasts between two sets of related inputs to determine a steering direction [583, 587]; and determine potential steering dimensions through a self-consistency criterion [586]. Thanks to the vast amount of linear structure in modern transformers [707], linear steering is surprisingly potent in leading models toward desirable behaviors.

Evaluation

Neuroscience was and continues to be a rich source of inspiration for AI interpretability. Neuroscience and AI interpretability have a common goal: understanding the mechanisms by which black-box systems form representations and display flexible, adaptive behaviors. AI interpretability has an important role in building safer systems and facilitating the assurance of existing AI systems. However, not every tool in neuroscience applies to AI safety (see Box [box-anns-vs-bnns]). Much of neuroscience tooling is designed to overcome the limits of noisy and partial recordings, building efficient estimators. In addition, some of the most powerful tools in neuroscience interpretability are dedicated to understanding recurrent dynamical systems, which don’t have clear analogs in many popular AI systems. It may be that many of the insights from neuroscience have already been incorporated into the AI interpretability toolbox, which continues to evolve on its own.

The opposite arrow of influence — moving from AI interpretability to neuroscience — is underexplored and likely to benefit neuroscience. AI interpretability is, in a sense, an ideal neuroscience [593]: a neuroscience where one can measure every neuron, every weight, every activation, control the flow of information, causally mediate, edit, and ablate, all without resorting to difficult and slow biological experiments. Finding a minimal set of tools that have proven useful in mechanistic interpretability and bringing them back to neuroscience, where they can accelerate discoveries–some of which may have a potential impact on AI safety–is a promising avenue of research, particularly when applied to digital twins [67, 68].

Opportunities

In summary, we highlight several promising avenues for neuroscience to influence AI safety from the interpretability standpoint:

Adapt interpretability methods from neuroscience to AI
- Focus on methods that have been seldom applied in AI interpretability, including dynamical systems analysis and methods from connectomics.
- Apply interpretability to model types better adapted to neuroscience tools, e.g. state-space models [609, 708, 709] and newer variants of LSTMs [608].
Build models, inspired by neuroscience, which are transparent by design
- Build modular [710, 711, 712], spatially embedded [618, 713], and sparse models [610] to facilitate interpretation.
- Build models with learnable activation functions on edges [632, 714, 715].
Build tools to facilitate mechanistic interpretability in neuroscience
- Build tools to record from [138, 139, 446, 716] and stimulate [601, 643, 644, 717] large populations of neurons to facilitate causal mediation analysis [154].
Define new criteria for interpretability based on insights from cognitive science
- Find operationalizations anchored in cognitive science for what it means to be “interpretable”, e.g. Minimum Description Length (MDL), Kolmogorov complexity, etc.
- Develop new benchmarks for interpretability based on insights from neuroscience and cognitive science

← Infer the brain’s loss functions Discussion →