NeuroAI for AI Safety

Patrick Mineault

Niccolò Zanichelli*

Joanne Zichen Peng*

Anton Arkhipov

Eli Bingham

Julian Jara-Ettinger

Emily Mackevicius

Adam Marblestone

Marcelo Mattar

Andrew Payne

Sophia Sanborn

Karen Schroeder

Zenna Tavares

Andreas Tolias

Read more about the authors here.

Abstract

As AI systems become increasingly powerful, the need for safe AI has become more pressing. Humans are an attractive model for AI safety: as the only known agents capable of general intelligence, they perform robustly even under conditions that deviate significantly from prior experiences, explore the world safely, understand pragmatics, and can cooperate to meet their intrinsic goals. Intelligence, when coupled with cooperation and safety mechanisms, can drive sustained progress and well-being. These properties are a function of the architecture of the brain and the learning algorithms it implements. Neuroscience may thus hold important keys to technical AI safety that are currently underexplored and underutilized. In this roadmap, we highlight and critically evaluate several paths toward AI safety inspired by neuroscience: emulating the brain’s representations, information processing, and architecture; building robust sensory and motor systems from imitating brain data and bodies; fine-tuning AI systems on brain data; advancing interpretability using neuroscience methods; and scaling up cognitively-inspired architectures. We make several concrete recommendations for how neuroscience can positively impact AI safety.

Read as PDF on arXiv

Discussion

This roadmap has evaluated several promising approaches for how neuroscience could positively impact AI safety, including:

  1. Building digital twins of sensory systems and reverse engineering their robust representations

  2. Developing embodied digital twins through large-scale neural recordings and virtual embodiment

  3. Pursuing biophysically detailed models through connectomics and biophysical modeling

  4. Developing better cognitive architectures

  5. Using brain data to fine-tune AI systems

  6. Inferring loss functions from neural data and behavior

  7. Leveraging neuroscience methods for mechanistic interpretability

Several key themes have emerged from this analysis:

Focus on safety over capabilities

Much of NeuroAI has historically been focused on increasing capabilities: creating systems that leverage reasoning, agency, embodiment, compositional representations, etc., that display adaptive behavior over a broader range of circumstances than conventional AI. We highlighted several different ways in which NeuroAI could also enhance safety without dramatically increasing capabilities. This is a promising and potentially impactful niche for NeuroAI as AI systems develop more autonomous capabilities.

Data and tooling bottlenecks

Some of the most impactful ways in which neuroscience could affect AI safety are infeasible today because of a lack of tooling and data. Neuroscience is more data-rich than at any time in the past (Box [box-available-neural-data]), but it remains fundamentally data-poor. Recording technologies are advancing exponentially (Box [box-scaling-trends]), doubling every 5.2 years for electrophysiology and 1.6 years for imaging, but this is dwarfed by the pace of progress in AI. For example, AI compute is estimated to double every 6-10 months [718]. Being serious about neuroscience affecting AI safety will require large-scale investments in data and tooling to record neural data in animals and humans under high-entropy natural tasks, measure structure and its mapping to function, and access frontier-model-scale compute.

Need for theoretical frameworks

While we have identified promising empirical approaches, stronger theoretical frameworks are needed to understand when and why brain-inspired approaches enhance safety. This includes better understanding when robustness can be transferred from structural and functional data to AI models; the range of validity of simulations of neural systems and their ability to self-correct; and improved evaluation frameworks for robustness and simulations.

Breaking down research silos

When we originally set out to write a technical overview of neuroscience for AI safety, we did not foresee that our work would balloon to a 100 page manuscript. What we found is that much of the relevant research lived in different silos: AI safety research has a different culture than AI research as a whole; neuroscience has only recently started to engage with scaling law research; structure-focused and representation-focused work rarely overlap, with the recent exception of structure-to-function enabled by connectomics [150, 719]; insights from mechanistic interpretability have yet to shape much research in neuroscience. Some prescient proposals have synthesized these areas of research into programs, including Byrnes’ brain-like AGI safety proposal (Section 7.3.4) and Sarma et al. framework for AGI safety based in top-down neuropsychology and bottom-up biophysical simulations [229, 460, 461]. We hope to catalyze more positive exchanges between these fields by building a strong common base of knowledge from which AI safety and neuroscience researchers can have productive interactions.

Moving forward: recommendations

We’ve identified several distinct neuroscientific approaches which could positively impact AI safety. Some of the approaches, which are focused on building tools and data, would benefit from coordinated execution within a national lab, a focused research organization [720], a research non-profit or a moonshot startup. Well-targeted tools and data serve a dual purpose: a direct shot-on-goal of improving AI safety, and an indirect benefit of accelerating neuroscience research and neurotechnology translation. Projects that we identify as being good targets for a coordinated effort within the next 7 years include:

  • Development of high-bandwidth neural interfaces, including next-generation chronic recording capabilities in animals and humans, including electrophysiology and functional ultrasound imaging. We believe that decreasing the doubling time for electrophysiology capabilities to 2 years is structurally feasible provided sufficient funding and pressures

  • Large-scale naturalistic neural recordings during rich behavior in animals and humans, including the aggregation of data collected in humans in a distributed fashion

  • Development of detailed virtual animals with bodies and environments with the aim of a shot-on-goal of the embodied Turing test [144]

  • Bottom-up reconstruction of circuits underlying robust behavior, including simulation of the whole mouse cortex at the point neuron level

  • Development of multimodal foundation models for neuroscience to simulate neural activity at the level of representations and dynamics across a broad range of target species

Other approaches are focused on building knowledge and insight, and could be addressed through conventional and distributed academic research:

  • Improving robustness through neural data augmentation

  • Developing better tools for mechanistic interpretability

  • Creating benchmarks for human-aligned representation learning

Nothing about safer AI is inevitable - progress requires sustained investment and focused research effort. By thoughtfully combining insights from neuroscience with advances in AI, we can work toward systems that are more robust, interpretable, and aligned with human values. However, this field is still in its early stages. Many of the approaches we’ve evaluated remain speculative and will require significant advances in both neuroscience and AI to realize their potential. Success will require close collaboration between neuroscientists, AI researchers, and the broader scientific community.

Our review suggests that neuroscience has unique and valuable contributions to make to AI safety, particularly in understanding how biological systems implement robust, safe, and aligned intelligence. The challenge now is to translate these insights into practical approaches for developing safer AI systems.

Acknowledgements

We would like to thank reviewers who provided critical feedback on early versions of this manuscript: Ed Boyden, Bing Brunton, Milan Cvitkovic, Jan Leike, Grace Lindsay, Alex Murphy, Sumner Norman, Bence Ölveczky, Raiany Romanni, Jarod Rutledge and Paul Scotti, as well as dozens of neuroscientists and AI researchers who shaped the narrative of this manuscript. Finally, we would like to thank James Fickel for his unwavering support and vision in advancing neuroscience and AI safety.