NeuroAI for AI Safety

Patrick Mineault

Niccolò Zanichelli*

Joanne Zichen Peng*

Anton Arkhipov

Eli Bingham

Julian Jara-Ettinger

Emily Mackevicius

Adam Marblestone

Marcelo Mattar

Andrew Payne

Sophia Sanborn

Karen Schroeder

Zenna Tavares

Andreas Tolias

Read more about the authors here.

Abstract

As AI systems become increasingly powerful, the need for safe AI has become more pressing. Humans are an attractive model for AI safety: as the only known agents capable of general intelligence, they perform robustly even under conditions that deviate significantly from prior experiences, explore the world safely, understand pragmatics, and can cooperate to meet their intrinsic goals. Intelligence, when coupled with cooperation and safety mechanisms, can drive sustained progress and well-being. These properties are a function of the architecture of the brain and the learning algorithms it implements. Neuroscience may thus hold important keys to technical AI safety that are currently underexplored and underutilized. In this roadmap, we highlight and critically evaluate several paths toward AI safety inspired by neuroscience: emulating the brain’s representations, information processing, and architecture; building robust sensory and motor systems from imitating brain data and bodies; fine-tuning AI systems on brain data; advancing interpretability using neuroscience methods; and scaling up cognitively-inspired architectures. We make several concrete recommendations for how neuroscience can positively impact AI safety.

Read as PDF on arXiv

Methods

Scaling laws for digital twins

To compile empirical scaling laws for digital twins (Figure 6), we manually captured the mean test set performance reported in the main text or tables of papers; where relevant, we used webplotdigitizer to capture the data in graphs. In cases where correlation coefficients were reported, we squared them to obtain a measure that approximates FEVE. Although recording length was often reported in terms of trials, repeats, and number images, we standardized everything to report recording time in seconds; for example, 12 repeats of 300 images presented in one second trials would add up to an hour.

We fit scaling laws by minimizing the squared loss between the points and a sigmoid as a function of log recording time, using the curve_fit function in scipy.optimize. For the V4 data, we only used the data from action potentials to fit the scaling law, as the other dataset used noisier single photon calcium imaging.

Simulations

We simulated how the number of datapoints used to train the readout weights from a core affect the prediction accuracy. We generated data from a linear-nonlinear-Poisson (LNP) model:

\[\boldsymbol{\mu} = \exp(a\mathbf{Xw} + b)\] \[\mathbf{y} \sim \textrm{Poisson}(\boldsymbol{\mu})\]

Where the weights of the model were taken to be \(\mathbf{w} \sim \textrm{Normal}(0, 1/M)\), and the design matrix was iid Gaussian, \(\mathbf{X} \sim \textrm{Normal}(0, 1)\). We set \(a = 0.4\) and \(b = 0.1\).

\(\mathbf{X}\) has size \(N\)x\(M\), where \(N\) is the number of simulated trials or timepoints, and \(M\) is the number of simulated readout weights. We estimated the parameters of this model using MAP with a prior matched to the weights, using the iteratively reweighted least squares algorithm. We evaluated the accuracy of the predictions on newly generated data to estimate the validated FEVE. We then fit scaling laws to this data as above.

Wrong core

We simulated a scenario where a neuron is driven by two components of a core: one part which is accounted for, and another part which is incorrect. For example, imagine a visual neuron in area V1 halfway between a simple cell and a complex cell: it displays some phase sensitivity as well as tolerance for the position of an oriented bar. Fitting that cell with a linear model means that no matter how much data we use to fit the model, predictions will be imperfect: the linear model cannot account for the (quadratic) nonlinearity inherent in the phase insensitivity. Thus, the mean response was modeled as:

\[\boldsymbol{\mu} = \exp(a\alpha^2 \mathbf{X}_c\mathbf{w}_c + a(1-\alpha^2) \mathbf{X}_i\mathbf{w}_i + b)\]

Since we assumed the weights \(\mathbf{w}_i\) couldn’t be estimated, in practice, \(a(1-\alpha^2) \mathbf{X}_i\mathbf{w}_i\) reduced to a source of normal noise:

\[\boldsymbol{\mu} = \exp(a\alpha^2 \mathbf{X}_c\mathbf{w}_c + a(1-\alpha^2) \epsilon + b)\]

Where \(\epsilon\sim\textrm{Normal}(0, 1)\). \(\alpha\) was varied simulate scenarios varying from the core being completely incorrect (\(\alpha=0\)) to the core being completely correct (\(\alpha=1\)).

Simulations of adversarial robustness

To investigate the feasibility of transferring robustness through neural data, we took adversarially trained networks and attempted to distill them into student networks based on either their behavioral output or their internal representations. We first conducted proof-of-concept experiments using MNIST-1D, a simplified, algorithmically generated one-dimensional digit classification task [721] consisting of 40 element vectors. We trained a robust teacher network (a 3-layer convolutional neural network with 64 channels per layer) using \(L_\infty\)-norm adversarial training with \(\epsilon=0.3\) with a PGD attack with 50 iterations. The teacher was trained using the Adam optimizer with an initial learning rate of 0.01 (decreased by 10x halfway through training) for 250 epochs with data augmentation. For student training, we leveraged MNIST-1Ds algorithmic generator to create multiple training datasets with different random seeds. Students of identical architecture were trained for 256 epochs using Adam with learning rate 0.01 (decreased by 10x halfway through training). The training objective, \(L = (1-\beta)L_{CLASS} + \beta L_{RSA}\), combined a teacher prediction matching loss (\(L_{CLASS}\)) and a representation matching one (\(L_{RSA}\)), with \(\beta\), ranging from 0 to 300, controlling the strength of representation matching. To simulate training with different amounts of data while controlling for training time, we always ran 256 epochs of 5,000 examples, varying the number of distinct training examples from 5,000 (labeled 1X) to 640,000 (128X); we recycled examples appropriately across epochs.

For CIFAR-10, we used a pre-trained \(L_\infty\)-adversarially robust WideResNet-28-10 teacher [722]. Student networks of identical architecture were trained for 200 epochs using SGD with momentum 0.9, weight decay 5e-4, an initial learning rate of 0.01 (decreased by 10x halfway through training) and the same training objective as in MNIST-1D experiments. Teacher features were precomputed on clean images, while student training utilized standard CIFAR-10 augmentation (random crops and horizontal flips). The RSA loss was only computed over middle block group features and was rescaled dynamically during training to maintain consistent magnitude relative to the classification loss. To simulate neural recording noise, we added Gaussian noise of varying magnitude (0%, 5%, or 10% of the feature magnitude) to the teachers features before computing the RSA loss. Models were evaluated against 10-step PGD attacks with \(\epsilon=8\/255\).

To extend the results of [137] and [138], we retrieved all neuroscience abstracts from bioRxiv since its inception (44,000 abstracts, cutoff date of Sep 1st, 2024) and filtered them using an LLM (gpt-4o-2024-08-06) to obtain 513 promising papers. We used a query that started with:

You are tasked with analyzing an abstract from a scientific paper to
determine if the full paper is likely to contain useful information about
state-of-the-art neural recording methods. Focus only on invasive methods, 
including penetrating electrodes, ECoG arrays, and calcium imaging.
Also consider functional ultrasound, which is sometimes referred to as
non-invasive but typically requires a craniotomy.

Here is the abstract you need to analyze:

<abstract>{{abstract}}</abstract>

Carefully read through the abstract and consider the following criteria
to determine if the paper is promising:

1. Does the abstract mention methods development--specifically methods
that enable large-scale recordings, for example new hardware or
indicators--as its primary goal?

2. Is there any indication of a large or massive dataset being recorded?

3. Does the abstract suggest that the research would only have been
possible with large datasets?

4. Are there mentions of advancements in terms of:

a) Size of the dataset
b) Number of probes used
c) Number of neurons recorded per session
d) The percentage of the brain that can be recorded, for instance
whole-brain imaging or near whole-brain imaging

If the paper meets one or more of these criteria, consider it promising.
<further instructions for formatting in JSON>

We then fed the full text of these papers using the same LLM, using a query that started like this:


You are tasked with analyzing a scientific paper to extract information
about neural recording techniques and the number of simultaneously
recorded neurons. Here\textquotesingle s how to proceed:

1. Carefully read through the following PDF content:

<pdf_content>
{{PDF_content}}
</pdf_content>

2. As you read, look for information related to:

- The species studied
- The neural recording technology used
- The number of neurons, unit, multi-units, channels, electrodes,
animals, ROIs, probes, sessions, voxels, arrays.
- Whether the recording was chronic or acute
- Any details about the recording sessions or animals used

3. Extract relevant quotes from the PDF content and list them using
<quote> tags. For example:

<quote>We recorded from 10,000 neurons across 5
mice using two-photon calcium imaging.</quote>

4. After listing the relevant quotes, analyze the information to
determine:

- The number of simultaneously recorded neurons per session
- Whether the recording was chronic or acute
- The specific neural recording technology used

<further instructions for formatting results in JSON>

One of the authors (PM) then manually curated the results. After joining with previous databases [137, 138], we deduplicated and filtered out papers with a lower number of neurons per recording than 10 prior papers, as suggested by [137]. This process resulted in a total of 151 papers. We then fit the following Bayesian linear regression using PyMC [723], \(log(\textrm{neurons}) \sim at + b\) separately for electrophysiology and calcium imaging recorded after 1990.

Dimensionality of neural data

To estimate the dimensionality of neural data across model organisms, we retrieved six representative datasets: two for C. elegans [185, 186], one for larval zebrafish [187], and three for mice [139, 183, 188]. The data were converted to HDF5 format for analysis, and included both neural activity recordings and three-dimensional spatial coordinates of each recorded neuron, allowing us to take spatial organization of neurons into account.

For all datasets, we employed Shared Variance Component Analysis (SVCA) to estimate the number of dimensions of the data which could be reliably estimated [139, 724]. We split the data in train and test sets that did not overlap along either time or neuron axes. SVCA then calculated the covariance between the two groups of neurons for time points belonging to the train set, applied randomized singular value decomposition (SVD) to calculate orthogonal bases for both sets of neural activity, and assessed how well this basis explained the variance in the test set. This amount of variance, normalized by the total variance, was termed reliable variance, and dimensions were considered significantly reliable only if their percentage of reliable variance was greater than four standard deviations above the mean of the reliable variance for two shuffled datasets, following [139].

To assess how dimensionality scaled with the number of recorded neurons, we randomly sampled subsets of neurons from each dataset, z-scored their activity, and applied all three methods described above. The number of neurons sampled was varied either linearly or logarithmically, depending on the dataset’s total neuron count. To prevent spurious correlations, neurons were divided into spatially alternating bins before splitting into training and testing sets. We then fit a linear regression in log-log space to estimate the exponent of a scaling law for neural dimensionality as a function of the recording size, \(\textrm{dim}\sim \alpha (\# \text{neurons})^\beta\).

Estimating freely available data

To estimate the total amount of freely available neural data, we analyzed datasets available on three major data repositories: DANDI [725], OpenNeuro [719] and iEEG.org. For iEEG, we used a script to scrape recording length information from the HTML source of website. For DANDI datasets, we used the DANDI Python client to access Neurodata Without Borders [726] files, leveraging partial downloads to minimize data transfer. Extracted metadata included recording modality (e.g. calcium imaging or electrophysiology), subject species, number of recorded neurons, and recording duration for each file. The resulting data was then processed to standardize species and recording technology nomenclature, as well as to compute subject- and dataset-level statistics.

For OpenNeuro datasets, we used the OpenNeuro GraphQL API endpoint to identify and analyze datasets containing neuroimaging data. We then retrieved session metadata for BIDS-formatted [727] recordings, extracting metadata such as recording modality (e.g. MRI, EEG, MEG etc) and number of subjects, using the accompanying event files to estimate the recording’s duration. The resulting data was then filtered to exclude likely artifacts, such as single sessions lasting more than four hours, and to compute dataset-level statistics.

Given their outsized contribution to the total amount of available data, we manually validated the largest datasets in each repository, cross-referencing them against published papers, technical documentation and code repositories when available, and manually correcting them in case of discrepancies. Since many large-scale neuroimaging datasets are not hosted on these repositories, we supplemented this data with the UK Biobank [728], the 1000 Functional Connectomes Project [729], the Human Connectome Project [730], the Adolescent Brain Cognitive Development Study [731], the Healthy Brain Network [732], the Natural Scenes Dataset [430], the Courtois NeuroMod project [733, 734, 735], incorporating this information into our analysis.

We pulled the results of the vision leaderboard from brainscore.org using a script. We filtered the data on those models for which we had scores on 9 neural datasets: one from V1, one from V2, three from V4, and four from IT. The submission dates for the models are not listed, and we instead used the order in which each model was submitted as a proxy for the date of submission. We used seaborn to plot the results, using a quadratic model to fit the temporal trends separately.