Autor: Saeed Taghavi – Investigador con doctorado en Fisíca en el Instituto Zapata-Briceño de Neurociencia
Neuroscience has long advanced through specialization, with vision researchers, language researchers, and auditory researchers each building models in relative isolation. TRIBE v2, a collaboration between Meta FAIR and École Normale Supérieure – PSL, challenges that fragmentation with a single, unified approach.
Rather than generating text or images like most AI models, TRIBE v2 predicts patterns of brain activity. Feed it a movie clip, a spoken sentence, a piece of music, or a paragraph of text, and it returns an estimate of how the cortex would respond.
A Unified Map of the Brain
Trained on large-scale fMRI data from hundreds of subjects watching naturalistic content, TRIBE v2 learns a single mapping across visual, auditory, and language inputs.
The practical payoff is immediate: classical neuroscience experiments presenting faces to identify the fusiform face area, or contrasting sentences with word lists to probe language networks, can now be run computationally, without collecting new data.
When the authors applied standard experimental paradigms directly to the model, it recovered well-established cortical patterns it was never explicitly trained on.
What It Actually Predicts (and What It Doesn't)
TRIBE v2 predicts BOLD signals, the hemodynamic responses measured by fMRI. This is a temporally coarse proxy for neural activity, delayed by several seconds.
It tells you where the brain is broadly engaged, not how it computes moment to moment.
Crucially, the model treats the brain as a passive receiver of stimuli. Real brains are not. Attention, expectations, and behavioral goals all shape how a stimulus is processed, and TRIBE v2 captures none of this.
For passive viewing conditions it is a reasonable approximation; for active cognition or clinical populations, it is a real limitation.
From Lab Tool to Engineering Layer
Early applications hint at broader uses. Researchers have already combined TRIBE v2 with AI music generation to explore how rhythm and emotional valence engage cortical networks, without a single neuroimaging session.
Similar logic could extend to content production, interface design, or education: evaluate predicted neural engagement computationally before committing to expensive empirical testing.
A Useful Addition, Not a Revolution
TRIBE v2 predicts population-averaged responses, not individual ones, and generalizes best to stimuli similar to its training data
What it offers is a reproducible computational layer between stimulus design and data collection, one that used carefully can make experimental programs more efficient and hypothesis-driven
Computational pre-screening does not replace lab work, it makes lab work smarter.
Whether its predictions are faithful enough to change the decisions researchers actually make is the essential open question, and systematic benchmarking against real neural data is the necessary next step.
To explore the demo:
https://aidemos.atmeta.com/tribev2
To read the paper:
https://ai.meta.com/research/publications/a-foundation-model-of-vision-audition-and-language-for-in-silico-neuroscience/