Beyond The Transformer

Published in 2017 by Google Brain, 'Attention Is All You Need' [1] established the Transformer neural network architecture, which parallelised previously serial computation through their multi-head, self-attention mechanism - described by the language of linear algebra.

This paper deployed their model on eight NVIDIA P100 GPUs. Notably, this GPU incorporated HBM2 (3D-stacked video memory) and NVLink (rapid data exchange between multiple GPUs) technologies, positioned to handle the bandwidth and memory density issues faced by high-performance compute workloads at the time. This card did NOT have dedicated matrix multiplication hardware units, and relied on highly optimised software libraries and memory innovations (HBM2) to handle the complex data movement across CUDA Cores. 'Tensor Cores' were developed in 2017 with NVIDIA's Volta architecture, which implemented 4x4 matrix multiply-add instructions at a hardware level: a 20-40x speedup for these workloads.

In a positive feedback loop, innovations in parallelised neural architectures and the hardware needed to accelerate the specific math operations involved occurred, feeding on larger and larger datasets, resulting in the LLMs of today and an assortment of ASICs (GPUs, TPUs, NPUs etc.) to deploy them on.

Importantly, our scientific understanding of the human-brain is not static.

While low-level neural processes are highly parallelised, a paper in Nature (Qiuhai Yue et. al., 2025) demonstrated that "...there is a central bottleneck of information processing distinct from perceptual and motor stages that limits our ability to carry out two cognitively demanding tasks at once, resulting in the serial queueing of task information processing" (p1) [2].

The authors state "the empirical identification of the serial bottleneck as the MD [multiple-demand] network is not without important implications. In particular, since this network is flexibly engaged across multiple cognitive domains to form a domain-general cognitive operating system, the present results are consistent with the hypothesis that such system's flexibility comes at a cost of limiting the multiplexing of information through that system" (p14).

To restate, the high-level cognitive control, planning and decision making largely associated with the multiple-demand network in the human-brain requires a central-attention mechanism.

We have multi-billion dollar data centres purpose built for the acceleration of foundation models built with multi-headed, attention-based neural architectures fed with hundreds of megawatts of power.

Yet, the human brain you are equipped with runs on less energy than a dim light bulb, with higher cognitive functions bottlenecked as inherently serial processes.

This raises questions. Perhaps, attention is not all you need?


References
[1] https://doi.org/10.48550/arXiv.1706.03762
[2] https://doi.org/10.1038/s41467-025-58228-0

Previous
Previous

Penfield's Homunculus v. Integrate-Isolate Model

Next
Next

EEG Artifact Removal: Does It Help Or Hinder?