Consolidation of model architecture
October 1, 2023
Consolidation of model architecture:
- in 2000s
- completely indepedent architecture (ie: vision, speech, NLP, RL), some not even ML-based
- little collaboration
- in 2010s
- diverse architectures, but a transition to ML and specifically neural networks
- easier to collaborate
- in 2020s
- convergence on the transformer as underlying architecture
- extremely simple/flexible modelling framework (ie: train on sequences of words, text, image patches, state / action / reward transitions)
- research ideas are easily shared and relevant across domains
- reinforcing cycle of progress
- concentrate software, hardware and infrastrucure
- contrast to biology
- neocortex has a highly uniform architecture across input modalities, indicating that a unified architecture might be an efficient design principle
Distinguishing features between transformers:
- the data
- the input / output specification that maps problem into and out of a sequence of vectors
- the type of positional encoder and problem-specific structured sparsity pattern in the attention mask
In-context learning:
- special-purpose computers
- previous neural network architectures
- general-purpose computers
- two ingredients for general-purpose computers
- appropriate architecture
- training objective hard enough to force the optimisation to converge on it in the weights space of the network
- transformer architecture
- language modelling (next word prediction) is a great objective, simple to define and collect data for at scale, multi-tasking across domains
- ability to learn via activations at runtime and not via changes to the weights of the models
- reconfigurable at runtime to run natural language programs
- emergent-only attributes that are only observed at scale
- the core unlock was achieving a general purpose computer neural net via simple scalable objectives that have strong training signal
- two ingredients for general-purpose computers
References: