A life is worth 10^x tokens
Information is a precursor to intelligence, there must exist some thing to observe, understand, and learn patterns from. The architecture behind frontier LLMs is the transformer decoder (TFD) [https://arxiv.org/pdf/1706.03762]; it takes sequences of data and learns to understand and generate that data. These are information processing and generating systems. They are/soon will be the best programmers in the world, the best translators, the best pilots, the best information processing systems. Historically humans have held these titles but the gap is quickly shrinking. Humans are general information processing and generating systems. They read and write books, study the universe, and pursue the unknown.
How do these two systems compare? What are the similarities and differences?
Humans operate on raw sensory inputs, sensors (eye, ear, skin, …) produce readings which are compressed and presented to the brain as useful information. The brain operates on this information and produces outputs (speech, movement, …). TFDs should strive for a similar design philosophy.
The first stage of training a TFD is to create a static dictionary of tokens, a critical design choice that determines how the model expresses information. Tokens are not a useful unit of information, they encode a useful unit of information, the embedding. An embedding is a continuous vector of semantic meaning that stores information about a concept. Tokens are an inefficiency of the TFD design process, they require learning some offline algorithm to compress bytes of information (BPE https://en.wikipedia.org/wiki/Byte-pair_encoding) and should be eaten into the training recipe. Byte Latent Transformers (BLT) [https://arxiv.org/abs/2412.09871] operate on sequences of bytes rather than tokens, text is broken up into chars (bytes) which the BLT learns how to encode into concepts natively.
VLMs have historically trained with text as a “first-class citizen”. They are initialised using a frozen pre-trained language model and are then augmented to understand other modalities. The world is however, significantly more than just text, all modalities should be learnt jointly. A recent paper from Meta trains a TFD from scratch using text, image, video, and action data, to equal the modality playing field [https://arxiv.org/abs/2603.03276]. In this form the concept of a token is inaccurate, each element in the sequences is a vector of semantic information, abstracted away from the modality that generated it. The same model learns how to process information in latent space, generating text, image, and video outputs. While the architecture they present is somewhat clumsy (e.g. manual decoder head switching mode at inference time), it is a step towards a general system.
To make further progress,, TFDs should operate on raw signal streams (i.e. bytes/bits rather than tokens), to make this computation tractable for multiple modalities we need encoders that compress signals into useful representations → concepts. Francois Chollet gives a nice intuition: if you were to double the resolution of the human eye would the amount of information received from the eye double? in terms of raw bits yes, in terms of useful concepts of information in which to operate on, no. Compression into useful concepts is necessary and we fundamentally care about these concepts, not the raw signal that generated them. Perceiver [https://arxiv.org/abs/2103.03206] addresses this by learning latent representations from streams of bytes, encoding raw signals.
Additionally, we should not be too concerned about the specification of individual sensors, as long as they are diverse. Humans have 5 primary sensors but not all are necessary, see Helen Keller (https://en.wikipedia.org/wiki/Helen_Keller). We need some set of diverse sensors but their specification is not strict. We compress raw sensory signals into concepts of information, our TFD learns and generates this latent sequence. Decoders can then project back to output spaces (text, image, video, etc).
Suppose we designed such a system, how many concepts of information would it need?
It takes ~20 years for a human to become economically valuable (rough proxy for intelligence), 20 years * 365 days * 18 hours * 60 minutes * 60 seconds = 4.7 * 10^8 wakeful seconds. Estimating the number of concepts experienced a second is non-trivial, all information is not equal. Sitting in a lecture theatre produces more concepts of information than watching paint dry, despite the raw bits being similar. The novelty of the information is important to consider, as we age we are presented with less surprising data and so our information gain per second reduces. A child might acquire 10^1 useful concepts per second, while by age 20 this might drop to 10^-2. If we assume an average of 10^0 useful concepts per second, then a human accumulates 4.7 * 10^8 concepts of information before becoming economically valuable.
To find an equivalence for TFDs we can construct a toy conversion using the number of training datapoints. Suppose a text dataset contains 10^11 tokens, if we estimate that each sentence provides a useful concept, and a sentence is ~10^1 tokens, our 10^11 token dataset provides us with 10^10 concepts. Similar heuristics apply for other modalities, for example each image provides ~10^1 concepts and each video provides 10^2 concepts. We assume that we would want equal concept counts across modalities (though this is likely not the case as humans are vision dominant). Current LLMs train with text datasets of ~10^10 concepts (~10T tokens) to achieve ~human-like language performance, and that humans have 5 primary sensors we would need 5 * 10^10 concepts of information, or 10^11 token-equivalents.
This would imply that our current datasets are of the order of magnitude humans experience by the age of ~20, so why can’t current models do a lot of the things humans can (e.g. learn to drive with 16 hours of experience)? Well current LLMs are significantly more intelligent than humans across a range of tasks already: knowledge breadth, coding skills, translation, however this intelligence is narrow and sharp (i.e. not general). Current LLMs are text dominant. If training data were distributed evenly across modalities (sensors) would we expect broad human level performance? Also no, for the following reasons:
- Humans are embodied, they actively generate data rather than passively learning from static data as LLMs do. This plays a significant role in the quality of information, if you can interact with the world you can hunt for “surprising” information, making your acquired data much more valuable.
- Humans have better learning algorithms, they can learn from concepts significantly more efficiently than current ML training recipes.
If we were to solve these issues would we expect this hypothetical AI model, given 10^10 concepts of information, to be a general information processing system? P(yes) > 0.