Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
The AI research community continues to find new ways to improve large language models (LLMs), the latest being a new architecture introduced by scientists at Meta and the University of Washington.
Their technique, Byte Latent Transformer (BLT), could be the next important paradigm for making LLMs more versatile and scalable.
BLT solves one of the longstanding problems of LLMs that operate at byte level as opposed to tokens. BLT can open the way for new models that can process rare data, are robust to changes and don’t rely on fixed vocabularies.
Tokens vs bytes
Most LLMs are trained based on a static set of tokens, predefined groups of byte sequences.
During inference, a tokenizer breaks the input sequence down into tokens before passing it to the LLM.
This makes the models more efficient in using compute resources but also creates biases that can degrade the model’s performance when faced with tokens not included in the vocabulary.
For example, many leading language models can become slow and more costly when faced with languages that have a small representation on the web because their words were not included in the model’s token vocabulary. Misspelled words can also cause the model to tokenize the input incorrectly. And tokenized models can struggle with character-level tasks, such as manipulating sequences.
Moreover, modifying the vocabulary requires the model to be retrained. And expanding the token vocabulary can require architectural changes to the model to accommodate for the added complexity.
Alternatively, LLMs can be trained directly on single bytes, which can solve many of the abovementioned problems. However, byte-level LLMs are prohibitively costly to train at scale and can’t handle very long sequences, which is why tokenization remains an essential part of current LLMs.
Byte latent transformer (BLT)
Byte Latent Transformer (BLT) is a tokenizer-free architecture that learns directly from raw bytes and matches the performance of tokenization-based models. To solve the inefficiencies of other byte-level LLMs, BLT uses a dynamic method that groups bytes based on the level of information th …