Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
As enterprises continue to adopt large language models (LLMs) in various applications, one of the key challenges they face is improving the factual knowledge of models and reducing hallucinations. In a new paper, researchers at Meta AI propose “scalable memory layers,” which could be one of several possible solutions to this problem.
Scalable memory layers add more parameters to LLMs to increase their learning capacity without requiring additional compute resources. The architecture is useful for applications where you can spare extra memory for factual knowledge but also want the inference speed of nimbler models.
Dense and memory layers
Traditional language models use “dense layers” to encode vast amounts of information in their parameters. In dense layers, all parameters are used at their full capacity and are mostly activated at the same time during inference. Dense layers can learn complex functions, and increasing their requires additional computational and energy resources.
In contrast, for simple factual knowledge, much simpler layers with associative memory architectures would be more efficient and interpretable. This is what memory layers do. They use simple sparse activations and key-value lookup mechanisms to encode and retrieve knowledge. Sparse layers take up more memory than dense layers but only use a small portion of the parameters at once, which makes them much more compute-efficient.
Memory layers have existed for several years but are rarely used in modern deep learning architectures. They are not optimized for current hardware accelerators.
Current frontier LLMs usually use some form of “mixture of experts” (MoE) architecture, which uses a mechanism vaguely similar to memory layers. MoE models are composed of many smaller expert components that specialize in specific tasks. At inference time, a routing mechanism determines which expert becomes activated based on the input sequence. PEER, an architecture recently developed by Google DeepMind, extends MoE to millions of ex …