As agentic AI moves from experiments to real production workloads, a quiet but serious infrastructure problem is coming into focus: memory. Not compute. Not models. Memory.Under the hood, today’s GPUs simply don’t have enough space to hold the Key-Value (KV) caches that modern, long-running AI agents depend on to maintain context. The result is a lot of invisible waste — GPUs redoing work they’ve already done, cloud costs climbing, and performance taking a hit. It’s a problem that’s already showing up in production environments, even if most people haven’t named it yet.At a recent stop on the VentureBeat AI Impact Series, WEKA CTO Shimon Ben-David joined VentureBeat CEO Matt Marshall to unpack the industry’s emerging “memory wall,” and why it’s becoming one of the biggest blockers to scaling truly stateful agentic AI — systems that can remember and build on context over time. The conversation didn’t just diagnose the issue; it laid out a new way to think about memory entirely, through an approach WEKA calls token warehousing.The GPU memory problem“When we’re looking at the infrastructure of inferencing, it is not a GPU cycles challenge. It’s mostly a GPU memory problem,” said Ben-David. The root of the issue comes down to how transformer models work. To generate responses, they rely on KV caches that store contextual information for every token in a conversation. The longer the context window, the more memory those caches consume, and it adds up fast. A single 100,000-token sequence can require roughly 40GB of GPU memory, noted Ben-David.That wouldn’t be a problem if …