What Is Power Retention?

Authors

Jacob Buckman

Carles Gelada

Sean Zhang

Published

September 23, 2025

Power retention is a new architecture developed by Manifest that enables AI to handle millions of tokens at a fraction of today’s cost. It replaces Transformers’ attention as the de facto approach for designing large language models, and unlocks long-context applications that were previously prohibitively expensive or impractically slow.

The Problem With Transformers

Transformers are an incredible architecture. Their invention represented a massive leap forward in AI progress. But that does not make them perfect. The Transformer architecture has limitations, which are now beginning to come into focus.

The fatal flaw of Transformers is that they remember the past in perfect detail.

In theory, perfect detail might seem desirable. In practice, it is crippling. Imagine telling a doctor about your symptoms of slurred speech and facial droop, and then patiently waiting while he spends thirty hours mentally replaying every detail of his past – childhood, adolescence, medical school, career, all the way up to the present moment – before diagnosing you as being at risk of an imminent stroke. That is exactly how Transformers process information. A human who operates this way would not be useful in most jobs, and yet this is how the current generation of AIs are designed to operate.¹

Our new technology changes this paradigm and makes AI memory look a lot more like human memory. Power retention equips LLMs with a fixed-size memory, and since space is limited, it becomes precious. Each new experience is carefully incorporated into the worldview of the AI. There are certain aspects of human intelligence – such as ignoring irrelevant details, synthesizing evidence to reach conclusions, or reminiscing on the past – that become crucial for retention models; aspects that Transformers completely sidestep in favor of naive memorization. The upshot is that, thanks to power retention, our aforementioned “doctor” will soon be able to offer a diagnosis as quickly at the end of his career as he did at the beginning, but better informed thanks to a rich history of experiences.

This difference represents an enormous unlock in practical capabilities for AI. No longer will you need to start a brand-new session for each interaction. Instead, all of your interactions will be a continual stream, shaping the AI’s understanding of your needs and enabling it to assist you more effectively. No longer will agentic platforms need to break up tasks before running out of context and getting confused – a single agent will be able to remember all aspects of the task and see it through to its conclusion. No longer will you be forced to use a patchwork of RAG heuristics to wade through a sea of tokens and pray that the relevant ones make it into the context – power retention means that all the information can simply be made available, and the AI will decide for itself how to choose the relevant aspects in constructing its response. For those who dream of achieving AGI, it is clear that this technology represents a necessary stepping stone on the path to general human-level capabilities.

Why Power Retention?

In our architecture, retention serves the same purpose as a Transformer’s attention layer, in that it is a technique for synthesizing the information contained in the context (the past) into good predictions about upcoming tokens (the future). Attending to the past requires remembering it – that’s why Transformers, which use attention, are memorization machines. In contrast, retention learns to identify which parts of the past are important and retain only the most relevant ones. Retention can also be understood as a unification of attention with an older idea, recurrence.² By combining the best parts of attention (hardware-friendly design) with the best parts of recurrence (judicious use of fixed-size memory), we get retention.

But it is the power in power retention that represents the heart of the breakthrough. Retention has been studied³ since at least 2016, but until now, it has been mostly relegated to the status of academic curiosity. The reason is simple: in the era of LLMs, scale is king, and no previous retention model could be scaled up as well as Transformers can. Our critical insight was to discover that a specific form of retention, derived from a mathematical idea known as the symmetric power, has incredible scaling properties. The enormous potential of power retention became clear, and the team at Manifest dove into action to make it practical and usable.

We’ve developed custom GPU kernels for power retention training & inference, which we have now open-sourced. Anyone training long-context models can install them right now and see immediate speedups.⁴ If you are interested in understanding power retention in more detail, or seeing rigorous comparisons between power retention, attention, and other approaches, we also released a paper titled Scaling Context Requires Rethinking Attention. We are also releasing an open-source and open-weights power retention coding model, PowerCoder-3B.

If you are interested in being ahead of the curve on the next generation of foundational AI models, join our community discord and subscribe to our mailing list below.

Footnotes

The only reason why current Transformer-based chatbots don’t take thirty hours to respond is that they are very short-lived. Since they have fewer experiences to mentally replay, they can respond quickly enough to be tolerable, but the limited context hurts their usefulness.↩︎
Before Transformers, the best sequence architectures were “recurrent neural networks”.↩︎
Although by other names, e.g. this work.↩︎
Beyond the inherent benefits of retention over attention, our power retention kernels are implemented so efficiently that they achieve higher GPU utilization than Flash Attention.↩︎