Weight-space and token-space memory

Presently, AI systems are optimised for direct, one-step performance on digital assistant tasks. As an example, agents might be asked to do deep research on a specific topic, prepare a meeting brief for an upcoming meeting, or serve as a customer service agent answering customer questions and managing their flights in accordance with particular airline policies. Agents are even beginning to excel at increasingly complex tasks outside the realm of simple "app layer companies" (i.e. companies that focus primarily on filling in economically viable gaps by leveraging applied AI/prompt-engineering-focussed products). One crucial distinction for these topics, however, is that most of these tasks tend to be ungrounded tasks; that is, they have very little associated context and often don't require any previous context or understanding.

This lack of contextual implication thus implies that the majority of tasks being solved right now are still the low-hanging fruit of tasks, as most all of the variables and understandings needed can be imparted upon the agent in broad, general strokes during the pretraining phase. For most of human knowledge work, this is not the case. A natural next step to wonder about, then, is how we might modify agents to be able to more effectively use and reason across their memories, not unlike what we wonder about ourselves. Specifically, as touched upon in this blogpost

How might we emulate humans' abilities to best determine what the most valuable information to remember from any given task is? (generalisation)
How do we effectively associate this information either to past pieces of knowledge that we've already crystallised, or connect it to other new concepts in order to form a new "circuits of understanding"? (integration)

The platonic ideal of a memory system

To achieve some semblance of a platonic ideal for a memory system, we must first closely examine the different capabilities of existing models in order to understand exactly what they need to have within their memories such that they maximise the utility of their present skills.

I. Verifiable tasks

Code generation serves as a representative starting point. Consider a task of the form: "write a function that sorts a list of integers in ascending order." The task is unambiguous, the solution space is well-defined, and — critically — the correctness of the output is verifiable by execution. Even substantially more complex variants of this, such as implementing a web scraper against a specific schema or diagnosing a bug in an existing function, share the same essential property: there exists an objective ground truth against which any candidate output can be evaluated.

Current models exhibit strong performance on tasks of this form, and the explanation follows directly from the structure of the training pipeline that produced them. The dominant paradigm — pretraining on a large text corpus followed by fine-tuning via RLHF and related techniques — is well-suited to dense-reward, single-step tasks. As CollabLLM demonstrates, models are typically optimised to maximise the quality of a single response, rather than a full trajectory of responses. More precisely, the reward signal is $r_t$ at a single timestep $t$, which creates a direct inductive bias toward maximising the immediate utility of each individual output.

Benchmark structure compounds this further. The dominant evaluation benchmarks — HumanEval, MMLU, GSM8K, SWE-bench — are fundamentally single-shot: a prompt is issued, an output is produced, and that output is scored in isolation. There is no intermediate reward, no signal on the quality of the reasoning process, and no credit assigned across turns. The result is a training objective that rewards clean, complete, happy-path solutions — precisely the kind that verifiable tasks demand. For this class of task, the alignment between training objective and task structure is essentially exact: the ground truth is static, the reward is immediate, and the model's inductive biases are well-matched to what the task requires.

II. Non-verifiable tasks or variable reward structures

The limitations of this regime become apparent as soon as we depart from it. Consider a qualitatively different class of task: "suggest a research direction for my dissertation, given the following constraints," or "evaluate whether I should accept this job offer based on the following context." These tasks are non-verifiable in the strict sense — there is no objective ground truth against which a response can be scored, and what constitutes a correct answer is contingent on the user's particular context, mental models, values, and history, most of which are unavailable to the model.

The characteristic failure mode here is not outright incorrectness but something subtler: responses that are structurally coherent and superficially plausible, yet systematically miscalibrated — overconfident where uncertainty is warranted, underspecified where detail is required, or insensitive to the longer-horizon consequences implicit in the task.

The CollabLLM analysis is again instructive. A model trained to maximise the quality of a single response learns to optimise for surface-level helpfulness — outputs that appear maximally useful when evaluated in isolation. For tasks requiring longer-horizon reasoning, however, the optimal action at time $t$ may not be an immediately helpful response at all; it may be a clarifying question, an explicit acknowledgement of uncertainty, or an integration of context accumulated across prior turns. Under a single-step reward structure, these behaviours receive no positive signal, precisely because they trade immediate reward for future reward — and no component of the training objective captures that trade.

The evaluation landscape reinforces this gap. There are effectively no widely-adopted long-horizon, intermediate-reward benchmarks in current use. Those that exist are poorly standardised, infrequently applied, and absent from the hill-climbing targets that drive model development in practice. As a result, the capacity to reason coherently across turns and integrate consequences over time remains largely out-of-distribution — a behaviour the model was never incentivised to develop.

A final and more subtle point: RL-based fine-tuning maximises expected reward in the statistical sense — the average reward over the full distribution of users and contexts. The practical consequence is a model that is well-calibrated for the population in aggregate, but poorly suited to strongly optimising for any particular user's preferences or contextual constraints. The optimal policy under this objective is to hedge across the distribution, producing responses that score well in expectation rather than responses that are maximally appropriate for a specific individual. For non-verifiable tasks — where correctness is inherently contextual and individual — this constitutes a structural mismatch between training objective and task requirement.

III. Actions and consequences

The contrast between sections I and II points to a single underlying variable: the tractability of consequence attribution. In verifiable tasks, the consequence of an action is immediate, unambiguous, and context-independent — the code either executes correctly or it does not. In non-verifiable tasks, the consequence is delayed, diffuse, and conditioned on information the model does not have access to. The model's action policy $\pi$ cannot improve under these conditions, not because it lacks the representational capacity to reason, but because the signal required to update that policy arrives at a temporal distance that no component of the current training objective captures.

This framing is developed in depth in one of my prior posts on action-consequence as a learning policy, but the key formalisation is worth restating here. Any task can be represented as a sequence $(s_0, {c_0, t_0}, a_0),\ (s_1, {c_1, t_1}, a_1),\ \ldots$ where each action $a_t$ generates a consequence $c_t$ conditioned on the state transition from $s_t$ to $s_$. Effective learning over this sequence requires two distinct operations: correctly extracting the information content of each consequence, and correctly assigning credit — searching across the prior sequence of $(a, c)$ pairs to determine which actions at which timesteps should be updated and in what direction (this is the same generalisation and integration that we addressed above from Jessy Lin's post).

In the vein of these case studies that we've just thought about, we can see that current models fail at the second operation in particular. Training under a single-step reward provides no mechanism for credit assignment across turns; the model has no way to attribute a consequence observed at time $t$ to an action taken at time $t - k$ for any non-trivial $k$. For verifiable tasks, this is largely immaterial — the consequence is immediate and $k \approx 0$, so credit assignment is trivial. For non-verifiable tasks, $k$ is variable, consequences are context-dependent, and the absence of any multi-turn credit assignment mechanism is precisely the structural cause of the failure modes described in section II. A memory system that supports effective performance across both regimes must therefore provide the substrate for this kind of temporal credit attribution — not as a post-hoc reasoning capability appended to an otherwise unchanged architecture, but as a first-class concern.

From this exploration, we can distill what is truly important when building a memory system:

Trajectory-level credit assignment. The memory system must preserve a structured representation of past $(s, a, c)$ tuples at sufficient resolution to support attribution of outcomes to prior actions across arbitrary time horizons — the temporal analogue of backpropagation, applied at the episodic level.
Contextual personalisation. Correct consequence evaluation for non-verifiable tasks requires access to user-specific preferences, constraints, and mental models that are not derivable from the pretraining distribution. The memory system must encode and surface this information as a first-class retrieval target.
Consequence-driven generalisation. Raw episodic storage is insufficient; the memory system must compress specific experiences into transferable representations, supporting updates to the action policy that generalise beyond the particular episode in which a consequence was observed.

Today's versions of memory

Of the three requirements above, consequence-driven generalisation is the one that existing systems have made the most serious attempt to address. The other two — trajectory-level credit assignment and contextual personalisation — remain largely unsolved, and the gap between the two solved and unsolved halves of the problem is precisely what motivates the distinction between today's two dominant memory paradigms.

Weight-space memory

The most direct attempt at consequence-driven generalisation is to route new experience through the same mechanism that produced the model's capabilities in the first place: gradient descent into the weights. If pretraining can compress petabytes of text into a fixed parameter set, continuous updates to those parameters should, in principle, accumulate experience over a lifetime in the same way.

This then becomes, however, a very strong proxy towards potential catastrophic forgetting. Because a model's weights are an entangled compression, updating them for any new datum risks corrupting the representations that encode everything previously learned. But the deeper problem is not forgetting per se; it is granularity. As in my previous writing, one can view learning at a high level as essentially just a search problem: the task is to locate the precise region of the model's knowledge that the new experience bears on and modify that region and only that region effectively.

The correct primitive, then, is sparsity — activating and updating only the small fraction of parameters semantically relevant to a given input, leaving the rest undisturbed. Current weight-space architectures, from sparse memory layers to Titans' online MLP-based memory modules, all converge on this idea: rather than diffuse writes across the full parameter space, route each update through a tiny, targeted subset. The result is a high-capacity store that can absorb new knowledge without contaminating existing representations.

What these architectures share, however, is also their shared limitation. The granularity of update is the token or fact — not the $(s, a, c)$ trajectory — and because gradient descent operates at the population level regardless of how sparse, contextual personalisation remains structurally out of reach: the model can only internalise what generalises across all users, which is exactly the averaging behaviour that non-verifiable tasks expose as insufficient.

Token-space memory

The two failures of weight-space memory — on credit assignment and personalisation — share a common cause: both require the memory system to be individual-aware, and gradient descent into shared weights is, by construction, individual-blind. The alternative is a conceptually simple inversion: leave the model's weights entirely fixed and treat the agent's accumulated context as the memory substrate instead. Rather than asking "how do we update the model to remember this?", the question becomes "how do we update what the model is told?"

This immediately resolves the personalisation problem. Learned context is per-agent by construction, human-readable, and trivially editable — the user's specific preferences, constraints, and history live explicitly in the context rather than being entangled somewhere in the weights. Catastrophic forgetting is structurally impossible, since the model's parameters are never touched.

In a direct comparison to weight updates however, token-space memory quickly becomes much more expensive and offer only marginal, if any, improvements on robustness. Accumulated tokens must be reprocessed on every forward pass, and long contexts actively degrade reasoning — the "context rot" failure mode, where an overloaded context window confuses rather than informs. More fundamentally, there is no gradient-based distillation: no mechanism by which token-space memory extracts the generalised lesson from a specific episode and discards the noise, instead relying on base LLM capabilities that aren't yet there. The shortfall on consequence-driven generalisation is therefore structural, the precise inverse of weight-space memory's failure on personalisation.

How continual learning and preference learning paradigms will affect memory

Through this exploration, we arrive at a clearer picture of where the field actually stands. The storage mechanisms are converging — weight-space for generalisation, token-space for personalisation, and an increasingly clear case for a hybrid that leverages both. What has not yet been seriously addressed is the harder problem sitting above the storage layer: not where to put information, but what is worth keeping, and how to retrieve and associate it in ways that are genuinely useful to a specific individual at a specific moment. Without that, even a well-engineered hybrid memory system is still just a more sophisticated log.

At ThirdLayer, we believe that the unlock here is not another architectural improvement to how memories are stored, but a fundamentally better understanding of the individual — their preferences, their action policy, and the personal context that makes any given memory meaningful or irrelevant. The infrastructure of memory is maturing rapidly; what lags behind is the intelligence to use it. That intelligence, we think, has to be built on a deep, continuously updated model of who the user is and how they work — which is precisely the problem we're focused on.