Continual Learning Frontiers

Mackenzie Morehead

13 Aug 2025 • 15 min read

Generalized LLMs are now winning gold at IMO and placing second at competitive coding competitions, yet still far away from disrupting normal workflows in white collar jobs.

Arguably the most important thing limiting them from being drop-in employees is their complete inability to learn on the fly over extended periods of time. Human employees learn how to do tasks too complex to be one shotted over a couple demonstrations or training sessions. Eventually it becomes second nature. Whereas, AI systems either can or can’t do something out of the box. No amount of coaching will help it learn over time.

Dwarkesh recently identified continual online learning as his prime reason for his forecast that he won't be able to hire an AI to run production for his podcast until 2032.

We believe across many use-cases this could be overly bearish.

Hiding in plain sight is just how many primitives already deployed live in frontier LLMs like ChatGPT enable adaptivity.

A simple breakdown of these primitives can be broken up into a few different tactics outlined below:

Continued March Along Existing Paradigms: What is already being used today

Frontier models have made tremendous progress in adaptivity in just two years. Remember when ChatGPT was initially released and you’d ask it a question about something that happened after its training run and it was completely helpless?

With current progress, we now expect that these systems will fulfill Dwarkesh’s wager by 2030 without any fundamentally new technology, as they ride the exponentials of their core components: training, inference time, and accessing the right information.

After all, over a 3-6 month period, humans’ neuronal weights or deeply internalized circuits only change so much, with our memory system handling much of the on the job learning.

Training & Inference Time

Everything starts with pre-training. Performance on long duration tasks, rapid adaptability to new tasks, and data efficient learning are all directly proportional to raw intelligence. GPT-3 was the first model capable of few-shotting relevant tasks without fine-tuning. As the base models keep getting more all-encompassing, they’ll be able to few-shot ever-more such that the continual learning discussed later is unnecessary.

Even as naive parameter count scaling ended with GPT 4.5, pre-training has continued to scale by getting 2-5x more efficient on a per parameter basis every year thanks to improved data quality and algorithms. With the focus now squarely on data quality, it has been hypothesized that underlying the infamous scaling laws is the idea that discovering the rarer or nuanced things in the data requires more data because the data distribution itself obeys power laws. A good doctor can treat common cases; great doctors can treat the rare, convoluted ones.

To find the data at the end of the power law, companies like Compound portfolio company Wayve have developed generative models to simulate only edge cases at scale and Google just released an active learning approach that concentrates the datasets on the boundary conditions that confuse the model which helped reduce dataset size for ad curation by three OOM.

Meanwhile, just months ago, reasoning models and inference-time scaling were introduced. They were the first models capable of adapting to new environments at least as defined by beating ARC-AGI 1.

Inference-time scaling clearly hasn’t approached saturation. The reasoning traces of GPT5 are already 50-80% more efficient than o3, just a few months old. Achieving gold at both IMO and IOI required 4.5 hours of potent thinking for each. Their performance on the competition jumped from the 49th to 98th percentile in the last year. Now, Noam Brown is talking about getting the model to think for months to help solve Millennium Prizes.

Crucially, both Google and OpenAI used the same fully general purpose reasoning models for IMO, IOI, and competitive coding. They will be incorporated into models shortly.

Maybe the most underappreciated recent news is that we have now entered the era of general purpose verifiers. For the first time, Google and OpenAI didn’t rely on a Lean‑style formal verifier for their IMO gold. They instead used another LLM as a “fuzzy” verifier capable of checking natural‑language proofs.

The implications of fuzzy verifiers span from the scope of machine-verifiable tasks exploding to self-play becoming feasible to search and verification requiring similar compute to the very nature of corporations possibly being rearranged to micro-tasks auctioned to AIs as spelled out here.

Remember that verifier + search is what produced AlphaZero. It appears we may have that combination for general domains, even outside of math and coding.

Indeed, AlphaZero-style cold start training regimes have been detailed in several technical reports, startups are using recursion where learnings from the best search trajectories are incorporated back into the base model to search more efficiently in the future, and Google released AlphaEvolve which learns to mutate an algorithm over time and found a faster 4x4 matrix multiplication algorithm for the first time since the 1969.

To go beyond human and synthetic data, Richard Sutton and David Silver have proclaimed we’re entering the era of experience where fully generalized versions of AlphaZero will learn directly from interacting with the real-world via RL. Though there's little specifics like where to get the rewards, maybe a good place to start is by replicating existing outputs or maybe LLMs themselves really are good enough verifiers now. Either way, getting this to work may require novel optimization algorithms and/or architectures less prone to superposition.

One could even build a foundation model to build RL environments with the necessary diversity and realism to build actually robust, generalized agents in various domains. Getting to the GPT-3 equivalent of RL environment scale would require several OOMs larger scale.

Finally, in addition to better data, maybe we’ll get novel architectures that scale better with data by learning abstractions not at the token-level (e.g. H-Nets, JEPA, HRM). While Transformers appear good enough, they don’t have a divine mandate, are OOMs less data efficient than humans, and historically model architectures have gotten replaced every ~7 years.

All together, training and inference time exponentials have gotten us very far and have plenty left in them.

METR; DeepMind's model already beat GPT-5 Pro before its launch

While we’re still a handful of years from real physical limits to scaling compute clusters, the more immediate hurdle is that the research and model servicing must remain economical at large user base scales.

The market price per token has remained constant despite costs for a given model capability plummeting, because all users immediately switch to the best model. Meanwhile, reasoning models have made token demand go vertical.

The economics of reasoning models may improve thanks to lots of low hanging fruit in making the reasoning search much more efficient with recursive learning (e.g. GPT-5 being 50-80% more efficient than o3 just months ago) and eventually safe multi-hop latent space reasoning. The more advanced reasoning models powering the IMO, IOI, and competitive coding wins will also be incorporated into models in the coming months.

Moreover, maybe we soon reach thresholds of capability where people stop always demanding the frontier. Alternatively, the economics are patched by OpenAI monetizing free users with ads and Anthropic with usage-based pricing while Google and Meta fund from their $100B combined FCF.

Beyond that, plenty of well-researched but not yet productized techniques discussed throughout the rest of the post could significantly bend the cost curves and ignite a new “scaling law.”

Memory & RAG

One such scaling paradigm after training and inference time may be memory systems / external information retrieval if you think of these paradigms as a combination of the amount of information available to the model and its ability to parse it. One can only fit so much information into the model weights and cannot include any personal or private information.

Information retrieval and memory systems must decide what to store for the long term, how to do so (vector vs graph, what features, etc.), consolidate related concepts, and how to retrieve it at the right time. It must do so on the fly for every user. Across frontier labs, that’s ~1B MAUs.

However, most existing RAG and memory systems still struggle with generality; multi-modality; unifying the three memory domains (parametric knowledge, KV-caches, and externally injected content); and moving from static entity stores to time‑aware, self‑refining graphs of ongoing real-time data streams. Traditional chunk-based RAG can’t represent complex entity relationships, while GraphRAG methods organize knowledge into entity-relation graphs but are limited by costly graph construction and one-time retrieval.

Maybe we can take inspiration from the sophistication and adaptability of human memory systems. Our short- and long-term memory are controlled by different biological mechanisms and stored in different regions of the brain. Working memory can even be pre-synaptic thereby not taking up a single neuron, whereas long-term memory mechanisms are literally physically stabilized. The stimuli’s frequency, intensity/arousal, and context guides successive memory consolidation, possibly using dopaminergic actor–critic reinforcement learning.

AI research is indeed moving towards figuring out how to effectively mimic humans’ capacity to break up days’ or months’ long tasks into chunks like semester-long classes into modules with spaced repetition, consolidation of crucial info, and a filing system to find arbitrary information as needed.

Just last year, context windows extended from 4k to 10M, allowing for ingestion of entire code bases. This may continue to scale, and researchers are exploring other ways to boost the base model’s intrinsic memory like recurrence even within Transformers or strengthening attention heads responsible for recall.

More exciting is using this newfound long context length as a short-term cached memory and building around it sophisticated information retrieval to effectively achieve indefinite context length.

The academic frontier is solving these issues by combining hierarchical memory, RL-trained agents and multi-turn reasoning scaling at inference time. The memory systems often either use a compression method to create a “high level” understanding which can then be used to guide retrieval of specifics or use LLMs / agents to create a lightweight knowledge graph that’s acceptably cheap to construct and maintain. Some have even more ambitious visions of memory operating systems, whatever that means exactly. All systems will have to contend with computational overhead (e.g. Mem0 occupies ~7K tokens per conversation).

These systems will become more complex as we get to many different agents acting autonomously.

With memory already live for a year in ChatGPT and memory being one of the few true moats in AI as we at Compound have long pointed out, it can be relied upon that they will improve quickly.

Current systems, including the one deployed within ChatGPT, are still quite ineffective so there’s still a window for startups to commercialize a technical breakthrough. An application we at Compound have long been excited about is AI therapists. Another possible domain for startups is the building the pipes for companies to safely and privately connect all their internal data. We are sure there are many others.

Update the Weights: Frequently Retraining Models

We may not need fundamentally new approaches beyond the three preexisting scaling paradigms sketched out above to reach human-level performance. However, we almost certainly will be getting it.

Models are still missing the ability to autonomously incorporate learnings directly back into the model itself. Hints have already emerged that such technologies are on the near horizon.

The methods discussed above like recursive learning from search, self-play, and RL from real-world environments could provide data for frequent weight update techniques
Test-time training is starting to be deployed effectively on ARC, with the ARC team speculating that the technique will make its way into frontier LLMs next year
Academic groups have published work towards how to do frequent LorA adapters without unacceptable safety risks, catastrophic forgetting, overfitting, or exorbitant costs

Maybe the future plays out where the AlphaZero-like RL approach dominates where the model weights continuously see gradient descent updates to real-world rewards.

But what feels more likely due in part to the costs of storing entire frontier model weights online is something more anthropomorphic and piecemeal where various techniques are used depending on the situation. AI models could have something like a working memory (a dataset of potentially useful things that it’s recently seen stored in an accessible buffer) that then gets resurfaced for remembering. Then on the order of say weeks, LorA adapters or fine-tuning durably adds the most important info.

Frequently and autonomously updating model weights currently faces the technical barriers of avoiding unacceptable safety risks, catastrophic forgetting, overfitting, and exorbitant computational costs.

The intuition for it being feasible stems in part from the findings that weights are somewhat segregated by function despite of course plenty of overlap. For instance, different papers have suggested that a small fraction of “superweights” dominate behavior, that <1% of neurons are largely responsible for safety refusal activations, that a small minority of directions are responsible for safety, that one can apply a LorA adapter for safety without causing catastrophic forgetting or a drop off in reasoning capability and vice versa, etc.

However, the gap between academic PoC and autonomously updating the base model weights of products nearing one billion users is unusually large because doing so opens up a huge surface area for attack. It seems significantly higher dimensional than the current already vast attack vectors of prompts, tool use, multimodality, and computer use.

Malicious weight updates to ChatGPT could cause serious reputation damage, like perpetual Mecha Hitler scandals. Or, in the worst case, cyber criminals figure out how to consistently hack the weights and actively use these near-AGI systems for malicious purposes.

While OpenAI is not your classic slow moving incumbent and certainly isn’t petrified by liability risk, they may be slow to broadly offer such a solution until the research is clear. Maybe they’d start with ChatGPT for enterprise with strict KYC, high refusal rates, and oversight.

Their current fine-tuning service uses some lightweight classifiers to catch harmful data in your dataset and keep your raw dataset for 30 days afterward.

Before they roll out mass personalization with frequent updates, a window may emerge for startups to build a robust business providing these services at least to certain niches.

Automated Safety Checks Before Merging Weights

Excitingly, only recently has there been a very noticeable shift in frontier safety research towards automated constructs and agents from strictly manual work.

OpenAI released research claiming the first model able to generate both effective and diverse prompt attacks by having an LLM generate attacker goals at large (but which are likely ineffective) and rule-based rewards and multi-step iterative RL to hone the attack’s potency. Others have published similar approaches.
Anthropic created and open sourced sandboxed environments where parallel agents can call functions like “retrieve features” to conduct real value additive basic research on alignment. Individual agents currently successfully solve the Marks et al. auditing game 13% of the time under realistic conditions, struggling with fixating on early hypotheses. By running many agents parallel and aggregating findings, they improve to 42%.
Anthropic also introduced a method for autonomously identifying persona vectors and predicting undesirable personality shifts during fine-tuning. The authors also propose “vaccinating” the model during fine-tuning to mitigate negative activations at inference time, with fewer performance side effects than is typical with steering.

The Gemini 2.5 team used the tactics below to find 2,000 successful malicious triggers. They then generated synthetic responses to each trigger, one in which the LLM successfully defended against and one in which it failed, and fine-tuned on that dataset. The fine-tuning process significantly increased the model’s resilience to indirect prompt injection attacks. Concerningly, LLMs remain susceptible to attack, with the TAP prompt exfiltrating user data roughly 19 times out of 20 when it reached the assistant in the calendar management context.

An academic group developed a method that identifies the small fraction of alignment weight updates disproportionately responsible for safety robustness and intensifies their activation, making the model more immune to safety degradation during fine-tuning. It requires no changes to training, alignment, fine-tuning, or inference.
Other techniques that could easily be made into automated safety frameworks include refusal benchmarks, jailbreak suites, and regression tests on original capabilities

Before merging the LorA adapters or fine-tuned weights, methods like those above could be used for automated red-teaming.

Learning New Info without Forgetting Useful Current Info

A parallel track of research has explored how to avoid catastrophic forgetting (i.e. the overwriting of useful information). Approaches include focusing on retaining features as opposed to weights, selecting data for uncertainty reduction ala active learning, online weight regularization to identify key parameters, automatic recognition of data distribution shifts, and stochastically restoring a small fraction of the neurons to the pre-trained weights.

Maybe once base models become so informationally dense that overwriting any small portion of weights with adapters causes significant degradation, then models may even dynamically increase their parameter count to keep learning on the job.

Ultimately, even if one addresses risks of individual fine-tunes, updating weights frequently over long periods of time no less doing so constantly online requires further innovation. Most LorA approaches lack robustness to small weight updates due to a discrete mapping between input data, parameters, and adapters.

Recent advances like ELDER create continuous, smooth mappings. ROME uses interpretability to find particular facts to edit in mid-layer MLPs. Others add and train new neurons at the feed-forward layer.

A World of Frequent Fine-Tunes and Mass Specialization

Once teams figure out how to adjust models’ weights frequently, specialization will become the norm. Currently models are only differentiated by size. Frontier models are on the order of 300B+ parameters and edge models are 3B.

With parameter count scaling tapped out, we should focus on specializing large scale models for every domain. AI models deployed for coding don’t also need to write lovely haikus.

As models are deployed in real-world settings, they may gather enough high quality feedback data to train models filled with predominantly domain-specific data.

After all, employers only expect so much from a human fresh off general-purpose training like high school or undergrad. Humans get paid more and given more responsibility as they gain ever-more years of specialized training on specific jobs and roles.

There may evolve an App Store or Braintrust-like marketplace for specialized AIs. The equivalent of a 20-year veteran front end developer could be paired with an infra developer while a second year investment banking analyst could be hired cheaply to automate SMBs’ finances. A swarm optimized for the task at hand. Maybe in such a world, the tools enabling customization and the router to the optimal agents become immensely valuable. Or, maybe what I'm describing is just startups, Hugging Face, or OpenAI.

Consumers may soon also drive demand for differentiation with use cases like therapy, coaching, and persona replication for influencers or people that want digital twins for productivity. Maybe AIs are the next influencers, and the creators of the most viral and beloved personalities will be today’s Tik Tokkers.

Though early, we expect specialization to become increasingly common given scaling limits and that OpenAI has already started rolling out different personalities, previously tried using Scarlett Johansson’s voice, and one of Sam Altman’s main takeaways from the GPT5 launch was that they should push more on customization following the concerningly emotional funerals for 4o.

We’re starting to see multi-agent frameworks in research papers, but specialization is done largely with system prompts not something more sophisticated and is as coarse as “researcher” and “grader.”

This potential future of abundant model tuning may imply that open source and local models become far more prevalent. Compound invested in a stealth startup focused on enabling high-performance local compute.

In such a world of mass specialization, presumably LorA adapters would be the favored technical solution because they can be detached or rolled back in milliseconds and cost ~10x less to maintain than full fine-tuning.

Maybe we’ll train countless LorA adapters and use them like Instagram filters, dynamically routing to them on the fly.

Maybe we’ll even go beyond fine-tuning on existing datasets by simulating real-world settings in RL environments ahead of deployment.

In Five Year's Time

We expect the current paradigms of base model training + inference reasoning + memory/RAG will get us to AIs that can handle some full real-world workflows.

Nevertheless, we will get new primitives.

Predicting exactly when each of these discrete events will happen is of course difficult. But, given that they’ve already been demonstrated in academic settings and frontier labs are generally roughly a year ahead of academics, it seems reasonable to expect several primitives to start appearing within the next 1-2 years. Once such a new technique is introduced to some success, all labs will rapidly climb that hill simultaneously.

So I suspect Dwarkesh’s timeline of 2032 is perhaps too bearish by a year or two. Our timeline incorporates the difficulty of gathering the long-tail of data that marked the difference between AV demos in 2015 to commercial networks ten years later (though an imperfect analogy because AVs are life & death, etc.).

If you’re working on any of these problems from basic research, productization, or starting a company building upon them we’d love to chat!