Paper review: Bayesian continual learning and forgetting in neural networks

7 minute read

Published:

đź“„ Paper

Bayesian Continual Learning and Forgetting in Neural Networks
Djohan Bonnet et al.
Nature Communications (2025)
DOI: https://doi.org/10.1038/s41467-025-64601-w


Why I read this paper today

Continual learning always feels like a fundamental challenge between plasticity and stability:

  • If the model keeps learning, it forgets (catastropic forgetting);
  • If it remembers too well, it stops learning (catastropic remembering).

A straightforward way to prevent catastropic forgetting is to rigidly regularize the network so that its parameters remain close to their old values, but this solution creates an opposite failure mode: catastrophic remembering, where the model becomes too inflexible to learn new data at all. Together, these effects highlight a fundamental tension between stability and plasticity in continual learning: preventing forgetting through overly strong constraints can block adaptation, so effective methods must strike a balance rather than freezing the model.

Biological Inspirations

Bayesian principles: biological synapses may operate according to Bayesian principles, maintaining an “error bar” on synaptic weight values to gauge uncertainty.

Meta-plasticity: Synapses adapt their plasticity based on the importance of prior tasks → widely regarded as a plausible mechanism that enables the brain to balance memory retention and flexibility.

Comparision between other metaplasticity-based continual learning algorithms

One of the earliest works that explicitly applied the concept of metaplasticity to mitigate catastrophic forgetting in neural networks is Synaptic metaplasticity in binarized neural networks (Nature Communications, 2021). In this paper, metaplasticity is introduced in a heuristic manner within binarized neural networks, where synaptic states are endowed with additional internal variables that control their plasticity. While this approach significantly improves memory retention across tasks, it relies on explicit task boundaries to trigger plasticity modulation.

In subsequent work, Statistical mechanics of continual learning (Phys. Rev. E, 2023), my collaborators and I developed a theoretically grounded Bayesian framework for continual learning in binary neural networks. By formulating learning as a variational inference problem, metaplasticity emerges naturally rather than being imposed heuristically. The resulting Bayesian dynamics explicitly link synaptic plasticity to uncertainty, yielding adaptive learning rates at the level of individual synapses. Similar to the earlier work, this framework still assumes known task boundaries, but it goes further by enabling a full statistical-mechanics analysis of the learning process using the Franz–Parisi potential, which characterizes the stability and memory structure of solutions across tasks.

đź’ˇ Core idea in one sentence

The algorithm MESU introduces forgetting directly at the posterior level of a Bayesian neural network, so that uncertainty, plasticity, and memory emerge naturally from variational inference.


Truncated posterior instead of infinite memory

Standard Bayesian continual learning accumulates evidence forever. For a sequence of datasets $\mathcal{D}_1, \mathcal{D}_2, \ldots, \mathcal{D}_t$, the posterior is recursively updated as:

\[p\left(\boldsymbol{\omega} \mid \mathcal{D}_1, \ldots, \mathcal{D}_t\right)=\frac{p\left(\mathcal{D}_t \mid \boldsymbol{\omega}\right) \cdot \boldsymbol{p}\left(\boldsymbol{\omega} \mid \mathcal{D}_1, \ldots, \mathcal{D}_{t-1}\right)}{p\left(\mathcal{D}_t\right)}\]

Limitations. First, the framework treats all tasks as equally important, ignoring differences in task relevance or difficulty, which can lead to inefficient use of model capacity as the number of tasks grows. Second, when the model repeatedly revisits the same dataset, uncertainty continues to shrink, causing the system to become increasingly over-confident and less adaptable, ultimately reducing its ability to accommodate new or changing data distributions.

To resolve these issues, the authors introduce a forgetting mechanism using a truncated posterior (a finite memory window):

\[p\left(\boldsymbol{\omega} \mid \mathcal{D}_{\mathrm{t}-\mathrm{N}}, \ldots, \mathcal{D}_{\mathrm{t}}\right)=\frac{p\left(\mathcal{D}_{\mathrm{t}} \mid \boldsymbol{\omega}\right) \cdot \boldsymbol{p}\left(\boldsymbol{\omega} \mid \mathcal{D}_{\mathrm{t}-\mathrm{N}}, \ldots, \mathcal{D}_{t-1}\right)}{p\left(\mathcal{D}_{\mathrm{t}}\right)}\]

The model retains information only from the most recent N tasks, so knowledge from earlier tasks is explicitly discarded rather than compressed or integrated. Moreover, the update equation is not inherently recursive, meaning that the posterior at each step cannot be expressed as a simple function of the previous posterior alone, which limits its suitability for truly lifelong or streaming learning settings. Next, we can rewrite the prior term using Bayes’ rule:

\[p\left(\boldsymbol{\omega} \mid \mathcal{D}_{\mathrm{t}-\mathrm{N}-1}, \ldots, \mathcal{D}_{\mathrm{t}-1}\right)=\frac{p\left(\mathcal{D}_{\mathrm{t}-\mathrm{N}-1} \mid \boldsymbol{\omega}\right) \cdot \boldsymbol{p}\left(\boldsymbol{\omega} \mid \mathcal{D}_{\mathrm{t}-\mathrm{N}}, \ldots, \mathcal{D}_{t-1}\right)}{p\left(\mathcal{D}_{\mathrm{t}-\mathrm{N}-1}\right)}.\]

Finally we can get the recursive posterior:

\[p(\omega \mid D_{t-N}, \ldots, D_t) = \frac{p(D_t \mid \omega) p(\omega \mid D_{t-N-1}, \ldots, D_{t-1})}{p(D_t)} \cdot \frac{p(D_{t-N-1})}{p(D_{t-N-1} \mid \omega)}.\]

Learning vs Forgetting in the free energy

The MESU objective decomposes cleanly into:

  • Learning: fit the current dataset
  • Forgetting: gently de-consolidate old synapses

The forgetting term explicitly encourages:

  • pulling the mean back toward the prior
  • increasing posterior variance to free capacity

This makes forgetting an active, controlled process, not a failure mode.


Metaplasticity emerges automatically

In the variational inference method, each weight is modeled as a Gaussian:

\[\omega_i \sim \mathcal N(\mu_i, \sigma_i^2),\]

and we assume a simple trial probability distribution:

\[\boldsymbol{q}_{\theta_t}(\boldsymbol{\omega})=\prod_i \mathcal{N}\left(\omega_i ; \mu_{t, i}, \boldsymbol{\sigma}_{t, i}^2\right).\]

To approximate the posterier

\[p\left(\boldsymbol{\omega} \mid \mathcal{D}_{\mathrm{t}-\mathrm{N}}, \ldots, \mathcal{D}_{\mathrm{t}}\right),\]

we want to minimize the KL divergence between the trial probability distribution and the posterier:

\[D_{K L}\left[q_{\boldsymbol{\theta}_t}(\boldsymbol{\omega}) \mid p\left(\boldsymbol{\omega} \mid \mathcal{D}_{\mathrm{t}-\mathcal{N}}, \ldots, \mathcal{D}_{\mathrm{t}}\right)\right].\]

This is also called the free energy. The update rule implies:

\[\begin{gathered} \Delta \boldsymbol{\sigma}=-\frac{\boldsymbol{\sigma}_{t-1}^2}{2} \frac{\partial \mathcal{C}_t}{\partial \boldsymbol{\sigma}_{t-1}}+\frac{\boldsymbol{\sigma}_{t-1}}{2 N \boldsymbol{\sigma}_{\text {prior }}^2}\left(\boldsymbol{\sigma}_{\text {prior }}^2-\boldsymbol{\sigma}_{t-1}^2\right) \\ \Delta \boldsymbol{\mu}=-\boldsymbol{\sigma}_{t-1}^2 \frac{\partial \mathcal{C}_t}{\partial \boldsymbol{\mu}_{t-1}}+\frac{\boldsymbol{\sigma}_{t-1}^2}{N \boldsymbol{\sigma}_{\text {prior }}^2}\left(\boldsymbol{\mu}_{\text {prior }}-\boldsymbol{\mu}_{t-1}\right) \end{gathered}\]

This updating rule indicates that there is no task boundaries required in this algorithm, just data stream is required. Also, we realize that the synaptic variance (uncertainty of the connections) plays the role of learning rate, scaling the updates by each parameter’s uncertainty! So:

  • Large epistemic uncertainty → fast learning
  • Small epistemic uncertainty → slow learning

Plasticity is no longer a global hyperparameter — it is learned per parameter.

This is exactly what neuroscientists refer to as metaplasticity.


Connection to Hessian and EWC

One of my favorite aspects of the paper is the theoretical unification:

  • Posterior variance scales inversely with the Hessian diagonal
  • In the large-window limit, MESU recovers EWC / Synaptic Intelligence
  • The update resembles a diagonal Newton step on the variational free energy

From a Bayesian perspective:

  • EWC = Laplace approximation with frozen uncertainty
  • MESU = Laplace approximation with dynamic uncertainty

This explains why EWC often over-consolidates parameters and loses flexibility.


Epistemic uncertainty and OOD behavior

Because MESU avoids variance collapse, it preserves epistemic uncertainty even after many tasks.

Empirically:

  • In-distribution samples show low epistemic uncertainty
  • OOD samples retain high uncertainty
  • Deterministic methods fail by construction
  • Bayesian methods without forgetting become over-confident

This reinforces a key message:

OOD detection requires not just Bayesian modeling, but controlled forgetting.


Open questions I’m thinking about

Some directions that feel very natural to explore next:

  • Is there a efficient way to implement the computation in Bayes’ neural networks?
  • What happens beyond mean-field Gaussian posteriors?
  • Can the memory window (N) be learned instead of fixed?
  • Is there a clean DMFT / field-theoretic interpretation of variance dynamics?
  • How does this connect to ergodicity breaking in learning dynamics?

It also feels closely related to ideas from:

  • Online Laplace
  • Broken ergodicity
  • Critical slowing down near task saturation

Takeaway

This paper convinced me that:

Forgetting is not a bug in Bayesian learning — it is a missing term.

Once forgetting is introduced at the posterior level, many problems in continual learning line up in a surprisingly clean way.

Definitely a paper I’ll return to.