🧭 Chan’s Curiosity Log — January 5, 2026

1 minute read

Published: January 05, 2026

Daily reflections on new papers, theories, and open questions.

🧩 Paper: Scaling laws and representation learning in simple hierarchical languages: Transformers versus convolutional architectures

📄 Phys. Rev. E (2025)

1.1 Background

Different neural network architectures are typically applied to different tasks, such as transformers for language modeling and convolutional architectures for computer vision.
Power laws are among the most prominent empirical findings in modern deep learning: they appear remarkably consistent across domains and architectures and now serve as practical heuristics for guiding resource allocation in large-scale models.
Their ubiquity suggests the existence of fundamental statistical phenomena underlying deep learning.
If these phenomena are truly universal, they must reside in the data itself—although personally, I think they arise from the interaction between neural networks and datasets, not solely from deterministic statistical properties of the data.

1.2 Questions

How do neural language models acquire a language’s structure when trained for next-token prediction?
More specifically, how do different architectures learn hierarchically compositional data, and how do architectural differences affect the scaling of performance?

1.3 Key Idea

The authors focus on synthetic datasets generated by the random hierarchy model (RHM)—an ensemble of hierarchically compositional generative processes corresponding to simple context-free grammars (CFGs).
In previous work, they developed a theory of representation learning based on data correlations, explaining how deep models capture hierarchical structure sequentially, layer by layer.
Here, they extend this theoretical framework to explicitly account for architectural differences.

1.4 Key Conclusions and Results

Deep networks infer the hierarchical structure of the data by exploiting correlations between tokens, and the authors derive the resulting scaling of test error with the number of training samples.

1.5 Why It Is Interesting

This work introduces an analytical model for language together with a theoretical method to calculate scaling laws, offering a rare bridge between language modeling and statistical physics–style analysis.

1.6 Questions Worth Exploring

How many distinct theoretical approaches currently exist for deriving scaling laws?
How do the statistical properties of the data and the learning dynamics interact to drive the evolution of representations?

Twitter Facebook LinkedIn

Chan Li

🧭 Chan’s Curiosity Log — January 5, 2026

🧩 Paper: Scaling laws and representation learning in simple hierarchical languages: Transformers versus convolutional architectures

1.1 Background

1.2 Questions

1.3 Key Idea

1.4 Key Conclusions and Results

1.5 Why It Is Interesting

1.6 Questions Worth Exploring

🧭 Chan’s Curiosity Log — January 12, 2026

Paper review: Bayesian continual learning and forgetting in neural networks

📄 Paper

🧭 Chan’s Curiosity Log — Nov 10, 2025

🧭 Chan’s Curiosity Log — Nov 6, 2025