🧠Chan’s Curiosity Log — January 5, 2026
Published:
Daily reflections on new papers, theories, and open questions.
🧩 Paper: Scaling laws and representation learning in simple hierarchical languages: Transformers versus convolutional architectures
📄 Phys. Rev. E (2025)
1.1 Background
Different neural network architectures are typically applied to different tasks, such as transformers for language modeling and convolutional architectures for computer vision.
Power laws are among the most prominent empirical findings in modern deep learning: they appear remarkably consistent across domains and architectures and now serve as practical heuristics for guiding resource allocation in large-scale models.
Their ubiquity suggests the existence of fundamental statistical phenomena underlying deep learning.
If these phenomena are truly universal, they must reside in the data itself—although personally, I think they arise from the interaction between neural networks and datasets, not solely from deterministic statistical properties of the data.
1.2 Questions
How do neural language models acquire a language’s structure when trained for next-token prediction?
More specifically, how do different architectures learn hierarchically compositional data, and how do architectural differences affect the scaling of performance?
1.3 Key Idea
The authors focus on synthetic datasets generated by the random hierarchy model (RHM)—an ensemble of hierarchically compositional generative processes corresponding to simple context-free grammars (CFGs).
In previous work, they developed a theory of representation learning based on data correlations, explaining how deep models capture hierarchical structure sequentially, layer by layer.
Here, they extend this theoretical framework to explicitly account for architectural differences.
1.4 Key Conclusions and Results
- Deep networks infer the hierarchical structure of the data by exploiting correlations between tokens, and the authors derive the resulting scaling of test error with the number of training samples.
1.5 Why It Is Interesting
This work introduces an analytical model for language together with a theoretical method to calculate scaling laws, offering a rare bridge between language modeling and statistical physics–style analysis.
1.6 Questions Worth Exploring
- How many distinct theoretical approaches currently exist for deriving scaling laws?
- How do the statistical properties of the data and the learning dynamics interact to drive the evolution of representations?
