Paper Reading | Inverse Depth Scaling From Most Layers Being Similar
We know that deeper models generally have larger capacity, but what is the mechanism behind it? Given that the independent teacher regime is better aligned with the empirical signatures of real LLMs, ensemble averaging becomes the most plausible explanation for how real LLMs use depth.