undefined - OpenThought Blog

There’s a word researchers use to describe the best kind of data for training AI. The word is “sloppy.” And it means the opposite of what you’d expect.

Most people assume good data is clean, well-labelled, and balanced. Those things matter. But there’s another quality — one that’s invisible to most people building AI — that turns out to be just as important. It’s the shape of the data’s internal structure.

What “Sloppy” Actually Means

Think about a large collection of photos. If you were to measure how much variation exists across the images, you’d find that most of it is concentrated in a small number of patterns. Is it indoors or outdoors? Dark or bright? Does it contain a face? A handful of patterns like these explain the bulk of what makes each image different from every other.

Once you’ve accounted for those main patterns, everything else — the tiny texture variations, the minor lighting shifts, the background details — matters much less. The remaining variation trails off fast.

That’s what researchers mean by sloppy data. A few dimensions carry most of the signal. The rest carry almost nothing.

The opposite would be data where every dimension matters equally. Imagine purely random noise: no hierarchy, no dominant patterns, every variation as important as every other. That’s non-sloppy. And, it turns out, terrible for training AI.

Why Structure Does the Heavy Lifting

In research published in PNAS (2024) and expanded in the PhD thesis “The Hidden Geometry of Learning” by Jialin Mao (University of Pennsylvania, 2025), supervised by Pratik Chaudhari, researchers discovered something striking: the sloppiness of your data flows directly into the model that learns from it.

When data has a sharp hierarchy — a few big patterns, then a rapid drop-off — the model’s internal sensitivity mirrors that same shape. The model effectively learns that only a small number of things truly matter. It constrains itself.

This has a practical consequence. A model with millions of parameters could memorise every quirk of the training data. But when the data is sloppy, the structure steers the model toward the real patterns instead of the noise. The data is doing the regularisation — that’s the term researchers use for techniques that stop a model from over-fitting, or learning the training data so specifically that it fails on anything new.

The Experiment That Made This Concrete

The researchers tested this directly. They trained two sets of models: one on sloppy data (real-world structure, with the rapid drop-off), and one on non-sloppy data (flat spectrum, where all dimensions mattered equally). Both sets of models were trained until they fit the training data perfectly. Zero errors on examples they’d seen.

Then they tested both on new examples the models hadn’t seen.

The models trained on sloppy data generalised well. The models trained on non-sloppy data failed badly.

Same model. Same training approach. The only difference was the structure of the data.

What’s more, the researchers showed that smaller models trained on sloppy data could generalise as well as much larger models trained on less sloppy data. The structure of the data was, in effect, making the model more efficient.

What This Means in Practice

This finding helps explain something that has puzzled researchers for years: why does AI trained on real-world data generalise at all? These models are huge. They could, in theory, just memorise everything. Why don’t they?

Part of the answer is that real-world data — images, text, audio — is naturally sloppy. Language has structure at multiple scales. Images have dominant patterns. Audio has consistent rhythms and timbres. In every case, a small number of dimensions carry most of the information.

That structure quietly constrains what the model learns. It doesn’t need to be told to ignore irrelevant details. The data already demotes them.

Clean data is good. But structured data — data with a natural hierarchy of what matters — is what makes AI genuinely reliable.

This article draws on research by Jialin Mao, Itay Griniasty, Han Kheng Teoh, Rahul Ramesh, Rubing Yang, Mark K. Transtrum, James P. Sethna, and Pratik Chaudhari. The key findings were published in the Proceedings of the National Academy of Sciences (2024). The full dissertation, “The Hidden Geometry of Learning,” is available at pratikac.github.io/pub/mao.thesis25.pdf.

What “Sloppy” Actually Means

Why Structure Does the Heavy Lifting

The Experiment That Made This Concrete

What This Means in Practice

Stay Updated