Cantor's Data: When Some Infinities Are Bigger Than Others

In 1874, Georg Cantor proved something that unsettled the mathematical world: there are different sizes of infinity.^[1]

The natural numbers (1, 2, 3, ...) are infinite. The real numbers (every point on the number line, including irrationals like π and √2) are also infinite. But Cantor showed these two infinities aren't the same size. The real numbers are uncountably infinite, a strictly larger infinity than the countable infinity of the naturals. No matter how clever your pairing scheme, you'll always miss real numbers. There are more of them, in a precise mathematical sense, than there are natural numbers to count them with.

This was radical enough that some of Cantor's contemporaries rejected it outright. But the proof is airtight, and its implications reach further than mathematics. It reaches into every data center, every analytics dashboard, and every AI training pipeline.

The Data Explosion That Isn't What It Seems

The world generated roughly 120 zettabytes of data in 2023, and that figure doubles approximately every two years.^[2] At first glance, this looks like a straightforward scaling problem: more storage, more compute, more bandwidth. But Cantor's insight suggests a subtler issue. Not all of that data is the same kind of infinity.

Structured data, the kind that fits neatly into rows and columns, is like the natural numbers: countable, orderly, and relatively easy to work with. A database table with a billion rows is large, but it's tractable. You can index it, query it, aggregate it.

A vast library with infinite shelves receding into darkness, books glowing faintly, with a single figure searching through the stacks, representing the Library of Babel problem of finding meaning in infinite data — Borges imagined a library containing every possible book. Enterprise data lakes are getting uncomfortably close.

Unstructured data is more like the real numbers. Text, images, video, sensor streams, log files: these don't fit into tidy schemas. The space of possible meanings within unstructured data is vastly larger than the space of possible queries you can run against it. You can store it all, but understanding it is a different problem entirely, one that grows faster than the storage does.

This is Cantor's hierarchy playing out in engineering. The countable infinity of data points is manageable. The uncountable infinity of potential meaning within that data is not.

The Library of Babel Problem

In 1941, Jorge Luis Borges imagined a library containing every possible book: every combination of characters, every possible arrangement of words.^[3] Such a library would contain the cure for cancer, the complete works of every author who will ever live, and a perfect prediction of tomorrow's weather. It also contains every possible wrong version of each of those things. The library is infinite and therefore useless, because finding anything meaningful requires searching through an incomprehensibly larger space of noise.

Enterprise data lakes often become miniature Libraries of Babel. Organizations collect everything because storage is cheap and the instinct is that more data means more insight. But 80 to 90 percent of enterprise data is "dark data," collected and stored but never analyzed.^[4] It sits in data lakes that gradually become data swamps: vast, murky, and expensive to maintain.

The problem isn't storage capacity. Storage scales well. The problem is that the space of meaningful patterns within the data grows combinatorially while the tools to extract those patterns grow linearly at best. You can double your storage every two years. You can't double your understanding on the same schedule.

The Curse of Dimensionality

Machine learning encounters Cantor's hierarchy in a concrete way through the curse of dimensionality. As the number of features in a dataset increases, the volume of the feature space grows exponentially. Data points that seemed close together in low dimensions become sparse and distant in high dimensions. The amount of data you need to maintain statistical significance grows faster than most teams can collect it.^[5]

This is a finite echo of Cantor's proof. Adding dimensions to your data doesn't just make the problem bigger; it makes it a different kind of bigger. Ten features require a certain density of data points. A hundred features don't require ten times as many; they require exponentially more. The space you're searching through has grown in a way that linear scaling can't address.

LLM training datasets illustrate a version of this. Early scaling of training data produced dramatic improvements in model capability. But the Chinchilla scaling laws showed that simply adding more data yields progressively smaller gains without proportional increases in model size.^[6] The low-hanging patterns are captured first. Each additional increment of data contributes less new information.

The Scaling Wall: LLMs and the Limits of More Data

Large language models have pushed Cantor's hierarchy into the headlines. The industry's dominant strategy for the past several years has been straightforward: more data, more parameters, more compute, better models. And it worked, spectacularly, for a while.

But the strategy is hitting walls that look distinctly Cantorian.

The first wall is supply. Researchers at Epoch AI have projected that the stock of publicly available human-generated text on the internet, roughly 300 trillion tokens, could be exhausted as training data by as early as 2026 to 2032 at current consumption rates.^[10] The countable infinity of tokens on the internet turns out to be countable in a very practical sense: we can count to the end of it.

The second wall is quality. Not all tokens are equal. A peer-reviewed paper, a well-edited novel, and a spam comment are all tokens, but their informational density varies enormously. As models consume the high-quality data first, each additional tranche of training data contains proportionally more noise. The countable infinity of available text grows, but the meaningful signal within it doesn't grow at the same rate.

The third wall is recursion. As AI-generated content floods the internet, future models increasingly train on text produced by previous models. Research published in Nature has shown that this recursive training leads to "model collapse," where the tails of the original data distribution disappear and the model's outputs become increasingly homogeneous and degraded.^[11] It's a feedback loop where shadows train on shadows, each generation losing fidelity to the original signal.

This is Cantor's hierarchy in action. The tokens are countable. The knowledge, understanding, and reasoning those tokens are supposed to encode belong to a denser, richer space that doesn't scale linearly with token count. You can double the training data and get a fraction of the improvement. You can triple it and get less still. The curve of capability versus data is flattening, and no amount of additional tokens changes the fundamental relationship between the countable infinity of words and the uncountable infinity of meaning.

When More Data Means Less Knowledge

There's a counterintuitive failure mode where collecting more data actively degrades your ability to understand it.

CERN's Large Hadron Collider generates approximately one petabyte of collision data per second. The physicists keep a tiny fraction, roughly one in a million events, selected by trigger algorithms that decide in real time what's worth saving.^[7] This is deliberate, principled data destruction. The alternative, keeping everything, would produce a dataset so large that no analysis pipeline could process it meaningfully. More data would mean less physics.

DNA sequencing faces a similar dynamic. Genomic data is growing faster than Moore's Law, outpacing the computational capacity to analyze it.^[8] Sequencing a genome is now cheap. Understanding what the sequence means is not. The data is countable. The biological meaning is something closer to uncountable.

Social media platforms generate billions of posts daily. The vast majority is noise by any analytical standard. Collecting it all is trivial. Extracting signal from it is a research problem that hasn't scaled with the data volume.

The Right to Be Finite

The instinct to collect everything and sort it out later is understandable. Storage is cheap. Deletion feels wasteful. What if that data turns out to be valuable someday?

But Cantor's insight suggests a different framing. Not all data is equally valuable, and the relationship between volume and value isn't linear. At some point, more data doesn't help. It just makes the haystack bigger without adding needles.

Data retention policies, often treated as compliance burdens, are actually concessions to Cantor's hierarchy. They acknowledge that infinite accumulation doesn't produce infinite insight. The EU's right to be forgotten, whatever its legal merits, encodes a mathematical truth: unbounded data persistence doesn't serve unbounded understanding.^[9]

The wisest data strategy might not be "collect everything." It might be "collect what you can understand, and understand what you collect." Cantor showed that some infinities are bigger than others. The infinity of data you can store will always be smaller than the infinity of meaning you'd need to extract from it. The gap between those two infinities is where data strategies go to drown.

References

[1] Georg Cantor, "Über eine Eigenschaft des Inbegriffes aller reellen algebraischen Zahlen," Journal für die reine und angewandte Mathematik, vol. 77, 1874, pp. 258–262.

[2] Statista Research Department, "Volume of data/information created, captured, copied, and consumed worldwide from 2010 to 2025," Statista, September 2023. https://www.statista.com/statistics/871513/worldwide-data-created/

[3] Jorge Luis Borges, "The Library of Babel," in Ficciones, 1944. English translation by Andrew Hurley, Penguin Books, 1998.

[4] Splunk, "The State of Dark Data," Splunk Research Report, 2019. https://www.splunk.com/en_us/form/the-state-of-dark-data.html

[5] Richard Bellman, Adaptive Control Processes: A Guided Tour, Princeton University Press, 1961. Bellman coined the term "curse of dimensionality."

[6] Jordan Hoffmann et al., "Training Compute-Optimal Large Language Models," arXiv:2203.15556, March 2022. https://arxiv.org/abs/2203.15556

[7] CERN, "Processing: What to record?" CERN Accelerating Science. https://home.cern/science/computing/processing-what-record

[8] Zachary D. Stephens et al., "Big Data: Astronomical or Genomical?" PLOS Biology, Vol. 13, No. 7, July 2015. https://doi.org/10.1371/journal.pbio.1002195

[9] European Parliament and Council, "General Data Protection Regulation (GDPR)," Article 17: Right to Erasure, 2016. https://gdpr-info.eu/art-17-gdpr/

[10] Pablo Villalobos et al., "Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data," Epoch AI, 2024. Reported by PBS NewsHour, June 6, 2024. https://www.pbs.org/newshour/economy/ai-gold-rush-for-chatbot-training-data-could-run-out-of-human-written-text-as-early-as-2026

[11] Ilia Shumailov et al., "AI models collapse when trained on recursively generated data," Nature, Vol. 631, July 2024, pp. 755–759. https://arxiv.org/abs/2305.17493