The end of data domains

Jasper Gilley
San Francisco, California
April 2025

One aspect of the ARC-AGI saga that I think has gone underdiscussed is how big of a deal it is that o3 was able to extrapolate completely out of domain to solve ARC tasks. We now have convincing evidence that reasoning models can adapt to any textual task not seen during training, provided with sufficient compute. This means that data domains don't really exist anymore - it's just a matter of how much compute you're comfortable spending to venture into them.

Visual reasoning

A complaint I've heard made about ARC as a benchmark for LLMs is that what ARC asks you to do is not really a language task - it's more of a visual processing task that LLMs happen to be able to bumble their way through sometimes. This complaintroon was yapping about it too in a now-deleted tweet lol is entirely reasonable, and some of the early marketing copy around ARC ("the average 5-year-old can solve this task but LLMs can't!") didn't help.

But the weirdness of the measured task is precisely what's impressive about reasoning models being able to bumble their way through to a ~75% success rate. ARC is useful as a benchmark for language models because it's both completely out-of-domain and not amenable to reward hacking.

I'm excited about the implications of this because a lot of useful tasks involve data that is somewhat OOD for language models, but less OOD than ARC. Suppose you have object tracking data in the form of (x, y, z) coordinates over time, encoded in JSON. It probably wouldn't be reasonable to expect a traditional language model to be able to draw inferences about the spatial patterns in this data over time, since these models aren't natively trained to do so. But a model capable of solving ARC tasks at a decent clip could probably draw some useful inferences, provided with a few examples of what to look for.

Nor should we expect this sort of extrapolation ability to be limited to massive frontier models. People are already doing really cool projects with local RL on data domains that are a little unconventional for language models. Some cool ones I've seen include recommender systems, sudoku, 2048, and debating.

What's especially interesting about these mini-domains to me is that they'd be pretty much unsolvable by AI without frontier reasoning models or local RL models figuring them out. You couldn't train a model just on the mini-domain since not much data is encoded in the domain itself. Wasteful as it may seem to burn thousands of dollars on a random ARC task, starting from a nearby capabilities point and spending compute to infiltrate the new mini-domain is probably the only way these mini-domains would ever get solved.

Text eats all

All this points to a near future in which it often makes sense to throw random textual-ish data into reasoning models and trust them to figure it out. You could imagine this working with:

  1. Biological data
  2. 3D modeling/CAD data
  3. Human computer use workflows
  4. Physics data
  5. Financial market data
  6. Musical data/music notation???
  7. ...

As long as you have a textual representation of real-world data and a few ground truth labels to work from, text is poised to eat all.


Jasper Gilley
Twitter
Github