• Tom Dekan
  • Posts
  • đź’ˇSnake eating its tail: how can synthetic data possibly work for training AI?

đź’ˇSnake eating its tail: how can synthetic data possibly work for training AI?

Joke: Why did the Github project board go to therapy?

(Answer below ⬇️)

Snake eating its tail: how can synthetic data possibly work for training AI?

LLM training has exhausted all existing humans data. So, we're now training LLMs synthetic datasets, i.e., data produced by the LLM itself.

The strange thing is that training synthetic data works for training LLM. As in, training an LLM on data that it produced produces a better LLM. But how can this work?

Isn't this the snake eating its tail? Isn't this like a dung beetle eating its own feces? It almost seems to go against conservation of energy.

After all, LLMs are just interpolation, aren't they? This process seems like it would lead to a closed loop of cannibalism, a form of self-cannibalism where quality inevitably degrades.

My hypothesis is that LLMs do more than this, and the key to understanding why synthetic data works is through analogy.

Hypothesis: Thinking in an Empty Room

Let's start with a thought experiment. We, as humans, explore ideas just by thinking more.

If I you were put in an empty room with only a blank notepad and a pen, could come out of that room with significantly more knowledge than when I you went in, given time and no other external stimulation?

This is a situation where there's very little new data about the world, yet my answer is yes. I'm still able to create new knowledge.

If I you were left in this room forever, would you run out of things to think?

I think the answer is no. You would never run out of things to think about, even though there's no new external stimulus. (I'd go mad due to the lack of social stimulation, but that's besides the point. I'd keep thinking in my madness).

Assuming I came out of the room before I lost my sanity, I'd imagine I'd have gained significant knoweldge while in the room.

The analogy here is that the LLM is the thing that can keep thinking forever, using just its existing thoughts and building on them.

Like us in the white room, we can creating new knowledge from combining and varying existing data.

Dreams as varied training data

This is similar to how I think of dreams. We know that when you go to sleep, you wake up smarter, having assimilated the day’s events, particularly if you’re sleeping enough (e.g., over 7.5 hours).

Dreams are often strange, yet relatable. Your brain is making up situations that are subtly different but still somehow connected to the data it already knows as reality.

In this sense, dreams can be like data augmentation

Analogy: Data Augmentation

This brings us to the most direct technical analogy. In machine learning, you can improve the performance of image classifiers (e.g., for cats and dogs) simply by doing data augmentation.

Quick explanation of data augmentation for people who don't know what it is:

Data augmentation is a standard ML process during training that involves transforming data in to vary it, without changing its essence, with the effect of improving your final model's ability to generalize.

Data augmentation can be very simple. Taking an example from the Fast.ai course, you can take a set of cat photos and apply some basic transforms. Just by rotating all the cats 90 degrees, skewing them, making them bigger, and so on - making them different while remaining distinctively "catty" - you can improve the performance of your model significantly to generalize, i.e., understanding what "catness" is.

This is data augmentation. Variation to produce better understanding.

Conclusion: Contemplation not Self-Cannibalism

My hypothesis: training a model with synthetic data less like a snake eating its own tail and more like learning from imaginative variations of things you already know.

Modelling potential new situations in your mind is useful. You're not just repeating the same information. You're creating new ideas by varying the things you already know to gain new knowledge - without the need for external stimuli.

Therefore, the learning process with synthetic data isn't a closed loop of degradation. It seems more like form of internal contemplation, an exploration of the latent space.

Note I’m (shortly) releasing a AI photo generator I built with Next.js

Over the past few days, I've been building a AI photo generator product in a heavily ai-enhanced way (I built a very similar product back in 2022. If you want to see the just-working version (still featuring placeholders), it’s here: .https://amazing.photos

I'm going to polish it tomorrow, then I'll send you another email walking you through how I built it with Next.js in just a few days.

Joke: Why did the Github project board go to therapy?

→ It had too many unresolved issues.