Is it possible for an AI to be trained just on data generated by another AI? It might sound like a harebrained idea. But it’s one that’s been around for quite some time — and as new, real data is increasingly hard to come by, it’s been gaining traction.
Anthropic used some synthetic data to train one of its flagship models, Claude 3.5 Sonnet. Meta fine-tuned its Llama 3.1 models using AI-generated data. And OpenAI is said to be sourcing synthetic training data from o1, its “reasoning” model, for the upcoming Orion.
But why does AI need data in the first place — and what kind of data does it need? And can this data really be replaced by synthetic data?
The importance of annotations
AI systems are statistical machines. Trained on a lot of examples, they learn the patterns in those examples to make predictions, like that “to whom” in an email typically precedes “it may concern.”
Annotations, usually text labeling the meaning or parts of the data these systems ingest, are a key piece in these examples. They serve as guideposts, “teaching” a model to distinguish among things, places, and ideas.
Consider a photo-classifying model shown lots of pictures of kitchens labeled with the word “kitchen.” As it trains, the model will begin to make associations between “kitchen” and general characteristics of kitchens (e.g. that they contain fridges and countertops). After training, given a photo of a kitchen that wasn’t included in the initial examples, the model should be able to identify it as such. (Of course, if the pictures of kitchens were labeled “cow,” it would identify them as cows, which emphasizes the importance of good annotation.)
The appetite for AI and the need to provide labeled data for its development have ballooned the market for annotation services. Dimension Market Research estimates that it’s worth $838.2 million today — and will be worth $10.34 billion in the next 10 years. While there aren’t precise estimates of how many people engage in labeling work, a 2022 paper pegs the number in the “millions.”
Companies large and small rely on workers employed by data annotation firms to create labels for AI training sets. Some of these jobs pay reasonably well, particularly if the labeling requires specialized knowledge (e.g. math expertise). Others can be backbreaking. Annotat …