Elon Musk and former OpenAI chief scientist Ilya Sutskever say that AI companies have run out of real-world data to train generative models on.
“We’ve now exhausted basically the cumulative sum of human knowledge … in AI training,” Musk tells Stagwell chairman Mark Penn in an X livestream yesterday, per Tech Crunch. “That happened basically last year.”
Musk’s comments came just a few days after Sutskever, who helped build ChatGPT, told the annual Neurips event that “we have achieved peak data and there’ll be no more.”
If true, it means that all of the available data on the internet has already been used up to train AI models. PetaPixel reported on this phenomenon back in November when it came to light that OpenAI was struggling with its new model, Orion, which is allegedly not hitting internal expectations.
Similarly, Google’s newest iteration of Gemini is not much better than the previous one. While Anthropic has also delayed the release of its Claude model.
One of the reasons cited is that “it’s become increasingly difficult to find new, untapped sources of high-quality, human-made training data that can be used to build more advanced AI systems.”
Synthetic Data
Musk suggested that the way for AI companies to plug this gap is synthetic data, i.e. the content that generative AI models themselves produce.
“The only way to supplement [real-world data] is with synthetic data, where the AI creates [training data],” Musk says. “With synthetic data … [AI] will sort of grade itself and go through this process of self-learning.”
However, this method is not totally proven. One study suggested that AI models trained on AI images start churning out garbage images with the lead author comparing it to species inbreeding.
“If a species inbreeds with their own offspring and doesn’t diversify their gene pool, it can lead to a collapse of the species,” says Hany Farid, a computer scientist at the University of California, Berkeley.
Nevertheless, Tech Crunch notes that Microsoft, Meta, OpenAI, and Anthropic are all using synthetic data to train AI models with. While this method has obvious benefits such as cost-cutting, the model’s functionality could be compromised because of inherent limitations within the training data.