Here’s why synthetic data is bound to cause more issues than solve problems

2 months ago 48

(Image credit: Shutterstock / carlos castilla)

As AI becomes deeply embedded across sectors, the quality of its foundations becomes more important. One trend receiving growing attention is the use of synthetic data – artificially generated datasets used to train AI systems when real-world data is difficult to obtain. While synthetic data promises speed and scalability, its reliability in complex, operational environments is far from guaranteed.

This raises broader concerns about the role and impact of AI in our daily lives. We live in an age where AI drives decision-making, reshaping industries and redefining how work gets done. Moreover, it is enabling systems to learn, adapt, and act autonomously.

But amid this acceleration, one principle remains non-negotiable: “what cannot be measured cannot be improved – and what cannot be verified cannot be trusted”. As AI systems become more embedded in critical infrastructure, the integrity of their data must meet the same standard.

Given these high stakes, on the surface, synthetic data appears to be a logical step in AI’s evolution. But if we look more closely, it becomes clear that there are serious limitations to what artificially generated data can deliver.

The fundamental issue is that synthetic data reflects what we already know or expect. However, the world we operate in, particularly in industries such as infrastructure, manufacturing, and energy, rarely behaves as expected.

These are high-pressure environments where systems must constantly adapt to human behavior and unpredictable conditions. In such settings, AI trained predominantly on artificial data risks missing the very complexity it’s supposed to manage.

AI must be grounded in reality. Otherwise, we are building instruments that look capable in simulations but fall short when faced with the noise and nuance of real-world deployment.

Sign up to the TechRadar Pro newsletter to get all the top news, opinion, features and guidance your business needs to succeed!

Simulation isn’t enough

Synthetic data has earned its place in AI development because it enables systems to be trained when operational data is unavailable. It also allows for the controlled creation of specific scenarios that may be difficult to encounter otherwise.

One technique is synthetic scene generation – where realistic digital environments are built to help AI learn how to respond to complex or unusual situations before being tested in the real world. It’s particularly common in areas like industrial automation, where replicating every possible scenario physically is impractical.

These virtual environments can speed up early development, but they remain simplified versions of reality. They’re built on assumptions, about lighting, positioning or movement, that inevitably shape how the AI systems learn.

This introduces blind spots that often go unnoticed until the system is deployed. A model trained to “see” in synthetic scenes might pass lab tests but fail to recognize subtle, real-world deviations it never encountered during training.

This is especially true in complex industrial environments. For example, in manufacturing quality control, synthetic data often misses the subtle variations in materials, environmental conditions, and human interactions that cause real-world defects.

These subtle dynamics are difficult to simulate, let alone anticipate, without a base of real-world observation. The result is AI systems that perform well in controlled tests but fail when deployed in actual production environments.

The case for reality-first development

In contrast, real-world data, which is collected from sensors, field operations, and machines, offers a more accurate foundation. Unlike synthetic datasets, it captures unpredictability as it happens.

It identifies the anomalies, fluctuations, and evolving patterns that characterize live environments, and reflects a growing shift toward spatial intelligence powered by reality capture technologies across industries.

Specifically, spatial intelligence transforms raw environmental data into actionable insights by understanding the relationships between objects, spaces, and processes in real-time. Through advanced reality capture sensors and data visualization platforms, we can create comprehensive digital twins that reflect actual conditions, not theoretical ones.

By grounding our models in this kind of accurate, real-world data, we move beyond assumptions and simulations. This type of data does not require us to guess what might happen. It shows us what actually has. And this matters, because the most valuable AI systems are not those that merely replicate our existing knowledge, but those that help us uncover what we did not yet know to ask.

Leading technology companies have put this into practice, where AI systems trained on real-world data learn to adapt instantly and develop sensitivity to shifts in context. The most effective approaches deploy instruments at the edge rather than relying on cloud-based synthetic training, thus enabling immediate decision-making where and when it matters.

Ultimately, by grounding AI in digital reality, organizations can build systems that reflect the environments they serve. Not just once at the point of deployment, but continuously, throughout their lifecycle.

Practical systems, not theoretical ones

This shift, from synthetic-first to reality-first, requires a change in how we think about intelligence. AI should not be treated as a one-time model to be built and deployed. Just as LLM-based generative AI requires new data sets to evolve, so do reality-first spatial intelligence systems.

This is not a limitation. It’s a strength. When AI is shaped by lived experience, it becomes more than just a prediction engine; it responds to change and reflects the intricacies of the physical world.

Rebuilding trust through transparency

There is also a wider implication. As AI is used to guide more and more decisions, such as planning maintenance, questions of transparency and accountability come to the fore. Synthetic data can be difficult to trace. It does not show its origins or highlight its assumptions. Comparatively, real-world data is measurable. We know where it came from. We understand how it evolved.

In critical industries, regulatory requirements often mandate verified data sources and complete audit trails, standards that synthetic data cannot meet. Indeed, safety regulations, compliance frameworks, and accountability measures all depend on the traceability that only real-world data can provide.

Far from being a mere regulatory requirement, traceability underpins the trust we place in AI systems. If we are to deploy these systems into public infrastructure, critical industries, or frontline workflows, we must be confident in the accuracy and reliability of the source data.

Grounding AI in real-world experience is an ethical necessity. As complexity grows, clarity becomes our most valuable currency. Measurable and verifiable results are no longer just best practices, but the foundation of credibility and growth, giving companies the power to not just guess better, but to know better.

Looking ahead

Synthetic data will continue to play a role in AI development. It offers value in areas where data is sensitive, access is limited, or testing needs are extreme. But it should not replace the insights that come from genuine, real-world inputs.

To unlock AI’s full potential, we must shift our focus from simulated possibilities to the rich, real-time signals already present in the environments we operate. The greatest opportunity lies in embracing the operational and environmental data we already generate and building systems that reflect and respond to it.

These are the systems that will endure, scale, and adapt. These are the systems that will be trusted. As we move forward, the challenge is not to simulate intelligence, but to connect it more deeply with the reality it is meant to support. The future belongs to those who can prove it.

We've featured the best database software.

This article was produced as part of TechRadarPro's Expert Insights channel where we feature the best and brightest minds in the technology industry today. The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing find out more here: https://www.techradar.com/news/submit-your-story-to-techradar-pro

Burkhard Boeckem is CTO at Hexagon.

Read Entire Article