Evolving observability architecture for cloud-scale event data

11 hours ago 4

Modern businesses produce massive amounts of telemetry, and as systems grow more complex, the risk of failure rises with them.

Under normal conditions, observability platforms perform well, delivering responsive dashboards and reliable alerts.

Article continues below

Chief Architect at Imply.

When the Architecture Doesn’t Match the Workload

The scale and complexity of modern telemetry are exposing a core architectural problem. Many observability platforms are designed for predictable, steady-state monitoring – not the unpredictable, exploratory queries required during incidents.

Monolithic, detection-oriented observability systems assume they can support both predefined monitoring and open-ended investigation. But in practice, these systems are optimized for workflows where questions are known in advance.

When unexpected issues arise, these systems struggle to handle ad hoc queries across large data sets and multiple teams.

This is not a feature gap. It is an architectural mismatch.

Sign up to the TechRadar Pro newsletter to get all the top news, opinion, features and guidance your business needs to succeed!

How Evolving Observability Environments Scale Complexity and Cost

Today’s digital environments produce far more telemetry than most organizations have dealt with in previous years, and it’s all driven by microservices, cloud-native infrastructure, and AI workloads. As data grows, teams rely on large historical datasets to investigate issues, often requiring concurrent access from multiple users.

At the same time, cloud economics have shifted. Cloud storage is relatively inexpensive, but compute (especially during large, exploratory queries) has become a dominant cost driver. Many observability platforms tightly couple storage, indexing, and cloud computing, forcing organizations to scale all three together.

This creates a structural inefficiency: costs rise with data volume, even when that data is only queried occasionally. Teams are left choosing between retaining less data, accepting slower investigations, or overprovisioning IT infrastructure for peak performance.

Why Observability is Moving Toward Decoupled, Event-Native Systems

To address these challenges, observability architectures are shifting away from tightly coupled systems toward a decoupled model in which storage, compute, and data visualization operate as independent layers. This separation introduces flexibility and allows organizations to store more telemetry, scale compute on demand, and access the same underlying data across multiple tools without duplication.

At the same time, the structure of the data itself is also evolving. Event-native systems treat events (such as application logs, user requests, and API calls) as the fundamental unit of analysis. Instead of relying on predefined indexing strategies, data is stored in formats optimized for large-scale scanning and high-cardinality queries.

This enables more flexible investigation and multiple teams can query the same data concurrently without significant performance degradation.

Systems built on this model, such as Apache Druid, have demonstrated the ability to support bursty, high-concurrency workloads while maintaining interactive performance at scale. This is critical for investigation workflows, where query patterns are unpredictable and often collaborative.

Event-native systems solve the data exploration problem, while decoupled observability architectures address scaling and cost issues. Modern and effective observability requires the right data model and architecture. This shift mirrors patterns seen in other data domains.

Business intelligence platforms, for example, initially relied on tightly coupled architectures before evolving toward decoupled systems. As data volumes grew, separating storage, processing, and visualization enabled each layer to scale independently and innovate more rapidly. Observability is now following a similar path.

Observability Warehouses: A Dedicated Data Layer

The next architectural evolution for observability systems introduces a purpose-built data layer that sits underneath traditional observability platforms like Splunk, Grafana, and Kibana.

Known as an Observability Warehouse, this allows organizations to store large amounts of telemetry, query it quickly when needed for investigations, and scale compute based on demand (instead of by data size). This improves cost efficiency and operational resilience.

Separating the data layer gives teams more flexibility to analyze the same event-native data across multiple tools, adapt to new technologies over time, and avoid being constrained by a single platform’s architecture. Retention strategies, query engines, and visualization layers can evolve independently.

As telemetry volumes continue to grow, investigations will become more collaborative and data-intensive. Organizations that align their observability architecture with how these workloads behave will be better positioned to respond quickly, control costs, and operate reliably at scale.

We've ranked the best ITSM tools.

This article was produced as part of TechRadar Pro Perspectives, our channel to feature the best and brightest minds in the technology industry today.

The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing find out more here: https://www.techradar.com/pro/perspectives-how-to-submit

Chief Architect at Imply.

Read Entire Article