Beitrag | FullStackS GmbH

In the traditional monitoring of applications and infrastructure, telemetry data such as logs, metrics and traces from different, very specific backends were processed and prepared for problem analyses.

Each of these backends often required its own proprietary agents to monitor an application or infrastructure. From the application developer's point of view, it was not possible to dynamically set functional metrics in the application code, for example (and thus enable even more targeted analyses), as a provider lockin directly in the source code is not expedient.

The following overview outlines traditional monitoring of applications and infrastructure:

Another major disadvantage of this architecture is that it is not possible, or only possible with great effort, to correlate the different telemetry data with each other and thus obtain a comprehensive picture of the status of your own application/infrastructure.

The business challenge:

Organisations monitor parts of the infrastructure separately, which leads to isolated data silos. Due to a lack of correlation, the cause is not found or not found quickly enough.

OpenTelemetry

OpenTelemetry is stepping up to solve these problems. It has its origins in the OpenTracing and OpenCensus project and is now one of the most actively developed projects in the CNCF environment.

OpenTelemetry is a collection of APIs, SDKs, and tools. Use it to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) to help you analyze your software’s performance and behavior.

OpenTelemetry itself is based on open standards, such as the W3C Trace Context or the W3C Baggage Specification.

The OpenTelemetry Collector is a fundamentally optional but very important component. This enables manufacturer-independent reception, processing and forwarding of telemetry data. We therefore clearly recommend using this component in your own observability architecture.

The following diagram outlines a possible architecture:

Splunk Observability Cloud

Say goodbye to blind spots, guesswork and swivel-chair monitoring with all of your metrics, logs and traces automatically correlated in one place.

Most people will recognise Splunk from its most powerful tool for processing and preparing logs, Splunk Core. However, Splunk now offers a much larger product portfolio. The individual products are linked together via the Splunk Platform, allowing the greatest possible flexibility from the customer's point of view in terms of functionality as well as licensing options.

One of the Splunk Platform products is the Splunk Observability Cloud (O11y Cloud), which is a native OpenTelemetry backend. Thanks to native OpenTelemetry support, proprietary agents are a thing of the past. Splunk also offers its own distributions of the OpenTelemetry Agents/Profiler and the OpenTelemetry Collector, so that enterprise support for these components is also included in the scope of the licence.

The O11y Cloud also boasts a low entry barrier and a flat learning curve. This is also a significant difference to Splunk Core. Splunk Core offers an extremely powerful query language, but requires more effort to familiarise yourself with. The O11y Cloud comes as standard with many ready-made dashboards and an ideal user experience for DevOps teams. The tools complement each other perfectly.

Many areas of the Splunk Observability Cloud can also be set up automatically via Terraform with the Splunk Observability Cloud Terraform Provider. This allows Observability to be optimally integrated into CI/CD and enables true "Shift Left".

Finally, the following example provides an overview of how telemetry data is linked in the O11y Cloud and how a root cause analysis is carried out with just a few user interface interactions.

From the metric to the cause of the error

The problem analysis usually starts when an alert is triggered because a metric has fallen below or exceeded a limit value. The determination of limit values and the correct categorisation of whether it is actually an exceptional situation is naturally AI-supported in the O11y Cloud.

The classic image of the iceberg is very appropriate for root cause analysis. However, it should be noted that, from our point of view, logs are nowadays only the last "anchor" for localising the cause of a problem. If tracing and metrics are set up well, a glance at a trace is often enough to recognise the cause of the problem.

If you go directly to the Application Performance Monitoring (APM) menu item in the O11y Cloud, you can already see overviews of various metrics over time. In the example shown, you can see a high latency in the printing service. Clicking on the diagram immediately opens a pop-up with which you can navigate directly to the associated traces:

The pop-up already shows that the O11y Cloud has recognised a problem. In the detailed view of the trace, the cause of the problem can be traced immediately (the author service returns a server error):

The bottom section of the screenshot is also worth noting. The "Infrastructure" and "Logs" buttons can be used to link directly to infrastructure monitoring or the log observer. This feature is called Related Content and contributes significantly to efficient problem analysis thanks to the correlation of the various telemetry data.

If you link to the Infrastructure Monitoring, you immediately get an overview of the nodes on which the printing service is running. This overview can now be "zoomed" down to container level. This means that specific and detailed information can be called up for each infrastructure level.

If you link to the Log Observer, a search is automatically started with the corresponding trace ID. This displays all log entries for the trace ID across all services involved.

The business challenge:

Organisations monitor parts of the infrastructure separately, which leads to isolated data silos. Due to a lack of correlation, the cause is not found or not found quickly enough.

Our solution:

Observability integrates monitoring data from various sources to quickly isolate problems and efficiently identify the root cause.

To stay with the image of the iceberg: The example outlined above only touches on a small part of the possibilities of the O11y Cloud. We are happy to present further functions of the O11y Cloud in detail during demo sessions 🚀