Eating Our Own Dog Food: Causely’s Journey with OpenTelemetry & Causal AI

Original photo by MART PRODUCTION

Implementing OpenTelemetry at the core of our observability strategy for Causely’s SaaS product was a natural decision. In this article I would like to share some background on our rationale and how the combination of OpenTelemetry and Causal AI addresses several critical requirements that enable us to scale our services more efficiently.

Avoiding Pitfalls Based on Our Prior Experience

We already know from decades of experience working in and with operations teams in the most challenging environments, that bridging the gap between the vast ocean of observability data and actionable insights has and continues to be a major pain point. This is especially true in the complex world of cloud-native applications.

Missing application insights

Application observability remains an elusive beast for many, especially in complex microservices architectures. While infrastructure monitoring has become readily available, neglecting application data paints an incomplete picture, hindering effective troubleshooting and operations.

Siloed solutions

Traditional observability solutions have relied on siloed, proprietary agents and data sources, leading to fragmented visibility across teams and technologies. This makes it difficult to understand the complete picture of service composition and dependencies.

To me this is like trying to solve a puzzle with missing pieces – that’s essentially a problem that many DevOps teams face today – piecing together a picture of how microservices, serverless functions, databases, and other elements interact with one  another, and underlying infrastructure and cloud services they run on. This hinders collaboration and troubleshooting efforts, making it challenging to pinpoint the root cause of performance issues or outages.

Vendor lock-in

Many vendors’ products also lock customers’ data into their cloud services. This can result in customers paying through the nose, because licensing costs are predicated on the volume of data that is being collected and stored in the service providers’ backend SaaS services. It can also be very hard to exit these services once locked in.

These are all pitfalls we want to avoid at Causely as we build out our Causal AI services.

The Pillars of Our Observability Architecture Pointed Us to OpenTelemetry

OpenTelemetry provides us with a path to break free from these limitations, establishing a common framework that transcends programming languages and platforms that we are using to build our services, and satisfying the requirements laid out in the pillars of our observability architecture:

Precise instrumentation

OpenTelemetry offers automatic instrumentation options that minimize the amount of work we need to do on manual code modifications and streamline the integration of our internal observability capabilities into our chosen backend applications.

Unified picture

By providing a standardized data model powered by semantic conventions, OpenTelemetry enables us to paint an end to end picture of how all of our services are composed including application and infrastructure dependencies. We can also gain access to critical telemetry information, leveraging this semantically consistent data across multiple backend microservices even when written in different languages.

Vendor-neutral data management

OpenTelemetry enables us to avoid locking our application data into 3rd party vendors’ services by decoupling it from proprietary vendor formats. This gives us the freedom to choose the best tools on an ongoing basis based on the value they provide, and if something new comes along that we want to exploit, we can easily plug it into our architecture.

Resource-optimized observability

OpenTelemetry enables us to take a top down approach to data collection, starting with the problems we are looking to solve and eliminating unnecessary information. In doing so, this minimizes our storage costs and optimizes compute resources we need to support our observability pipeline.

We believe that following these pillars and building our Causal AI platform on top of OpenTelemetry will propel our product’s performance, enable rock-solid reliability, and ensure consistent service experiences for our customers as we scale our business. We will also minimize our ongoing operational costs, creating a win-win for us and our customers.

OpenTelemetry + Causal AI: Scaling for Performance and Cost Efficiency

Ultimately, observability aims to illuminate the behavior of distributed systems, enabling proactive maintenance and swift troubleshooting. Yet isolated failures manifest as cascading symptoms across interconnected services.

While OpenTelemetry enables back-end applications to use this data to provide a unified picture in maps, graphs and dashboards, the job of figuring out the cause and effect in the correlated data still requires highly skilled resources. This process can also be very time consuming, tying up personnel across multiple teams, with ownership for different elements of overall services.

There is a lot of noise in the industry right now about how AI and LLMs are going to magically come to the rescue, but reality paints a different picture. All of the solutions available in the market today focus on correlating data versus uncovering a direct understanding of causal relationships between problems and the symptoms they cause, leaving devops teams with noise, not answers.

Traditional AI and LLMs also require massive amounts of data as input for training and learning behaviors on a continuous basis. This is data that ultimately ends up being transferred and stored in some form of SaaS. Processing these large datasets is very computationally intensive. This all translates into significant cost overheads for the SaaS providers as customer datasets grow overtime – costs that ultimately result in ever increasing bills for customers.

By contrast, this is where Causal AI comes into its own, taking a fundamentally different approach. Causal AI provides operations and engineering teams with an understanding of the “why”, which is crucial for effective and timely troubleshooting and decision-making.

Application: Database Connection Noisy Neighbor causing service and infrastructure symptoms

Example causality chain: Database Connection Noisy Neighbor causing service and infrastructure symptoms

Causal AI uses predefined models of how problems behave and propagate. When combined with real-time information about a system’s specific structure, Causal AI computes a map linking all potential problems to their observable symptoms.

This map acts as a reference guide, eliminating the need to analyze massive datasets every time Causal AI encounters an issue. Think of it as checking a dictionary instead of reading an entire encyclopedia.

The bottom line is, in contrast to traditional AI, Causal AI operates on a much smaller dataset, requires far less resources for computation and provides more meaningful actionable insights, all of which translate into lower ongoing operational costs and profitable growth.

Summing it up

There’s massive potential for Causal AI and OpenTelemetry to come together to tackle the limitations of traditional AI to get to the “why.”  This is what we’re building at Causely. Doing so will result in numerous benefits:

  • Less time on Ops, more time on Dev: OpenTelemetry provides standardized data while Causal AI analyzes it to automate the root cause analysis (RCA) process, which will significantly reduce the time our devops teams have to spend on troubleshooting.
  • Instant gratification, no training lag: We can eliminate AI’s slow learning curve, because Causal AI leverages OpenTelemetry’s semantic language and Causal AI’s domain knowledge of cause and effect to deliver actionable results, right out of the box without massive amounts of data and with no training lag!
  • Small data, lean computation, big impact: Unlike traditional AI’s data gluttony and significant computational overheads, Causal AI thrives on targeted data streams. OpenTelemetry’s smart filtering keeps the information flow lean, allowing Causal AI to identify the root causes with a significantly smaller dataset and compute footprint.
  • Fast root cause identification: Traditional AI might tell us “ice cream sales and shark attacks rise together,” but Causal AI reveals the truth – it’s the summer heat and not the sharks, driving both! By understanding cause-and-effect relationships, Causal AI cuts through the noise and identifies the root causes behind performance degradation and service malfunctions.

Having these capabilities is critical if we want to move beyond the labor intensive processes associated with how RCA is performed in devops today. This is why we are eating our own dog food and using Causely as part of our tech stack to manage the services we provide to customers.

If you would like to learn how to unplug from the Matrix of guesswork and embrace the opportunity offered through the combination of OpenTelemetry and Causal AI, don’t hesitate to reach out! The team and I at Causely are here to share our experience and help you navigate the path.


Related Resources

Leave a Reply

Your email address will not be published. Required fields are marked *