Author: Karina Babcock

Causely Platform Overview

Causality diagram in Causely

Causely assures continuous reliability of cloud applications. The causal AI platform automatically captures cause and effect relationships based on real-time, dynamic data across the entire application environment. This means that we can detect, remediate and even prevent problems that result in service impact. With Causely, Dev and Ops teams are better equipped to plan for ongoing changes to code, configurations or load patterns, and they stay focused on achieving service-level and business objectives instead of firefighting.

Watch the demo to see the causal AI platform in action, or schedule time for a live walk-through.

 

Real-time Data & Modern UXs: The Power and the Peril When Things Go Wrong

real-time data streams are complex to troubleshoot

Imagine a world where user experiences adapt to you in real time. Personalized recommendations appear before you even think of them, updates happen instantaneously, and interactions flow seamlessly. This captivating world is powered by real-time data, the lifeblood of modern applications.

But this power comes at a cost. The intricate architecture behind real-time services can make troubleshooting issues a nightmare. Organizations that rely on real-time data to deliver products and services face a critical challenge: ensuring data is delivered fresh and on time. Missing data or delays can cripple the user experience and demand resolutions within minutes, if not seconds.

This article delves into the world of real-time data challenges. We’ll explore the business settings where real-time data is king, highlighting the potential consequences of issues. Then I will introduce a novel approach that injects automation into the troubleshooting process, saving valuable time and resources, but most importantly mitigating the business impact when problems arise.

Lags & Missing Data: The Hidden Disruptors Across Industries

Lags and missing data can be silent assassins, causing unseen disruptions that ripple through various industries. Let’s dig into the specific ways these issues can impact different business sectors.

Disruptions in real-time data can cause business impact
Financial markets

  • Trading: In high-frequency trading, even milliseconds of delay can mean the difference between a profitable and losing trade. Real-time data on market movements is crucial for making informed trading decisions.
  • Fraud detection: Real-time monitoring of transactions allows financial institutions to identify and prevent fraudulent activity as it happens. Delays in data can give fraudsters a window of opportunity.
  • Risk management: Real-time data on market volatility, creditworthiness, and other factors helps businesses assess and manage risk effectively. Delays can lead to inaccurate risk assessments and potentially large losses.

Supply chain management

  • Inventory management: Real-time data on inventory levels helps businesses avoid stockouts and optimize inventory costs. Delays can lead to overstocking or understocking, impacting customer satisfaction and profitability.
  • Logistics and transportation: Real-time tracking of shipments allows companies to optimize delivery routes, improve efficiency, and provide accurate delivery estimates to customers. Delays can disrupt logistics and lead to dissatisfied customers.
  • Demand forecasting: Real-time data on customer behavior and sales trends allows businesses to forecast demand accurately. Delays can lead to inaccurate forecasts and production issues.

Customer service

  • Live chat and phone support: Real-time access to customer data allows support agents to personalize interactions and resolve issues quickly. Delays can lead to frustration and longer resolution times.
  • Social media monitoring: Real-time tracking of customer sentiment on social media allows businesses to address concerns and build brand reputation. Delays can lead to negative feedback spreading before it’s addressed.
  • Personalization: Real-time data on customer preferences allows businesses to personalize website experiences, product recommendations, and marketing campaigns. Delays can limit the effectiveness of these efforts.

Manufacturing

  • Machine monitoring: Real-time monitoring of machine performance allows for predictive maintenance, preventing costly downtime. Delays can lead to unexpected breakdowns and production delays.
  • Quality control: Real-time data on product quality allows for immediate identification and correction of defects. Delays can lead to defective products reaching customers.
  • Process optimization: Real-time data on production processes allows for continuous improvement and optimization. Delays can limit the ability to identify and address inefficiencies.

Other examples

  • Online gaming: Real-time data is crucial for smooth gameplay and a fair playing field. Delays can lead to lag, disconnects, and frustration for players.
  • Healthcare: Real-time monitoring of vital signs and patient data allows for faster diagnosis and treatment. Delays can have serious consequences for patient care.
  • Energy management: Real-time data on energy consumption allows businesses and utilities to optimize energy use and reduce costs. Delays can lead to inefficient energy usage and higher costs.
  • Cybersecurity: Real-time data is the backbone of modern cybersecurity, enabling rapid threat detection, effective incident response, and accurate security analytics. However, delays in the ability to see and understand this data can create critical gaps in your defenses. From attackers having more time to exploit vulnerabilities to outdated security controls and hindered automated responses, data lags can significantly compromise your ability to effectively combat cyber threats.

As we’ve seen, the consequences of lags and missing data can be far-reaching. From lost profits in financial markets to frustrated customers and operational inefficiencies, these issues pose a significant threat to business success. Having the capability to identify the root cause, impact and remediate issues with precision and speed is an imperative to mitigate the business impact.

Causely  automatically captures cause and effect relationships based on real-time, dynamic data across the entire application environment.Request a demo to see it in action.

The Delicate Dance: A Web of Services and Hidden Culprits

Modern user experiences that leverage real-time data rely on complex chains of interdependent services – a delicate dance of microservices, databases, messaging platforms, and virtualized compute infrastructure. A malfunction in any one element can create a ripple effect, impacting the freshness and availability of data for users. This translates to frustrating delays, lags, or even complete UX failures.

microservices environments are complex

Let’s delve into the hidden culprits behind these issues and see how seemingly minor bottlenecks can snowball into major UX problems:

Slowdown Domino with Degraded Microservice

  • Scenario: A microservice responsible for product recommendations experiences high latency due to increased user traffic and internal performance degradation (e.g., memory leak, code inefficiency).
  • Impact 1: The overloaded and degraded microservice takes significantly longer to process requests and respond to the database.
  • Impact 2: The database, waiting for the slow microservice response, experiences delays in retrieving product information.
  • Impact 3: Due to the degradation, the microservice might also have issues sending messages efficiently to the message queue. These messages contain updates on product availability, user preferences, or other relevant data for generating recommendations.
  • Impact 4: Messages pile up in the queue due to slow processing by the microservice, causing delays in delivering updates to other microservices responsible for presenting information to the user.
  • Impact 5: The cache, not receiving timely updates from the slow microservice and the message queue, relies on potentially outdated data.
  • User Impact: Users experience significant delays in seeing product recommendations. The recommendations themselves might be inaccurate or irrelevant due to outdated data in the cache, hindering the user experience and potentially leading to missed sales opportunities. Additionally, users might see inconsistencies between product information displayed on different pages (due to some parts relying on the cache and others waiting for updates from the slow microservice).

Message Queue Backup

  • Scenario: A sudden spike in user activity overwhelms the message queue handling communication between microservices.
  • Impact 1: Messages pile up in the queue, causing delays in communication between microservices.
  • Impact 2: Downstream microservices waiting for messages experience delays in processing user actions.
  • Impact 3: The cache, not receiving updates from slow microservices, might provide outdated information.
  • User Impact: Users experience lags in various functionalities – for example, slow loading times for product pages, delayed updates in shopping carts, or sluggish responsiveness when performing actions.

Cache Miss Cascade

  • Scenario: A cache experiences a high rate of cache misses due to frequently changing data (e.g., real-time stock availability).
  • Impact 1: The microservice needs to constantly retrieve data from the database, increasing the load on the database server.
  • Impact 2: The database, overloaded with requests from the cache, experiences performance degradation.
  • Impact 3: The slow database response times further contribute to cache misses, creating a feedback loop.
  • User Impact: Users experience frequent delays as the system struggles to retrieve data for every request, leading to a sluggish and unresponsive user experience.

Kubernetes Lag

  • Scenario: A resource bottleneck occurs within the Kubernetes cluster, limiting the processing power available to microservices.
  • Impact 1: Microservices experience slow response times due to limited resources.
  • Impact 2: Delays in microservice communication and processing cascade throughout the service chain.
  • Impact 3: The cache might become stale due to slow updates, and message queues could experience delays.
  • User Impact: Users experience lags across various functionalities, from slow page loads and unresponsive buttons to delayed updates in real-time data like stock levels or live chat messages.

Even with advanced monitoring tools, pinpointing the root cause of these and other issues can be a time-consuming detective hunt. The triage & troubleshooting process often requires a team effort, bringing together experts from various disciplines. Together, they sift through massive amounts of observability data – traces, metrics, logs, and the results of diagnostic tests – to piece together the evidence and draw the right conclusions so they can accurately determine the cause and effect. The speed and accuracy of the process is very much determined by the skills of the available resources when issues arise

Only when the root cause is understood can the responsible team make informed decisions to resolve the problem and restore reliable service.

 

Transforming Incident Response: Automation of the Triage & Troubleshooting Process

Traditional methods of incident response, often relying on manual triage and troubleshooting, can be slow, inefficient, and prone to human error. This is where automation comes in, particularly with the advancements in Artificial Intelligence (AI). Specifically, a subfield of AI called Causal AI presents a revolutionary approach to transforming incident response.

what troubleshooting looks like before and after causal AI

Causal AI goes beyond correlation, directly revealing cause-and-effect relationships between incidents and their root causes. In an environment where services rely on real-time data and fast resolution is critical, Causal AI offers significant benefits:

  • Automated Triage: Causal AI analyzes alerts and events to prioritize incidents based on severity and impact. It can also pinpoint the responsible teams, freeing resources from chasing false positives.
  • Machine Speed Root Cause Identification: By analyzing causal relationships, Causal AI quickly identifies the root cause, enabling quicker remediation and minimizing damage.
  • Smarter Decisions: A clear understanding of the causal chain empowers teams to make informed decisions for efficient incident resolution.

Causely is leading the way in applying Causal AI to incident response for modern cloud-native applications. Causely’s technology utilizes causal reasoning to automate triage and troubleshooting, significantly reducing resolution times and mitigating business impact. Additionally, Causal AI streamlines post-incident analysis by automatically documenting the causal chain.

Beyond reactive incident response, Causal AI offers proactive capabilities that focus on measures to reduce the probability of future incidents and service disruptions, through improved hygiene, predictions and “what if” analysis.

The solution is built for the modern world that incorporates real-time data, applications that communicate synchronously and asynchronously, and leverage modern cloud building blocks (databases, caching, messaging & streaming platforms and Kubernetes).

This is just the beginning of the transformative impact Causal AI is having on incident response. As the technology evolves, we can expect even more advancements that will further streamline and strengthen organizations’ ability to continuously assure the reliability of applications.

If you would like to learn more about Causal AI and its applications in the world of real-time data and cloud-native applications, don’t hesitate to reach out.

You may also want to check out an article by Endre Sara which explains how Causely is using Causely to manage its own SaaS service, which is built around a real-time data architecture.


Related Resources

 

Crossing the Chasm, Revisited

Sometimes there’s a single book (or movie, podcast or Broadway show) that seems to define a particular time in your life. In my professional life, Geoffrey Moore’s Crossing the Chasm has always been that book. When I started my career as VP Marketing in the 1990s, this was the absolute bible for early-stage B2B startups launching new products. Fast forward to today, and people still refer to it as a touchstone. Even as go-to-market motions have evolved and become more agile and data-driven, the need to identify a beachhead market entry point and solve early-adopter pain points fully before expanding to the mainstream market has remained relevant and true. I still use the positioning framework for every new product and company.

The gap between early adopters and early majority

Graphic from “Crossing the Chasm” showing the gap between early adopters and early majority. Image source: Patel, Neeral & Patlas, Michael & Mafeld, Sebastian. (2020). 

 

Recently while hosting the Causely team at their beautiful new offices for our quarterly meetup, our investors at 645 Ventures gave everyone a copy of the latest edition of Crossing the Chasm. It was an opportunity for me to review the basic concepts. Re-reading it brought back years of memories of startups past and made me think about the book in a new context: how have Moore’s fundamental arguments withstood the decades of technology trends I’ve experienced personally? Specifically, what does “crossing the chasm” actually mean when new product adoption can be so different from one technology shift to another?

A Quick Refresher

One of Moore’s key insights is that innovators and early adopters are willing to try to a new product and work with a new company because it meets some specific needs – innovators love being first to try cool things, and early adopters see new technology as a way to solve problems not currently being met by existing providers. These innovators/early adopters then share their experiences with others in their organizations and industries, who trust and respect their knowledge. This allows the company to reach a broader market over time, cross the chasm and begin adoption by the early majority. Many years can go by during this process, much venture funding will be spent, and still the company may only have penetrated a small percent of the market. Only years later (and with many twists and turns) will the company reach the late majority and finally the laggards.

The Chasm Looks Different Over Time

Netezza and Data Warehousing

I started to think about this in terms of technology shifts that I’ve lived through. Earlier in my career I had the good fortune to be part of a company that crossed the chasm: Netezza. We built a data warehousing system that was 100x the performance of existing solutions from IBM and Oracle, at half the cost. While this was clearly a breakthrough product, the data warehousing industry had not changed in any meaningful way for over a decade and the database admins who ran the existing solutions were in no rush to try something new or different, for all the usual reasons. Within 10 years we created a new category, the “data warehouse appliance.” We gained traction first with some true innovators and then with early adopters who brought the product into their data centers, proved the value and then used it more widely as a platform. However, “crossing the chasm” took many more years – we had a couple of hundred customers at the time of IPO – and only once the company was acquired by IBM did more mainstream adopters become ready to buy (since no one ever gets fired for buying IBM, etc). The product was so good that it remained in the market for over 20 years until the cloud revolution changed things, but it’s hard to argue that it ever gained broad market adoption compared with more traditional data warehouses.

Datadog and Cloud Observability

A second example, which is closer to the market my current company operates in, is Datadog in the observability space. Fueled by the cloud computing revolution (which itself was in the process of crossing the chasm when Datadog was founded in 2010), Datadog rode the new technology wave and solved old problems of IT operations management for new cloud-based applications. While this is not necessarily creating a new category, the company moved very quickly from early cloud innovators and adopters to mainstream customers, rocketing to around $1B in revenues in 10 years. What’s more impressive is that Datadog has become the de facto standard for cloud application observability; today the company has almost 30,000 customers and is still growing quickly in the “early majority” part of the observability market. Depending on which market size numbers you use, Datadog has already crossed the chasm or is well underway, with plenty of room to expand with “late majority” customers.

OpenAI and GenAI

Finally, think about the market adoption in the current GenAI revolution. 100 million users adopted ChatGPT within two months of its initial release in late 2022 and OpenAI claims that over 80% of F500 companies are already using it. No need for me to add more statistics here (e.g., comparisons vs adoptions of internet and social technologies) – it’s clear that this is one of the fastest chasm-crossings in history, although it’s not yet clear how companies plan to use the new AI products even as they adopt them. The speed and market confusion make it hard to envision what crossing the chasm will mean for mainstream adopters and how the technology will fully solve a specific set of problems.

Defining Success as You Cross

Thinking through these examples made me realize some things I hadn’t understood earlier in my career:

  • It’s easy to confuse a large financial outcome (through IPO or acquisition) with “crossing the chasm”, since the assumption is often that you’ve had enough market success for the outcome. In fact, these are not necessarily related issues. It’s possible to have a large $ acquisition or even a successful IPO (as Netezza did) without having yet crossed to mainstream adoption.
  • The market and technology trends that surround and support a new company and product can lead to very different experiences in crossing the chasm: You can have a breakthrough and exciting product in a slow-moving market without major technology tailwinds (e.g., data warehousing in the early 2000s) but you can also have a huge tailwind like cloud computing that drives a new product to more mainstream adoption within 10 years (e.g., Datadog’s cloud-based observability). Or you can have a hyper-growth technology shift like GenAI that shrinks the entire process into a few years, leaving the early and mainstream adopters jumbled together and trying to determine how to turn the new products into something truly useful.
  • It can be hard to tell if you’ve really crossed the chasm since people think of many metrics that indicate adoption: % of customers in the total addressable market (Moore defines a bell curve with percentages for each stage, but I’ve rarely seen people use these strictly), number of monthly active users, revenue market share, penetration within enterprise accounts, etc. Also at the early majority phase, the company can see so much excitement from early customers and analysts (“We’re a leader in the Gartner Magic Quadrant!”) that founders can confuse market awareness and marketing “noise” with true adoption by customers that are waiting for more proof points and additional product capabilities that weren’t as critical for the early adopters. It’s important to keep your eye on these requirements to avoid stalling out once you’ve reached the other side of the chasm.

I would love to hear from other founders who have made this journey! Please share your thoughts on lessons learned and how you’re thinking about the chasm in the new AI-centric world.


 

Related Resources

Bridging the Gap Between Observability and Automation with Causal Reasoning

Bridging the gap between observability and automation with causal reasoning

Observability has become a growing ecosystem and a common buzzword. Increasing visibility with observability and monitoring tools is helpful, but stopping at visibility isn’t enough. Observability lacks causal reasoning and relies mostly on people to connect application issues with potential causes.

Causal reasoning solves a problem that observability can’t

Combining observability with causal reasoning can revolutionize automated troubleshooting and boost application health. By pinpointing the “why” behind issues, causal reasoning reduces human error and labor.

This triggers a lot of questions from application owners and developers, including:

  • What is observability?
  • What is the difference between causal reasoning and observability?
  • How does knowing causality increase performance and efficiency?

Let’s explore these questions to see how observability pairs with causal reasoning for automated troubleshooting and more resilient application health.

What is Observability?

Observability can be described as observing the state of a system based on its outputs. The three common sources for observability data are logs, metrics, and traces.

  • Logs provide detailed records of ordered events.
  • Metrics offer quantitative but unordered data on performance.
  • Traces show the journey of specific requests through the system.

The goal of observability is to provide insight into system behavior and performance to help identify and resolve any issues that are happening. However, traditional monitoring tools are “observing” and reporting in silos.

“Observability is not control. Not being blind doesn’t make you smarter.” – Shmuel Kliger, Causely founder in our recent podcast interview

Unfortunately, this falls short of the goal above and requires tremendous human effort to connect alerts, logs, and anecdotal application knowledge with possible root cause issues.

For example, if a website experiences a sudden spike in traffic and starts to slow down, observability tools can show logs of specific requests and provide metrics on server response times. Furthermore, engineers digging around inside these tools may be able to piece together the flow of traffic through different components of the system to identify candidate bottlenecks.

The detailed information can help engineers identify and address the root cause of the performance degradation. But we are forced to rely on human and anecdotal knowledge to augment observability. This human touch may provide guiding information and understanding that machines alone are not able to match today, but that comes at the cost of increased labor, staff burnout, and lost productivity.

Data is not knowledge

Observability tools collect and analyze large amounts of data. This has created a new wave of challenges among IT operations teams and SREs, who are now left trying to solve a costly and complex big data problem.

The tool sprawl you experience, where each observability tool offers a unique piece of the puzzle, makes this situation worse and promotes inefficiency. For example, if an organization invests in multiple observability tools that each offer different data insights, it can create a fragmented and overwhelming system that hinders rather than enhances understanding of the system’s performance holistically.

This results in wasted resources spent managing multiple tools and an increased likelihood of errors due to the complexity of integrating and analyzing data from various sources. The resulting situation ultimately undermines the original goal of improving observability.

Data is not action

Even with a comprehensive observability practice, the fundamental issue remains: how do you utilize observability data to enhance the overall system? The problem is not about having some perceived wealth of information at your fingertips. The problem is relying on people and processes to interpret, correlate, and then decide what to do based on this data.

You need to be able to analyze and make informed decisions in order to effectively troubleshoot and assure continuous application performance. Once again, we find ourselves leaving the decisions and action plans to the team members, which is a cost and a risk to the business.

Causal reasoning: cause and effect

Analysis is essential to understanding the root cause of issues and making informed decisions to improve the overall system. By diving deep into the data and identifying patterns, trends, and correlations, organizations can proactively address potential issues before they escalate into major problems.

Causal reasoning uses available data to determine the cause of events, identifying whether code, resources, or infrastructure are the root cause of an issue. This deep analysis helps proactively and preventatively address potential problems before they escalate.

For example, a software development team may have been alerted about transaction slowness in their application. Is this a database availability problem? Have there been infrastructure issues happening that could be affecting database performance?

When you make changes based on observed behavior, it’s extremely important to consider how these changes will affect other applications and systems. Changes made without the full context are risky.

Figure 1: A PostgreSQL-based application experiencing database congestion

 

Using causal reasoning based on the observed environment shows that a recent update to the application code is causing crashes for users during specific transactions. A code update may have introduced inefficient database calls, which is affecting the performance of the application. That change can also go far beyond just the individual application.

If a company decides to update their software without fully understanding how it interacts with other systems, it could result in technical issues that disrupt operations and lead to costly downtime. This is especially challenging in shared infrastructure where noisy neighbors can affect every adjacent application.

Figure 2: Symptoms, causes, and impact determination

 

This is an illustration showing how causal AI software can connect the problem to active symptoms, while understanding the likelihood of each potential cause. This is causal reasoning in action that also understands the effect on the rest of the environment as we evaluate potential resolutions.

Now that we have causal reasoning for the true root cause, we can go even further by introducing remediation steps.

Automated remediation and system reliability

Automated remediation involves the real-time detection and resolution of issues without the need for human intervention. Automated remediation plays an indispensable role in reducing downtime, enhancing system reliability, and resolving issues before they affect users.

Yet, implementing automated remediation presents challenges, including the potential for unintended consequences like incorrect fixes that could worsen issues. Causal reasoning takes more information into account to drive the decision about root cause, impact, remediation options, and the effect of initiating those remediation options.

This is why a whole environment view combined with real-time causal analysis is required to be able to safely troubleshoot and take remedial actions without risk while also reducing the labor and effort required by operations teams.

Prioritizing action over visibility

Observability is a component of how we monitor and observe modern systems. Extending beyond observability with causal reasoning, impact determination, and automated remediation is the missing key to reducing human error and labor.

In order to move toward automation, you need trustworthy, data-driven decisions that are based on a real-time understanding of the impact of behavioral changes in your systems. Those decisions can be used to trigger automation and the orchestration of actions, ultimately leading to increased efficiency and productivity in operations.

Automated remediation can resolve issues before they escalate, and potentially before they occur at all. The path to automated remediation requires an in-depth understanding of the components of the system and how they behave as an overall system.

Integrating observability with automated remediation empowers organizations to boost their application performance and reliability. It’s important to assess your observability practices and incorporate causal reasoning to boost reliability and efficiency. The result is increased customer satisfaction, IT team satisfaction, and risk reduction.


Related resources

What is Causal AI & why do DevOps teams need it?

Causal AI can help IT and DevOps professionals be more productive, freeing hours of time spent troubleshooting so they can instead focus on building new applications. But when applying Causal AI to IT use cases, there are several domain-specific intricacies that practitioners and developers must be mindful of.

The relationships between application and infrastructure components are complex and constantly evolving, which means relationships and related entities are dynamically changing too. It’s important not to conflate correlation with causation, or to assume that all application issues stem from infrastructure limitations.

In this webinar, Endre Sara defines Causal AI, explains what it means for IT, and talks through specific use cases where it can help IT and DevOps practitioners be more efficient.

We’ll dive into practical implementations, best practices, and lessons learned when applying Causal AI to IT. Viewers will leave with tangible ideas about how Causal AI can help them improve productivity and concrete next steps for getting started.

 

Tight on time? Check out these highlights

 

Building Startup Culture Isn’t Like It Used To Be

When does culture get established in a startup? I’d say the company’s DNA is set during the first year or two, and the founding team should do everything possible to make this culture intentional vs a series of disconnected decisions. Over the years, I’ve seen many great startup cultures that led to successful products and outcomes (and others that were hobbled from the beginning by poor DNA). However, as we plan for our upcoming Causely quarterly team meetup in New York City, I’m struck by how things have changed in culture-building since my previous ventures.

Startup culture used to happen organically

Back in the day, we took a small office space, gathered the initial team and started designing and building. Our first few months were often in the incubator space at one of our early investors. This was a fun and formative experience, at least until we got big enough to be kicked out (“you’re funded, now get your own space!”). Sitting with a small group crowded around a table and sharing ideas with each other on all topics may not have been very comfortable or even efficient. But it did create a foundational culture based on jokes, stories and decisions we would refer back to for years to come. Also, it established the extremely open and non-hierarchical cultural norms we wanted to encourage as we added people.

Once we hit initial critical mass and needed more space for breakouts or private discussions, it was off to the Boston real estate market to see what could possibly be both affordable and reasonable for commutes. The more basic the space, the better in many ways, since it emphasized the need to run lean and spend money only on the things that mattered – hiring, engineering tools, early sales and marketing investments, etc. But most important was to spend on things that would encourage the team to get to know each other and build trust. Lunches, dinners, parties, local activities were all important, as was having the right snacks, drinks and games in the kitchen area to encourage people to hang out together (it’s amazing how much the snacks matter).

Building culture in a startup requires in-person get-togethers

The new normal

Fast forward to now, post-Covid and all the changes that have occurred in office space and working remotely. Causely is a remote-from-birth company, with people scattered around the US and a couple internationally. I would never have considered building a company this way before Covid, but when Shmuel and I decided to start the company, it just didn’t seem that big an issue anymore. We committed ourselves to making the extra effort required to build culture remotely, and talked about it frequently with the early team and investors.

PS: We’re hiring! Want to help shape the Causely culture? Check out our open roles.

In my experiences hanging out with the local Boston tech community and hearing stories from other entrepreneurs, I’ve noticed some of the following trends (which I believe are typical for the startup world):

  • Most companies have one or more offices that people come to weekly, but not daily; attendance varies by team and is tied to days of the week that the team is gathering for meetings or planning. Peak days are Tues-Thurs but even then, attendance may vary widely.
  • Senior managers show up more frequently to encourage their teams to come in, but they don’t typically enforce scheduling.
  • The office has become more a place to build social and mentoring relationships and less about getting work done, which may honestly be more efficient from home.
  • Employees like to come in, and more junior staff in particular benefit from in-person interaction with peers and managers, as well as having a separate workspace from their living space. But the flexibility of working remotely is very hard to give up and is something people value.
  • Gathering the entire company together regularly (and smaller groups in local offices/meetups) is much more important than it used to be for creating a company-wide culture and helping people build relationships with others in different teams and functional areas.

Given this new normal, I’ve been wondering where this takes us for the next generation of startup companies. It matters to me that people have a shared sense of the company’s vision and feel bound to each other on a company mission. Without this, joining a startup loses a big element of its appeal and it becomes harder to do the challenging, creative, exhausting and sometimes nutty things it takes to launch and scale. There are only so many hours anyone can spend on Zoom before fatigue sets in. And it’s harder to have the casual and serendipitous exchanges that used to generate new ideas and energize long-running brainstorming discussions.

Know where you want to go before you start

Building culture in the current startup world requires intention. Here are some things I think are critical to doing this well. I would love to hear about things that are working for other entrepreneurs!

  1. Founders: spend more time sharing your vision on team calls and 1:1 with new hires – this is the “glue” that holds the company together.
  2. Managers: schedule more frequent open-ended, 1:1 calls to chat about what’s on people’s minds and hear ideas on any topic. Leave open blocks of time on your weekly calendar so people can “drop by” for a “visit.”
  3. Encourage local meetups as often as practical – make it easy for local teams to get together where and when they want.
  4. Invest in your all-team meetups, and make these as fun and engaging as possible. (We’ve tried packed agendas with all-day presentations and realized that this was too much scheduling). Leave time for casual hangouts and open discussions while people are working or catching up on email/Slack.
  5. Do even more sharing of information about the company updates and priorities – there’s no way for people to hear these informally, so more communications are needed and repetition is good 🙂
  6. Encourage newer/younger employees to share their work and ideas with the rest of the team – it’s too easy for them to lack feedback or mentoring and to lose engagement.
  7. Consider what you will do in an urgent situation that requires team coordination: simulations and reviews of processes are much more important than in the past.

There’s no silver bullet to building great company culture, but instead a wide range of approaches that need to be tried and tested iteratively. These approaches also change as the company grows – building cross-functional knowledge and creativity requires all the above but even more leadership by the founders and management team (and a commitment to traveling regularly between locations to share knowledge). Recruiting, already such a critical element of building culture, now has an added dimension: will the person being hired succeed in this particular culture without many of the supporting structures they used to have? Will they thrive and help build bridges between roles and teams?

It’s easy to lose sight of the overall picture and trends amidst the day-to-day urgency, so it’s important to take a moment when you’re starting the company to actually write down what you want your company culture to be. Then check it as you grow and make updates as you see what’s working and where there are gaps. The founding team still sets the direction, but today more explicit and creative efforts are needed to stay on track and create a cultural “mesh” that scales.


Related reading

Assure application reliability with Causely

In this video, we’ll show how easy it is to continuously assure application reliability using Causely’s causal AI platform.

 

In a modern production microservices environment, the number of alerts from observability tooling can quickly amount to hundreds or even thousands, and it’s extremely difficult to understand how all these alerts relate to each other and to the actual root cause. At Causely, we believe these overwhelming alerts should be consumed by software, and root cause analysis should be conducted at machine speed.

Our causal AI platform automatically associates active alerts with their root cause, drives remedial actions, and enables review of historical problems as well. This information streamlines post-mortem analysis, frees DevOps time from complex, manual processes, and helps IT teams plan for upcoming changes that will impact their environment.

Causely installs in minutes and is SOC 2 compliant. Share your troubleshooting stories below or request a live demo – we’d love to see how Causely can help!

Cause and Effect: Solving the Observability Conundrum

Causely's Ellen Rubin on Inside Analysis

The pressure on application teams has never been greater. Whether for Cloud-Native Apps, Hybrid Cloud, IoT, or other critical business services, these teams are accountable for solving problems quickly and effectively, regardless of growing complexity. The good news? There’s a whole new array of tools and technologies for helping enable application monitoring and troubleshooting. Observability vendors are everywhere, and the maturation of machine learning is changing the game. The bad news? It’s still largely up to these teams to put it all together. Check out this episode of InsideAnalysis to learn how Causal AI can solve this challenge. As the name suggests, this technology focuses on extracting signal from the noise of observability streams in order to dynamically ascertain root cause analysis, and even fix mistakes automatically.

Tune in to hear Host Eric Kavanagh interview Ellen Rubin of Causely, as they explore how this fascinating new technology works.

Fools Gold or Future Fixer: Can AI-powered Causality Crack the RCA Code for Cloud Native Applications?

Can AI-powered causality crack the RCA code for cloud-native applications?

The idea of applying AI to determine causality in an automated Root Cause Analysis solution sounds like the Holy Grail, but it’s easier said than done. There’s a lot of misinformation surrounding RCA solutions. This article cuts the confusion and provides a clear picture. I will outline the essential functionalities needed for automated root cause analysis. Not only will I define these capabilities, I will also showcase some examples to demonstrate their impact.

By the end, you’ll have a clearer understanding of what a robust RCA solution powered by causal AI can offer and how it can empower your IT team to better navigate the complexities of your cloud-native environment and most importantly dramatically reduce MTTx.

The Rise (and Fall) of the Automated Root Cause Analysis Holy Grail

Modern organizations are tethered to technology. IT systems, once monolithic and predictable, have fractured into a dynamic web of cloud-native applications. This shift towards agility and scalability has come at a cost: unprecedented complexity.

Troubleshooting these intricate ecosystems is a constant struggle for DevOps teams. Pinpointing the root cause of performance issues and malfunctions can feel like navigating a labyrinth – a seemingly endless path of interconnected components, each with the potential to be the culprit.

For years, automating Root Cause Analysis (RCA) has been the elusive “Holy Grail” for service assurance, as the business consequences of poorly performing systems are undeniable, especially as organizations become increasingly reliant on digital platforms.

Despite its importance, commercially available solutions for automated RCA remain scarce. While some hyperscalers and large enterprises have the resources and capital to attempt to develop in-house solutions to address the challenge (like Capital One’s example), these capabilities are out of reach for most organizations.

See how Causely can help your organization eliminate human troubleshooting. Request a demo of the Causal AI platform. 

Beyond Service Status: Unraveling the Cause-and-Effect Relations in Cloud Native Applications

Highly distributed systems, regardless of technology, are vulnerable to failures that cascade and impact interconnected components. Cloud-native environments, due to their complex web of dependencies, are especially prone to this domino effect. Imagine a single malfunction in a microservice, triggering chain reaction, disrupting related microservices. Similarly, a database issue can ripple outwards, affecting its clients and in turn everything that relies on them.

The same applies to infrastructure services like Kubernetes, Kafka, and RabbitMQ. Problems in these platforms might not always be immediately obvious because of the symptoms they cause within their domain. Furthermore symptoms manifest themselves within applications they support. The problem can then propagate further to related applications, creating a situation where the root cause problem and the symptoms they cause are separated by several layers.

Although many observability tools offer maps and graphs to visualize infrastructure and application health, these can become overwhelming during service disruptions and outages. While a sea of red icons in a topology map might highlight one or more issues, they fail to illuminate cause-and-effect relationships. Users are then left to decipher the complex interplay of problems and symptoms to work out the root cause. This is even harder to decipher when multiple root causes are present that have overlapping symptoms.

While topology maps show the status of services, they leave their users to interpret cause & effect

While topology maps show the status of services, they leave their users to interpret cause & effect

In addition to topology based correlation, DevOps team may also have experience of other types of correlation including event deduplication, time based correlation and path based analysis all of which attempt to reduce the noise in observability data. Don’t loose sight of the fact that this is not root cause analysis, just correlation, and correlation does not equal causation. This subject is covered further in a previous article I published Unveiling The Causal Revolution in Observability.

The Holy Grail of troubleshooting lies in understanding causality. Moving beyond topology maps and graphs, we need solutions that represent causality depicting the complex chains of cause-and-effect relationships, with clear lines of responsibility. Precise root cause identification that clearly explains the relationship between root causes and the symptoms they cause, spanning the technology domains that support application service composition, empowers DevOps teams to:

  • Accelerate Resolution: By pinpointing the exact source of the issue and the symptoms that are caused by this, responsible teams are notified instantly and can prioritize fixes based on a clear understanding of the magnitude of the problem. This laser focus translates to faster resolution times.
  • Minimize Triage: Teams managing impacted services are spared the burden of extensive troubleshooting. They can receive immediate notification of the issue’s origin, impact, and ownership, eliminating unnecessary investigation and streamlining recovery.
  • Enhance Collaboration: With a clear understanding of complex chains of cause-and-effect relationships, teams can collaborate more effectively. The root cause owner can concentrate on fixing the issue, while impacted service teams can implement mitigating measures to minimize downstream effects.
  • Automate Responses: Understanding cause and effect is also an enabler for automated workflows. This might include automatically notifying relevant teams through collaboration tools, notification systems and the service desk, as well as triggering remedial actions based on the identified problem.

Bringing This To Life With Real World Examples

The following examples will showcase the concept of causality relations, illustrating the precise relationships between root cause problems and the symptoms they trigger in interrelated components that make up application services.

This knowledge is crucial for several reasons. First, it allows for targeted notifications. By understanding the cause-and-effect sequences, the right teams can be swiftly alerted when issues arise, enabling faster resolution. Second, service owners impacted by problems can pinpoint the responsible parties. This clarity empowers them to take mitigating actions within their own services whenever possible and not waste time troubleshooting issues that fall outside of their area of responsibility.

Infra Problem Impacting Multiple Services

In this example, a CPU congestion in a Kubernetes Pod is the root cause and this causes symptoms –  high latency – in application services that it is hosting. In turn, this results in high latency on other applications services. In this situation the causal relationships are clearly explained.

Causality graph generated by Causely

Causality graph generated by Causely

A Microservice Hiccup Leads to Consumer Lag

Imagine you’re relying on a real-time data feed, but the information you see is outdated. In this scenario, a bug within a microservice (the data producer) disrupts its ability to send updates. This creates a backlog of events, causing downstream consumers (the services that use the data) to fall behind. As a result, users/customers end up seeing stale data, impacting the overall user experience and potentially leading to inaccurate decisions. Very often the first time DevOps find out about these types of issues is when end users and customers complain about the service experience.

Causality graph generated by Causely

Causality graph generated by Causely

Database Problems

In this example the clients of a database are experiencing performance issues because one of the clients is issuing queries that are particularly resource-intensive. Symptoms of this include:

  • Slow query response times: Other queries submitted to the database take a significantly longer time to execute.
  • Increased wait times for resources: Applications using the database experience high error rate as they wait for resources like CPU or disk access that are being heavily utilized by the resource-intensive queries.
  • Database connection timeouts: If the database becomes overloaded due to the resource-intensive queries, applications might experience timeouts when trying to connect.
Causality graph generated by Causely

Causality graph generated by Causely

Summing Up

Cloud-native systems bring agility and scalability, but troubleshooting can be a nightmare. Here’s what you need to conquer Root Cause Analysis (RCA) in this complex world:

  • Automated Analysis: Move beyond time-consuming manual RCA. Effective solutions automate data collection and analysis to pinpoint cause-and-effect relationships swiftly.
  • Causal Reasoning: Don’t settle for mere correlations. True RCA tools understand causal chains, clearly and accurately explaining “why” things happen and the impact that they have.
  • Dynamic Learning: Cloud-native environments are living ecosystems. RCA solutions must continuously learn and adapt to maintain accuracy as the landscape changes.
  • Abstraction: Cut through the complexity. Effective RCA tools provide a clear view, hiding unnecessary details and highlighting crucial troubleshooting information.
  • Time Travel:Post incident analysis requires clear explanations. Go back in time to understand “why” problems and understand the impact they had.
  • Hypothesis: Understand the impact that degradation or failures in application services and infrastructure will have before they happen.

These capabilities unlock significant benefits:

  • Faster Mean Time to Resolution (MTTR): Get back to business quickly.
  • More Efficient Use Of Resources: Eliminate wasted time chasing the symptoms of problems and get to the root cause immediately.
  • Free Up Expert Resources From Troubleshooting: Empower less specialized teams to take ownership of the work.
  • Improved Collaboration: Foster teamwork because everyone understands the cause-and-effect chain.
  • Reduced Costs & Disruptions: Save money and minimize business interruptions.
  • Enhanced Innovation & Employee Satisfaction: Free up resources for innovation and create a smoother work environment.
  • Improved Resilience: Take action now to prevent problems that could impact application performance and availability in the future

If you would like to get to avoid the glitter of “Fools Gold” and get to the Holy Grail of service assurance with automated Root Cause Analysis don’t hesitate to reach out to me directly, or contact the team at Causely today to discuss your challenges and discover how they can help you.


Related Resources

 

On security platforms

🎧 This Tech Tuesday Podcast features Endre Sara, Founding Engineer at Causely!

Causely is bridging observability with automated orchestration for self-managed, resilient applications at scale.

In this episode, Amir and Endre discuss leadership, how to make people’s lives easier by operating complex, large software systems, and why Endre thinks IaC should be boring!

Dr. Shmuel Kliger on Causely, Causal AI, and the Challenging Journey to Application Health

Shmuel Kliger on the DiscoPosse Podcast about Causal AI

Dr. Shmuel Kliger, the founder of Causely.io, discusses his journey in the IT industry and the development of Causely. With a strong focus on reducing labor associated with IT operations, Dr. Kliger emphasizes the importance of understanding causality and building intelligent systems to drive insights and actions in complex IT environments. He highlights the need to focus on purpose-driven analytics and structured causality models to effectively manage and control IT systems.

Dr. Kliger also touches on the role of human interaction in influencing system behavior, mentioning the importance of defining constraints and trade-offs to guide automated decision-making processes. He envisions a future where humans provide high-level objectives and constraints, allowing the system to automatically configure, deploy, and optimize applications based on these inputs. By combining human knowledge with machine learning capabilities, Dr. Kliger aims to create a more efficient and effective approach to IT management, ultimately reducing troubleshooting time and improving system performance.

Tight on time?

Get the cliff notes from these clips:

Other ways to watch/listen

See the Causely platform in action

Are you ready to eat your own dogfood?

Photo by Design Wala on Unsplash

It’s a truism of all cloud SaaS companies that we should run our businesses on our own technology. After all, if this technology is so valuable and innovative that customers with dozens of existing vendors, tools and processes need to adopt the new offering, shouldn’t the startup use it internally as well? It also seems obvious that a new company should reality check the value it’s claiming to provide to the industry by seeing it first-hand, and that any major gaps should ideally be experienced first by the home team rather than inflicting these on customers.

Sometimes it’s easier said than done.

It can be surprising how difficult this sometimes turns out to be for new companies.

When the technical and product teams focus on the ideal customer profile and likely user, it’s not always the case that these match the startup’s internal situation. Very often, the intended user is really part of a larger team that interacts with other teams in a complex set of processes and relationships, and the realistic environment that the new product will face is far larger and more diverse than anything a startup would have internally. This makes it difficult to apply the new technology in a meaningful way and can also make the value proposition less obvious.

For example, if a major claim of the innovation is to simplify complex, manual or legacy processes, or to reduce wasteful spending through optimization, these benefits may be less obvious in a smaller/newer company environment. As a result, the “eat your own dogfood” claim may be just a marketing slogan without real meaning.

And then there’s the other truism to consider: the cobbler’s children often have no shoes. When a startup is running fast and focused on building innovation – with a small team that prioritizes the core value of a new product and customer engagement over any internal efforts – it’s easy to push off eating your own dogfood for another day. Ironically, if there are challenges in using the product internally, these may not be seen as the highest priority to fix or improve, vs. the urgent customer-facing ones that are tied to adoption and revenue. It’s always easy to say, “Well, customers haven’t seen these issues yet, so it’s probably ok for now.”

But, in fact, this is at the heart of the innovation process.

No one cares more about your product than the team that built it. You’re always challenging yourself:

Is it really easy to use?

Does it work reliably

Where are the hidden “gotchas”?

But for a breakthrough new product, this isn’t enough.

As your earliest design partners start testing the product in their environments, it’s equally important to consider how you will use it internally as well. This is not just about testing for bugs or functionality as you build your software development processes. It’s about becoming the user in every way possible and seeing the product through their eyes and daily jobs.

The world can look quite different from this viewpoint. Using the product internally raises the bar higher than responding to customer feedback or watching them during usability testing. It gives you a direct, visceral reaction to your own product:

Does it delight you as a user?

Would you use it on a daily basis?

Does it make your job easier?

Does it provide value that’s beyond other products you’re currently using?

Even if your company is not a perfect match with your ICP and your employees are not the typical users, you can still learn a great deal. For example:

  • Do a job that your customer would do, from end to end, and see whether the product made your work easier/better.
  • Show someone else in your company (who’s less familiar with the product) an output from the product and ask if this was helpful and understandable.
  • Think of what you’d want your product to do next if you were a paying customer and considering a renewal.

At Causely, we decided early on that a high priority was to run “Causely on Causely.” Since we are developing our own SaaS application (which of course is not nearly as complex or mature as our customers’ application environments, but still has many of the same potential cloud-native problems and failure scenarios), we also need to troubleshoot when things go wrong. So we wanted to make sure that Causely would automatically identify our own emerging issues, root cause them correctly, remediate them faster and prevent end-user impact. We judge our progress based on whether WE would find this compelling and validate the claims we are making to customers, such as enabling them to have healthier, more resilient applications without human troubleshooting.

As a team, this requires us to discuss our own experiences as a customer and makes it easier to imagine the experiences of larger, inter-connected teams of users running massive applications at scale. Eating our own dogfood helps us improve the product so it’s  easier to use, more understandable and reliable. And it has laid the foundation for how we will develop and operate our own applications as we scale. Of course eating your own dogfood is not a substitute for all other required approaches to testing and improving the product, but it’s a critical element in a startup’s development that should be hard-wired into company culture.

I would love to hear about other founders’ dog-fooding experiences and what’s worked well (or not) as you build your products. Please share!


Related resources

The Fast Track to Fixes: How to Turbo Charge Application Instrumentation & Root Cause Analysis

In the fast-paced world of cloud-native development, ensuring application health and performance is critical. The application of Causal AI, with its ability to understand cause and effect relationships in complex distributed systems, offers the potential to streamline this process.

A key enabler for this is application instrumentation that facilitates an understanding of application services and how they interact with one another through distributed tracing. This is particularly important with complex microservices architectures running in containerized environments like Kubernetes, where manually instrumenting applications for observability can be a tedious and error-prone task.

This is where Odigos comes in.

In this article, we’ll share our experience working with the Odigos community to automate application instrumentation for cloud-native deployments in Kubernetes.

Thanks to Amir Blum for adding resources attributes to native OpenTelemetry instrumentation based on our collaboration. And I appreciate the community accepting my PR to allow easy deployment using a Helm chart in addition to using the CLI in your K8s cluster!

This collaboration enables customers to implement universal application instrumentation and automate root cause analysis process in just a matter of hours.

The Challenges of Instrumenting Applications to Support Distributed Tracing

Widespread application instrumentation remains a hurdle for many organizations. Traditional approaches rely on deploying vendor agents, often with complex licensing structures and significant deployment effort. This adds another layer of complexity to the already challenging task of instrumenting applications.

Because of the complexities and costs involved, many organizations struggle with making the business case for universal deployment, and are therefore very selective about which applications they choose to instrument.

While OpenTelemetry offers a step forward with auto-instrumentation, it doesn’t eliminate the burden entirely. Application teams still need to add library dependencies and deploy the code. In many situations this may meet resistance from product managers who prioritize development of functional requirements over operational benefits.

As applications grow more intricate, maintaining consistent instrumentation across a large codebase is a major challenge, and any gaps leave blind spots in an organization’s observability capabilities.

Odigos to the Rescue: Automating Application Instrumentation

Odigos offers a refreshing alternative. Their solution automates the process of instrumenting all applications running in Kubernetes clusters, with just a few Kubernetes API calls. This eliminates the need to call in applications developers to facilitate the process which may take time and also require approval from product managers. This not only saves development time and effort but also ensures consistent and comprehensive instrumentation across all applications.

Benefits of Using Odigos

Here’s how Odigos is helping Causely and its customers to streamline the process:

  • Reduced development time: Automating instrumentation requires zero effort from development teams.
  • Improved consistency: Odigos ensures consistent instrumentation across all applications, regardless of the developer or team working on them.
  • Enhanced observability: Automatic instrumentation provides a more comprehensive view of application behavior.
  • Simplified maintenance: With Odigos handling instrumentation, maintaining and updating is simple.
  • Deeper insights into microservice communication: Odigos goes beyond HTTP interactions. It automatically instruments asynchronous communication through message queues, including producers and consumer flows.
  • Database and cache visibility: Odigos doesn’t stop at message queues. It also instruments database interactions and caches, giving a holistic view of data flow within applications.
  • Key performance metric capture: Odigos automatically instruments key performance metrics that can be consumed by any OpenTelemetry compliant backend application.

Using Distributed Tracing Data to Automate Root Cause Analysis

Causely consumes distributed tracing data along with observability data from Kubernetes, messaging platforms, databases and caches, whether they are self hosted or running in the cloud, for the following purposes:

  • Mapping application interactions for causal reasoning: Odigos’ tracing data empowers Causely to build a comprehensive dependency graph. This depicts how application services interact, including:
    • Synchronous and asynchronous communication: Both direct calls and message queue interactions between services are captured.
    • Database and cache dependencies: The graph shows how services rely on databases and caches for data access.
    • Underlying infrastructure: The compute and messaging infrastructure that supports the application services is also captured.
Example dependency graph depicting how application services interact

Example dependency graph depicting how application services interact

This dependency graph can be visualized but also is crucial for Causely’s causal reasoning engine. By understanding the interconnectedness of services and infrastructure, Causely can pinpoint the root cause of issues more effectively.

  • Precise state awareness: Causely only consumes the observability data needed to analyze the state of application and infrastructure entities for causal reasoning, ensuring efficient resource utilization.
  • Automated root cause analysis: Through its causal reasoning capability Causely is able to automatically identify the detailed chain of cause and effect relationships between problems and their symptoms in real time, when performance degrades or malfunctions occur in applications and infrastructure. These can be visualized through causal graphs which clearly depict the relationships between root cause problems and the symptoms/impacts that they cause.
  • Time travel: Causely provides the ability to go back in time so devops teams can retrospectively review root cause problems and the symptoms/impacts they caused in the past.
  • Assess application resilience: Causely enables users to reason about what the effect would be if specific performance degradations or malfunctions were to occur in application services or infrastructure.

Want to see Causely in action? Request a demo. 

Causal graphs depict the relationships between root cause problems and the symptoms/impacts that they cause

Example causal graph depicting relationships between root cause problems and the symptoms/impacts that they cause

Conclusion

Working with Odigos has been a very smooth and efficient experience. They have enabled our customers to instrument their applications and exploit Causely’s causal reasoning engine within a matter of hours. In doing so they were able to:

  • Instrument their entire application stack efficiently: Eliminating developer overheads and roadblocks without the need for costly proprietary agents.
  • Assure continuous application reliability: Ensuring that KPIs, SLAs, SLOs and SLAs are continually met by proactively identifying and resolving issues.
  • Improve operational efficiency: By minimizing the labor, data, and tooling costs with faster MTTx.

If you would like to learn more about our experience of working together, don’t hesitate to reach out to the teams at Odigos or Causely, or join us in contributing to the Odigos open source observability plane.


Related Resources

Time to Rethink DevOps Economics? The Path to Sustainable Success

As organizations transform their IT applications and adopt cloud-native architectures, scaling seamlessly while minimizing resource overheads becomes critical. DevOps teams can play a pivotal role in achieving this by embracing automation across various facets of the service delivery process.

Automation shines in areas such as infrastructure provisioning and scaling, continuous integration and delivery (CI/CD), testing, security and compliance, but the practice of automated root cause analysis remains elusive.

While automation aids observability data collection and data correlation, understanding the relationships between cause and effect still requires the judgment and expertise of skilled personnel. This work falls on the shoulders of developers and SREs who have to manually decode the signals – from metrics, traces and logs – in order to get to the root cause when the performance of services degrades.

Individual incidents can take hours and even days to troubleshoot, demanding significant resources from multiple teams. The consistency of the process can also vary greatly depending on the skills that are available when these situations occur.

Service disruptions can also have significant financial consequences. Negative customer experiences directly impact revenue, and place an additional resource burden on the business functions responsible for appeasing unhappy customers. Depending on the industry you operate in and the type of services you provide, service disruptions may result in costly chargebacks and fines, making mitigation even more crucial.

Quantifying the load of manual root cause analysis on IT teams

 

Shining a Light on the Root Cause Analysis Problem in DevOps

While decomposing applications into microservices through the adoption of cloud-native architectures has enabled DevOps teams to increase the velocity with which they can release new functionality, it has also created a new set of operational challenges that have a significant impact on ongoing operational expenses and service reliability.

Increased complexity: With more services comes greater complexity, more moving parts, and more potential interactions that can lead to issues. This means diagnosing the root cause of problems becomes more difficult and time-consuming.

Distributed knowledge: In cloud-native environments, knowledge about different services often resides in different teams, who have limited knowledge of the wider system architecture. As the number of services scales, finding the right experts and getting them to collaborate on troubleshooting problems becomes more challenging. This adds to the time and effort required to coordinate and carry out root cause analysis and post incident analysis.  

Service proliferation fuels troubleshooting demands: Expanding your service landscape, whether through new services or simply additional instances, inevitably amplifies troubleshooting needs which translate into more resource requirements in DevOps teams for troubleshooting overtime.

Testing regimes cannot cover all scenarios: DevOps, with its CI/CD approach, releases frequent updates to individual services. This agility can reveal unforeseen interactions or behavioral changes in production, leading to service performance issues. While rollbacks provide temporary relief, identifying the root cause is crucial. Traditional post-rollback investigations might fall short due to unreproducible scenarios. Instead, real-time root cause analysis of these situations as they happen is important to ensure swift fixes and prevent future occurrences.

Telling the Story with Numbers

As cloud-native services scale, troubleshooting demands also grow exponentially, in a similar way to compounding interest on a savings account. As service footprints expand, more DevOps cycles are consumed by troubleshooting versus delivering new code, creating barriers to innovation. Distributed ownership and unclear escalation paths can also mask the escalating time that is consumed by troubleshooting. 

Below is a simple model that can be customized with company-specific data to illustrate the challenge in numbers. This model helps paint a picture of the current operational costs associated with troubleshooting. It also demonstrates how these are going to escalate over time, driven by the growth in cloud-native services (more microservices, serverless functions, etc).   

The model also illustrates the impact of efficiency gains through automation versus the current un-optimized state. The gap highlights the size of the opportunity available to create more cycles for productive development while reducing the need for additional headcount into the future, by automating troubleshooting.

This model illustrates the costs and impact of efficiency gains that can be achieved through automated root cause analysis

 

Beyond the cost of human capital, there are a number of other costs that have a direct impact on troubleshooting costs. These include the escalating costs of infrastructure and third-party SaaS services, dedicated to the management of observability data. These are well publicized and highlighted in a recent article published by Causely Founding Engineer Endre Sara that discusses avoiding the pitfalls of escalating costs when building out Causely’s own SaaS offering.

While DevOps have a cost, they pale in comparison to the financial consequences of service disruptions. With automated root cause analysis, DevOps teams can mitigate these risks, saving the business time, money and reputation. 

Future iterations of the model will account for these additional dimensions. 

If you would like to put your data to work and see the quantifiable benefits of automated root cause analysis in numbers, complete the short form to get started. 

Translating Theory into Reality

Did you know companies like Meta and Capital One automate root cause analysis in devops, achieving 50% faster troubleshooting? However, the custom solutions built by these industry giants require vast resources and deep expertise in data science to build and maintain, putting automated root cause analysis capabilities out of reach for most companies.

The team at Causely are changing this dynamic. Armed with decades of experience –  applying AI to the management of distributed systems, networks, and application resource management – they offer a powerful SaaS solution that removes the roadblocks to automated root cause analysis in DevOps environments. The Causely solution enables

  • Clear, explainable insights: Instead of receiving many notifications when issues arise, teams receive clear notifications that explain the root cause along with the symptoms that led to these conclusions.
  • Faster resolution times: Teams can get straight to work on problem resolution and even automate resolutions, versus spending time diagnosing problems.
  • Business impact reduction: Problems can be prevented, early in their cycle, from escalating into critical situations that might otherwise have resulted in significant business disruption.
  • Clearer communication & collaboration: RCA pinpoints issue owners, reducing triage time and wasted efforts from other teams.
  • Simplified post-incident analysis: All of the knowledge about the cause and effect of prior problems is stored and available to simplify the process of post incident analysis and learning.

Wrapping Up

In this article we discussed the key challenges associated with troubleshooting and highlighted the cost implications of today’s approach. Addressing these issues is important because the business consequences of today’s approach are significant.

  1. Troubleshooting is costly because it consumes the time of skilled resources.
  2. Troubleshooting steals time from productive activities which impacts the ability of DevOps to deliver new capabilities.
  3. Service disruptions have business consequences: The longer they persist, the bigger the impact to customers and business.

If you don’t clearly understand your future resource requirements and costs associated with troubleshooting as you scale out cloud-native services, the model we’ve developed provides a simple way to capture this.

Want to see quantifiable benefits from automated root cause analysis?

Turn “The Optimized Mode of Operation” in the model from vision to reality with Causely.

The Causely service enables you to measure and showcase the impact of automating RCA in your organization. Through improved service performance and availability, faster resolution times, and improved team efficiency, the team will show you the “art of the possible.”


Related Resources

The Fellowship of the Uptime Ring: A Quest for Site Reliability

Reposted with permission from its original source on LinkedIn


A digital chill swept through King Reginald as he materialized back in his royal chambers, having returned from the Cloud Economic World Forum. The summit had cast a long shadow, its discussions echoing through his processors with a stark warning. One of the most concerning threats to the global economy, the forum declared, was the alarming unreliability of digital infrastructure. Countless nations, their economies heavily reliant on online services, were teetering on the brink of collapse due to ailing applications.

Haunted by the pleas of desperate leaders, King Reginald, a champion of digital prosperity, knew he could not stand idly by. He summoned his most trusted advisors: Causely, the griffin with an eagle eye for detail; OpenTelemetry, the ever-vigilant data whisperer; and Odigos, the enigmatic hobbit renowned for his eBPF magic.

“My friends,” boomed the King, his voice laced with urgency, “we face a momentous challenge. We must embark on a quest, a mission to guide failing Digital Realms back to stability, preventing a domino effect that could plunge the global economy into chaos. We need to formulate a Charter: a beacon of hope, a blueprint for success that will empower these realms to assure the reliability of their online services.”

Causely, his gaze sharp as a hawk’s, swiveled his head. “A noble undertaking, Your Majesty. But how do we convince the denizens of these realms to join us? We need not just a path to success, but a compelling reason for Business Knights, Technical Leaders, and Individual Contributors to rally behind this cause.”

The King Reginald hollered “The Charter will be a clarion call, that not only illuminates the path to success but also unveils the treasures that await those who embark on this journey with us. We cannot afford to delay in crafting this crucial charter. Therefore, I propose we convene an offsite meeting to dedicate our collective wisdom to this critical task.”

A Pint and a Plan

King Reginald’s favourite place for such gatherings would normally be Uncle Sam’s Oyster Bar & Grill, renowned for its delightful fare and stimulating intellectual discourse, but was fully booked for the next six months, thanks to the ever-growing popularity of their chatbot steaks.

A hint of amusement flickered across Odigos’ eyes. “Fear not, my friends,” he declared, a mischievous glint in his voice. “I know of a place steeped in history, and perhaps, even edible sustenance – The Hypervisor Arms!”

Causely’s gaze narrowed slightly. “The Hypervisor Arms?” he echoed, a hint of apprehension tinging his voice. “While once a legendary establishment, I’ve heard whispers…”

Odigos shrugged nonchalantly. “It recently changed hands, coming under the new management of Bob Broadcom. While some may cast aspersions, I believe it may hold the key to a productive and, hopefully, reasonably comfortable brainstorming session.”

Intrigued by the prospect of a new experience, the group followed Odigos to the Hypervisor Arms. An initial buzz of anticipation filled them as they approached the weathered facade. However, upon entering, their enthusiasm waned. The once-vibrant pub was a shadow of its former self. The menu offered a meager selection, the ale taps a disappointingly limited variety, and murmurs of discontent filled the air. Several patrons grumbled about long wait times, and a few bore the mark of undercooked meals on their disgruntled faces.

A hint of disappointment flickered across Odigos’ face. “Perhaps,” he started hesitantly, “we should reconsider another venue.”

King Reginald, however, straightened his virtual shoulders. “While the circumstances are not ideal,” he declared with unwavering resolve, “we shall not allow them to deter us. The ailing applications in these struggling economies demand our expertise. We may need to adapt, but our mission remains. Let us find a suitable table and proceed with our task. The fate of countless digital realms depends on it.”

Despite the underwhelming ambiance, fueled by their sense of purpose, the group settled in and began brainstorming. After hours of passionate discussion, fueled by stale ale and questionable mutton stew (courtesy of the Hypervisor Arms), they emerged with a finalized Charter:

The Charter: A Quest for Site Reliability

1. Building the Observatory (aka Not Using Carrier Pigeons):

  • Leverage OpenTelemetry and its Magical Data Spyglass: Gain X-ray vision into your applications’ health, identifying issues before they wreak havoc like a toddler with a marker in an art museum.
  • Unify with Semantic Conventions, the Universal App Translator: Forget the Tower of Babel, all your applications will speak the same language, making communication smoother than a greased piglet on an ice rink.
  • Embrace Auto-Instrumentation and Odigos, the eBPF Whisperer: For stubborn applications that refuse to cooperate (like those grumpy dragons guarding their treasure), Odigos will use his mystical eBPF magic to pry open the data treasure chest.

2. Causely, the wise Griffin (Who Doesn’t Need Coffee):

  • Empower Causely’s Detective Skills: Let Causely analyze your data like Sherlock Holmes on a caffeine bender, pinpointing the root cause of problems faster than you can say “performance degradation.”
  • Direct Notification and Swift Action: Causely will alert the responsible teams with the urgency of a squirrel who just spotted a peanut, ensuring problems get squashed quicker than a bug under a magnifying glass.

3. The Spoils of Victory (Beyond Just Bragging Rights):

For Business Knights:

  • Increased Productivity: Say goodbye to endless troubleshooting and hello to more time for conquering new territories (aka strategic initiatives).
  • Stronger Financial Position: No more SLA and SLO violations that make customers grumpier than a hangry troll. This means fewer chargebacks, fines, and share price nosedives – more like rocketships, baby!

For Technical Leaders:

  • Empower Your Team: Become the Gandalf of your team, guiding them on a path of innovation instead of getting bogged down in the troubleshooting trenches.
  • Focus on the Future: Less time spent putting out fires means more time spent building the next big thing (and maybe finally getting that nap you deserve).

For Individual Contributors:

  • Do More Than Just Slay Bugs: Break free from the monotony of troubleshooting and tackle exciting challenges you wouldn’t have had time for before.
  • Level Up Your Skills: Spend more time learning new things and becoming the ultimate coding ninja.

Adding Value For Customers:

  • Happy Customers, Happy Life: Customer service teams will be transformed from fire-fighters to customer experience superheroes, helping your customers achieve their goals with ease.

Digital Talent Optimization:

  • Work in a Place That Doesn’t Feel Like Mordor: Create a work environment as inviting as a hobbit’s hole, attracting and retaining top talent who are tired of battling application dragons all day.

King Reginald declared “This charter is not just a document, it’s a beacon of hope for Digital Realms seeking application serenity (and maybe some decent snacks) and we must promote this immediately to spread the word for our cause throughout the Realms.

The Call to Arms: Beyond the Glitches, Lies a Digital Utopia

In a public broadcast to the Realms King Reginald announced the charter “This is not just a charter, it’s a battle cry, a war horn echoing across the Digital Realms. We stand united against a common foe: application dragons spewing fire (and error messages) that threaten to devour productivity and sanity.

But fear not, brave knights and valiant developers! We offer you the holy grail of application serenity and to help you understand the opportunity that lays ahead we offer;

  • White Papers & Videos: That explain the details, the mechanics of the journey and the treasures that this unlocks.
  • Demo Days: Witness the magic unfold before your very eyes as we showcase the power of this system in live demonstrations.
  • Proof of Concepts: Doubt no more! We’ll provide tangible proof that this solution is the elixir your Digital Realm craves.
  • Interactive ROI calculators: Quantify the value of the journey for your Digital Realm.
  • Disco Tech Takeover: Prepare to groove! I’ll be guest-starring on the legendary podcast “Disco Tech,” hosted by the infamous (and influential) blogger, Eric Wrong. Millions of ears will hear the call, and the revolution will begin.

To the Business Knights:

Lead your teams to digital El Dorado, a land overflowing with increased productivity and financial prosperity.

To the Technical Leaders:

Become the Gandalf of your team, guiding them on quests of innovation and unleashing their full potential.

To the Individual Contributors:

Slay the bugs no more! Ascend to new heights of skill and mastery with newfound time and resources.

And to the Customers:

Rejoice! You shall be treated as kings and queens, receiving flawless service and exceptional experiences.

The call to action is clear: Join the quest! Together, we’ll vanquish the performance dragons, build a digital utopia, and snack on subpar catered lunches (we’re working on the catering part).

Remember, brave heroes, the fate of the Digital Realms rests in your hands (and keyboards)!


If you missed the first and second episodes in this Epic trilogy you can find them here:

Don’t Forget These 3 Things When Starting a Cloud Venture

I’ve been in the cloud since 2008. These were the early days of enterprise cloud adoption and the very beginning of hybrid cloud, which to this day remains the dominant form of enterprise cloud usage. Startups that deliver breakthrough infrastructure technology to enterprise customers have their own dynamic (which differs from consumer-focused companies), and although the plots may differ, the basic storylines stay the same. I’ve learned from several startup experiences (and a fair share of battle scars) there are several things you should start planning for right from the beginning of a new venture.

These aren’t the critical ones you’re probably already doing, such as selecting a good initial investor, building a strong early team and culture, speaking with as many early customers as possible to find product-market fit (PMF), etc.

Instead, here are three of the less-known but equally important things you should prioritize in your first years of building a company.

1. Build your early customer advisor relationships

As always, begin with the customer. Of course you will cast a wide net in early customer validation and market research efforts to refine your ideal customer profile (ICP) and customer persona(s). But as you’re iterating on this, you also want to build stronger relationships with a small group of customers who provide more intensive and strategic input. Often these early customers become part of a more formal customer advisory board, but even before this, you want to have an ongoing dialog with them 1:1 as you’re thinking through your product strategy.

These customers are strategic thinkers who love new technologies and startups. They may be senior execs but not necessarily – they are typically the people who think about how to bring innovation into their complex enterprise organizations and enjoy acting as “sherpas” to guide entrepreneurs on making their products and companies more valuable. I can remember the names of all the customer advisors from my previous companies and they have had an enormous impact on how things progressed. So beyond getting initial input and then focusing only on sales opportunities, build advisor relationships that are based on sharing vision and strategy, with open dialogue about what’s working/not working and where you want to take the company. Take the time to develop these relationships and share more deeply with these advisors.

2. Begin your patent work

Many B2B and especially infrastructure-oriented startups have meaningful IP that is the foundation of their companies and products. Patenting this IP requires the founding engineering team to spend significant time documenting the core innovation and its uniqueness, and the legal fees can run to thousands of dollars. As a result, there’s often a desire to hold off filing patents until the company is farther along and has raised more capital. Also, the core innovation may change as the company gets further into the market and founders realize more precisely what the truly valuable technology includes – or even if the original claims are not aligned with PMF.

However, it’s important to begin documenting your thinking early, as the company and development process are unfolding, so you have a written record and are prepared for the legal work ahead. In the end, the real value patent(s) provide for startups is less about protecting your IP from other larger players — who will always have more lawyers and money than you do and will be willing to take legal action to protect their own patent portfolios or keep your technology off-market — than it is about capturing the value of your innovation for future funding and M&A scenarios, and as a way to show larger players that you’ve taken steps to protect your IP and innovation during buy-out. In the US, a one year clock for filing the initial patent (frequently with a provisional patent filed first to preserve an early filing date) begins ticking upon external disclosure, such as when you launch your product, and you don’t want to address these issues in a rush. It’s important to get legal advice early on about whether you have something patentable, how much work it will be to write up the patent application, who will be named as inventors, and whether you want to file in more than one country.

3. Start your SOC2 process as you’re building MVP

Yes, SOC2. Not very sexy but as time has gone by, it’s become absolute table stakes for enabling early-adopter customers to test your initial product, even in a dev/test environment. In the past, it was possible to find a champion who loves new technology to “unofficially” test your early product since s/he was respected and trusted by their organizations to bring in great, new stuff. But as cloud and SaaS have matured and become so widespread at enterprise customers, the companies that provide these services – even to other startups – are requiring ALL their potential vendors to keep them aligned with their own SOC2 requirements. It’s like a ladder of compliance that is built on each vendor supporting the vendors and customers above them. There is typically a new vendor onboarding process and security review, even for testing out a new product, and these are more consistently enforced than in the past.

As a result, it has become more urgent to start your SOC2 process right away, so you can say you’re underway even as you’re building your minimum viable product (MVP) and development processes. Although it’s much easier now to automate and track SOC2 processes and prepare for the audit (this is my third time doing this, and it is far less manual than in the past) – if you launch your product and then go back later to set up security/compliance policies and processes, it will be much harder, more complicated and under much more pressure from your sales teams to be able to check this box.

There’s no question that other must-dos should be added to the above list (and it would be great to hear from other founders on this). With so many things to consider, it’s hard to prioritize in the early days and sometimes the less obvious things can be pushed for “later”. But as I build my latest cloud venture, Causely, I’ve kept these 3 priorities top of mind since I know how important they are to lay a strong foundation for growth.


Related resources

Mission Impossible? Cracking the Code of Complex Tracing Data

In this video, we’ll show how Causely leverages OpenTelemetry. (For more on how and why we use OpenTelemetry in our causal AI platform, read the blog from Endre Sara.)

 

 

Distributed tracing gives you a bird’s eye view of transactions across your microservices. Far beyond what logs and metrics can offer, it helps you trace the path of a request across service boundaries. Setting up distributed tracing has never been easier. In addition to OpenTelemetry and other existing tracing tools such as Tempo and Jaeger, with open source tools like Grafana Beyla and Keyval Odigos, you can enable distributed tracing in your system without a single line of change.

These tools allow the instrumented applications to start sending traces immediately. But, with potentially hundreds of spans in each trace and millions of traces generated per minute, you can easily become over overwhelmed. Even with a bird’s eye view, you might feel like you’re flying blind.

That’s where Causely comes in. Causely efficiently consumes and analyzes tracing data, automatically constructs a cause and effect relationship, and pinpoints the root cause.

Interested in seeing how Causely makes it faster and easier to use tracing data in your environment so you can understand the root cause of challenging problems?

Comment here or contact us. We hope to hear from you!


Related resources

Cloud Cuckoo Calamity: The eBPF-Wielding Hobbit and the Root Cause Griffin Save the Realm!

Reposted with permission from its source on LinkedIn


The fate of the realm hangs in the balance. Join the mayhem in Cloud Cuckoo Calamity, the thrilling sequel to Data, Dragons & Digital Dreams.

A mournful dirge echoed through the digital realms, lamenting the passing of King Bartholomew the Bold, ruler of the fabled Cloud Cuckoo Land. With no direct heir apparent, the crown landed upon the brow of King Reginald Mainframe, already the esteemed sovereign of Microservice Manor. He inherited not just a kingdom, but a brewing chaos: Cloud Cuckoo Land, riddled with uninstrumented applications and plagued by performance woes, teetered on the brink of digital disaster.

The developers, pressured by the relentless Business Knights, had prioritized new features over the crucial art of instrumentation. Applications, like untended gardens, were choked with bugs and unpredictable performance. User complaints thundered like disgruntled dragons, and service reliability plummeted faster than a dropped byte.

King Reginald, a leader known for his strategic prowess, faced a quandary. Directly mandating changes would likely spark a rebellion amongst the Business Knights, jeopardizing the kingdom’s already fragile state. He needed a solution both swift and subtle, a whisper in the storm that could transform the chaos into a symphony of efficiency.

That’s when OpenTelemetry, his ever-watchful advisor, swooped in with a glimmer of hope. “Your Majesty,” she chirped, “I bring Odigos, the mystical hobbit rumored to possess not only the ability to weave instrumentation magic into even the most tangled code, but also mastery over the art of eBPF exploration!”

From the vibrant digital realm of Ahlan, renowned for its sand-swept servers and cunning code whisperers, Odigos arrived. Clad in a flowing robe adorned with intricate patterns, he bowed low. “Indeed, your Majesty,” he rasped in a warm, accented voice, “my eBPF skills allow me to delve into the deepest recesses of your applications, unearthing their secrets and revealing their inner workings. But fear not, for I can enchant them with instrumentation in a single, elegant command, seamlessly integrating into your Kubernetes clusters without disrupting your developers’ flow.”

With a twinkle in his eye, Odigos uttered the magical incantation, and lines of code danced across the screen. As if by magic, applications across Cloud Cuckoo Land blossomed with instrumentation, whispering their performance metrics to Open Telemetry’s attentive ear. The developers, initially wary, were astonished to discover the instrumentation seamlessly woven into their workflows.

But amidst the data deluge, chaos threatened to return. Enter Causely, the griffin detective, with his razor-sharp intellect and unwavering determination. Like a beacon in a storm, his gaze pierced through the mountains of metrics, his feathers bristling with the thrill of the hunt. He possessed an uncanny ability to weed out the root cause even in the most complex of situations, his logic as intricate as the threads of fate.

Armed with Open Telemetry’s data, Odigos’s newfound insights into the instrumented applications, and Causely’s expert analysis, King Reginald’s forces launched a targeted counteroffensive. Bottlenecks were unraveled, inefficiencies eradicated, and applications hummed with newfound stability. User complaints became echoes of praise, and Cloud Cuckoo Land emerged from the brink, its digital sun shining brighter than ever.

News of their triumph spread like wildfire, inspiring kingdoms across the digital landscape who were themselves struggling with instrumentation and root cause analysis woes. Odigos, the single-command code whisperer, and Causely, the master of root cause analysis, became legends. Their names were synonymous with efficiency, clarity, and the ability to tame even the most chaotic of digital realms.

But as celebrations died down, whispers of a new threat emerged, a malevolent force lurking in the shadows. Would their combined talents be enough to face this looming challenge? Stay tuned, dear reader, for the next chapter in the legend of King Reginald and his ever-evolving digital realm.


Read the next post in this series: The Fellowship of the Uptime Ring: A Quest for Site Reliability.

Eating Our Own Dog Food: Causely’s Journey with OpenTelemetry & Causal AI

Original photo by MART PRODUCTION

Implementing OpenTelemetry at the core of our observability strategy for Causely’s SaaS product was a natural decision. In this article I would like to share some background on our rationale and how the combination of OpenTelemetry and Causal AI addresses several critical requirements that enable us to scale our services more efficiently.

Avoiding Pitfalls Based on Our Prior Experience

We already know from decades of experience working in and with operations teams in the most challenging environments, that bridging the gap between the vast ocean of observability data and actionable insights has and continues to be a major pain point. This is especially true in the complex world of cloud-native applications.

Missing application insights

Application observability remains an elusive beast for many, especially in complex microservices architectures. While infrastructure monitoring has become readily available, neglecting application data paints an incomplete picture, hindering effective troubleshooting and operations.

Siloed solutions

Traditional observability solutions have relied on siloed, proprietary agents and data sources, leading to fragmented visibility across teams and technologies. This makes it difficult to understand the complete picture of service composition and dependencies.

To me this is like trying to solve a puzzle with missing pieces – that’s essentially a problem that many DevOps teams face today – piecing together a picture of how microservices, serverless functions, databases, and other elements interact with one  another, and underlying infrastructure and cloud services they run on. This hinders collaboration and troubleshooting efforts, making it challenging to pinpoint the root cause of performance issues or outages.

Vendor lock-in

Many vendors’ products also lock customers’ data into their cloud services. This can result in customers paying through the nose, because licensing costs are predicated on the volume of data that is being collected and stored in the service providers’ backend SaaS services. It can also be very hard to exit these services once locked in.

These are all pitfalls we want to avoid at Causely as we build out our Causal AI services.

Want to see Causely in action? Request a demo. 

The Pillars of Our Observability Architecture Pointed Us to OpenTelemetry

OpenTelemetry provides us with a path to break free from these limitations, establishing a common framework that transcends programming languages and platforms that we are using to build our services, and satisfying the requirements laid out in the pillars of our observability architecture:

Precise instrumentation

OpenTelemetry offers automatic instrumentation options that minimize the amount of work we need to do on manual code modifications and streamline the integration of our internal observability capabilities into our chosen backend applications.

Unified picture

By providing a standardized data model powered by semantic conventions, OpenTelemetry enables us to paint an end to end picture of how all of our services are composed including application and infrastructure dependencies. We can also gain access to critical telemetry information, leveraging this semantically consistent data across multiple backend microservices even when written in different languages.

Vendor-neutral data management

OpenTelemetry enables us to avoid locking our application data into 3rd party vendors’ services by decoupling it from proprietary vendor formats. This gives us the freedom to choose the best tools on an ongoing basis based on the value they provide, and if something new comes along that we want to exploit, we can easily plug it into our architecture.

Resource-optimized observability

OpenTelemetry enables us to take a top down approach to data collection, starting with the problems we are looking to solve and eliminating unnecessary information. In doing so, this minimizes our storage costs and optimizes compute resources we need to support our observability pipeline.

We believe that following these pillars and building our Causal AI platform on top of OpenTelemetry will propel our product’s performance, enable rock-solid reliability, and ensure consistent service experiences for our customers as we scale our business. We will also minimize our ongoing operational costs, creating a win-win for us and our customers.

OpenTelemetry + Causal AI: Scaling for Performance and Cost Efficiency

Ultimately, observability aims to illuminate the behavior of distributed systems, enabling proactive maintenance and swift troubleshooting. Yet isolated failures manifest as cascading symptoms across interconnected services.

While OpenTelemetry enables back-end applications to use this data to provide a unified picture in maps, graphs and dashboards, the job of figuring out the cause and effect in the correlated data still requires highly skilled resources. This process can also be very time consuming, tying up personnel across multiple teams, with ownership for different elements of overall services.

There is a lot of noise in the industry right now about how AI and LLMs are going to magically come to the rescue, but reality paints a different picture. All of the solutions available in the market today focus on correlating data versus uncovering a direct understanding of causal relationships between problems and the symptoms they cause, leaving devops teams with noise, not answers.

Traditional AI and LLMs also require massive amounts of data as input for training and learning behaviors on a continuous basis. This is data that ultimately ends up being transferred and stored in some form of SaaS. Processing these large datasets is very computationally intensive. This all translates into significant cost overheads for the SaaS providers as customer datasets grow overtime – costs that ultimately result in ever increasing bills for customers.

By contrast, this is where Causal AI comes into its own, taking a fundamentally different approach. Causal AI provides operations and engineering teams with an understanding of the “why”, which is crucial for effective and timely troubleshooting and decision-making.

Application: Database Connection Noisy Neighbor causing service and infrastructure symptoms

Example causality chain: Database Connection Noisy Neighbor causing service and infrastructure symptoms

Causal AI uses predefined models of how problems behave and propagate. When combined with real-time information about a system’s specific structure, Causal AI computes a map linking all potential problems to their observable symptoms.

This map acts as a reference guide, eliminating the need to analyze massive datasets every time Causal AI encounters an issue. Think of it as checking a dictionary instead of reading an entire encyclopedia.

The bottom line is, in contrast to traditional AI, Causal AI operates on a much smaller dataset, requires far less resources for computation and provides more meaningful actionable insights, all of which translate into lower ongoing operational costs and profitable growth.

Summing it up

There’s massive potential for Causal AI and OpenTelemetry to come together to tackle the limitations of traditional AI to get to the “why.”  This is what we’re building at Causely. Doing so will result in numerous benefits:

  • Less time on Ops, more time on Dev: OpenTelemetry provides standardized data while Causal AI analyzes it to automate the root cause analysis (RCA) process, which will significantly reduce the time our devops teams have to spend on troubleshooting.
  • Instant gratification, no training lag: We can eliminate AI’s slow learning curve, because Causal AI leverages OpenTelemetry’s semantic language and Causal AI’s domain knowledge of cause and effect to deliver actionable results, right out of the box without massive amounts of data and with no training lag!
  • Small data, lean computation, big impact: Unlike traditional AI’s data gluttony and significant computational overheads, Causal AI thrives on targeted data streams. OpenTelemetry’s smart filtering keeps the information flow lean, allowing Causal AI to identify the root causes with a significantly smaller dataset and compute footprint.
  • Fast root cause identification: Traditional AI might tell us “ice cream sales and shark attacks rise together,” but Causal AI reveals the truth – it’s the summer heat and not the sharks, driving both! By understanding cause-and-effect relationships, Causal AI cuts through the noise and identifies the root causes behind performance degradation and service malfunctions.

Having these capabilities is critical if we want to move beyond the labor intensive processes associated with how RCA is performed in devops today. This is why we are eating our own dog food and using Causely as part of our tech stack to manage the services we provide to customers.

If you would like to learn how to unplug from the Matrix of guesswork and embrace the opportunity offered through the combination of OpenTelemetry and Causal AI, don’t hesitate to reach out! The team and I at Causely are here to share our experience and help you navigate the path.


Related Resources

Data, Dragons & Digital Dreams: The Saga Of Microservice Manor

Reposted with permission from its source on LinkedIn


In the bustling kingdom of Microservice Manor, where code flowed like rivers and servers hummed like contented bees, all was not well. Glitches lurked like mischievous sprites, transactions hiccuped like startled unicorns, and user satisfaction plummeted faster than a dropped byte.

King Reginald Mainframe, a wise old server with a crown of blinking LEDs, wrung his metaphorical hands. His loyal advisors, the Observability Owls, hooted in frustration. Metrics were a jumbled mess, logs spoke in cryptic tongues, and tracing requests felt like chasing greased squirrels through a server maze. Root cause analysis? More like root canal analysis!

King Reginald Mainframe

Then, like a ray of sunshine in a server storm, arrived Open Telemetry. This sprightly young service, armed with instrumentation libraries and tracing tools, promised a unified telemetry framework to illuminate the Manor’s inner workings. Intrigued, King Reginald gave Open Telemetry a royal trial, equipping all his microservices with its magical sensors.

OpenTelemetry

The transformation was instant. Metrics flowed like crystal rivers, logs sang in perfect harmony, and tracing requests became a delightful waltz through a well-mapped network. The Observability Owls, no longer befuddled, hooted gleefully as they pinpointed glitches with pinpoint precision.

The Observability Owls

But King Reginald, ever the wise ruler, knew true peace required more than just clear data. He needed someone to interpret the whispers of the metrics, to make sense of the digital symphony. Enter Causely, his new Digital Assistant. Causely, a majestic data analysis griffin with a keen eye and a beak imbued with the power of Causal AI, could see patterns where others saw only noise.

Causely

Together, OpenTelemetry and Causely formed the ultimate root cause analysis dream team. OpenTelemetry, the tireless scout, would reveal the Manor’s secrets, while Causely, the wise griffin, would decipher their meaning.

First, they tackled a rogue shopping cart service, hoarding transactions like a squirrel with acorns. From the whispers of OpenTelemetry, Causely revealed the culprit: a hidden bug causing carts to multiply like gremlins. With a swift code fix, the carts vanished, and checkout flowed like a well-oiled machine.

Then, a more insidious challenge arose. Whispers of delayed messages and sluggish microservices echoed through the Manor halls. OpenTelemetry traced the issue to the Message Queue Meadow, where messages piled up like autumn leaves. But the cause remained elusive.

Causely, feathers bristling with urgency, soared over OpenTelemetry’s data streams. He spotted a pattern: certain message types were clogging the queues, causing a domino effect of delays. Further investigation revealed a compatibility issue between a recently updated microservice and the messaging format.

With Causely’s guidance, the developers swiftly adjusted the messaging format, and the queues sprang back to life. Messages flowed like a rushing river, microservices danced in sync, and user satisfaction soared higher than a cloud-surfing unicorn.

But the saga wasn’t over. Fresh data, the lifeblood of the Manor, seemed to stagnate. Transactions stuttered, user complaints echoed like mournful owls, and the Observability Owls flapped in confusion. OpenTelemetry led them to the Kafka Canal, where messages, instead of flowing freely, were backed up like barges in a narrow lock.

Causely, the griffin detective perched atop OpenTelemetry’s digital cityscape, surveyed the scene. His gaze, piercing through the stream of metrics, snagged on a glaring imbalance: certain topics, once bustling avenues, now overflowed like overflowing dams, while others lay vacant as silent streets. With a determined glint in his eye, Causely unearthed the culprit: a misconfigured Kafka broker, its settings acting like twisted locks, choking the data flow.

With Causely’s guidance, the developers swiftly adjusted the broker configuration. The Kafka Canal unclogged, messages surged forward, and consumers feasted on fresh data. Transactions hummed back to life, user complaints turned to delighted chirps, and King Reginald’s crown shone brighter than ever.

The legend of OpenTelemetry and Causely spread far and wide. News of their triumphs, from rogue carts to stagnant data, reached other kingdoms, battling their own digital dragons and mischievous sprites. Causely, the wise griffin, and OpenTelemetry, the tireless scout, became symbols of hope, their teamwork a testament to the power of clear data, insightful analysis, and a dash of griffin magic.

Digital Realms

And so, the quest for perfect digital harmony continued, fuelled by collaboration and a never-ending pursuit of efficiency. Microservice Manor, once plagued by glitches and stale data, became a beacon of smooth operations and fresh information, a reminder that even the trickiest of digital challenges can be overcome with the right tools and the unwavering spirit of a dream team.

The End. (But who knows, perhaps one day, even dragons will learn to code efficiently!)

Does chaos reign in your Digital Kingdom? Banish glitches and optimize your realm with the dynamic duo – OpenTelemetry and Causely!   Don’t hesitate to reach out if you would like an introduction to this powerful team.


Related reading

Read the next articles in this saga: 

Causely for asynchronous communication

Causely for async communication - broker OOM

Managing microservices-based applications at scale is challenging, especially when it comes to troubleshooting and pinpointing root causes.

In a microservices-based environment, when a failure occurs, it causes a flood of anomalies across the entire system. Pinpointing the root cause can be as difficult as searching for a needle in a haystack. In this video, we’ll share how Causely can eliminate human heavy lifting and automate the troubleshooting process.

 

Causely is the operating system to assure application service delivery by automatically preventing failures, pinpointing root causes, and remediating. Causely captures and analyzes cause and effect relationships so you can explore interesting insights and questions about your application environment.

Does this resonate with you? Feel free to share your troubleshooting stories here. We’d love to explore the ways Causely can help you!

Moving Beyond Traditional RCA In DevOps

Reposted with permission from LinkedIn

Modernization Of The RCA Process

Over the past month, I have spent a significant amount of time researching what vendors and customers are doing in the devops space to streamline the process of root cause analysis (RCA).

My conclusion is that the underlying techniques and processes used in operational environments today to perform RCA remain human centric. As a consequence troubleshooting remains complex, resource intensive and requires skilled practitioners to perform the work.

So, how do we break free from this human bottleneck? Brace yourselves for a glimpse into a future powered by AI. In this article, we’ll dissect the critical issues, showcase how cutting-edge AI advancements can revolutionize RCA, and hear first hand from operations and engineering leaders who have shared their perspective on this transformative tech, having experienced the capabilities first hand.

Troubleshooting In The Cloud Native Era With Monitoring & Observability

Troubleshooting is hard because when degradations or failures occur in components of a business service, they spread like a disease to related service entities which also become degraded or fail.

This problem is amplified in the world of cloud-native applications where we have decomposed business logic into many separate but interrelated service entities. Today an organization might have hundreds or thousands of interrelated service entities (micro services, databases, caches, messaging…).

To complicate things even further, change is a constant – code changes, fluctuating demand patterns, and the inherent unpredictability of user behavior. These changes can result in service degradations or failures.

Testing for all possible permutations in this ever-shifting environment is akin to predicting the weather on Jupiter – an impossible feat – amplifying the importance of a fast, effective and consistent root cause analysis process, to maintain the availability, performance and operational resilience of business systems.

While observability tools have made strides in data visualization and correlation, their inherent inability to explain the cause-and-effect relationships behind problems leaves us dependent on human expertise to navigate the vast seas of data to determine the root cause of service degradation and failures.

This dependence becomes particularly challenging due to siloed devops teams that have responsibility for supporting individual service entities within the complex web of services entities that make up business services. In this context individual teams may frequently struggle to pinpoint the source of service degradation or failure as the entity they support might be the culprit, or a victim of another service entity’s malfunction.

The availability of knowledge and skills within these teams also fluctuate due to business priorities, vacations, holidays, and even the daily working cycles. This resource variability can lead to significant inconsistencies in problem identification and resolution times.

Causal AI To The Rescue: Automating The Root Cause Analysis Process For Cloud Native DevOps

For those who are not aware, Causal AI is a distinct field in Artificial Intelligence. It is already used extensively in many different industries but until recently there has been no application of the technology in the world of devops.

Causely is a new pioneer championing the use of Causal AI working in the area of cloud-native applications. Their platform embodies an understanding of causality so that when service entities are degraded or failing and affecting other service entities that make up business services, it can explain the cause and effect, by showing the relationship between the problem and the symptoms that this causes.

Through this capability, the team with responsibility for the failing or degraded service can be immediately notified and get to work on resolving the problem. Other teams might also be provided with notifications to let them know that their services are affected, along with an explanation for why this occurred. This eliminates the need for complex triage processes that would otherwise involve multiple teams and managers to orchestrate the process.

Understanding the cause-and-effect relationships in software systems serves as an enabler for automated remediation, predictive maintenance, and planning/gaming out operational resilience.

By using software in this way to automate the process of root cause analysis, organizations can reduce the time and effort and increase the consistency in the troubleshooting process, all of which leads to lower operational costs, improved service availability and less business disruption.

Customer Reactions: Unveiling the Transformative Impact of Causal AI for Cloud-Native DevOps

After sharing insights into Causely’s groundbreaking approach to root cause analysis (RCA) with operational and engineering leaders across various organizations, I’ve gathered a collection of anecdotes that highlight the profound impact this technology is poised to have in the world of cloud-native devops.

Streamlined Incident Resolution and Reduced Triage

“By accurately pinpointing the root cause, we can immediately engage the teams directly responsible for the issue, eliminating the need for war rooms and time-consuming triage processes. This ability to swiftly identify the source of problems and involve the appropriate teams will significantly reduce the time to resolution, minimizing downtime and its associated business impacts.”

Automated Remediation: A Path to Efficiency

“Initially, we’d probably implement a ‘fix it’ button that triggers remediation actions manually. However, as we gain confidence in the results, we can gradually automate the remediation process. This phased approach ensures that we can seamlessly integrate Causely into our existing workflows while gradually transitioning towards a more automated and efficient remediation strategy.”

Empowering Lower-Skilled Team Members

“Lower-skilled team members can take on more responsibilities, freeing up our top experts to focus on code development. By automating RCA tasks and providing clear guidance for remediation, Causely will empower less experienced team members to handle a wider range of issues, allowing senior experts to dedicate their time to more strategic initiatives.”

Building Resilience through Reduced Human Dependency

“Causely will enable us to build greater resilience into our service assurance processes by reducing our reliance on human knowledge and intuition. By automating RCA and providing data-driven insights, Causely will help us build a more resilient infrastructure that is less susceptible to human error and fluctuations in expertise.”

Enhanced Support Beyond Office Hours

“We face challenges maintaining consistent support outside of office hours due to reduced on-call expertise. Causely will enable us to handle incidents with the same level of precision and efficiency regardless of the time of day. Causely’s ability to provide automated RCA and remediation even during off-hours ensures that organizations can maintain a high level of service continuity around the clock.”

Automated Runbook Creation and Maintenance

“I was planning to create runbooks to guide other devops team members through troubleshooting processes. Causely can automatically generate and maintain these runbooks for me. This automated runbook generation eliminates the manual effort required to create and maintain comprehensive troubleshooting guides, ensuring that teams have easy access to the necessary information when resolving issues.”

Simplified Post-Incident Analysis

“Post-incident analysis will become much simpler as we’ll have a detailed record of the cause and effect for every incident. Causely’s comprehensive understanding of cause and effect provides a valuable resource for post-incident analysis, enabling us to improve processes, and prevent similar issues from recurring.”

Faster Problem Identification and Reduced Business Impacts

“Problems will be identified much faster, and there will be fewer business consequences.  By automating RCA and providing actionable insights, Causely can significantly reduce the time it takes to identify and resolve problems, minimizing their impact on business operations and customer experience.”

These anecdotes underscore the transformative potential of Causely, offering a compelling vision of how root cause analysis is automated, remediation is streamlined, and operational resilience in cloud-native environments is enhanced. As Causely progresses, the company’s impact on the IT industry is poised to be profound and far-reaching.

Summing Things Up

Troubleshooting in cloud-native environments is complex and resource-intensive, but Causal AI can automate the process, streamline remediation, and enhance operational resilience.

If you would like to learn more about how Causal AI might benefit your organization, don’t hesitate to reach out to me or Causely directly.


Related Resources

Root Cause Chronicles: Connection Collapse

The below post is reposted with permission from its original source on the InfraCloud Technologies blog.

This MySQL connection draining issue highlights the complexity of troubleshooting today’s complex environments, and provides a great illustration of the many rabbit holes SREs find themselves in. It’s critical to understand the ‘WHY’ behind each problem, as it paves the way for faster and more precise resolutions. This is exactly what we at Causely are on a mission to improve using causal AI.


On a usual Friday evening, Robin had just wrapped up their work, wished their colleagues a happy weekend, and turned themselves in for the night. At exactly 3 am, Robin receives a call from the organization’s automated paging system, “High P90 Latency Alert on Shipping Service: 9.28 seconds”.

Robin works as an SRE for Robot-Shop, an e-commerce company that sells various robotics parts and accessories, and this message does not bode well for them tonight. They prepare themselves for a long, arduous night ahead and turn on their work laptop.

Setting the Field

Robot-Shop runs a sufficiently complex cloud native architecture to address the needs of their million-plus customers.

  • The traffic from load-balancer is routed via a gateway service optimized for traffic ingestion, called Web, which distributes the traffic across various other services.
  • User handles user registrations and sessions.
  • Catalogue maintains the inventory in a MongoDB datastore.
  • Customers can see the ratings of available products via the Ratings service APIs.
  • They choose products they like and add them to the Cart, a service backed by Redis cache to temporarily hold the customer’s choices.
  • Once the customer pays via the Payment service, the purchased items are published to a RabbitMQ channel.
  • These are consumed by the Dispatch service and prepared for shipping. Shipping uses MySQL as its datastore, as does Ratings.

(Figure 1: High Level Architecture of Robot-shop Application stack)

Troubles in the Dark

“OK, let’s look at the latency dashboards first.” Robin clicks on the attached Grafana dashboard on the Slack notification for the alert sent by PagerDuty. This opens up the latency graph of the Shipping service.

“How did it go from 1s to ~9.28s within 4-5 minutes? Did traffic spike?” Robin decides to focus on the Gateway ops/sec panel of the dashboard. The number is around ~140 ops/sec. Robin knows this data is coming from their Istio gateway and is reliable. The current number is more than affordable for Robot-Shop’s cluster, though there is a steady uptick in the request-count for Robot-Shop.

None of the other services show any signs of wear and tear, only Shipping. Robin understands this is a localized incident and decides to look at the shipping logs. The logs are sourced from Loki, and the widget is conveniently placed right beneath the latency panel, showing logs from all services in the selected time window. Nothing in the logs, and no errors regarding connection timeouts or failed transactions. So far the only thing going wrong is the latency, but no requests are failing yet; they are only getting delayed by a very long time. Robin makes a note: We need to adjust frontend timeouts for these APIs. We should have already gotten a barrage of request timeout errors as an added signal.

Did a developer deploy an unapproved change yesterday? Usually, the support team is informed of any urgent hotfixes before the weekend. Robin decides to check the ArgoCD Dashboards for any changes to shipping or any other services. Nothing there either, no new feature releases in the last 2 days.

Did the infrastructure team make any changes to the underlying Kubernetes cluster? Any version upgrades? The Infrastructure team uses Atlantis to gate and deploy the cluster updates via Terraform modules. The last date of change is from the previous week.

With no errors seen in the logs and partial service degradation as the only signal available to them, Robin cannot make any more headway into this problem. Something else may be responsible, could it be an upstream or downstream service that the shipping service depends on? Is it one of the datastores? Robin pulls up the Kiali service graph that uses Istio’s mesh to display the service topology to look at the dependencies.

Robin sees that Shipping has now started throwing its first 5xx errors, and both Shipping and Ratings are talking to something labeled as PassthroughCluster. The support team does not maintain any of these platforms and does not have access to the runtimes or the codebase. “I need to get relevant people involved at this point and escalate to folks in my team with higher access levels,” Robin thinks.

Stakeholders Assemble

It’s already been 5 minutes since the first report and customers are now getting affected.

(Figure 5: Detailed Kubernetes native architecture of Robot-shop)

Robin’s team lead Blake joins in on the call, and they also add the backend engineer who owns Shipping service as an SME. The product manager responsible for Shipping has already received the first complaints from the customer support team who has escalated the incident to them; they see the ongoing call on the #live-incidents channel on Slack, and join in. P90 latency alerts are now clogging the production alert channel as the metric has risen to ~4.39 minutes, and 30% of the requests are receiving 5xx responses.

The team now has multiple signals converging on the problem. Blake digs through shipping logs again and sees errors around MySQL connections. At this time, the Ratings service also starts throwing 5xx errors – the problem is now getting compounded.

The Product Manager (PM) says their customer support team is reporting frustration from more and more users who are unable to see the shipping status of the orders they have already paid for and who are supposed to get the deliveries that day. Users who just logged in are unable to see product ratings and are refreshing the pages multiple times to see if the information they want is available.

“If customers can’t make purchase decisions quickly, they’ll go to our competitors,” the PM informs the team.

Blake looks at the PassthroughCluster node on Kiali, and it hits them: It’s the RDS instance. The platform team had forgotten to add RDS as an External Service in their Istio configuration. It was an honest oversight that could cost Robot-Shop significant revenue loss today.

“I think MySQL is unable to handle new connections for some reason,” Blake says. They pull up the MySQL metrics dashboards and look at the number of Database Connections. It has gone up significantly and then flattened. “Why don’t we have an alert threshold here? It seems like we might have maxed out the MySQL connection pool!”

To verify their hypothesis, Blake looks at the Parameter Group for the RDS Instance. It uses the default-mysql-5.7 Parameter group, and max_connections is set to:

{DBInstanceClassMemory/12582880}

But, what does that number really mean? Blake decides not to waste time with checking the RDS Instance Type and computing the number. Instead, they log into the RDS instance with mysql-cli and run:

#mysql> SHOW VARIABLES LIKE "max_connections";

Then Blake runs:

#mysql> SHOW processlist;

“I need to know exactly how many,” Blake thinks, and runs:

#mysql> SELECT COUNT(host) FROM information_schema.processlist;

It’s more than the number of max_connections. Their hypothesis is now validated: Blake sees a lot of connections are in sleep() mode for more than ~1000 seconds, and all of these are being created by the shipping user.

(Figure 13: Affected Subsystems of Robot-shop)

“I think we have it,” Blake says, “Shipping is not properly handling connection timeouts with the DB; it’s not refreshing its unused connection pool.” The backend engineer pulls up the Java JDBC datasource code for shipping and says that it’s using defaults for max-idle, max-wait, and various other Spring datasource configurations. “These need to be fixed,” they say.

“That would need significant time,” the PM responds, “and we need to mitigate this incident ASAP. We cannot have unhappy customers.”

Blake knows that RDS has a stored procedure to kill idle/bad processes.

mysql#> CALL mysql.rds_kill(processID)

Blake tests this out and asks Robin to quickly write a bash script to kill all idle processes.

#!/bin/bash

# MySQL connection details
MYSQL_USER="<user>"
MYSQL_PASSWORD="<passwd>"
MYSQL_HOST="<rds-name>.<id>.<region>.rds.amazonaws.com"

# Get process list IDs
PROCESS_IDS=$(MYSQL_PWD="$MYSQL_PASSWORD" mysql -h "$MYSQL_HOST" -u "$MYSQL_USER" -N -s -e "SELECT ID FROM INFORMATION_SCHEMA.PROCESSLIST WHERE USER='shipping'")

for ID in $PROCESS_IDS; do
MYSQL_PWD="$MYSQL_PASSWORD" mysql -h "$MYSQL_HOST" -u "$MYSQL_USER" -e "CALL mysql.rds_kill($ID)"
echo "Terminated connection with ID $ID for user 'shipping'"
done

The team runs this immediately and the connection pool frees up momentarily. Everyone lets out a visible sigh of relief. “But this won’t hold for long, we need a hotfix on DataSource handling in Shipping,” Blake says. The backend engineer informs they are on it and soon they have a patch-up that adds better defaults for

spring.datasource.max-active
spring.datasource.max-age
spring.datasource.max-idle
spring.datasource.max-lifetime
spring.datasource.max-open-prepared-statements
spring.datasource.max-wait
spring.datasource.maximum-pool-size
spring.datasource.min-evictable-idle-time-millis
spring.datasource.min-idle

The team approves the hotfix and deploys it, finally mitigating a ~30 minute long incident.

Key Takeaways

Incidents such as this can occur in any organization with sufficiently complex architecture involving microservices written in different languages and frameworks, datastores, queues, caches, and cloud native components. A lack of understanding of end-to-end architecture and information silos only adds to the mitigation timelines.

During this RCA, the team finds out that they have to improve on multiple accounts.

  • Frontend code had long timeouts and allowed for large latencies in API responses.
  • The L1 Engineer did not have an end-to-end understanding of the whole architecture.
  • The service mesh dashboard on Kiali did not show External Services correctly, causing confusion.
  • RDS MySQL database metrics dashboards did not send an early alert, as no max_connection (alert) or high_number_of_connections (warning) thresholds were set.
  • The database connection code was written with the assumption that sane defaults for connection pool parameters were good enough, which proved incorrect.

Pressure to resolve incidents quickly that often comes from peers, leadership, and members of affected teams only adds to the chaos of incident management, causing more human errors. Coordinating incidents such as this through the process of having an Incident Commander role has shown more controllable outcomes for organizations around the world. An Incident Commander assumes the responsibility of managing resources, planning, and communications during a live incident, effectively reducing conflict and noise.

When multiple stakeholders are affected by an incident, resolutions need to be handled in order of business priority, working on immediate mitigations first, then getting the customer experience back at nominal levels, and only afterward focusing on long-term preventions. Coordinating these priorities across stakeholders is one of the most important functions of an Incident Commander.

Troubleshooting complex architecture remains a challenging activity to date. However, with the Blameless RCA Framework coupled with periodic metric reviews, a team can focus on incremental but constant improvements of their system observability. The team could also convert all successful resolutions to future playbooks that can be used by L1 SREs and support teams, making sure that similar errors can be handled well.

Concerted effort around a clear feedback loop of Incident -> Resolution -> RCA -> Playbook Creation eventually rids the system of most unknown-unknowns, allowing teams to focus on Product Development, instead of spending time on chaotic incident handling.

 

That’s a Wrap

Hope you all enjoyed that story about a hypothetical but complex troubleshooting scenario. We see incidents like this and more across various clients we work with at InfraCloud. The above scenario can be reproduced using our open source repository. We are working on adding more such reproducible production outages and subsequent mitigations to this repository.

We would love to hear from you about your own 3 am incidents. If you have any questions, you can connect with me on Twitter and LinkedIn.

References


 

Related Resources

Understanding failure scenarios when architecting cloud-native applications

Developing and architecting complex, large cloud-native applications is hard. In this short demo, we’ll show how Causely helps to understand failure scenarios before something actually fails in the environment.

In the demo environment we have a dozen applications with database servers, caches running in a cluster, providing multiple services. If we drill into these services and focus on the application, we can only see how the application is behaving right now. But Causely is automatically identifying the potential root causes and alerts that would be caused – services that would be impacted – by failures.

For example, a congested service would cause high latency across a number of different downstream dependencies. A malfunction of this service would make services unavailable and cause high error rates on the dependent services.

Causely is able to reason about the specific dependencies and all the possible root causes – not just for services, but for the applications – in terms of: what would happen if their database query takes too long, if their garbage collection time takes too long, if their transaction latency is high? What services would be impacted, and what alerts would it receive?

This allows developers to design a more resilient system, and operators can understand how to run the environment with their actual dependencies.

We’re hoping that Causely can help application owners avoid production failures and service impact by architecting applications to be resilient in the first place.

What do you think? Share your comments on this use case below.

Troubleshooting cloud-native applications with Causely

Running large, complex, distributed cloud-native applications is hard. This short demo shows how Causely can help.

In this environment, we are running a number of applications with database servers, caches, in a cluster, multiple services, pods, and containers. At any one point in time, we would be getting multiple alerts showing high latency, high CPU utilization, high garbage collection time, high memorization across multiple microservices. Troubleshooting what is the root cause of each one of these alerts is really difficult.

Causely automatically identifies the root cause and shows how the service that is actually congested causing all of these downstream alerts on its dependent services. Instead of individual teams troubleshooting their respective alerts, the team responsible for this product catalog service can focus on remediating and restoring this service while showing all of the other impacted services, so the teams are aware that their problems are caused by congestion in this service. This can significantly reduce the time to detect and to remediate and restore a service.

What do you think? Share your comments on this use case below.

Unveiling the Causal Revolution in Observability

Reposted with permission from LinkedIn.

OpenTelemetry and the Path to Understanding Complex Systems

Decades ago, the IETF’s (Internet Engineering Task Force) developed an innovative protocol, SNMP, revolutionizing network management. This standardization spurred a surge of innovation, fostering a new software vendor landscape dedicated to streamlining operational processes in network management, encompassing Fault, Configurations, Accounting, Performance, and Security (FCAPS). Today, SNMP reigns as the world’s most widely adopted network management protocol.

On the cusp of a similar revolution stands the realm of application management. For years, the absence of management standards compelled vendors to develop proprietary telemetry for application instrumentation, to enable manageability. Many of the vendors also built applications to report on and visualize managed environments, in an attempt to streamline the processes of  incident and performance management.

OpenTelemetry‘s emergence is poised to transform the application management market dynamics in a similar way by commoditizing application telemetry instrumentation, collection, and export methods. Consequently, numerous open-source projects and new companies are emerging, building applications that add value around OpenTelemetry.

This evolution is also compelling established vendors to embrace OpenTelemetry. Their futures hinge on their ability to add value around this technology, rather than solely providing innovative methods for application instrumentation.

Adding Value Above OpenTelemetry

While OpenTelemetry simplifies the process of collecting and exporting telemetry data, it doesn’t guarantee the ability to pinpoint the root cause of issues. This is because understanding the causal relationships between events and metrics requires more sophisticated analysis techniques.

Common approaches to analyzing OpenTelemetry data that get devops teams closer to this goal include:

  • Visualization and Dashboards: Creating effective visualizations and dashboards is crucial for extracting insights from telemetry data. These visualizations should present data in a clear and concise manner, highlighting trends, anomalies, and relationships between metrics.
  • Correlation and Aggregation: To correlate logs, metrics, and traces, you need to establish relationships between these data streams. This can be done using techniques like correlation IDs or trace identifiers, which can be embedded in logs and metrics to link them to their corresponding traces.
  • Pattern Recognition and Anomaly Detection: Once you have correlated data, you can apply pattern recognition algorithms to identify anomalies or outliers in metrics, which could indicate potential issues. Anomaly detection tools can also help identify sudden spikes or drops in metrics that might indicate performance bottlenecks or errors.
  • Machine Learning and AI: Machine learning and AI techniques can be employed to analyze telemetry data and identify patterns, correlations, and anomalies that might be difficult to detect manually. These techniques can also be used to predict future performance or identify potential issues before they occur.

While all of these techniques might help to increase the efficiency of the troubleshooting process, human expertise is still essential for interpreting and understanding the results. This is because these approaches to analyzing telemetry data are based on correlation and lack an inherent understanding of cause and effect (causation).

Avoiding The Correlation Trap: Separating Coincidence from Cause and Effect

In the realm of analyzing observability data, correlation often takes center stage, highlighting the apparent relationship between two or more variables. However, correlation does not imply causation, a crucial distinction that software-driven causal analysis can effectively address and results in a better outcome in the following ways:

Operational Efficiency And Control: Correlation-based approaches often leave us grappling with the question of “why,” hindering our ability to pinpoint the root cause of issues. This can lead to inefficient troubleshooting efforts, involving multiple teams in a devops environment as they attempt to unravel the interconnectedness of service entities.

Software-based causal analysis empowers us to bypass this guessing game, directly identifying the root cause and enabling targeted corrective actions. This not only streamlines problem resolution but also empowers teams to proactively implement automations to mitigate future occurrences. It also frees up the time of experts in the devops organizations to focus on shipping features and working on business logic.

Consistency In Responding To Adverse Events: The speed and effectiveness of problem resolution often hinge on the expertise and availability of individuals, a variable factor that can delay critical interventions. Software-based causal analysis removes this human dependency, providing a consistent and standardized approach to root cause identification.

This consistency is particularly crucial in distributed devops environments, where multiple teams manage different components of the system. By leveraging software, organizations can ensure that regardless of the individuals involved, problems are tackled with the same level of precision and efficiency.

Predictive Capabilities And Risk Mitigation: Correlations provide limited insights into future behavior, making it challenging to anticipate and prevent potential problems. Software-based causal analysis, on the other hand, unlocks the power of predictive modeling, enabling organizations to proactively identify and address potential issues before they materialize.

This predictive capability becomes increasingly valuable in complex cloud-native environments, where the interplay of numerous microservices and data pipelines can lead to unforeseen disruptions. By understanding cause and effect relationships, organizations can proactively mitigate risks and enhance operational resilience.

Conclusion

OpenTelemetry marks a significant step towards standardized application management, laying a solid foundation for a more comprehensive understanding of complex systems. However, to truly unlock the full potential, the integration of software-driven causal analysis, also referred to as Causal AI, is essential. By transcending correlation, software-driven causal analysis empowers devops  organizations to understand cause and effect of system behavior, enabling proactive problem detection, predictive maintenance, operational risk mitigation and automated remediation.

The founding team of Causely participated in the standards-driven transformation that took place in the network management market more than two decades ago at a company called SMARTS. The core of their solution was built on Causal AI. SMARTS became the market leader of Root Cause Analysis in networks and was acquired by EMC in 2005. The team’s rich experience in Causal AI is now being applied at Causely, to address the challenges of managing cloud native applications.

Causely’s embrace of OpenTelemetry stems from the recognition that this standardized approach will only accelerate the advancement of application management. By streamlining telemetry data collection, OpenTelemetry creates a fertile ground for Causal AI to flourish.

If you are intrigued and would like to learn more about Causal AI the team at Causely would love to hear from you, so don’t hesitate to get in touch.

All Sides of the Table

Reflecting on the boardroom dynamics that truly matter

This past month has been an eventful one. Like everyone in the tech world, I’m riveted by the drama unfolding at OpenAI, wondering how the board and CEO created such an extreme situation. I’ve been thinking a lot about board dynamics – and how different things look as a founder/CEO vs board member, especially at very different stages of company growth.

Closer to home and far less dramatic, last week we had our quarterly Causely team meetup in NYC, including our first in-person board meeting with 645 Ventures, Amity Ventures and Cervin Ventures. As a remote company, it was great to actually sit together in one room and discuss company and product strategy, including a demo of our latest iteration of the Causely product. Getting aligned and hearing the board’s input was truly helpful as we plan for 2024.

Also in the past month, my board experiences at Corvus Insurance and Chase Corporation came to an end. Corvus (a late-stage insurtech) announced it’s being acquired by Travelers Insurance for $435M, and Chase (a 75-year old manufacturing company) closed its acquisition by KKR for $1.3B. These exits were the culmination of years of work by the management teams (and support of their boards) to create significant shareholder value.

Each of these experiences has shown me models of board interaction and highlighted how critical it is for board members to build trust with the CEO and each other. I thought I’d share some thoughts on the most valuable traits or contributions a board member can offer, and what executives should look for in board members that will make a meaningful impact on the business, depending on the stage and size of the company.

From the startup CEO/founder view

As a founder who’s built and managed my own boards for the past 15 years, I’ve learned a lot about what kinds of board members are most impactful and productive for early-stage life. Here are a few examples of what these board members do: 

  • They get hands-on to provide real value through introductions to design partners, customers and potential key hires. 
  • They ask the operational questions relevant for the company’s current stage – for example, is this product decision (or hire) really the one we need to make now, while we are trying to validate product market fit, or can it be deferred? 
  • They hold the CEO accountable and keep board discussions focused, by asking questions like, “Ellen, are we actually talking about the topic that keeps you up at night?!”
  • They don’t project concerns from other companies or past experiences that might be irrelevant.
  • They stay calm through ends-of-quarters and acquisition processes, and balance the needs of investors and common shareholders.

From the board view

As an independent board member, I now appreciate these board members even more. It can be hard to step back from the operational role (“What do I need to do next?”) and provide guidance and support, sometimes just by asking the right question at the right moment in a board meeting. I find it very helpful to check in with the CEO and other board members before any official meeting, so I understand where the “hot” issues are and what decisions need to be made as a group. 

In a public company, the challenge is even greater. The independent board member must maintain this same operator/advisor perspective, but also weigh decisions as they relate to corporate governance and enterprise risk management across a wide range of products, markets and countries. For example, how fast can management drive product innovation which may cause new cyber risk or data management concerns? And unlike in private and early-stage companies, which tend to focus almost entirely on top-line growth, what is the right balance of growth vs profitability for the more mature public company?

Building trust is key

As the recent chaos at OpenAI shows (albeit in an extreme way), strong board relationships and ongoing communications between the board and management are critical. 

If you’re building a company and/or adding new board seats, think about what a new board member should bring to the table that will help you reach the next phase of growth and major milestones — and stay laser-focused on finding someone that meets your criteria. 

If you’re considering serving on the board of a company, think about what kinds of companies you’re best suited to help, and find one where you can work closely with the CEO and where existing board members will complement your skill set and experience.  

Regardless of which side of the table you’re on, take the time to build strong relationships and trust. Lead directors, who have taken a more central role in the past several years, can ensure that communications don’t break down. But even in earlier stage companies, it’s the job of everyone around the table to make sure there’s clarity on the key strategic issues the company is facing, and to provide the support that the CEO and management need to make the best decisions for the business.


Related reading

Why do this startup thing all over again? Our reasons for creating Causely

Why be a serial entrepreneur?

It’s a question that my co-founder, Shmuel, and I are asked many times. Both of us have been to this rodeo twice before – Shmuel, with SMARTS and Turbonomic, myself with ClearSky Data and CloudSwitch. There are all the usual reasons, including love of hard challenges, creation of game-changing products and working with teams and customers who inspire you. And of course there’s more than a small share of insanity (or as one of our founding engineers, Endre Sara, might call it, an addiction to Sisyphean tasks?) 

I’ve been pondering this as we build our new venture, Causely. The motivation behind Causely was a long-standing goal of Shmuel’s to tackle a problem he’s addressed in both previous companies, but still feels is unresolved: How to remove the burden of human troubleshooting from the IT industry? (Shmuel is not interested in solving small problems.) 

 Although there are tools galore and so much data being gathered that it takes data lakes to manage it all, at the heart of the IT industry there is a central problem that hasn’t fundamentally changed in decades. When humans try to determine the root cause of things that make applications slow down or break, they rely on the (often specific) expertise of people on their teams. And the scale, complexity and rate of change of their application environments will always grow faster than humans can keep up.   

I saw this during my time at AWS, while I was running some global cloud storage services. No matter how incredible the people were, how well-architected the services, or robust the mechanisms and tools – when things went wrong (usually at 3 am) it always came down to having the right people online at that moment to figure out what was really happening. Much of the human effort went into stabilizing things asap for customers and then digging in for days (or longer) to understand what had happened  

When Peter Bell, our founding investor at Amity Ventures, originally introduced me and Shmuel, it was clear that Shmuel had the answer to this never-ending cycle of applications changing, scaling, breaking, requiring human troubleshooting, writing post-mortems… and starting all over again. He was thinking about the problem from the perspective of what’s actually missing: the ability to capture causality in software. AI and ML technologies are advancing beyond our wildest dreams, but they are still challenged by the inability to automate causation (vs correlation, which they do quite well). By building a causal AI platform that can tackle this huge challenge, we believe Causely can eliminate the need for humans to keep pushing the same rocks up the same hills. 

So why do this startup thing all over again?

Because for each new venture there’s always been a big, messy problem that needs to be fixed. One that requires a solution that’s never been done before, that will be exciting to build with smart, creative people. 

And so today, we announce funding for Causely and the opening of our Early Access program for initial users. We’ve been quietly designing and building for the past year, working with some awesome design partners. We’re thrilled to have 645 Ventures join us on this journey and are already seeing the impact of support from Aaron, Vardan and the team. We also welcome new investors Glasswing Ventures and Tau Ventures. We hope early users will love the Causely premise and early version of the product and give us the input we need to build something that truly changes how applications are built and operated.

Please take a look and let us know your thoughts. 

Learn more

 

Causely raises $8.8M in Seed funding to deliver IT industry’s first causal AI platform

Automation of causality will eliminate human troubleshooting and enable faster, more resilient cloud application management

Boston, June 29, 2023Causely, Inc., the causal AI company, today announced it has raised $8.8M in Seed funding, led by 645 Ventures with participation from founding investor Amity Ventures, and including new investors Glasswing Ventures and Tau Ventures. The funding will enable Causely to build its causal AI platform for IT and launch an initial service for applications running in Kubernetes environments. This financing brings the company’s total funding to over $11M since it was founded in 2022.

For years, the IT industry has struggled to make sense of the overwhelming amounts of data coming from dozens of observability platforms and monitoring tools. In a dynamic world of cloud and edge computing, with constantly increasing application complexity and scale, these systems gather metrics and logs about every aspect of application and IT environments. In the end, all this data still requires human troubleshooting to respond to alerts, make sense of patterns, identify root cause, and ultimately determine the best action for remediation. This process, which has not changed fundamentally in decades, is slow, reactive, costly and labor-intensive. As a result, many problems can cause end-user and business impact, especially in situations where complex problems propagate across multiple layers and components of an application.  

Causely’s breakthrough approach is to remove the need for human intervention from the entire process by capturing causality in software. By enabling end-to-end automation, from detection through remediation, Causely closes the gap between observability and action, speeding time to remediation and limiting business impact. Unlike existing solutions, Causely’s core technology goes beyond correlation and anomaly detection to identify root cause in dynamic systems, making it possible to see causality relationships across any IT environment in real time.

The founding team, led by veterans Ellen Rubin (founder of ClearSky Data, acquired by Amazon Web Services, and CloudSwitch, acquired by Verizon) and Shmuel Kliger (founder of Turbonomic, acquired by IBM, and SMARTS, acquired by EMC), brings together world-class expertise from the IT Ops, cloud-native and Kubernetes communities and decades of experience successfully building and scaling companies. 

“In a world where developers and operators are overwhelmed by data, alert storms and incidents, the current solutions can’t keep up,” said Ellen Rubin, Causely CEO and Founder. “Causely’s vision is to enable self-managed, resilient applications and eliminate the need for human troubleshooting. We are excited to bring this vision to life, with the support and partnership of 645 Ventures, Amity Ventures and our other investors, working closely with our early design partners.”

“Causality is the missing link in automating IT operations,” said Aaron Holiday, Managing Partner at 645 Ventures. “The Causely team is uniquely able to address this difficult and long-standing challenge and we are proud to be part of the next phase of the company’s growth.”

“Having worked with the founding team for many years, I’m excited to see them tackle an unsolved industry problem and automate the entire process from observability through remediation,” said Peter Bell, Chairman at Amity Ventures.

The initial Causely service, for DevOps and SRE users who are building and supporting apps in Kubernetes, is now available in an Early Access program. To learn more, please visit www.causely.io/early-access

About Causely

Causely, the causal AI company, automates the end-to-end detection, prevention and remediation of critical defects that can cause user and business impact in application environments. Led by veterans from Turbonomic, AWS, SMARTS and other cloud infrastructure companies, Causely is backed by Amity Ventures and 645 Ventures. To learn more, please visit www.causely.io

About 645 Ventures 

645 Ventures is an early-stage venture capital firm that partners with exceptional founders who are building iconic companies. We invest at the Seed and Series A stages and leverage our Voyager software platform to enable our Success team and Connected Network to help founders scale to the growth stage. Select companies that have reached the growth stage include Iterable, Goldbelly, Resident, Eden Health, FiscalNote, Lunchbox, and Squire. 645 has $550m+ in AUM across 5 funds, and is growing fast with backing from leading institutional investors, including university endowments, funds of funds, and pension funds. The firm has offices in New York and SF, and you can learn more at www.645ventures.com. 

About Amity Ventures 

Amity Ventures is a venture capital firm based in San Francisco, CA. We are a closely knit team with 40+ years of collective experience partnering deeply with entrepreneurs building category-defining technology businesses at the early stage. Amity intentionally invests in a small number of startups per year, and currently has a portfolio of about 25 companies across multiple funds. 

Contact

Kelsey Cullen, KCPR
kelsey@kcpr.com
650.438.1063