Author: Karina Babcock

Mission Impossible? Cracking the Code of Complex Tracing Data

In this video, we’ll show how Causely leverages OpenTelemetry. (For more on how and why we use OpenTelemetry in our causal AI platform, read the blog from Endre Sara.)

 

 

Distributed tracing gives you a bird’s eye view of transactions across your microservices. Far beyond what logs and metrics can offer, it helps you trace the path of a request across service boundaries. Setting up distributed tracing has never been easier. In addition to OpenTelemetry and other existing tracing tools such as Tempo and Jaeger, with open source tools like Grafana Beyla and Keyval Odigos, you can enable distributed tracing in your system without a single line of change.

These tools allow the instrumented applications to start sending traces immediately. But, with potentially hundreds of spans in each trace and millions of traces generated per minute, you can easily become over overwhelmed. Even with a bird’s eye view, you might feel like you’re flying blind.

That’s where Causely comes in. Causely efficiently consumes and analyzes tracing data, automatically constructs a cause and effect relationship, and pinpoints the root cause.

Interested in seeing how Causely makes it faster and easier to use tracing data in your environment so you can understand the root cause of challenging problems?

Comment here or contact us. We hope to hear from you!


Related resources

Don’t Forget These 3 Things When Starting a Cloud Venture

I’ve been in the cloud since 2008. These were the early days of enterprise cloud adoption and the very beginning of hybrid cloud, which to this day remains the dominant form of enterprise cloud usage. Startups that deliver breakthrough infrastructure technology to enterprise customers have their own dynamic (which differs from consumer-focused companies), and although the plots may differ, the basic storylines stay the same. I’ve learned from several startup experiences (and a fair share of battle scars) there are several things you should start planning for right from the beginning of a new venture.

These aren’t the critical ones you’re probably already doing, such as selecting a good initial investor, building a strong early team and culture, speaking with as many early customers as possible to find product-market fit (PMF), etc.

Instead, here are three of the less-known but equally important things you should prioritize in your first years of building a company.

1. Build your early customer advisor relationships

As always, begin with the customer. Of course you will cast a wide net in early customer validation and market research efforts to refine your ideal customer profile (ICP) and customer persona(s). But as you’re iterating on this, you also want to build stronger relationships with a small group of customers who provide more intensive and strategic input. Often these early customers become part of a more formal customer advisory board, but even before this, you want to have an ongoing dialog with them 1:1 as you’re thinking through your product strategy.

These customers are strategic thinkers who love new technologies and startups. They may be senior execs but not necessarily – they are typically the people who think about how to bring innovation into their complex enterprise organizations and enjoy acting as “sherpas” to guide entrepreneurs on making their products and companies more valuable. I can remember the names of all the customer advisors from my previous companies and they have had an enormous impact on how things progressed. So beyond getting initial input and then focusing only on sales opportunities, build advisor relationships that are based on sharing vision and strategy, with open dialogue about what’s working/not working and where you want to take the company. Take the time to develop these relationships and share more deeply with these advisors.

2. Begin your patent work

Many B2B and especially infrastructure-oriented startups have meaningful IP that is the foundation of their companies and products. Patenting this IP requires the founding engineering team to spend significant time documenting the core innovation and its uniqueness, and the legal fees can run to thousands of dollars. As a result, there’s often a desire to hold off filing patents until the company is farther along and has raised more capital. Also, the core innovation may change as the company gets further into the market and founders realize more precisely what the truly valuable technology includes – or even if the original claims are not aligned with PMF.

However, it’s important to begin documenting your thinking early, as the company and development process are unfolding, so you have a written record and are prepared for the legal work ahead. In the end, the real value patent(s) provide for startups is less about protecting your IP from other larger players — who will always have more lawyers and money than you do and will be willing to take legal action to protect their own patent portfolios or keep your technology off-market — than it is about capturing the value of your innovation for future funding and M&A scenarios, and as a way to show larger players that you’ve taken steps to protect your IP and innovation during buy-out. In the US, a one year clock for filing the initial patent (frequently with a provisional patent filed first to preserve an early filing date) begins ticking upon external disclosure, such as when you launch your product, and you don’t want to address these issues in a rush. It’s important to get legal advice early on about whether you have something patentable, how much work it will be to write up the patent application, who will be named as inventors, and whether you want to file in more than one country.

3. Start your SOC2 process as you’re building MVP

Yes, SOC2. Not very sexy but as time has gone by, it’s become absolute table stakes for enabling early-adopter customers to test your initial product, even in a dev/test environment. In the past, it was possible to find a champion who loves new technology to “unofficially” test your early product since s/he was respected and trusted by their organizations to bring in great, new stuff. But as cloud and SaaS have matured and become so widespread at enterprise customers, the companies that provide these services – even to other startups – are requiring ALL their potential vendors to keep them aligned with their own SOC2 requirements. It’s like a ladder of compliance that is built on each vendor supporting the vendors and customers above them. There is typically a new vendor onboarding process and security review, even for testing out a new product, and these are more consistently enforced than in the past.

As a result, it has become more urgent to start your SOC2 process right away, so you can say you’re underway even as you’re building your minimum viable product (MVP) and development processes. Although it’s much easier now to automate and track SOC2 processes and prepare for the audit (this is my third time doing this, and it is far less manual than in the past) – if you launch your product and then go back later to set up security/compliance policies and processes, it will be much harder, more complicated and under much more pressure from your sales teams to be able to check this box.

There’s no question that other must-dos should be added to the above list (and it would be great to hear from other founders on this). With so many things to consider, it’s hard to prioritize in the early days and sometimes the less obvious things can be pushed for “later”. But as I build my latest cloud venture, Causely, I’ve kept these 3 priorities top of mind since I know how important they are to lay a strong foundation for growth.


Related resources

Eating Our Own Dog Food: Causely’s Journey with OpenTelemetry & Causal AI

Original photo by MART PRODUCTION

Implementing OpenTelemetry at the core of our observability strategy for Causely’s SaaS product was a natural decision. In this article I would like to share some background on our rationale and how the combination of OpenTelemetry and Causal AI addresses several critical requirements that enable us to scale our services more efficiently.

Avoiding Pitfalls Based on Our Prior Experience

We already know from decades of experience working in and with operations teams in the most challenging environments, that bridging the gap between the vast ocean of observability data and actionable insights has and continues to be a major pain point. This is especially true in the complex world of cloud-native applications.

Missing application insights

Application observability remains an elusive beast for many, especially in complex microservices architectures. While infrastructure monitoring has become readily available, neglecting application data paints an incomplete picture, hindering effective troubleshooting and operations.

Siloed solutions

Traditional observability solutions have relied on siloed, proprietary agents and data sources, leading to fragmented visibility across teams and technologies. This makes it difficult to understand the complete picture of service composition and dependencies.

To me this is like trying to solve a puzzle with missing pieces – that’s essentially a problem that many DevOps teams face today – piecing together a picture of how microservices, serverless functions, databases, and other elements interact with one  another, and underlying infrastructure and cloud services they run on. This hinders collaboration and troubleshooting efforts, making it challenging to pinpoint the root cause of performance issues or outages.

Vendor lock-in

Many vendors’ products also lock customers’ data into their cloud services. This can result in customers paying through the nose, because licensing costs are predicated on the volume of data that is being collected and stored in the service providers’ backend SaaS services. It can also be very hard to exit these services once locked in.

These are all pitfalls we want to avoid at Causely as we build out our Causal AI services.

The Pillars of Our Observability Architecture Pointed Us to OpenTelemetry

OpenTelemetry provides us with a path to break free from these limitations, establishing a common framework that transcends programming languages and platforms that we are using to build our services, and satisfying the requirements laid out in the pillars of our observability architecture:

Precise instrumentation

OpenTelemetry offers automatic instrumentation options that minimize the amount of work we need to do on manual code modifications and streamline the integration of our internal observability capabilities into our chosen backend applications.

Unified picture

By providing a standardized data model powered by semantic conventions, OpenTelemetry enables us to paint an end to end picture of how all of our services are composed including application and infrastructure dependencies. We can also gain access to critical telemetry information, leveraging this semantically consistent data across multiple backend microservices even when written in different languages.

Vendor-neutral data management

OpenTelemetry enables us to avoid locking our application data into 3rd party vendors’ services by decoupling it from proprietary vendor formats. This gives us the freedom to choose the best tools on an ongoing basis based on the value they provide, and if something new comes along that we want to exploit, we can easily plug it into our architecture.

Resource-optimized observability

OpenTelemetry enables us to take a top down approach to data collection, starting with the problems we are looking to solve and eliminating unnecessary information. In doing so, this minimizes our storage costs and optimizes compute resources we need to support our observability pipeline.

We believe that following these pillars and building our Causal AI platform on top of OpenTelemetry will propel our product’s performance, enable rock-solid reliability, and ensure consistent service experiences for our customers as we scale our business. We will also minimize our ongoing operational costs, creating a win-win for us and our customers.

OpenTelemetry + Causal AI: Scaling for Performance and Cost Efficiency

Ultimately, observability aims to illuminate the behavior of distributed systems, enabling proactive maintenance and swift troubleshooting. Yet isolated failures manifest as cascading symptoms across interconnected services.

While OpenTelemetry enables back-end applications to use this data to provide a unified picture in maps, graphs and dashboards, the job of figuring out the cause and effect in the correlated data still requires highly skilled resources. This process can also be very time consuming, tying up personnel across multiple teams, with ownership for different elements of overall services.

There is a lot of noise in the industry right now about how AI and LLMs are going to magically come to the rescue, but reality paints a different picture. All of the solutions available in the market today focus on correlating data versus uncovering a direct understanding of causal relationships between problems and the symptoms they cause, leaving devops teams with noise, not answers.

Traditional AI and LLMs also require massive amounts of data as input for training and learning behaviors on a continuous basis. This is data that ultimately ends up being transferred and stored in some form of SaaS. Processing these large datasets is very computationally intensive. This all translates into significant cost overheads for the SaaS providers as customer datasets grow overtime – costs that ultimately result in ever increasing bills for customers.

By contrast, this is where Causal AI comes into its own, taking a fundamentally different approach. Causal AI provides operations and engineering teams with an understanding of the “why”, which is crucial for effective and timely troubleshooting and decision-making.

Application: Database Connection Noisy Neighbor causing service and infrastructure symptoms

Example causality chain: Database Connection Noisy Neighbor causing service and infrastructure symptoms

Causal AI uses predefined models of how problems behave and propagate. When combined with real-time information about a system’s specific structure, Causal AI computes a map linking all potential problems to their observable symptoms.

This map acts as a reference guide, eliminating the need to analyze massive datasets every time Causal AI encounters an issue. Think of it as checking a dictionary instead of reading an entire encyclopedia.

The bottom line is, in contrast to traditional AI, Causal AI operates on a much smaller dataset, requires far less resources for computation and provides more meaningful actionable insights, all of which translate into lower ongoing operational costs and profitable growth.

Summing it up

There’s massive potential for Causal AI and OpenTelemetry to come together to tackle the limitations of traditional AI to get to the “why.”  This is what we’re building at Causely. Doing so will result in numerous benefits:

  • Less time on Ops, more time on Dev: OpenTelemetry provides standardized data while Causal AI analyzes it to automate the root cause analysis (RCA) process, which will significantly reduce the time our devops teams have to spend on troubleshooting.
  • Instant gratification, no training lag: We can eliminate AI’s slow learning curve, because Causal AI leverages OpenTelemetry’s semantic language and Causal AI’s domain knowledge of cause and effect to deliver actionable results, right out of the box without massive amounts of data and with no training lag!
  • Small data, lean computation, big impact: Unlike traditional AI’s data gluttony and significant computational overheads, Causal AI thrives on targeted data streams. OpenTelemetry’s smart filtering keeps the information flow lean, allowing Causal AI to identify the root causes with a significantly smaller dataset and compute footprint.
  • Fast root cause identification: Traditional AI might tell us “ice cream sales and shark attacks rise together,” but Causal AI reveals the truth – it’s the summer heat and not the sharks, driving both! By understanding cause-and-effect relationships, Causal AI cuts through the noise and identifies the root causes behind performance degradation and service malfunctions.

Having these capabilities is critical if we want to move beyond the labor intensive processes associated with how RCA is performed in devops today. This is why we are eating our own dog food and using Causely as part of our tech stack to manage the services we provide to customers.

If you would like to learn how to unplug from the Matrix of guesswork and embrace the opportunity offered through the combination of OpenTelemetry and Causal AI, don’t hesitate to reach out! The team and I at Causely are here to share our experience and help you navigate the path.


Related Resources

Causely for asynchronous communication

Causely for async communication - broker OOM

Managing microservices-based applications at scale is challenging, especially when it comes to troubleshooting and pinpointing root causes.

In a microservices-based environment, when a failure occurs, it causes a flood of anomalies across the entire system. Pinpointing the root cause can be as difficult as searching for a needle in a haystack. In this video, we’ll share how Causely can eliminate human heavy lifting and automate the troubleshooting process.

Causely is the operating system to assure application service delivery by automatically preventing failures, pinpointing root causes, and remediating. Causely captures and analyzes cause and effect relationships so you can explore interesting insights and questions about your application environment.

Does this resonate with you? Feel free to share your troubleshooting stories here. We’d love to explore the ways Causely can help you!

Moving Beyond Traditional RCA In DevOps

Reposted with permission from LinkedIn

Modernization Of The RCA Process

Over the past month, I have spent a significant amount of time researching what vendors and customers are doing in the devops space to streamline the process of root cause analysis (RCA).

My conclusion is that the underlying techniques and processes used in operational environments today to perform RCA remain human centric. As a consequence troubleshooting remains complex, resource intensive and requires skilled practitioners to perform the work.

So, how do we break free from this human bottleneck? Brace yourselves for a glimpse into a future powered by AI. In this article, we’ll dissect the critical issues, showcase how cutting-edge AI advancements can revolutionize RCA, and hear first hand from operations and engineering leaders who have shared their perspective on this transformative tech, having experienced the capabilities first hand.

Troubleshooting In The Cloud Native Era With Monitoring & Observability

Troubleshooting is hard because when degradations or failures occur in components of a business service, they spread like a disease to related service entities which also become degraded or fail.

This problem is amplified in the world of cloud-native applications where we have decomposed business logic into many separate but interrelated service entities. Today an organization might have hundreds or thousands of interrelated service entities (micro services, databases, caches, messaging…).

To complicate things even further, change is a constant – code changes, fluctuating demand patterns, and the inherent unpredictability of user behavior. These changes can result in service degradations or failures.

Testing for all possible permutations in this ever-shifting environment is akin to predicting the weather on Jupiter – an impossible feat – amplifying the importance of a fast, effective and consistent root cause analysis process, to maintain the availability, performance and operational resilience of business systems.

While observability tools have made strides in data visualization and correlation, their inherent inability to explain the cause-and-effect relationships behind problems leaves us dependent on human expertise to navigate the vast seas of data to determine the root cause of service degradation and failures.

This dependence becomes particularly challenging due to siloed devops teams that have responsibility for supporting individual service entities within the complex web of services entities that make up business services. In this context individual teams may frequently struggle to pinpoint the source of service degradation or failure as the entity they support might be the culprit, or a victim of another service entity’s malfunction.

The availability of knowledge and skills within these teams also fluctuate due to business priorities, vacations, holidays, and even the daily working cycles. This resource variability can lead to significant inconsistencies in problem identification and resolution times.

Causal AI To The Rescue: Automating The Root Cause Analysis Process For Cloud Native DevOps

For those who are not aware, Causal AI is a distinct field in Artificial Intelligence. It is already used extensively in many different industries but until recently there has been no application of the technology in the world of devops.

Causely is a new pioneer championing the use of Causal AI working in the area of cloud-native applications. Their platform embodies an understanding of causality so that when service entities are degraded or failing and affecting other service entities that make up business services, it can explain the cause and effect, by showing the relationship between the problem and the symptoms that this causes.

Through this capability, the team with responsibility for the failing or degraded service can be immediately notified and get to work on resolving the problem. Other teams might also be provided with notifications to let them know that their services are affected, along with an explanation for why this occurred. This eliminates the need for complex triage processes that would otherwise involve multiple teams and managers to orchestrate the process.

Understanding the cause-and-effect relationships in software systems serves as an enabler for automated remediation, predictive maintenance, and planning/gaming out operational resilience.

By using software in this way to automate the process of root cause analysis, organizations can reduce the time and effort and increase the consistency in the troubleshooting process, all of which leads to lower operational costs, improved service availability and less business disruption.

Customer Reactions: Unveiling the Transformative Impact of Causal AI for Cloud-Native DevOps

After sharing insights into Causely’s groundbreaking approach to root cause analysis (RCA) with operational and engineering leaders across various organizations, I’ve gathered a collection of anecdotes that highlight the profound impact this technology is poised to have in the world of cloud-native devops.

Streamlined Incident Resolution and Reduced Triage

“By accurately pinpointing the root cause, we can immediately engage the teams directly responsible for the issue, eliminating the need for war rooms and time-consuming triage processes. This ability to swiftly identify the source of problems and involve the appropriate teams will significantly reduce the time to resolution, minimizing downtime and its associated business impacts.”

Automated Remediation: A Path to Efficiency

“Initially, we’d probably implement a ‘fix it’ button that triggers remediation actions manually. However, as we gain confidence in the results, we can gradually automate the remediation process. This phased approach ensures that we can seamlessly integrate Causely into our existing workflows while gradually transitioning towards a more automated and efficient remediation strategy.”

Empowering Lower-Skilled Team Members

“Lower-skilled team members can take on more responsibilities, freeing up our top experts to focus on code development. By automating RCA tasks and providing clear guidance for remediation, Causely will empower less experienced team members to handle a wider range of issues, allowing senior experts to dedicate their time to more strategic initiatives.”

Building Resilience through Reduced Human Dependency

“Causely will enable us to build greater resilience into our service assurance processes by reducing our reliance on human knowledge and intuition. By automating RCA and providing data-driven insights, Causely will help us build a more resilient infrastructure that is less susceptible to human error and fluctuations in expertise.”

Enhanced Support Beyond Office Hours

“We face challenges maintaining consistent support outside of office hours due to reduced on-call expertise. Causely will enable us to handle incidents with the same level of precision and efficiency regardless of the time of day. Causely’s ability to provide automated RCA and remediation even during off-hours ensures that organizations can maintain a high level of service continuity around the clock.”

Automated Runbook Creation and Maintenance

“I was planning to create runbooks to guide other devops team members through troubleshooting processes. Causely can automatically generate and maintain these runbooks for me. This automated runbook generation eliminates the manual effort required to create and maintain comprehensive troubleshooting guides, ensuring that teams have easy access to the necessary information when resolving issues.”

Simplified Post-Incident Analysis

“Post-incident analysis will become much simpler as we’ll have a detailed record of the cause and effect for every incident. Causely’s comprehensive understanding of cause and effect provides a valuable resource for post-incident analysis, enabling us to improve processes, and prevent similar issues from recurring.”

Faster Problem Identification and Reduced Business Impacts

“Problems will be identified much faster, and there will be fewer business consequences.  By automating RCA and providing actionable insights, Causely can significantly reduce the time it takes to identify and resolve problems, minimizing their impact on business operations and customer experience.”

These anecdotes underscore the transformative potential of Causely, offering a compelling vision of how root cause analysis is automated, remediation is streamlined, and operational resilience in cloud-native environments is enhanced. As Causely progresses, the company’s impact on the IT industry is poised to be profound and far-reaching.

Summing Things Up

Troubleshooting in cloud-native environments is complex and resource-intensive, but Causal AI can automate the process, streamline remediation, and enhance operational resilience.

If you would like to learn more about how Causal AI might benefit your organization, don’t hesitate to reach out to me or Causely directly.


Related Resources

Root Cause Chronicles: Connection Collapse

The below post is reposted with permission from its original source on the InfraCloud Technologies blog.

This MySQL connection draining issue highlights the complexity of troubleshooting today’s complex environments, and provides a great illustration of the many rabbit holes SREs find themselves in. It’s critical to understand the ‘WHY’ behind each problem, as it paves the way for faster and more precise resolutions. This is exactly what we at Causely are on a mission to improve using causal AI.


On a usual Friday evening, Robin had just wrapped up their work, wished their colleagues a happy weekend, and turned themselves in for the night. At exactly 3 am, Robin receives a call from the organization’s automated paging system, “High P90 Latency Alert on Shipping Service: 9.28 seconds”.

Robin works as an SRE for Robot-Shop, an e-commerce company that sells various robotics parts and accessories, and this message does not bode well for them tonight. They prepare themselves for a long, arduous night ahead and turn on their work laptop.

Setting the Field

Robot-Shop runs a sufficiently complex cloud native architecture to address the needs of their million-plus customers.

  • The traffic from load-balancer is routed via a gateway service optimized for traffic ingestion, called Web, which distributes the traffic across various other services.
  • User handles user registrations and sessions.
  • Catalogue maintains the inventory in a MongoDB datastore.
  • Customers can see the ratings of available products via the Ratings service APIs.
  • They choose products they like and add them to the Cart, a service backed by Redis cache to temporarily hold the customer’s choices.
  • Once the customer pays via the Payment service, the purchased items are published to a RabbitMQ channel.
  • These are consumed by the Dispatch service and prepared for shipping. Shipping uses MySQL as its datastore, as does Ratings.

(Figure 1: High Level Architecture of Robot-shop Application stack)

Troubles in the Dark

“OK, let’s look at the latency dashboards first.” Robin clicks on the attached Grafana dashboard on the Slack notification for the alert sent by PagerDuty. This opens up the latency graph of the Shipping service.

“How did it go from 1s to ~9.28s within 4-5 minutes? Did traffic spike?” Robin decides to focus on the Gateway ops/sec panel of the dashboard. The number is around ~140 ops/sec. Robin knows this data is coming from their Istio gateway and is reliable. The current number is more than affordable for Robot-Shop’s cluster, though there is a steady uptick in the request-count for Robot-Shop.

None of the other services show any signs of wear and tear, only Shipping. Robin understands this is a localized incident and decides to look at the shipping logs. The logs are sourced from Loki, and the widget is conveniently placed right beneath the latency panel, showing logs from all services in the selected time window. Nothing in the logs, and no errors regarding connection timeouts or failed transactions. So far the only thing going wrong is the latency, but no requests are failing yet; they are only getting delayed by a very long time. Robin makes a note: We need to adjust frontend timeouts for these APIs. We should have already gotten a barrage of request timeout errors as an added signal.

Did a developer deploy an unapproved change yesterday? Usually, the support team is informed of any urgent hotfixes before the weekend. Robin decides to check the ArgoCD Dashboards for any changes to shipping or any other services. Nothing there either, no new feature releases in the last 2 days.

Did the infrastructure team make any changes to the underlying Kubernetes cluster? Any version upgrades? The Infrastructure team uses Atlantis to gate and deploy the cluster updates via Terraform modules. The last date of change is from the previous week.

With no errors seen in the logs and partial service degradation as the only signal available to them, Robin cannot make any more headway into this problem. Something else may be responsible, could it be an upstream or downstream service that the shipping service depends on? Is it one of the datastores? Robin pulls up the Kiali service graph that uses Istio’s mesh to display the service topology to look at the dependencies.

Robin sees that Shipping has now started throwing its first 5xx errors, and both Shipping and Ratings are talking to something labeled as PassthroughCluster. The support team does not maintain any of these platforms and does not have access to the runtimes or the codebase. “I need to get relevant people involved at this point and escalate to folks in my team with higher access levels,” Robin thinks.

Stakeholders Assemble

It’s already been 5 minutes since the first report and customers are now getting affected.

(Figure 5: Detailed Kubernetes native architecture of Robot-shop)

Robin’s team lead Blake joins in on the call, and they also add the backend engineer who owns Shipping service as an SME. The product manager responsible for Shipping has already received the first complaints from the customer support team who has escalated the incident to them; they see the ongoing call on the #live-incidents channel on Slack, and join in. P90 latency alerts are now clogging the production alert channel as the metric has risen to ~4.39 minutes, and 30% of the requests are receiving 5xx responses.

The team now has multiple signals converging on the problem. Blake digs through shipping logs again and sees errors around MySQL connections. At this time, the Ratings service also starts throwing 5xx errors – the problem is now getting compounded.

The Product Manager (PM) says their customer support team is reporting frustration from more and more users who are unable to see the shipping status of the orders they have already paid for and who are supposed to get the deliveries that day. Users who just logged in are unable to see product ratings and are refreshing the pages multiple times to see if the information they want is available.

“If customers can’t make purchase decisions quickly, they’ll go to our competitors,” the PM informs the team.

Blake looks at the PassthroughCluster node on Kiali, and it hits them: It’s the RDS instance. The platform team had forgotten to add RDS as an External Service in their Istio configuration. It was an honest oversight that could cost Robot-Shop significant revenue loss today.

“I think MySQL is unable to handle new connections for some reason,” Blake says. They pull up the MySQL metrics dashboards and look at the number of Database Connections. It has gone up significantly and then flattened. “Why don’t we have an alert threshold here? It seems like we might have maxed out the MySQL connection pool!”

To verify their hypothesis, Blake looks at the Parameter Group for the RDS Instance. It uses the default-mysql-5.7 Parameter group, and max_connections is set to:

{DBInstanceClassMemory/12582880}

But, what does that number really mean? Blake decides not to waste time with checking the RDS Instance Type and computing the number. Instead, they log into the RDS instance with mysql-cli and run:

#mysql> SHOW VARIABLES LIKE "max_connections";

Then Blake runs:

#mysql> SHOW processlist;

“I need to know exactly how many,” Blake thinks, and runs:

#mysql> SELECT COUNT(host) FROM information_schema.processlist;

It’s more than the number of max_connections. Their hypothesis is now validated: Blake sees a lot of connections are in sleep() mode for more than ~1000 seconds, and all of these are being created by the shipping user.

(Figure 13: Affected Subsystems of Robot-shop)

“I think we have it,” Blake says, “Shipping is not properly handling connection timeouts with the DB; it’s not refreshing its unused connection pool.” The backend engineer pulls up the Java JDBC datasource code for shipping and says that it’s using defaults for max-idle, max-wait, and various other Spring datasource configurations. “These need to be fixed,” they say.

“That would need significant time,” the PM responds, “and we need to mitigate this incident ASAP. We cannot have unhappy customers.”

Blake knows that RDS has a stored procedure to kill idle/bad processes.

mysql#> CALL mysql.rds_kill(processID)

Blake tests this out and asks Robin to quickly write a bash script to kill all idle processes.

#!/bin/bash

# MySQL connection details
MYSQL_USER="<user>"
MYSQL_PASSWORD="<passwd>"
MYSQL_HOST="<rds-name>.<id>.<region>.rds.amazonaws.com"

# Get process list IDs
PROCESS_IDS=$(MYSQL_PWD="$MYSQL_PASSWORD" mysql -h "$MYSQL_HOST" -u "$MYSQL_USER" -N -s -e "SELECT ID FROM INFORMATION_SCHEMA.PROCESSLIST WHERE USER='shipping'")

for ID in $PROCESS_IDS; do
MYSQL_PWD="$MYSQL_PASSWORD" mysql -h "$MYSQL_HOST" -u "$MYSQL_USER" -e "CALL mysql.rds_kill($ID)"
echo "Terminated connection with ID $ID for user 'shipping'"
done

The team runs this immediately and the connection pool frees up momentarily. Everyone lets out a visible sigh of relief. “But this won’t hold for long, we need a hotfix on DataSource handling in Shipping,” Blake says. The backend engineer informs they are on it and soon they have a patch-up that adds better defaults for

spring.datasource.max-active
spring.datasource.max-age
spring.datasource.max-idle
spring.datasource.max-lifetime
spring.datasource.max-open-prepared-statements
spring.datasource.max-wait
spring.datasource.maximum-pool-size
spring.datasource.min-evictable-idle-time-millis
spring.datasource.min-idle

The team approves the hotfix and deploys it, finally mitigating a ~30 minute long incident.

Key Takeaways

Incidents such as this can occur in any organization with sufficiently complex architecture involving microservices written in different languages and frameworks, datastores, queues, caches, and cloud native components. A lack of understanding of end-to-end architecture and information silos only adds to the mitigation timelines.

During this RCA, the team finds out that they have to improve on multiple accounts.

  • Frontend code had long timeouts and allowed for large latencies in API responses.
  • The L1 Engineer did not have an end-to-end understanding of the whole architecture.
  • The service mesh dashboard on Kiali did not show External Services correctly, causing confusion.
  • RDS MySQL database metrics dashboards did not send an early alert, as no max_connection (alert) or high_number_of_connections (warning) thresholds were set.
  • The database connection code was written with the assumption that sane defaults for connection pool parameters were good enough, which proved incorrect.

Pressure to resolve incidents quickly that often comes from peers, leadership, and members of affected teams only adds to the chaos of incident management, causing more human errors. Coordinating incidents such as this through the process of having an Incident Commander role has shown more controllable outcomes for organizations around the world. An Incident Commander assumes the responsibility of managing resources, planning, and communications during a live incident, effectively reducing conflict and noise.

When multiple stakeholders are affected by an incident, resolutions need to be handled in order of business priority, working on immediate mitigations first, then getting the customer experience back at nominal levels, and only afterward focusing on long-term preventions. Coordinating these priorities across stakeholders is one of the most important functions of an Incident Commander.

Troubleshooting complex architecture remains a challenging activity to date. However, with the Blameless RCA Framework coupled with periodic metric reviews, a team can focus on incremental but constant improvements of their system observability. The team could also convert all successful resolutions to future playbooks that can be used by L1 SREs and support teams, making sure that similar errors can be handled well.

Concerted effort around a clear feedback loop of Incident -> Resolution -> RCA -> Playbook Creation eventually rids the system of most unknown-unknowns, allowing teams to focus on Product Development, instead of spending time on chaotic incident handling.

 

That’s a Wrap

Hope you all enjoyed that story about a hypothetical but complex troubleshooting scenario. We see incidents like this and more across various clients we work with at InfraCloud. The above scenario can be reproduced using our open source repository. We are working on adding more such reproducible production outages and subsequent mitigations to this repository.

We would love to hear from you about your own 3 am incidents. If you have any questions, you can connect with me on Twitter and LinkedIn.

References


 

Related Resources

Understanding failure scenarios when architecting cloud-native applications

Developing and architecting complex, large cloud-native applications is hard. In this short demo, we’ll show how Causely helps to understand failure scenarios before something actually fails in the environment.

In the demo environment we have a dozen applications with database servers, caches running in a cluster, providing multiple services. If we drill into these services and focus on the application, we can only see how the application is behaving right now. But Causely is automatically identifying the potential root causes and alerts that would be caused – services that would be impacted – by failures.

For example, a congested service would cause high latency across a number of different downstream dependencies. A malfunction of this service would make services unavailable and cause high error rates on the dependent services.

Causely is able to reason about the specific dependencies and all the possible root causes – not just for services, but for the applications – in terms of: what would happen if their database query takes too long, if their garbage collection time takes too long, if their transaction latency is high? What services would be impacted, and what alerts would it receive?

This allows developers to design a more resilient system, and operators can understand how to run the environment with their actual dependencies.

We’re hoping that Causely can help application owners avoid production failures and service impact by architecting applications to be resilient in the first place.

What do you think? Share your comments on this use case below.

Troubleshooting cloud-native applications with Causely

Running large, complex, distributed cloud-native applications is hard. This short demo shows how Causely can help.

In this environment, we are running a number of applications with database servers, caches, in a cluster, multiple services, pods, and containers. At any one point in time, we would be getting multiple alerts showing high latency, high CPU utilization, high garbage collection time, high memorization across multiple microservices. Troubleshooting what is the root cause of each one of these alerts is really difficult.

Causely automatically identifies the root cause and shows how the service that is actually congested causing all of these downstream alerts on its dependent services. Instead of individual teams troubleshooting their respective alerts, the team responsible for this product catalog service can focus on remediating and restoring this service while showing all of the other impacted services, so the teams are aware that their problems are caused by congestion in this service. This can significantly reduce the time to detect and to remediate and restore a service.

What do you think? Share your comments on this use case below.

Unveiling the Causal Revolution in Observability

Reposted with permission from LinkedIn.

OpenTelemetry and the Path to Understanding Complex Systems

Decades ago, the IETF’s (Internet Engineering Task Force) developed an innovative protocol, SNMP, revolutionizing network management. This standardization spurred a surge of innovation, fostering a new software vendor landscape dedicated to streamlining operational processes in network management, encompassing Fault, Configurations, Accounting, Performance, and Security (FCAPS). Today, SNMP reigns as the world’s most widely adopted network management protocol.

On the cusp of a similar revolution stands the realm of application management. For years, the absence of management standards compelled vendors to develop proprietary telemetry for application instrumentation, to enable manageability. Many of the vendors also built applications to report on and visualize managed environments, in an attempt to streamline the processes of  incident and performance management.

OpenTelemetry‘s emergence is poised to transform the application management market dynamics in a similar way by commoditizing application telemetry instrumentation, collection, and export methods. Consequently, numerous open-source projects and new companies are emerging, building applications that add value around OpenTelemetry.

This evolution is also compelling established vendors to embrace OpenTelemetry. Their futures hinge on their ability to add value around this technology, rather than solely providing innovative methods for application instrumentation.

Adding Value Above OpenTelemetry

While OpenTelemetry simplifies the process of collecting and exporting telemetry data, it doesn’t guarantee the ability to pinpoint the root cause of issues. This is because understanding the causal relationships between events and metrics requires more sophisticated analysis techniques.

Common approaches to analyzing OpenTelemetry data that get devops teams closer to this goal include:

  • Visualization and Dashboards: Creating effective visualizations and dashboards is crucial for extracting insights from telemetry data. These visualizations should present data in a clear and concise manner, highlighting trends, anomalies, and relationships between metrics.
  • Correlation and Aggregation: To correlate logs, metrics, and traces, you need to establish relationships between these data streams. This can be done using techniques like correlation IDs or trace identifiers, which can be embedded in logs and metrics to link them to their corresponding traces.
  • Pattern Recognition and Anomaly Detection: Once you have correlated data, you can apply pattern recognition algorithms to identify anomalies or outliers in metrics, which could indicate potential issues. Anomaly detection tools can also help identify sudden spikes or drops in metrics that might indicate performance bottlenecks or errors.
  • Machine Learning and AI: Machine learning and AI techniques can be employed to analyze telemetry data and identify patterns, correlations, and anomalies that might be difficult to detect manually. These techniques can also be used to predict future performance or identify potential issues before they occur.

While all of these techniques might help to increase the efficiency of the troubleshooting process, human expertise is still essential for interpreting and understanding the results. This is because these approaches to analyzing telemetry data are based on correlation and lack an inherent understanding of cause and effect (causation).

Avoiding The Correlation Trap: Separating Coincidence from Cause and Effect

In the realm of analyzing observability data, correlation often takes center stage, highlighting the apparent relationship between two or more variables. However, correlation does not imply causation, a crucial distinction that software-driven causal analysis can effectively address and results in a better outcome in the following ways:

Operational Efficiency And Control: Correlation-based approaches often leave us grappling with the question of “why,” hindering our ability to pinpoint the root cause of issues. This can lead to inefficient troubleshooting efforts, involving multiple teams in a devops environment as they attempt to unravel the interconnectedness of service entities.

Software-based causal analysis empowers us to bypass this guessing game, directly identifying the root cause and enabling targeted corrective actions. This not only streamlines problem resolution but also empowers teams to proactively implement automations to mitigate future occurrences. It also frees up the time of experts in the devops organizations to focus on shipping features and working on business logic.

Consistency In Responding To Adverse Events: The speed and effectiveness of problem resolution often hinge on the expertise and availability of individuals, a variable factor that can delay critical interventions. Software-based causal analysis removes this human dependency, providing a consistent and standardized approach to root cause identification.

This consistency is particularly crucial in distributed devops environments, where multiple teams manage different components of the system. By leveraging software, organizations can ensure that regardless of the individuals involved, problems are tackled with the same level of precision and efficiency.

Predictive Capabilities And Risk Mitigation: Correlations provide limited insights into future behavior, making it challenging to anticipate and prevent potential problems. Software-based causal analysis, on the other hand, unlocks the power of predictive modeling, enabling organizations to proactively identify and address potential issues before they materialize.

This predictive capability becomes increasingly valuable in complex cloud-native environments, where the interplay of numerous microservices and data pipelines can lead to unforeseen disruptions. By understanding cause and effect relationships, organizations can proactively mitigate risks and enhance operational resilience.

Conclusion

OpenTelemetry marks a significant step towards standardized application management, laying a solid foundation for a more comprehensive understanding of complex systems. However, to truly unlock the full potential, the integration of software-driven causal analysis, also referred to as Causal AI, is essential. By transcending correlation, software-driven causal analysis empowers devops  organizations to understand cause and effect of system behavior, enabling proactive problem detection, predictive maintenance, operational risk mitigation and automated remediation.

The founding team of Causely participated in the standards-driven transformation that took place in the network management market more than two decades ago at a company called SMARTS. The core of their solution was built on Causal AI. SMARTS became the market leader of Root Cause Analysis in networks and was acquired by EMC in 2005. The team’s rich experience in Causal AI is now being applied at Causely, to address the challenges of managing cloud native applications.

Causely’s embrace of OpenTelemetry stems from the recognition that this standardized approach will only accelerate the advancement of application management. By streamlining telemetry data collection, OpenTelemetry creates a fertile ground for Causal AI to flourish.

If you are intrigued and would like to learn more about Causal AI the team at Causely would love to hear from you, so don’t hesitate to get in touch.

All Sides of the Table

Reflecting on the boardroom dynamics that truly matter

This past month has been an eventful one. Like everyone in the tech world, I’m riveted by the drama unfolding at OpenAI, wondering how the board and CEO created such an extreme situation. I’ve been thinking a lot about board dynamics – and how different things look as a founder/CEO vs board member, especially at very different stages of company growth.

Closer to home and far less dramatic, last week we had our quarterly Causely team meetup in NYC, including our first in-person board meeting with 645 Ventures, Amity Ventures and Cervin Ventures. As a remote company, it was great to actually sit together in one room and discuss company and product strategy, including a demo of our latest iteration of the Causely product. Getting aligned and hearing the board’s input was truly helpful as we plan for 2024.

Also in the past month, my board experiences at Corvus Insurance and Chase Corporation came to an end. Corvus (a late-stage insurtech) announced it’s being acquired by Travelers Insurance for $435M, and Chase (a 75-year old manufacturing company) closed its acquisition by KKR for $1.3B. These exits were the culmination of years of work by the management teams (and support of their boards) to create significant shareholder value.

Each of these experiences has shown me models of board interaction and highlighted how critical it is for board members to build trust with the CEO and each other. I thought I’d share some thoughts on the most valuable traits or contributions a board member can offer, and what executives should look for in board members that will make a meaningful impact on the business, depending on the stage and size of the company.

From the startup CEO/founder view

As a founder who’s built and managed my own boards for the past 15 years, I’ve learned a lot about what kinds of board members are most impactful and productive for early-stage life. Here are a few examples of what these board members do: 

  • They get hands-on to provide real value through introductions to design partners, customers and potential key hires. 
  • They ask the operational questions relevant for the company’s current stage – for example, is this product decision (or hire) really the one we need to make now, while we are trying to validate product market fit, or can it be deferred? 
  • They hold the CEO accountable and keep board discussions focused, by asking questions like, “Ellen, are we actually talking about the topic that keeps you up at night?!”
  • They don’t project concerns from other companies or past experiences that might be irrelevant.
  • They stay calm through ends-of-quarters and acquisition processes, and balance the needs of investors and common shareholders.

From the board view

As an independent board member, I now appreciate these board members even more. It can be hard to step back from the operational role (“What do I need to do next?”) and provide guidance and support, sometimes just by asking the right question at the right moment in a board meeting. I find it very helpful to check in with the CEO and other board members before any official meeting, so I understand where the “hot” issues are and what decisions need to be made as a group. 

In a public company, the challenge is even greater. The independent board member must maintain this same operator/advisor perspective, but also weigh decisions as they relate to corporate governance and enterprise risk management across a wide range of products, markets and countries. For example, how fast can management drive product innovation which may cause new cyber risk or data management concerns? And unlike in private and early-stage companies, which tend to focus almost entirely on top-line growth, what is the right balance of growth vs profitability for the more mature public company?

Building trust is key

As the recent chaos at OpenAI shows (albeit in an extreme way), strong board relationships and ongoing communications between the board and management are critical. 

If you’re building a company and/or adding new board seats, think about what a new board member should bring to the table that will help you reach the next phase of growth and major milestones — and stay laser-focused on finding someone that meets your criteria. 

If you’re considering serving on the board of a company, think about what kinds of companies you’re best suited to help, and find one where you can work closely with the CEO and where existing board members will complement your skill set and experience.  

Regardless of which side of the table you’re on, take the time to build strong relationships and trust. Lead directors, who have taken a more central role in the past several years, can ensure that communications don’t break down. But even in earlier stage companies, it’s the job of everyone around the table to make sure there’s clarity on the key strategic issues the company is facing, and to provide the support that the CEO and management need to make the best decisions for the business.


Related reading

Why do this startup thing all over again? Our reasons for creating Causely

Why be a serial entrepreneur?

It’s a question that my co-founder, Shmuel, and I are asked many times. Both of us have been to this rodeo twice before – Shmuel, with SMARTS and Turbonomic, myself with ClearSky Data and CloudSwitch. There are all the usual reasons, including love of hard challenges, creation of game-changing products and working with teams and customers who inspire you. And of course there’s more than a small share of insanity (or as one of our founding engineers, Endre Sara, might call it, an addiction to Sisyphean tasks?) 

I’ve been pondering this as we build our new venture, Causely. The motivation behind Causely was a long-standing goal of Shmuel’s to tackle a problem he’s addressed in both previous companies, but still feels is unresolved: How to remove the burden of human troubleshooting from the IT industry? (Shmuel is not interested in solving small problems.) 

 Although there are tools galore and so much data being gathered that it takes data lakes to manage it all, at the heart of the IT industry there is a central problem that hasn’t fundamentally changed in decades. When humans try to determine the root cause of things that make applications slow down or break, they rely on the (often specific) expertise of people on their teams. And the scale, complexity and rate of change of their application environments will always grow faster than humans can keep up.   

I saw this during my time at AWS, while I was running some global cloud storage services. No matter how incredible the people were, how well-architected the services, or robust the mechanisms and tools – when things went wrong (usually at 3 am) it always came down to having the right people online at that moment to figure out what was really happening. Much of the human effort went into stabilizing things asap for customers and then digging in for days (or longer) to understand what had happened  

When Peter Bell, our founding investor at Amity Ventures, originally introduced me and Shmuel, it was clear that Shmuel had the answer to this never-ending cycle of applications changing, scaling, breaking, requiring human troubleshooting, writing post-mortems… and starting all over again. He was thinking about the problem from the perspective of what’s actually missing: the ability to capture causality in software. AI and ML technologies are advancing beyond our wildest dreams, but they are still challenged by the inability to automate causation (vs correlation, which they do quite well). By building a causal AI platform that can tackle this huge challenge, we believe Causely can eliminate the need for humans to keep pushing the same rocks up the same hills. 

So why do this startup thing all over again?

Because for each new venture there’s always been a big, messy problem that needs to be fixed. One that requires a solution that’s never been done before, that will be exciting to build with smart, creative people. 

And so today, we announce funding for Causely and the opening of our Early Access program for initial users. We’ve been quietly designing and building for the past year, working with some awesome design partners. We’re thrilled to have 645 Ventures join us on this journey and are already seeing the impact of support from Aaron, Vardan and the team. We also welcome new investors Glasswing Ventures and Tau Ventures. We hope early users will love the Causely premise and early version of the product and give us the input we need to build something that truly changes how applications are built and operated.

Please take a look and let us know your thoughts. 

Learn more

Causely raises $8.8M in Seed funding to deliver IT industry’s first causal AI platform

Automation of causality will eliminate human troubleshooting and enable faster, more resilient cloud application management

Boston, June 29, 2023Causely, Inc., the causal AI company, today announced it has raised $8.8M in Seed funding, led by 645 Ventures with participation from founding investor Amity Ventures, and including new investors Glasswing Ventures and Tau Ventures. The funding will enable Causely to build its causal AI platform for IT and launch an initial service for applications running in Kubernetes environments. This financing brings the company’s total funding to over $11M since it was founded in 2022.

For years, the IT industry has struggled to make sense of the overwhelming amounts of data coming from dozens of observability platforms and monitoring tools. In a dynamic world of cloud and edge computing, with constantly increasing application complexity and scale, these systems gather metrics and logs about every aspect of application and IT environments. In the end, all this data still requires human troubleshooting to respond to alerts, make sense of patterns, identify root cause, and ultimately determine the best action for remediation. This process, which has not changed fundamentally in decades, is slow, reactive, costly and labor-intensive. As a result, many problems can cause end-user and business impact, especially in situations where complex problems propagate across multiple layers and components of an application.  

Causely’s breakthrough approach is to remove the need for human intervention from the entire process by capturing causality in software. By enabling end-to-end automation, from detection through remediation, Causely closes the gap between observability and action, speeding time to remediation and limiting business impact. Unlike existing solutions, Causely’s core technology goes beyond correlation and anomaly detection to identify root cause in dynamic systems, making it possible to see causality relationships across any IT environment in real time.

The founding team, led by veterans Ellen Rubin (founder of ClearSky Data, acquired by Amazon Web Services, and CloudSwitch, acquired by Verizon) and Shmuel Kliger (founder of Turbonomic, acquired by IBM, and SMARTS, acquired by EMC), brings together world-class expertise from the IT Ops, cloud-native and Kubernetes communities and decades of experience successfully building and scaling companies. 

“In a world where developers and operators are overwhelmed by data, alert storms and incidents, the current solutions can’t keep up,” said Ellen Rubin, Causely CEO and Founder. “Causely’s vision is to enable self-managed, resilient applications and eliminate the need for human troubleshooting. We are excited to bring this vision to life, with the support and partnership of 645 Ventures, Amity Ventures and our other investors, working closely with our early design partners.”

“Causality is the missing link in automating IT operations,” said Aaron Holiday, Managing Partner at 645 Ventures. “The Causely team is uniquely able to address this difficult and long-standing challenge and we are proud to be part of the next phase of the company’s growth.”

“Having worked with the founding team for many years, I’m excited to see them tackle an unsolved industry problem and automate the entire process from observability through remediation,” said Peter Bell, Chairman at Amity Ventures.

The initial Causely service, for DevOps and SRE users who are building and supporting apps in Kubernetes, is now available in an Early Access program. To learn more, please visit www.causely.io/early-access

About Causely

Causely, the causal AI company, automates the end-to-end detection, prevention and remediation of critical defects that can cause user and business impact in application environments. Led by veterans from Turbonomic, AWS, SMARTS and other cloud infrastructure companies, Causely is backed by Amity Ventures and 645 Ventures. To learn more, please visit www.causely.io

About 645 Ventures 

645 Ventures is an early-stage venture capital firm that partners with exceptional founders who are building iconic companies. We invest at the Seed and Series A stages and leverage our Voyager software platform to enable our Success team and Connected Network to help founders scale to the growth stage. Select companies that have reached the growth stage include Iterable, Goldbelly, Resident, Eden Health, FiscalNote, Lunchbox, and Squire. 645 has $550m+ in AUM across 5 funds, and is growing fast with backing from leading institutional investors, including university endowments, funds of funds, and pension funds. The firm has offices in New York and SF, and you can learn more at www.645ventures.com. 

About Amity Ventures 

Amity Ventures is a venture capital firm based in San Francisco, CA. We are a closely knit team with 40+ years of collective experience partnering deeply with entrepreneurs building category-defining technology businesses at the early stage. Amity intentionally invests in a small number of startups per year, and currently has a portfolio of about 25 companies across multiple funds. 

Contact

Kelsey Cullen, KCPR
kelsey@kcpr.com
650.438.1063