Author: Karina Babcock
The use of eBPF – in Netflix, GPU infrastructure, Windows programs and more
Takeaways from eBPF Summit 2024
How are organizations applying eBPF to solve real problems in observability, security, profiling, and networking? It’s a question I’ve found myself asking as I work in and around the observability space – and I was pleasantly surprised when Isovalent’s recent eBPF Summit provided some answers.
For those new to eBPF, it’s an open source technology that empowers observability practices. Many organizations and vendors have adopted it as a data source (including Causely, where we use it to enhance our instrumentation for Kubernetes).
Many of the eBPF sessions highlighted real challenges companies faced and how they used eBPF to overcome them. In the spirit of helping others, my cliff notes and key takeaways from eBPF Summit are below.
Organizations like Netflix and Datadog are using eBPF in new, creative ways
The use of eBPF in Netflix
One of the Keynote presentations was delivered by Shweta Saraf who described specific problems Netflix overcame using eBPF, such as noisy neighbors. This is a common problem faced by many companies with cloud-native environments.
Netflix uses eBPF to measure how long processes spend in the CPU scheduled state. When processes are taking too long, it usually indicates a performance bottleneck on CPU resources — like CPU throttling or over-allocation. (Netflix’s compute and performance team released a blog recently with much more detail on the subject.) In solving the noisy neighbor problem, the Netflix team also created a tool called bpftop which is designed to measure the CPU usage of the eBPF code they instrumented.
The Netflix team released bpftop for the community to use, and it will ultimately help organizations implement efficient eBPF programs. This is especially useful if an eBPF program is hung, allowing teams to quickly identify any overhead that an eBPF program has. We have come full circle: monitoring our monitoring programs 😁.
The use of eBPF in Datadog
Another use case for eBPF – and one that can be easily overlooked – is in chaos engineering. Scott Gerring, a technical advocate at Datadog, shared his experience on the matter. This quote resonated with me: “with eBPF… we have this universal language of destruction” – controlled destruction that is.
The benefit of eBPF is that we can inject failures into cloud-native systems without having to re-write the code of an application. Interestingly, there are open source projects out there for chaos engineering that already use eBPF, such as ChaosMesh.
Scott listed a few examples like Kernel Probes that are attached to the openat system that will cause access denied errors for 50% of calls made by system processes that a user can select or define. Or, using the traffic control subsystem to drop packets for sockets on process you want to mark for failure.
eBPF will underpin AI development
Isovalent Co-founder and CTO Thomas Graf presented the eBPF roadmap and what he is most excited about. Notably: eBPF will deliver value in enabling the GPU and DPU infrastructure wave fueled by AI. AI is undoubtably one of the hottest topics in tech right now. Many companies are using GPUs and DPUs to accelerate AI and ML (Machine Learning) tasks, because CPUs cannot deliver the processing power demanded by today’s AI models.
As Tom mentioned, whether the AI wave produces anything meaningful is up for debate, but companies will undoubtedly try, and they will make significant investments in GPUs and DPUs along the way. The capabilities of eBPF will be applied to this new wave of infrastructure in the same manner they did for CPUs.
GPUs and DPUs are expensive, so companies do not want to waste processing power on programs that will drive up utilization. The efficiency of eBPF programs can help maximize the performance of costly GPUs. For example, eBPF can be used for GPU profiling by hooking into GPU events such as memory, sync, and kernel launches. Unlocking this type of data can be used to understand which kernels are used most frequently, improving efficiencies of AI development.
eBPF support for Windows is growing
Another interesting milestone in eBPF’s journey is the support for Windows. In fact, there is a growing Git Repository for eBPF programs on Windows that exists today: https://github.com/microsoft/ebpf-for-windows
The project supports Windows 10 or later and Windows Server 2019 or later, and while there is not feature parity yet to Linux, there is a lot of development in this space. The community is hard at work porting over the same tooling for eBPF on Linux, but it is a challenging endeavor as the hook points for Linux eBPF components (like Just-In-Time compilation or eBPF bytecode signatures) will differ on Windows.
It will be exciting to watch the same networking, security, and observability eBPF capabilities on Linux become available for Windows.
The need for better observability is fueling eBPF ecosystem growth
eBPF tools have been created by the community for both applications and infrastructure use cases. There a 9 major projects for applications and over 30 exciting emerging projects for applications. Notably, while there are a few production-ready runtimes and tools within the infrastructure ecosystem (like Linux and LLVM Compiler), there are many emerging projects such as eBPF for Windows.
With a user base across Meta, Apple, Capital One, LinkedIn, and Walmart (just to name a few), we can expect the number of eBPF projects to grow considerably in the coming years. The overall number of projects is actually forecasted in the triple digits by the end of 2025.
One of the top catalysts for growth? The urgent need for better observability. Of all the topics at last year’s KubeCon in Chicago, observability ranked the highest, beating competing topics like cost and automation. As with any other tool, eBPF can allow organizations gather a lot of data, but the “why” is important. Are you using that data to create more noise and more alerts, or can you apply it to get to the root cause of problems that surface, or for other applications?
It is exciting to watch the eBPF community develop and implement creative new ways to use eBPF and the 2024 eBPF summit was (and still is) an excellent source of real-world eBPF use cases and community-generated tooling.
Related resources
- Read the blog: Using OpenTelemetry and the OTel Collector for Logs, Metrics, and Traces
- Read the blog: Preventing OOM Kills in Kubernetes
- Watch the webinar: What is Causal AI and why do DevOps teams need it?
The “R” in MTTR: Repair or Recover? What’s the difference?
Finding meaning in a world of acronyms
There are so many ways to measure application reliability today, with hundreds of key performance indicators (KPIs) to measure availability, error rates, user experiences, and quality of service (QoS). Yet every organization I speak with struggles to effectively use these metrics. Some applications and services require custom metrics around reliability while others can be measured with just uptime vs. downtime.
In my role at Causely, I work with companies every day who are trying to improve the reliability, resiliency, and agility of their applications. One method of measuring reliability that I keep a close eye on is MTT(XYZ). Yes, I made that up, but it’s meant to capture all the different variations of mean time to “X” out there. We have MTTR, MTTI, MTTF, MTTA, MTBF, MTTD, and the list keeps going. In fact, some of these acronyms have multiple definitions. The one whose meaning I want to discuss today is MTTR.
So, what’s the meaning of MTTR anyway?
Before cloud-native applications, MTTR meant one thing – Mean Time to Repair. It’s a metric focused on how quickly an organization can respond to and fix problems that cause downtime or performance degradation. It’s simple to calculate too:
Total time spent on repairs is the length of time IT spends fixing issues, and number of repairs is the number of times a fix has been implemented. Some organizations look at this over a week or a month in production. It’s a great metric to understand how resilient your system is and how quickly the team can fix a known issue. Unfortunately, data suggests that most IT organizations’ MTTR is increasing every year, despite massive investments in the observability stack.
For monolithic applications, MTTR has historically been an excellent measurement; as soon as a fix is applied, the entire application is usually back online and performing well. Now that IT is moving toward serverless and cloud-native applications, it is a much different story. When a failure occurs in Kubernetes – where there are many different containers, services, applications, and more all communicating in real time – the entire system can take much longer to recover.
The new MTTR: Mean Time to Recover
I am seeing more and more organizations redefine the meaning of MTTR from “mean time to repair” to “mean time to recover.” Recover means that not only is everything back online, but the system is performing well and satisfying any QoS or SLAs AND a preventative approach has been implemented.
For example, take a common problem within Kubernetes: a pod enters a CrashLoopBackoff state. There are many reasons why a pod might continuously restart including deployment errors, resourcing constraints, DNS resolution errors, missing K8s dependencies, etc. But let’s say you completed your investigation and found out that your pod did not have sufficient memory and therefore was crashing/restarting. So you increased the limit on the container or the deployment and the pod(s) seems to be running fine for a bit…. but wait, it just got evicted.
The node now has increased memory usage and pods are being evicted. Or, what if now we created noisy neighbors, and that pod is “stealing” resources like memory from others on the same node? This is why organizations are moving away from repair because sometimes when the applied fix brings everything online, it doesn’t mean the system is healthy. “Repaired” can be a subjective term. Furthermore, sometimes the fix is merely a band-aid, and the problem returns hours, days, or weeks later.
Waiting for the entire application system to become healthy and applying a preventative measure will get us better insight into reliability. It is a more accurate way to measure how long it takes from a failure event to a healthy environment. After all, just because something is online does not mean it is performing well. The tricky issue here is: How do you measure “healthy”? In other words, how do we know the entire system is healthy and our preventative patch is truly preventing problems? There are some good QoS benchmarks like response time or transactions per second, but there is usually some difficulty in defining these thresholds. An improvement in MTBF (mean time between failures) is another good benchmark to test to see if your preventative approach is working.
How can we improve Mean Time to Recover?
There are many ways to improve system recovery, and ultimately the best way to improve MTTR is to improve all the MTT(XYZ) that come before it on incident management timelines.
- Automation: Automating tasks like ticket creation, assigning incidents to appropriate teams, and probably most importantly, automating the fix can all help reduce the time from problem identification to recovery. But, the more an organization scrutinizes every single change and configuration, the longer it takes to implement a fix. Becoming less strict drives faster results.
- Well-defined Performance Benchmarks: Lots of customers I speak with have a couple KPIs they track, but the more specific the better. For example, instead of making a blanket statement that every application needs to have 200ms of response time or less, set these metrics on an app by app basis.
- Chaos Engineering: This is an often-overlooked methodology to improve recovery rate. Practicing and simulating failures helps improve how quickly we can react, troubleshoot, and apply a fix. It does take a lot of time though, so it is not an easy strategy to adhere to.
- Faster Alerting Mechanisms: This is simple: The faster we get notified of a problem, the quicker we can fix it. We need to not just identify the symptoms but also quickly find the root cause. I see many companies try to set up proactive alerts, but they often get more smoke than fire.
- Knowledge Base: This was so helpful for me in a previous role. Building a KB in a system like Atlassian, SharePoint, or JIRA can help immensely in the troubleshooting process. The KB needs to be searchable and always changing as the environment evolves. Being able to search for a specific string in an error message within a KB can immediately highlight not just a root cause but also a fix.
To summarize, MTTR is a metric that needs to capture the state of a system from the moment of failure until the entire system is healthy again. This is a much more accurate representation of how fast we recover from a problem, and how resilient the application architecture is. MTTR is a principle that extends beyond the world of IT; its applications exist in security, mechanics, even healthcare. Just remember, a good surgeon is not only measured by how fast he can repair a broken bone, but by how fast the patient can recover.
Improving application resilience and reliability is something we spend a lot of time thinking about at Causely. We’d love to hear how you’re handling this today, and what metric you’ve found most useful toward this goal. Comment here or contact us with your thoughts!
Related resources
- Read the blog: The Rising Cost of Digital Incidents: Understanding and Mitigating Outage Impact
- Read the blog: Bridging the Gap Between Observability and Automation with Causal Reasoning
- Learn more about the Causal Reasoning Platform from Causely
Intelligence Augmentation: An Important Step in the Journey to Continuous Application Reliability
In an article that I published nearly two years ago titled Are Humans Actually Underrated, I talked about how technology can be used to augment human intelligence to empower humans to work better, smarter and faster.
The notion that technology can enhance human capabilities is far from novel. Often termed Intelligence Augmentation, Intelligence Amplification, Cognitive Augmentation, or Machine Augmented Intelligence, this concept revolves around leveraging information technology to bolster human intellect. Its roots trace back to the 1950s and 60s, a testament to its enduring relevance.
From the humble mouse and graphical user interface to the ubiquitous iPhone and the cutting-edge advancements in Artificial Intelligence like ChatGPT, Intelligence Augmentation has been steadily evolving. These tools and platforms serve as tangible examples of how technology can be harnessed to augment our cognitive abilities and propel us towards greater efficiency and innovation.
An area of scientific development closely aligned with Intelligence Augmentation is the field of Causal Reasoning. Before diving into this, it’s essential to underscore the fundamental importance of causality. Understanding why things happen, not just what happened, is the cornerstone of effective problem-solving, decision-making, and innovation.
Humans Crave Causality
Our innate curiosity drives us to seek explanations for the world around us. This deep-rooted desire to understand cause-and-effect relationships is fundamental to human cognition. Here’s why:
Survival: At the most basic level it all boils down to survival. By understanding cause-and-effect, we can learn what actions lead to positive outcomes (food, shelter, safety) and avoid negative ones (danger, illness, death).
Learning: Understanding cause-and-effect is fundamental to learning and acquiring knowledge. We learn by observing and making connections between events, forming a mental model of how the world works.
Prediction: Being able to predict what will happen allows us to plan for the future and make informed choices. We can anticipate the consequences of our actions and prepare for them.
Problem-solving: Cause-and-effect is crucial for solving problems efficiently. By identifying the cause of an issue, we can develop solutions that address the root cause rather than just treating the symptoms.
Scientific Discovery: This innate desire to understand causality drives scientific inquiry. By seeking cause-and-effect relationships, we can unravel the mysteries of the universe and develop new technologies.
Technological Advancement: Technology thrives on our ability to understand cause-and-effect. From inventing tools to building machines, understanding how things work allows us to manipulate the world around us.
Societal Progress: When we understand the causes of social issues, we can develop solutions to address them. Understanding cause-and-effect fosters cooperation and allows us to build a better future for ourselves and future generations.
Understanding Cause & Effect In The Digital World
In the complex digital age, this craving for causality remains as potent as ever. Nowhere is this more evident than in the world of cloud native applications. These intricate systems, composed of interconnected microservices and distributed components, can be challenging to manage and troubleshoot. When things go wrong, pinpointing the root cause can be akin to searching for a needle in a haystack.
This is increasingly important today because so many businesses rely on modern applications and real time data to conduct their daily business. Delays, missing data and malfunctions can have a crippling effect on business processes and customer experiences which in turn can have significant financial consequences.
Understanding causality in this context is paramount. It’s the difference between reacting to symptoms and addressing the underlying issue. For instance, a sudden spike in error rates might be attributed to increased traffic. While this might be a contributing factor, the root cause could lie in a misconfigured database, a network latency issue, or a bug in a specific microservice. Without a clear understanding of the causal relationships between these components, resolving the problem becomes a matter of trial and error.
Today site reliability engineers (SREs) and developers, tasked with ensuring the reliability and performance of cloud native systems, rely heavily on causal reasoning. They do this by constructing mental models of how different system components interact, they attempt to anticipate potential failure points and develop strategies to mitigate risks.
When incidents occur, SREs and developers work together, employing a systematic approach to identify the causal chain, from the initial trigger to the eventual impact on users. We rely heavily on their knowledge to implement effective remediation steps and prevent future occurrences.
In the intricate world of cloud native applications, where complexity reigns, this innate ability to connect cause and effect is essential for building resilient, high-performing systems.
The Crucial Role of Observability in Understanding Causality in Cloud Native Systems
OpenTelemetry and its ecosystem of observability tools provide a window into the complex world of cloud native systems. By collecting and analyzing vast amounts of data, engineers can gain valuable insights into system behavior. However, understanding why something happened – establishing causality – still remains a significant challenge.
The inability to rapidly pinpoint the root cause of issues is a costly affair. A recent PagerDuty customer survey revealed that the average time to resolve digital incidents is a staggering 175 minutes. This delay impacts service reliability, erodes customer satisfaction, revenue and consumes lots of engineering cycles in the process. A time-consuming process often leaves engineering teams overwhelmed and firefighting.
To drive substantial improvements in system reliability and performance, organizations must accelerate their ability to understand causality. This requires a fundamental shift in how we approach observability. By investing in advanced analytics that can reason about causality, we can empower engineers to quickly identify root causes and their effects so they can prioritize what is important and implement effective solutions.
Augmenting Human Ingenuity with Causal Reasoning
In this regard, causal reasoning software like Causely represent a quantum leap forward in the evolution of human-machine collaboration. By combining this capability with OpenTelemetry, the arduous task of causal reasoning can be automated, liberating SREs and developers from the firefighting cycle. Instead of being perpetually mired in troubleshooting, they can dedicate more cognitive resources to innovation and strategic problem-solving.
Imagine these professionals equipped with the ability to process vast quantities of observability data in mere seconds, unveiling intricate causal relationships that would otherwise remain hidden. This is the power of causal reasoning software that is built to amplify the processes associated with Reliability Engineering. They amplify human intelligence, transforming SREs and developers from reactive problem solvers into proactive architects of system reliability.
By accelerating incident resolution from today’s averages (175 minutes as documented in the PagerDuty’s customer survey) to mere minutes, these platforms not only enhance customer satisfaction but also unlock significant potential for business growth. With freed-up time, teams can focus on developing new features, improving system performance, and preventing future issues. Moreover, the insights derived from causal reasoning software can be leveraged to proactively identify vulnerabilities and optimize system performance, elevating the overall reliability and resilience of cloud native architectures.
The convergence of human ingenuity and machine intelligence, embodied in causal reasoning software, is ushering in a new era of problem-solving. This powerful combination enables us to tackle unprecedented challenges with unparalleled speed, accuracy, and innovation.
In the context of reliability engineering, the combination of OpenTelemetry and causal reasoning software offers a significant opportunity to accelerate progress towards continuous application reliability.
Related resources
- Read the blog: Explainability: The Black Box Dilemma in the Real World
- Watch the video: See how Causely leverages OpenTelemetry
- Take the interactive tour: Experience Causely first-hand
Preventing Out-of-Memory (OOM) Kills in Kubernetes: Tips for Optimizing Container Memory Management
Running containerized applications at scale with Kubernetes demands careful resource management. One very complicated but common challenge is preventing Out-of-Memory (OOM) kills, which occur when a container’s memory consumption surpasses its allocated limit. This brutal termination by the Kubernetes kernel’s OOM killer disrupts application stability and can affect application availability and the health of your overall environment.
In this post, we’ll explore the reasons that OOM kills can occur and provide tactics to combat and prevent them.
Before diving in, it’s worth noting that OOM kills represent one symptom that can have a variety of root causes. It’s important for organizations to implement a system that solves the root cause analysis problem with speed and accuracy, allowing reliability engineering teams to respond rapidly, and to potentially prevent these occurrences in the first place.
Deep dive into an OOM kill
An Out-Of-Memory (OOM) kill in Kubernetes occurs when a container exceeds its memory limit, causing the Kubernetes kernel’s OOM killer to terminate the container. This impacts application stability and requires immediate attention.
Several factors can trigger OOM kills in your Kubernetes environment, including:
- Memory limits exceeded: This is the most common culprit. If a container consistently pushes past its designated memory ceiling, the OOM killer steps in to prevent a system-wide meltdown.
- Memory leaks: Applications can develop memory leaks over time, where they allocate memory but fail to release it properly. This hidden, unexpected growth eventually leads to OOM kills.
- Resource overcommitment: Co-locating too many resource-hungry pods onto a single node can deplete available memory. When the combined memory usage exceeds capacity, the OOM killer springs into action.
- Bursting workloads: Applications with spiky workloads can experience sudden memory surges that breach their limits, triggering OOM kills.
As an example, a web server that experiences a memory leak code bug may gradually consume more and more memory until the OOM killer intervenes to prevent a crash.
Another case could be when a Kubernetes cluster over-commits resources by scheduling too many pods on a single node. The OOM killer may need to step in to free up memory and ensure system stability.
The devastating effects of OOM kills: Why they matter
OOM kills aren’t normally occurring events. They can trigger a cascade of negative consequences for your applications and the overall health of the cluster, such as:
- Application downtime: When a container is OOM-killed, it abruptly terminates, causing immediate application downtime. Users may experience service disruptions and outages.
- Data loss: Applications that rely on in-memory data or stateful sessions risk losing critical information during an OOM kill.
- Performance degradation: Frequent OOM kills force containers to restart repeatedly. This constant churn degrades overall application performance and user experience.
- Service disruption: Applications often interact with each other. An OOM kill in one container can disrupt inter-service communication, causing cascading failures and broader service outages.
If a container running a critical database service experiences an OOM kill, it could result in data loss and corruption. This leads to service disruptions for other containers that rely on the database for information, causing cascading failures across the entire application ecosystem.
Combating OOM kills
There are a few different tactics to combat OOM kills in attempt to operate a memory-efficient Kubernetes environment.
Set appropriate resource requests and limits
For example, you can set a memory request of 200Mi and a memory limit of 300Mi for a particular container in your Kubernetes deployment. Requests ensure the container gets at least 200Mi of memory, while limits cap it at 300Mi to prevent excessive consumption.
resources:
requests:
memory: "200Mi"
limits:
memory: "300Mi"
While this may mitigate potential memory use issues, it is a very manual process and does not deal at all with the dynamic nature of what we can achieve with Kubernetes. It also doesn’t solve the source issue, which may be a code-level problem triggering memory leaks or failed GC processes.
Transition to autoscaling
Leveraging autoscaling capabilities is a core dynamic option for resource allocation. There are two autoscaling methods:
- Vertical Pod Autoscaling (VPA): VPA dynamically adjusts resource limits based on real-time memory usage patterns. This ensures containers have enough memory to function but avoids over-provisioning.
- Horizontal Pod Autoscaling (HPA): HPA scales the number of pods running your application up or down based on memory utilization. This distributes memory usage across multiple pods, preventing any single pod from exceeding its limit. The following HPA configuration shows an example of scaling based on memory usage:
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
Monitor memory usage
Proactive monitoring is key. For instance, you can configure Prometheus to scrape memory metrics from your Kubernetes pods every 15 seconds and set up Grafana dashboards to visualize memory usage trends over time. Additionally, you can create alerts in Prometheus to trigger notifications when memory usage exceeds a certain threshold.
Optimize application memory usage
Don’t underestimate the power of code optimization. Address memory leaks within your applications and implement memory-efficient data structures to minimize memory consumption.
Pod disruption budgets (PDB)
When deploying updates, PDBs ensure a minimum number of pods remain available, even during rollouts. This mitigates the risk of widespread OOM kills during deployments. Here is a PDB configuration example that helps ensure minimum pod availability.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-app-pdb
spec:
minAvailable: 80%
selector:
matchLabels:
app: my-app
Manage node resources
You can apply a node selector to ensure that a memory-intensive pod is only scheduled on nodes with a minimum of 8GB of memory. Additionally, you can use taints and tolerations to dedicate specific nodes with high memory capacity for memory-hungry applications, preventing OOM kills due to resource constraints.
nodeSelector:
disktype: ssd
tolerations:
- key: "key"
operator: "Equal"
value: "value"
effect: "NoSchedule"
Use QoS classes
Kubernetes offers Quality of Service (QoS) classes that prioritize resource allocation for critical applications. Assign the highest QoS class to applications that can least tolerate OOM kills. Here is a sample resource configuration with QoS parameters:
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "1Gi"
cpu: "500m"
These are a few potential strategies to help prevent OOM kills. The challenge comes with the frequency with which they can occur, and the risk to your applications when they happen.
As you can imagine, it’s not possible to manually manage resource utilization, and guarantee the stability and performance of your containerized applications within your Kubernetes environment.
Manual thresholds = Rigidity and risk
These techniques can help reduce the risk of OOM kills. The issue is not entirely solved though. By setting manual thresholds and limits, you’re removing many of the dynamic advantages of Kubernetes.
A more ideal way to solve the OOM kill problem is to use adaptive, dynamic resource allocation. Even if you get resource allocation right on initial deployment, there are many factors that change that affect how your application consumes resources. There is also a risk because application and resource issues don’t just affect one pod, or one container. Resource issues can reach every part of the cluster and degrade the other running applications and services.
Which strategy works best to prevent OOM kills?
Vertical Pod Autoscaling (VPA) and Horizontal Pod Autoscaling (HPA) are common strategies used to manage resource limits in Kubernetes containers. VPA adjusts resource limits based on real-time memory usage patterns, while HPA scales pods based on memory utilization.
Monitoring with tools like Prometheus may help with the troubleshooting of memory usage trends. Optimizing application memory usage is no easy feat because it’s especially challenging to identify whether it is infrastructure or code causing the problem.
Pod Disruption Budgets (PDB) may help ensure a minimum number of pods remain available during deployments, while node resources can be managed using node selectors and taints. Quality of Service (QoS) classes prioritize resource allocation for critical applications.
One thing is certain: OOM kills are a common and costly challenge to manage using traditional monitoring tools and methods.
At Causely, we’re focused on applying causal reasoning software to help organizations keep applications healthy and resilient. By automating root cause analysis, issues like OOM kills can be resolved in seconds, and unintended consequences of new releases or application changes can be avoided.
Related resources
- Read the blog: Understanding the Kubernetes Readiness Probe: A Tool for Application Health
- Read the blog: Bridging the gap between observability and automation with causal reasoning
- Watch the webinar: What is Causal AI and why do DevOps teams need it?
Causely brings on a new CEO to accelerate growth
Yotam Yemini joins Causely as CEO after departing Cisco and previously leading go-to-market efforts at Oort, Quantum Metric, and IBM Turbonomic
Thursday, August 22, 2024 – Today, Causely is excited to welcome Yotam Yemini as the company’s Chief Executive Officer. In this role, Yotam will be instrumental in helping Causely fulfill its mission to enable continuous application reliability and streamline the software development lifecycle for modern applications.
“We are excited to welcome Yotam to the team,” said Causely Founder Shmuel Kliger. “Yotam brings an exceptional track record of building and scaling go-to-market strategy. His leadership will be pivotal as we deliver our Causal Reasoning Platform to Engineering and DevOps teams.”
The news comes on the heels of strong market signals from its early access program and the hiring of Francis Cordón as Chief Customer Officer in July. Francis is a domain expert and seasoned technology leader. His experience includes customer success leadership at Quantum Metric and IBM Turbonomic, resiliency and performance architecture at BNY Mellon, and sales engineering at Dynatrace.
“Causely’s causal reasoning software is a well-timed and much needed innovation for the tech industry,” said Alex Sukennik, CIO, Semrush. “The team behind Causely is uniquely suited to help engineering teams assure service levels for business-critical applications.”
In addition to today’s news, the company shares its deepest gratitude to Causely Founder Ellen Rubin for her work building the company, early product and team over the past few years. We wish Ellen the best as she moves on to her next adventures as board member, investor, advisor, and builder in the Boston startup ecosystem.
See Causely first-hand through a self-guided tour at causely.io/resources/experience-causely or sign up today at causely.io/trial.
About Causely
Causely is the leading provider of causal reasoning software, which enables continuous application reliability and streamlines the software development lifecycle for modern applications. Whereas reliability engineering today tends to be overly complex and labor-intensive, Causely amplifies engineering productivity through its patent-pending Causal Reasoning Platform. The platform identifies cause and effect relationships in runtime to automate the process of root cause and impact analysis. This drastically shortens mean-time-to-repair (MTTR), reduces the number of incidents that occur, and empowers engineering teams to build more resilient applications and business services. Causely, Inc. is a remote-first company headquartered in New York City. Visit causely.io to learn more.
Media Contact
Karina Babcock
kbabcock@causely.io
The Rising Cost of Digital Incidents: Understanding and Mitigating Outage Impact
Digital disruptions have reached alarming levels. Incident response in modern application environments is frequent, time-consuming and labor intensive. Our team has first-hand experience dealing with the far-reaching impacts of these disruptions and outages, having spent decades in IT Ops.
PagerDuty recently published a study1 that shines a light on how broken our existing incident response systems and practices are. The recent Crowdstrike debacle is further evidence of this. Even with all the investment in observability, AI Ops, automation, and playbooks, things aren’t improving. In some ways, they’re actually worse; we’re collecting more and more data and we’re overloaded with tooling, creating confusion between users and teams who struggle to understand the holistic environment and all of its interdependencies.
With a mean resolution time of 175 minutes, each customer-impacting digital incident costs both time and money. The industry needs to reset and revisit current processes so we can evolve and change the trajectory.
The impact of outages and application downtime
Outages erode customer trust. 90% of IT leaders report that disruptions have reduced customer confidence. Protecting sensitive data, ensuring swift service restoration, and providing real-time customer updates are essential for maintaining trust when digital incidents happen. Thorough, action-oriented postmortems are critical post-incident to prevent recurrences. And – at risk of reinforcing the obvious – IT organizations need to put operational practices in place to minimize outages from happening in the first place.
Yet even though IT leaders understand the implications on customer confidence, incident frequency continues to rise. 59% of IT leaders report an increase in customer-impacting incidents, and it’s not going to get better unless we change the way we observe and mitigate problems in our applications.
Automation can help, but adoption is slow
Despite the growing threat, many organizations are lagging behind in incident response automation:
- Over 70% of IT leaders report that key incident response tasks are not yet fully automated.
- 38% of responders’ time is spent dealing with manual incident response processes.
- Organizations with manual processes take on average 3 hours 58 minutes to resolve customer-impacting incidents, compared to 2 hours 40 minutes for those with automated processes.
It doesn’t take an IT expert to know that spending nearly half their time in manual processes is a waste of resources. And those that have automated operations are still taking almost 3 hours to resolve incidents. Why is incident response still so slow?
It’s not just about process automation. We also need to accelerate decision automation, driven by a deep understanding of the state of applications and infrastructure.
Causal AI for DevOps: The missing link
Causal AI for DevOps promises a bridge between observability and automated incident response. By “Causal AI for DevOps,” I’m referring to causal reasoning software that applies machine learning to automatically capture cause and effect relationships. Causal AI has the potential to help Dev and Ops teams better plan for changes to code, configurations or load patterns, so they can stay focused on achieving service-level and business objectives instead of firefighting.
With Causal AI for DevOps, many of the incident response tasks that are currently manual can be automated:
- When service entities are degraded or failing and affecting other entities that make up business services, causal reasoning software surfaces the relationship between the problem and the symptoms it’s causing.
- The team with responsibility for the failing or degraded service is immediately notified so they can get to work resolving the problem. Some problems can be remediated automatically.
- Notifications can be sent to end users and other stakeholders, letting them know that their services are affected along with an explanation for why this occurred and when things will be back to normal.
- Postmortem documentation is automatically generated.
- There’s no more complex triage processes that would otherwise involve multiple teams and managers to orchestrate. Incidents and outages are reduced and root cause analysis is automated, so DevOps teams spend less time troubleshooting and more time shipping code.
Introducing Causely
This potential to transform the way DevOps teams work is why we built Causely. Our Causal Reasoning Platform automatically pinpoints the root cause of observed symptoms based on real-time, dynamic data across the entire application environment. Causely transforms incident response and improves mean time to resolution (MTTR), so DevOps teams focus on building new services and innovations that propel the business forward.
By automatically understanding cause-and-effect relationships in application environments, Causely also enables predictive maintenance and better overall operational resilience. It can help to prevent outages and identify the root cause of potential issues before they escalate.
Here’s how it works, at a high level:
- Our Causal Reasoning Platform is shipped with out-of-the-box Causal Models that drive the platform’s behavior.
- Once deployed, Causely automatically discovers the application environment and generates a Topology Graph of it.
- A Causality Graph is generated by instantiating the Causal Models with the Topology Graph to reflect cause and effect relationships between the root causes and their symptoms, specific to that environment at that point in time.
- A Codebook is generated from the Causality Graph.
- Using the Codebook, our Causal Reasoning Platform automatically and continuously pinpoints the root cause of issues.
Users can dig into incidents, understand their root causes, take remediation steps, and proactively plan for new releases and application changes – all within Causely.
This decreases downtime, enhances operational efficiency, and improves customer trust long-term.
It’s time for a new approach
It’s time to shift from manual to automated incident response. Causal AI for DevOps can help teams prevent outages, reduce risk, cut costs, and build sustainable customer trust.
Don’t hesitate to contact us about how to bring automation into your organization, or you can see Causely for yourself.
1 “Customer impacting incidents increased by 43% during the past year- each incident costs nearly $800,000.” PagerDuty. (2024, June 26). https://www.pagerduty.com/resources/learn/cost-of-downtime/
Related resources
- Read the blog: DevOps may have cheated death, but do we all need to work for the king of the underworld?
- Learn about our Causal Reasoning Platform
- See Causely for yourself
Explainability: The Black Box Dilemma in the Real World
The software industry is at a crossroads. I believe those who embrace explainability as a key part of their strategy will emerge as leaders. Those who resist will risk losing customer confidence and market share. The time for obfuscation is over. The era of explainability has begun.
What Is The Black Box Dilemma?
Imagine a masterful illusionist, their acts so breathtakingly deceptive that the secrets behind them remain utterly concealed. Today’s software is much the same. We marvel at its abilities to converse, diagnose, drive, and defend, yet the inner workings often remain shrouded in mystery. This is often referred to as the “black box” problem.
The recent CrowdStrike incident is also a stark reminder of the risks of this opacity. A simple software update, intended to enhance security, inadvertently caused widespread system crashes. It’s as if the magician’s assistant accidentally dropped the secret prop, revealing the illusion for what it truly was – an error prone process with no resilience. Had organizations understood the intricacies of CrowdStrike’s software release process, they might have been better equipped to mitigate risks and prevent the disruptions that followed.
This incident, coupled with the rapid advancements in AI, underscores the critical importance of explainability. Understanding the entire lifecycle of software – from conception to operation – is no longer optional but imperative. It is the cornerstone of trust, a shield against catastrophic failures, and an important foundation for accountability.
As our world becomes increasingly reliant on complex systems, understanding their inner workings is no longer a luxury but a necessity. Explainability acts as a key to unlocking the black box, revealing the logic and reasoning behind complex outputs. By shedding light on the decision-making processes of software, AI, and other sophisticated systems, we foster trust, accountability, and a deeper comprehension of their impact.
The Path Forward: Cultivating Explainability in Software
Achieving explainability demands a comprehensive approach that addresses several critical dimensions.
- Software Centric Reasoning and Ethical Considerations: Can the system’s decision-making process be transparently articulated, justified, and aligned with ethical principles? Explainability is essential for building trust and ensuring that systems used to support decision making operate fairly and responsibly.
- Effective Communication and User Experience: Is the system able to communicate its reasoning clearly and understandably to both technical and non-technical audiences? Effective communication enhances collaboration, knowledge sharing, and user satisfaction by empowering users to make informed decisions.
- Robust Data Privacy and Security: How can sensitive information be protected while preserving the transparency necessary for explainability? Rigorous data handling and protection are crucial for safeguarding privacy and maintaining trust in the system.
- Regulatory Compliance and Continuous Improvement: Can the system effectively demonstrate adherence to relevant regulations and industry standards for explainability? Explainability is a dynamic process requiring ongoing evaluation, refinement, and adaptation to stay aligned with the evolving regulatory landscape.
By prioritizing these interconnected elements, software vendors and engineering teams can create solutions where explainability is not merely a feature, but a cornerstone of trust, reliability, and competitive advantage.
An Example of Explainability in Action with Causely
Causely is a pioneer in applying causal reasoning to revolutionize cloud-native application reliability. Our platform empowers operations teams to rapidly identify and resolve the root causes of service disruptions, preventing issues before they impact business processes and customers. The enables dramatic reductions in Mean Time to Repair (MTTR), minimizing business disruptions and safeguarding customer experiences.
Causely also uses its Causal Reasoning Platform to manage its SaaS offering, detecting and responding to service disruptions, and ensuring swift resolution with minimal impact. You can learn more about this in Endre Sara‘s article “Eating Our Own Dogfood: Causely’s Journey with OpenTelemetry & Causal AI”.
Causal Reasoning, often referred to as Causal AI, is a specialized field in computer science dedicated to uncovering the underlying cause-and-effect relationships within complex data. As the foundation of explainable AI, it surpasses the limitations of traditional statistical methods, which frequently produce misleading correlations.
Unlike opaque black-box models, causal reasoning illuminates the precise mechanisms driving outcomes, providing transparent and actionable insights. When understanding the why behind an event is equally important as predicting the what, causal reasoning offers superior clarity and reliability.
By understanding causation, we transcend mere observation to gain predictive and interventional power over complex systems.
Causely is built on a deep-rooted foundation of causal reasoning. The founding team, pioneers in applying this science to IT operations, led the charge at System Management Arts (SMARTS). SMARTS revolutionized root cause analysis for distributed systems and networks, empowering more than 1500 global enterprises and service providers to ensure the reliability of mission-critical IT services. Their groundbreaking work earned industry accolades and solidified SMARTS as a leader in the field.
Explainability is a cornerstone of the Causal Reasoning Platform from Causely. The company is committed to transparently communicating how its software arrives at its conclusions, encompassing both the underlying mechanisms used in Causal Reasoning and the practical application within organizations operational workflows.
Explainable Operations: Enhancing Workflow Efficiency for Continuous Application Reliability
Causely converts raw observability data into actionable insights by pinpointing root causes and their cascading effects within complex application and infrastructure environments.
Today incident response is a complex, resource-intensive process of triage and troubleshooting that often diverts critical teams from strategic initiatives. This reactive approach hampers innovation, erodes efficiency, and can lead to substantial financial losses and reputational damage, when application reliability is not continuously assured.
The complex interconnected nature of cloud-native environments magnifies the effect/impact, leading to cascading disruptions, which can propagate across services when Root Cause problems occur.
By automating the identification and explanation of cause-and-effect relationships, Causely accelerates incident response. Relevant teams responsible for root cause problems receive immediate alerts, complete with detailed explanations of the cause & effect, empowering them to prioritize remediation based on impact. Simultaneously, teams whose services are impacted gain insights into the root causes and who is responsible for resolution, enabling proactive risk mitigation without the need for extensive troubleshooting.
For certain types of Root Cause problems it may also be possible to automate the remediation when they occur without human intervention.
By maintaining historical records of the cause and effect of past Root Cause problems and identifying recurring patterns, Causely enables reliability engineering teams to anticipate future potential problems and implement targeted mitigation strategies.
Causely’s ability to explain the effect of potential degradations and failures, before they even happen through “what if” analysis, also empowers reliability engineering teams to identify single points of failure, changes in load patterns and assess the impact of planned changes on related applications, business processes, and customers.
The result? Through explainability organizations can dramatically reduce MTTR, improve business continuity, and increase cycles for development and innovation. Causely turns reactive troubleshooting into proactive prevention, ensuring application reliability can be continuously assured. This short video tells the story.
Unveiling Causely: How Our Platform Delivers Actionable Insights
Given the historical challenges and elusive nature of automating root cause analysis in IT operations, customer skepticism is warranted in this field. This problem has long been considered the “holy grail” of the industry, with countless vendors falling short of delivering a viable solution.
As a consequence Causely has found that it is important to prioritize transparency and explainability around how our Causal Reasoning Platform works and produces the results described earlier.
Much has been written about this — learn more here.
This is an approach which is grounded in sound scientific principles, making it both effective and comprehensible.
Beyond understanding how the platform works, customers also value transparency around data handling. In this regard our approach to data management offers unique benefits in terms of data privacy and data management cost savings. You can learn more about this here.
In Summary
Explainability is the cornerstone of Causely’s mission. As they advance their technology, their dedication to transparency and understanding will only grow stronger. Don’t hesitate to visit the website or reach out to me or members of the Causely team to learn more about the approach and to experience Causely firsthand.
Understanding the Kubernetes Readiness Probe: A Tool for Application Health
Application reliability is a dynamic challenge, especially in cloud-native environments. Ensuring that your applications are running smoothly is make-or-break when it comes to user experience. One essential tool for this is the Kubernetes readiness probe. This blog will explore the concept of a readiness probe, explaining how it works and why it’s a key component for managing your Kubernetes clusters.
What is a Kubernetes Readiness Probe?
A readiness probe is essentially a check that Kubernetes performs on a container to ensure that it is ready to serve traffic. This check is needed to prevent traffic from being directed to containers that aren’t fully operational or are still in the process of starting up.
By using readiness probes, Kubernetes can manage the flow of traffic to only those containers that are fully prepared to handle requests, thereby improving the overall stability and performance of the application.
Readiness probes also help in preventing unnecessary disruptions and downtime by only including healthy containers in the load balancing process. This is an essential part of a comprehensive SRE operational practice for maintaining the health and efficiency of your Kubernetes clusters.
How Readiness Probes Work
Readiness probes are configured in the pod specification and can be of three types:
- HTTP Probes: These probes send an HTTP request to a specified endpoint. If the response is successful, the container is considered ready.
- TCP Probes: These probes attempt to open a TCP connection to a specified port. If the connection is successful, the container is considered ready.
- Command Probes: These probes execute a command inside the container. If the command returns a zero exit status, the container is considered ready.
Below is an example demonstrating how to configure a readiness probe in a Kubernetes deployment:
apiVersion: v1
kind: Pod
metadata:
name: readiness-example
spec:
containers:
- name: readiness-container
image: your-image
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
This YAML file defines the Kubernetes pod with a readiness probe configured based on the following parameters:
- apiVersion: v1 – Specifies the API version used for the configuration.
- kind: Pod – Indicates that this configuration is for a Pod.
- metadata:
- name: readiness-example – Sets the name of the Pod to “readiness-example.”
- spec – Describes the desired state of the Pod.
- containers:
- name: readiness-container – Names the container within the Pod as “readiness-container.”
- image: your-image – Specifies the container image to use, named “your-image.”
- readinessProbe – Configures a readiness probe to check if the container is ready to receive traffic.
- httpGet:
- path: /healthz – Sends an HTTP GET request to the /healthz path.
- port: 8080 – Targets port 8080 for the HTTP GET request.
- initialDelaySeconds: 5 – Waits 5 seconds before performing the first probe after the container starts.
- periodSeconds: 10 – Repeats the probe every 10 seconds.
- httpGet:
- containers:
This relatively simple configuration creates a Pod named “readiness-example” with a single container running “your-image.” It includes a readiness probe that checks the /healthz endpoint on port 8080, starting 5 seconds after the container launches and repeating every 10 seconds to determine if the container is ready to accept traffic.
Importance of Readiness Probes
The goal is to make sure you can prevent traffic from being directed to a container that is still starting up or experiencing issues. This helps maintain the overall stability and reliability of your application by only sending traffic to containers that are ready to handle it.
Readiness probes can be used in conjunction with liveness probes to further enhance the health checking capabilities of your containers.
Readiness probes are important for a few reasons:
- Prevent traffic to unready pods: They ensure that only ready pods receive traffic, preventing downtime and errors.
- Facilitate smooth rolling updates: By making sure new pods are ready before sending traffic to them.
- Enhanced application stability: They can help with the overall stability and reliability of your application by managing traffic flow based on pod readiness.
Remember that your readiness probes only check for availability, and don’t understand why a container is not available. Readiness probe failure is a symptom that can manifest from many root causes. It’s important to know the purpose, and limitations before you rely too heavily on them for overall application health.
Related: Causely solves the root cause analysis problem, applying Causal AI to DevOps. Learn about our Causal Reasoning Platform.
Best Practices for Configuring Readiness Probes
To make the most of Kubernetes readiness probes, consider the following practices:
- Define Clear Health Endpoints: Ensure your application exposes a clear and reliable health endpoint.
- Set Appropriate Timing: Configure initialDelaySeconds and periodSeconds based on your application’s startup and response time.
- Monitor and Adjust: Continuously monitor the performance and adjust the probe configurations as needed.
For example, if your application requires a database connection to be fully established before it can serve requests, you can set up a readiness probe that checks for the availability of the database connection.
By configuring the initialDelaySeconds and periodSeconds appropriately, you can ensure that your application is only considered ready once the database connection is fully established. This will help prevent any potential issues or errors that may occur if the application is not fully prepared to handle incoming requests.
Limitations of Readiness Probes
Readiness probes are handy, but they only check for the availability of a specific resource and do not take into account the overall health of the application. This means that even if the database connection is established, there could still be other issues within the application that may prevent it from properly serving requests.
Additionally, readiness probes do not automatically restart the application if it fails the check, so it is important to monitor the results and take appropriate action if necessary. Readiness probes are still a valuable tool for ensuring the stability and reliability of your application in a Kubernetes environment, even with these limitations.
Troubleshooting Kubernetes Readiness Probes: Common Issues and Solutions
Slow Container Start-up
Problem: If your container’s initialization tasks exceed the initialDelaySeconds
of the readiness probe, the probe may fail.
Solution: Increase the initialDelaySeconds
to give the container enough time to start and complete its initialization. Additionally, optimize the startup process of your container to reduce the time required to become ready.
Unready Services or Endpoints
Problem: If your container relies on external services or dependencies (e.g., a database) that aren’t ready when the readiness probe runs, it can fail. Race conditions may also occur if your application’s initialization depends on external factors.
Solution: Ensure that external services or dependencies are ready before the container starts. Use tools like Helm Hooks or init containers to coordinate the readiness of these components with your application. Implement synchronization mechanisms in your application to handle race conditions, such as using locks, retry mechanisms, or coordination with external components.
Misconfiguration of the Readiness Probe
Problem: Misconfigured readiness probes, such as incorrect paths or ports, can cause probe failures.
Solution: Double-check the readiness probe configuration in your Pod’s YAML file. Ensure the path, port, and other parameters are correctly specified.
Application Errors or Bugs
Problem: Application bugs or issues, such as unhandled exceptions, misconfigurations, or problems with external dependencies, can prevent it from becoming ready, leading to probe failures.
Solution: Debug and resolve application issues. Review application logs and error messages to identify the problems preventing the application from becoming ready. Fix any bugs or misconfigurations in your application code or deployment.
Insufficient Resources
Problem: If your container is running with resource constraints (CPU or memory limits), it might not have the resources it needs to become ready, especially under heavy loads.
Solution: Adjust the resource limits to provide the container with the necessary resources. You may also need to optimize your application to use resources more efficiently.
Conflicts Between Probes
Problem: Misconfigured liveness and readiness probes might interfere with each other, causing unexpected behavior.
Solution: Ensure that your probes are configured correctly and serve their intended purposes. Make sure that the settings of both probes do not conflict with each other.
Cluster-Level Problems
Problem: Kubernetes cluster issues, such as kubelet or networking problems, can result in probe failures.
Solution: Monitor your cluster for any issues or anomalies and address them according to Kubernetes best practices. Ensure that the kubelet and other components are running smoothly.
These are common issues to keep an eye out for. Watch for problems that the readiness probes are not surfacing or that might be preventing them from acting as expected.
Summary
Ensuring that your applications are healthy and ready to serve traffic is necessary for maximizing uptime. The Kubernetes readiness probe is one helpful tool for managing Kubernetes clusters; it should be a part of a comprehensive Kubernetes operations plan.
Readiness probes can be configured in pod specifications and can be HTTP, TCP, or command probes. They help prevent disruptions and downtime by ensuring only healthy containers are included in the load-balancing process.
They also use the prevention of sending traffic to unready pods for smooth rolling updates and enhancing application stability. It’s good practice that your readiness probes include defining clear health endpoints, setting appropriate timing, and monitoring and adjusting configurations.
Don’t forget that readiness probes have clear limitations, as they only check for the availability of a specific resource and do not automatically restart the application if it fails the check. A Kubernetes readiness probe failure is merely a symptom that can be attributed to many root causes. To automate root cause analysis across your entire Kubernetes environment, check out Causely for Cloud-Native Applications.
Related resources
- Webinar: What is Causal AI and why do DevOps teams need it?
- Blog: Bridging the gap between observability and automation with causal reasoning
- Product Overview: Causely for Cloud-Native Applications
Beyond the Blast Radius: Demystifying and Mitigating Cascading Microservice Issues
Microservices architectures offer many benefits, but they also introduce new challenges. One such challenge is the cascading effect of simple failures. A seemingly minor issue in one microservice can quickly snowball, impacting other services and ultimately disrupting user experience.
The Domino Effect: From Certificate Expiry to User Frustration
Imagine a scenario where a microservice’s certificate expires. This seemingly trivial issue prevents it from communicating with others. This disruption creates a ripple effect:
- Microservice Certificate Expiry: The seemingly minor issue is a certificate going past its expiration date.
- Communication Breakdown: This expired certificate throws a wrench into the works, preventing the microservice from securely communicating with other dependent services. It’s like the microservice is suddenly speaking a different language that the others can’t understand.
- Dependent Service Unavailability: Since the communication fails, dependent services can no longer access the data or functionality provided by the failing microservice. Imagine a domino not receiving the push because the first one didn’t fall.
- Errors and Outages: This lack of access leads to errors within dependent services. They might malfunction or crash entirely, causing outages – the domino effect starts picking up speed.
- User Frustration (500 Errors): Ultimately, these outages translate to error messages for the end users. They might see cryptic “500 errors” or experience the dreaded “service unavailable” message – the domino effect reaches the end user, who experiences the frustration.
The Challenge: Untangling the Web of Issues
Cascading failures pose a significant challenge due to the following reasons:
- Network Effect: The root cause gets obscured by the chain reaction of failures, making it difficult to pinpoint the source.
- Escalation Frenzy: Customer complaints trigger incident tickets, leading to a flurry of investigations across multiple teams (DevOps Teams, Service Desk, customer support, etc.).
- Resource Drain: Troubleshooting consumes valuable time from developers, SREs, and support personnel, diverting them from core tasks.
- Hidden Costs: The financial impact of lost productivity and customer dissatisfaction often goes unquantified.
Beyond Certificate Expiry: The Blast Radius of Microservice Issues
Certificate expiry is just one example. Other issues with similar cascading effects include:
- Noisy Neighbors: A resource-intensive microservice can degrade performance for others sharing the same resources (databases, applications) which in turn impact other services that depend on them.
- Code Bugs: Code errors within a microservice can lead to unexpected behavior and downstream impacts.
- Communication Bottlenecks: Congestion or malfunctioning in inter-service communication channels disrupts data flow and service availability.
- Third-Party Woes: Outages or performance issues in third-party SaaS services integrated with your microservices can create a ripple effect.
Platform Pain Points: When Infrastructure Falters
The impact can extend beyond individual microservices. Platform-level issues can also trigger cascading effects:
- Load Balancer Misconfigurations: Incorrectly configured load balancers can disrupt service delivery to clients and dependent services.
- Container Cluster Chaos: Problems within Kubernetes pods, nodes, can lead to application failures and service disruptions.
Blast Radius and Asynchronous Communication: The Data Lag Challenge
Synchronous communication provides immediate feedback, allowing the sender to know if the message was received successfully. In contrast, asynchronous communication introduces a layer of complexity:
- Unpredictable Delivery: Messages may experience varying delays or, in extreme cases, be lost entirely. This lack of real-time confirmation makes it difficult to track the message flow and pinpoint the exact location of a breakdown.
- Limited Visibility: Unlike synchronous communication where a response is readily available, troubleshooting asynchronous issues requires additional effort. You may only have user complaints as a starting point, which can be a delayed and incomplete indicator of the problem.
The root cause of problems could be because of several factors that result delays or lost messages in asynchronous communication:
Microservice Issues:
- Congestion: A microservice overloaded with tasks may struggle to process or send messages promptly, leading to delays.
- Failures: A malfunctioning microservice may be entirely unable to process or send messages, disrupting the flow of data.
Messaging Layer Issues:
Problems within the messaging layer itself can also cause disruptions:
- Congestion: Congestion in message brokers, clusters, or cache instances can lead to delays in message delivery.
- Malfunctions: Malfunctions within the messaging layer can cause messages to be lost entirely.
The Cause & Effect Engine: Unveiling the Root of Microservice Disruptions in Real Time
So what can we do to tame this chaos?
Imagine a system that acts like a detective for your application services. It understands all of the cause-and-effect relationships within your complex architecture. It does this by automatically discovering and analyzing your environment to maintain an up-to-date picture of services, infrastructure and dependencies and from this computes a dynamic knowledge base of root causes and the effects they will have.
This knowledge is automatically computed in a Causality Graph that depicts all of the relationships between the potential root causes that could occur and the symptoms they may cause. In an environment with thousands of entities, it might represent hundreds of thousands of problems and the set of symptoms each one will cause.
A separate data structure is derived from this called a “Codebook“. This table is like a giant symptom checker, mapping all the potential root causes (problems) to the symptoms (errors) they might trigger.
Hence, each root cause in the Codebook has a unique signature, a vector of m probabilities, that uniquely identifies the root cause. Using the Codebook, the system quickly searches and pinpoints the root causes based on the observed symptoms.
The Causality Graph and Codebook are constantly updated as application services and infrastructure evolve. This ensures the knowledge in the Causality Graph and Codebook stays relevant and adapts to changes.
These powerful capabilities enable:
- Machine Speed Root Cause Identification: Unlike traditional troubleshooting, the engine can pinpoint the culprit in real time, saving valuable time and resources.
- Prioritization Based on Business Impact: By revealing the effects of specific root causes on related services, problem resolution can be prioritized.
- Reduced Costs: Faster resolution minimizes downtime and associated costs.
- Improved Collaboration: Teams responsible for failing services receive immediate notifications and a visualize a Causality Graph explaining the issue’s origin and impact. This streamlines communication and prioritizes remediation efforts based on the effect the root cause problem has.
- Automated Actions: In specific cases, the engine can even trigger automated fixes based on the root cause type.
- Empowered Teams: Teams affected by the problem are notified but relieved of troubleshooting burdens. They can focus on workarounds or mitigating downstream effects, enhancing overall system resilience.
The system represents a significant leap forward in managing cloud native applications. By facilitating real-time root cause analysis and intelligent automation, it empowers teams to proactively address disruptions and ensure the smooth operation of their applications.
The knowledge in the system is not just relevant to optimize the incident response process. It is also valuable for performing “what if” analysis to understand what the impact of future changes and planned maintenance will have so that steps can be taken to proactively understand and mitigate the risks of these activities.
Through its understanding of cause and effect, it can also play a role in business continuity planning, enabling teams to identify single points of failure in complex services to improve service resilience.
The system can also be used to streamline the process of incident postmortems because it contains the prior history of previous root cause problems, why they occurred and what the effect was — their causality. This avoids the complexity and time involved in reconstructing what happened and enables mitigating steps to be taken to avoid recurrences.
The Types of Root Cause Problems & Their Effects
The system computes its causal knowledge based on Causal Models. These describe the behaviours of how root cause problems will propagate symptoms along relationships to dependent entities independently of a given environment. This knowledge is instantiated through service and infrastructure auto discovery to create the Causal Graph and Codebook.
Examples of these types of root cause problems that are modeled in the system include:
Science Fiction or Reality
The inventions behind the system go back to the 90’s, and was at the time and still is groundbreaking. It was successfully deployed, at scale, by some of the largest telcos, system integrators and Fortune 500 companies in the early 2000’s. You can read about the original inventions here.
Today the problems that these inventions set out to address have not changed and the adoption of cloud-native technologies has only heightened the need for a solution. As real-time data has become pervasive in today’s application architectures, every second of service disruption is a lost business opportunity.
These inventions have been taken and engineered in a modern, commercially available platform by Causely to address the challenges of assuring continuous application reliability in the cloud-native world. The founding engineering team at Causely were the creators of the tech behind two high-growth companies: SMARTS and Turbonomic.
If you would like to learn more about this, don’t hesitate to reach out to me directly or comment here.
Related Resources
Using OpenTelemetry and the OTel Collector for Logs, Metrics, and Traces
OpenTelemetry (fondly known as OTel) is an open-source project that provides a unified set of APIs, libraries, agents, and instrumentation to capture and export logs, metrics, and traces from applications. The project’s goal is to standardize observability across various services and applications, enabling better monitoring and troubleshooting.
Our team at Causely has adopted OpenTelemetry within our own platform, which prompted us to share a production-focused guide. Our goal is to help developers, DevOps engineers, software engineers, and SREs understand what OpenTelemetry is, its core components, and a detailed look at the OpenTelemetry Collector (OTel Collector). This background will help you use OTel and the OTel Collector as part of a comprehensive strategy to monitor and observe applications.
What Data Does OpenTelemetry Collect?
There are 3 types of data that are gathered by OpenTelemetry using the OTel Collector: logs, metrics, and traces.
Logs
Logs are records of events that occur within an application. They provide a detailed account of what happened, when it happened, and any relevant data associated with the event. Logs are helpful for debugging and understanding the behavior of applications.
OpenTelemetry collects and exports logs, providing insights into events and errors that occur within the system. For example, if a user reports a slow response time in a specific feature of the application, engineers can use OpenTelemetry logs to trace back the events leading up to the reported issue.
Metrics
Metrics are the quantitative data that measure the performance and health of an application. Metrics help in tracking system behavior and identifying trends over time. OpenTelemetry collects metrics data, which helps in tracking resource usage, system performance, and identifying anomalies.
For instance, if a spike in CPU usage is detected using OpenTelemetry metrics, engineers can investigate the potential issue using the OTel data collected and make necessary adjustments to optimize performance.
Developers use OpenTelemetry metrics to see granular resource utilization data, which helps understand how the application is functioning under different conditions.
Traces
Traces provide a detailed view of request flows within a distributed system. Traces help understand the execution path, diagnose application behaviors, and see the interactions between different services.
For example, if a user reports slow response times on a website, developers can use trace data to help better identify which service is experiencing issues. Traces can also help in debugging issues such as failed requests or errors by providing a step-by-step view of how requests are processed through the system.
Introduction to OTel Collector
You can deploy the OTel Collector as a standalone agent or as a sidecar alongside your application. The OTel Collector also includes some helpful features for sampling, filtering, and transforming data before sending it to a monitoring backend.
How it Works
The OTel Collector works by receiving telemetry data from many different sources, processing it based on configured pipelines, and exporting it to chosen backends. This modular architecture allows for customization and scalability.
The OTel Collector acts as a central data pipeline for collecting, processing, and exporting telemetry data (metrics, logs, traces) within an observability stack.
Here’s a technical breakdown:
Data Ingestion:
- Leverages pluggable receivers for specific data sources (e.g., Redis receiver, MySQL receiver).
- Receivers can be configured for specific endpoints, authentication, and data collection parameters.
- Supports various data formats (e.g., native application instrumentation libraries, vendor-specific formats) through receiver implementations.
Data Processing:
- Processors can be chained to manipulate the collected data before export.
- Common processing functions include:
- Batching: Improves efficiency by sending data in aggregates.
- Filtering: Selects specific data based on criteria.
- Sampling: Reduces data volume by statistically sampling telemetry.
- Enrichment: Adds contextual information to the data.
Data Export:
- Utilizes exporters to send the processed data to backend systems.
- Exporters are available for various observability backends (e.g., Jaeger, Zipkin, Prometheus).
- Exporter configurations specify the destination endpoint and data format for the backend system.
Internal Representation:
- Leverages OpenTelemetry’s internal Protobuf data format (pdata) for efficient data handling.
- Receivers translate source-specific data formats into pdata format for processing.
- Exporters translate pdata format into the backend system’s expected data format.
Scalability and Configurability:
- Designed for horizontal scaling by deploying multiple collector instances.
- Configuration files written in YAML allow for dynamic configuration of receivers, processors, and exporters.
- Supports running as an agent on individual hosts or as a standalone service.
The OTel Collector is format-agnostic and flexible, built to work with various backend observability systems.
Setting up the OpenTelemetry (OTel) Collector
Starting with OpenTelemetry for your new system is a straightforward process that takes only a few steps:
- Download the OTel Collector: Obtain the latest version from the official OpenTelemetry website or your preferred package manager.
- Configure the OTel Collector: Edit the configuration file to define data sources and export destinations.
- Run the OTel Collector: Start the Collector to begin collecting and processing telemetry data.
Keep in mind that the example we will show here is relatively simple. A large scale production implementation will require fine-tuning to ensure optimal results. Make sure to follow your OS-specific instructions to deploy and run the OTel collector.
Next, we need to configure some exporters for your application stack.
Integration with Popular Tools and Platforms
Let’s use an example system running a multi-tier web application using NGINX, MySQL, and Redis. Each source platform will have some application-specific configuration parameters.
Configuring Receivers
redisreceiver:
- Replace
receiver_name
withredisreceiver
- Set
endpoint
to the port where your Redis server is listening (default: 6379) - You can configure additional options like authentication and collection intervals in the receiver configuration. Refer to the official documentation for details.
mysqlreceiver:
- Replace
receiver_name
withmysqlreceiver
- Set endpoint to the connection string for your MySQL server (e.g.,
mysql://user:password@localhost:3306/database
) - Similar to Redis receiver, you can configure authentication and collection intervals. Refer to the documentation for details.
nginxreceiver:
- Replace
receiver_name
withnginxreceiver
- No endpoint configuration needed as it scrapes metrics from the NGINX process.
- You can configure what metrics to collect and scraping intervals in the receiver configuration. Refer to the documentation for details.
The OpenTelemetry Collector can export data to multiple providers including Prometheus, Jaeger, Zipkin, and, of course, Causely. This flexibility allows users to leverage their existing tools while adopting OpenTelemetry.
Configuring Exporters
Replace exporter_name
with the actual exporter type for your external system. Here are some common options:
jaeger
for Jaeger backendzipkin
for Zipkin backendotlp/causely
for Causely backend- There are exporters for many other systems as well. Refer to the documentation for a complete list.
Set endpoint
to the URL of your external system where you want to send the collected telemetry data. You might need to configure additional options specific to the chosen exporter (e.g., authentication for Jaeger).
There is also a growing list of supporting vendors who consume OpenTelemetry data.
Conclusion
OpenTelemetry provides a standardized approach to collecting and exporting logs, metrics, and traces. Implementing OpenTelemetry and the OTel Collector offer a scalable and flexible solution for managing telemetry data, making it a popular and effective tool for modern applications.
You can use OpenTelemetry as part of your monitoring and observability practice in order to gather data that can help drive better understanding of the state of your applications. The most valuable part of OpenTelemetry is the ability to ingest the data for deeper analysis.
How Causely Works with OpenTelemetry
At Causely, we leverage OpenTelemetry as one of many data sources to assure application reliability for our clients. OpenTelemetry data is ingested by our Causal Reasoning Platform, which detects and remediates application failures in complex cloud-native environments. Causely is designed to be an intuitive, automated way to view and maintain the health of your applications and to eliminate the need for manual troubleshooting.
Related Resources
- Read the blog: Eating Our Own Dog Food: Causely’s Journey with OpenTelemetry and Causal AI
- Watch the video: Cracking the Code of Complex Tracing Data: How Causely Uses OpenTelemetry
- Read the blog: Bridging the Gap Between Observability and Automation with Causal Reasoning
Causely Overview
Causely assures continuous reliability of cloud applications. Causely for Cloud-Native Applications, built on our Causal Reasoning Platform, automatically captures cause and effect relationships based on real-time, dynamic data across the entire application environment. This means that we can detect, remediate and even prevent problems that result in service impact. With Causely, Dev and Ops teams are better equipped to plan for ongoing changes to code, configurations or load patterns, and they stay focused on achieving service-level and business objectives instead of firefighting.
Watch the video to see Causely in action, or take the product for a self-guided tour.
The State of AI in Observability Today
Real-time Data & Modern UXs: The Power and the Peril When Things Go Wrong
Imagine a world where user experiences adapt to you in real time. Personalized recommendations appear before you even think of them, updates happen instantaneously, and interactions flow seamlessly. This captivating world is powered by real-time data, the lifeblood of modern applications.
But this power comes at a cost. The intricate architecture behind real-time services can make troubleshooting issues a nightmare. Organizations that rely on real-time data to deliver products and services face a critical challenge: ensuring data is delivered fresh and on time. Missing data or delays can cripple the user experience and demand resolutions within minutes, if not seconds.
This article delves into the world of real-time data challenges. We’ll explore the business settings where real-time data is king, highlighting the potential consequences of issues. Then I will introduce a novel approach that injects automation into the troubleshooting process, saving valuable time and resources, but most importantly mitigating the business impact when problems arise.
Lags & Missing Data: The Hidden Disruptors Across Industries
Lags and missing data can be silent assassins, causing unseen disruptions that ripple through various industries. Let’s dig into the specific ways these issues can impact different business sectors.
Financial markets
- Trading: In high-frequency trading, even milliseconds of delay can mean the difference between a profitable and losing trade. Real-time data on market movements is crucial for making informed trading decisions.
- Fraud detection: Real-time monitoring of transactions allows financial institutions to identify and prevent fraudulent activity as it happens. Delays in data can give fraudsters a window of opportunity.
- Risk management: Real-time data on market volatility, creditworthiness, and other factors helps businesses assess and manage risk effectively. Delays can lead to inaccurate risk assessments and potentially large losses.
Supply chain management
- Inventory management: Real-time data on inventory levels helps businesses avoid stockouts and optimize inventory costs. Delays can lead to overstocking or understocking, impacting customer satisfaction and profitability.
- Logistics and transportation: Real-time tracking of shipments allows companies to optimize delivery routes, improve efficiency, and provide accurate delivery estimates to customers. Delays can disrupt logistics and lead to dissatisfied customers.
- Demand forecasting: Real-time data on customer behavior and sales trends allows businesses to forecast demand accurately. Delays can lead to inaccurate forecasts and production issues.
Customer service
- Live chat and phone support: Real-time access to customer data allows support agents to personalize interactions and resolve issues quickly. Delays can lead to frustration and longer resolution times.
- Social media monitoring: Real-time tracking of customer sentiment on social media allows businesses to address concerns and build brand reputation. Delays can lead to negative feedback spreading before it’s addressed.
- Personalization: Real-time data on customer preferences allows businesses to personalize website experiences, product recommendations, and marketing campaigns. Delays can limit the effectiveness of these efforts.
Manufacturing
- Machine monitoring: Real-time monitoring of machine performance allows for predictive maintenance, preventing costly downtime. Delays can lead to unexpected breakdowns and production delays.
- Quality control: Real-time data on product quality allows for immediate identification and correction of defects. Delays can lead to defective products reaching customers.
- Process optimization: Real-time data on production processes allows for continuous improvement and optimization. Delays can limit the ability to identify and address inefficiencies.
Other examples
- Online gaming: Real-time data is crucial for smooth gameplay and a fair playing field. Delays can lead to lag, disconnects, and frustration for players.
- Healthcare: Real-time monitoring of vital signs and patient data allows for faster diagnosis and treatment. Delays can have serious consequences for patient care.
- Energy management: Real-time data on energy consumption allows businesses and utilities to optimize energy use and reduce costs. Delays can lead to inefficient energy usage and higher costs.
- Cybersecurity: Real-time data is the backbone of modern cybersecurity, enabling rapid threat detection, effective incident response, and accurate security analytics. However, delays in the ability to see and understand this data can create critical gaps in your defenses. From attackers having more time to exploit vulnerabilities to outdated security controls and hindered automated responses, data lags can significantly compromise your ability to effectively combat cyber threats.
As we’ve seen, the consequences of lags and missing data can be far-reaching. From lost profits in financial markets to frustrated customers and operational inefficiencies, these issues pose a significant threat to business success. Having the capability to identify the root cause, impact and remediate issues with precision and speed is an imperative to mitigate the business impact.
Causely automatically captures cause and effect relationships based on real-time, dynamic data across the entire application environment. Request a demo to see it in action.
The Delicate Dance: A Web of Services and Hidden Culprits
Modern user experiences that leverage real-time data rely on complex chains of interdependent services – a delicate dance of microservices, databases, messaging platforms, and virtualized compute infrastructure. A malfunction in any one element can create a ripple effect, impacting the freshness and availability of data for users. This translates to frustrating delays, lags, or even complete UX failures.
Let’s delve into the hidden culprits behind these issues and see how seemingly minor bottlenecks can snowball into major UX problems:
Slowdown Domino with Degraded Microservice
- Scenario: A microservice responsible for product recommendations experiences high latency due to increased user traffic and internal performance degradation (e.g., memory leak, code inefficiency).
- Impact 1: The overloaded and degraded microservice takes significantly longer to process requests and respond to the database.
- Impact 2: The database, waiting for the slow microservice response, experiences delays in retrieving product information.
- Impact 3: Due to the degradation, the microservice might also have issues sending messages efficiently to the message queue. These messages contain updates on product availability, user preferences, or other relevant data for generating recommendations.
- Impact 4: Messages pile up in the queue due to slow processing by the microservice, causing delays in delivering updates to other microservices responsible for presenting information to the user.
- Impact 5: The cache, not receiving timely updates from the slow microservice and the message queue, relies on potentially outdated data.
- User Impact: Users experience significant delays in seeing product recommendations. The recommendations themselves might be inaccurate or irrelevant due to outdated data in the cache, hindering the user experience and potentially leading to missed sales opportunities. Additionally, users might see inconsistencies between product information displayed on different pages (due to some parts relying on the cache and others waiting for updates from the slow microservice).
Message Queue Backup
- Scenario: A sudden spike in user activity overwhelms the message queue handling communication between microservices.
- Impact 1: Messages pile up in the queue, causing delays in communication between microservices.
- Impact 2: Downstream microservices waiting for messages experience delays in processing user actions.
- Impact 3: The cache, not receiving updates from slow microservices, might provide outdated information.
- User Impact: Users experience lags in various functionalities – for example, slow loading times for product pages, delayed updates in shopping carts, or sluggish responsiveness when performing actions.
Cache Miss Cascade
- Scenario: A cache experiences a high rate of cache misses due to frequently changing data (e.g., real-time stock availability).
- Impact 1: The microservice needs to constantly retrieve data from the database, increasing the load on the database server.
- Impact 2: The database, overloaded with requests from the cache, experiences performance degradation.
- Impact 3: The slow database response times further contribute to cache misses, creating a feedback loop.
- User Impact: Users experience frequent delays as the system struggles to retrieve data for every request, leading to a sluggish and unresponsive user experience.
Kubernetes Lag
- Scenario: A resource bottleneck occurs within the Kubernetes cluster, limiting the processing power available to microservices.
- Impact 1: Microservices experience slow response times due to limited resources.
- Impact 2: Delays in microservice communication and processing cascade throughout the service chain.
- Impact 3: The cache might become stale due to slow updates, and message queues could experience delays.
- User Impact: Users experience lags across various functionalities, from slow page loads and unresponsive buttons to delayed updates in real-time data like stock levels or live chat messages.
Even with advanced monitoring tools, pinpointing the root cause of these and other issues can be a time-consuming detective hunt. The triage & troubleshooting process often requires a team effort, bringing together experts from various disciplines. Together, they sift through massive amounts of observability data – traces, metrics, logs, and the results of diagnostic tests – to piece together the evidence and draw the right conclusions so they can accurately determine the cause and effect. The speed and accuracy of the process is very much determined by the skills of the available resources when issues arise
Only when the root cause is understood can the responsible team make informed decisions to resolve the problem and restore reliable service.
Transforming Incident Response: Automation of the Triage & Troubleshooting Process
Traditional methods of incident response, often relying on manual triage and troubleshooting, can be slow, inefficient, and prone to human error. This is where automation comes in, particularly with the advancements in Artificial Intelligence (AI). Specifically, a subfield of AI called Causal AI presents a revolutionary approach to transforming incident response.
Causal AI goes beyond correlation, directly revealing cause-and-effect relationships between incidents and their root causes. In an environment where services rely on real-time data and fast resolution is critical, Causal AI offers significant benefits:
- Automated Triage: Causal AI analyzes alerts and events to prioritize incidents based on severity and impact. It can also pinpoint the responsible teams, freeing resources from chasing false positives.
- Machine Speed Root Cause Identification: By analyzing causal relationships, Causal AI quickly identifies the root cause, enabling quicker remediation and minimizing damage.
- Smarter Decisions: A clear understanding of the causal chain empowers teams to make informed decisions for efficient incident resolution.
Causely is leading the way in applying Causal AI to incident response for modern cloud-native applications. Causely’s technology utilizes causal reasoning to automate triage and troubleshooting, significantly reducing resolution times and mitigating business impact. Additionally, Causal AI streamlines post-incident analysis by automatically documenting the causal chain.
Beyond reactive incident response, Causal AI offers proactive capabilities that focus on measures to reduce the probability of future incidents and service disruptions, through improved hygiene, predictions and “what if” analysis.
The solution is built for the modern world that incorporates real-time data, applications that communicate synchronously and asynchronously, and leverage modern cloud building blocks (databases, caching, messaging & streaming platforms and Kubernetes).
This is just the beginning of the transformative impact Causal AI is having on incident response. As the technology evolves, we can expect even more advancements that will further streamline and strengthen organizations’ ability to continuously assure the reliability of applications.
If you would like to learn more about Causal AI and its applications in the world of real-time data and cloud-native applications, don’t hesitate to reach out.
You may also want to check out an article by Endre Sara which explains how Causely is using Causely to manage its own SaaS service, which is built around a real-time data architecture.
Related Resources
- Watch the on-demand webinar: What is Causal AI and why do DevOps teams need it?
- Read the blog: Bridging the gap between observability and automation with causal reasoning
- See causal AI in action: Request a demo of Causely
Crossing the Chasm, Revisited
Sometimes there’s a single book (or movie, podcast or Broadway show) that seems to define a particular time in your life. In my professional life, Geoffrey Moore’s Crossing the Chasm has always been that book. When I started my career as VP Marketing in the 1990s, this was the absolute bible for early-stage B2B startups launching new products. Fast forward to today, and people still refer to it as a touchstone. Even as go-to-market motions have evolved and become more agile and data-driven, the need to identify a beachhead market entry point and solve early-adopter pain points fully before expanding to the mainstream market has remained relevant and true. I still use the positioning framework for every new product and company.
Recently while hosting the Causely team at their beautiful new offices for our quarterly meetup, our investors at 645 Ventures gave everyone a copy of the latest edition of Crossing the Chasm. It was an opportunity for me to review the basic concepts. Re-reading it brought back years of memories of startups past and made me think about the book in a new context: how have Moore’s fundamental arguments withstood the decades of technology trends I’ve experienced personally? Specifically, what does “crossing the chasm” actually mean when new product adoption can be so different from one technology shift to another?
A Quick Refresher
One of Moore’s key insights is that innovators and early adopters are willing to try to a new product and work with a new company because it meets some specific needs – innovators love being first to try cool things, and early adopters see new technology as a way to solve problems not currently being met by existing providers. These innovators/early adopters then share their experiences with others in their organizations and industries, who trust and respect their knowledge. This allows the company to reach a broader market over time, cross the chasm and begin adoption by the early majority. Many years can go by during this process, much venture funding will be spent, and still the company may only have penetrated a small percent of the market. Only years later (and with many twists and turns) will the company reach the late majority and finally the laggards.
The Chasm Looks Different Over Time
Netezza and Data Warehousing
I started to think about this in terms of technology shifts that I’ve lived through. Earlier in my career I had the good fortune to be part of a company that crossed the chasm: Netezza. We built a data warehousing system that was 100x the performance of existing solutions from IBM and Oracle, at half the cost. While this was clearly a breakthrough product, the data warehousing industry had not changed in any meaningful way for over a decade and the database admins who ran the existing solutions were in no rush to try something new or different, for all the usual reasons. Within 10 years we created a new category, the “data warehouse appliance.” We gained traction first with some true innovators and then with early adopters who brought the product into their data centers, proved the value and then used it more widely as a platform. However, “crossing the chasm” took many more years – we had a couple of hundred customers at the time of IPO – and only once the company was acquired by IBM did more mainstream adopters become ready to buy (since no one ever gets fired for buying IBM, etc). The product was so good that it remained in the market for over 20 years until the cloud revolution changed things, but it’s hard to argue that it ever gained broad market adoption compared with more traditional data warehouses.
Datadog and Cloud Observability
A second example, which is closer to the market my current company operates in, is Datadog in the observability space. Fueled by the cloud computing revolution (which itself was in the process of crossing the chasm when Datadog was founded in 2010), Datadog rode the new technology wave and solved old problems of IT operations management for new cloud-based applications. While this is not necessarily creating a new category, the company moved very quickly from early cloud innovators and adopters to mainstream customers, rocketing to around $1B in revenues in 10 years. What’s more impressive is that Datadog has become the de facto standard for cloud application observability; today the company has almost 30,000 customers and is still growing quickly in the “early majority” part of the observability market. Depending on which market size numbers you use, Datadog has already crossed the chasm or is well underway, with plenty of room to expand with “late majority” customers.
OpenAI and GenAI
Finally, think about the market adoption in the current GenAI revolution. 100 million users adopted ChatGPT within two months of its initial release in late 2022 and OpenAI claims that over 80% of F500 companies are already using it. No need for me to add more statistics here (e.g., comparisons vs adoptions of internet and social technologies) – it’s clear that this is one of the fastest chasm-crossings in history, although it’s not yet clear how companies plan to use the new AI products even as they adopt them. The speed and market confusion make it hard to envision what crossing the chasm will mean for mainstream adopters and how the technology will fully solve a specific set of problems.
Defining Success as You Cross
Thinking through these examples made me realize some things I hadn’t understood earlier in my career:
- It’s easy to confuse a large financial outcome (through IPO or acquisition) with “crossing the chasm”, since the assumption is often that you’ve had enough market success for the outcome. In fact, these are not necessarily related issues. It’s possible to have a large $ acquisition or even a successful IPO (as Netezza did) without having yet crossed to mainstream adoption.
- The market and technology trends that surround and support a new company and product can lead to very different experiences in crossing the chasm: You can have a breakthrough and exciting product in a slow-moving market without major technology tailwinds (e.g., data warehousing in the early 2000s) but you can also have a huge tailwind like cloud computing that drives a new product to more mainstream adoption within 10 years (e.g., Datadog’s cloud-based observability). Or you can have a hyper-growth technology shift like GenAI that shrinks the entire process into a few years, leaving the early and mainstream adopters jumbled together and trying to determine how to turn the new products into something truly useful.
- It can be hard to tell if you’ve really crossed the chasm since people think of many metrics that indicate adoption: % of customers in the total addressable market (Moore defines a bell curve with percentages for each stage, but I’ve rarely seen people use these strictly), number of monthly active users, revenue market share, penetration within enterprise accounts, etc. Also at the early majority phase, the company can see so much excitement from early customers and analysts (“We’re a leader in the Gartner Magic Quadrant!”) that founders can confuse market awareness and marketing “noise” with true adoption by customers that are waiting for more proof points and additional product capabilities that weren’t as critical for the early adopters. It’s important to keep your eye on these requirements to avoid stalling out once you’ve reached the other side of the chasm.
I would love to hear from other founders who have made this journey! Please share your thoughts on lessons learned and how you’re thinking about the chasm in the new AI-centric world.
Related Resources
Bridging the Gap Between Observability and Automation with Causal Reasoning
Observability has become a growing ecosystem and a common buzzword. Increasing visibility with observability and monitoring tools is helpful, but stopping at visibility isn’t enough. Observability lacks causal reasoning and relies mostly on people to connect application issues with potential causes.
Causal reasoning solves a problem that observability can’t
Combining observability with causal reasoning can revolutionize automated troubleshooting and boost application health. By pinpointing the “why” behind issues, causal reasoning reduces human error and labor.
This triggers a lot of questions from application owners and developers, including:
- What is observability?
- What is the difference between causal reasoning and observability?
- How does knowing causality increase performance and efficiency?
Let’s explore these questions to see how observability pairs with causal reasoning for automated troubleshooting and more resilient application health.
What is Observability?
Observability can be described as observing the state of a system based on its outputs. The three common sources for observability data are logs, metrics, and traces.
- Logs provide detailed records of ordered events.
- Metrics offer quantitative but unordered data on performance.
- Traces show the journey of specific requests through the system.
The goal of observability is to provide insight into system behavior and performance to help identify and resolve any issues that are happening. However, traditional monitoring tools are “observing” and reporting in silos.
“Observability is not control. Not being blind doesn’t make you smarter.” – Shmuel Kliger, Causely founder in our recent podcast interview
Unfortunately, this falls short of the goal above and requires tremendous human effort to connect alerts, logs, and anecdotal application knowledge with possible root cause issues.
For example, if a website experiences a sudden spike in traffic and starts to slow down, observability tools can show logs of specific requests and provide metrics on server response times. Furthermore, engineers digging around inside these tools may be able to piece together the flow of traffic through different components of the system to identify candidate bottlenecks.
The detailed information can help engineers identify and address the root cause of the performance degradation. But we are forced to rely on human and anecdotal knowledge to augment observability. This human touch may provide guiding information and understanding that machines alone are not able to match today, but that comes at the cost of increased labor, staff burnout, and lost productivity.
Data is not knowledge
Observability tools collect and analyze large amounts of data. This has created a new wave of challenges among IT operations teams and SREs, who are now left trying to solve a costly and complex big data problem.
The tool sprawl you experience, where each observability tool offers a unique piece of the puzzle, makes this situation worse and promotes inefficiency. For example, if an organization invests in multiple observability tools that each offer different data insights, it can create a fragmented and overwhelming system that hinders rather than enhances understanding of the system’s performance holistically.
This results in wasted resources spent managing multiple tools and an increased likelihood of errors due to the complexity of integrating and analyzing data from various sources. The resulting situation ultimately undermines the original goal of improving observability.
Data is not action
Even with a comprehensive observability practice, the fundamental issue remains: how do you utilize observability data to enhance the overall system? The problem is not about having some perceived wealth of information at your fingertips. The problem is relying on people and processes to interpret, correlate, and then decide what to do based on this data.
You need to be able to analyze and make informed decisions in order to effectively troubleshoot and assure continuous application performance. Once again, we find ourselves leaving the decisions and action plans to the team members, which is a cost and a risk to the business.
Causal reasoning: cause and effect
Analysis is essential to understanding the root cause of issues and making informed decisions to improve the overall system. By diving deep into the data and identifying patterns, trends, and correlations, organizations can proactively address potential issues before they escalate into major problems.
Causal reasoning uses available data to determine the cause of events, identifying whether code, resources, or infrastructure are the root cause of an issue. This deep analysis helps proactively and preventatively address potential problems before they escalate.
For example, a software development team may have been alerted about transaction slowness in their application. Is this a database availability problem? Have there been infrastructure issues happening that could be affecting database performance?
When you make changes based on observed behavior, it’s extremely important to consider how these changes will affect other applications and systems. Changes made without the full context are risky.
Using causal reasoning based on the observed environment shows that a recent update to the application code is causing crashes for users during specific transactions. A code update may have introduced inefficient database calls, which is affecting the performance of the application. That change can also go far beyond just the individual application.
If a company decides to update their software without fully understanding how it interacts with other systems, it could result in technical issues that disrupt operations and lead to costly downtime. This is especially challenging in shared infrastructure where noisy neighbors can affect every adjacent application.
This is an illustration showing how causal AI software can connect the problem to active symptoms, while understanding the likelihood of each potential cause. This is causal reasoning in action that also understands the effect on the rest of the environment as we evaluate potential resolutions.
Now that we have causal reasoning for the true root cause, we can go even further by introducing remediation steps.
Automated remediation and system reliability
Automated remediation involves the real-time detection and resolution of issues without the need for human intervention. Automated remediation plays an indispensable role in reducing downtime, enhancing system reliability, and resolving issues before they affect users.
Yet, implementing automated remediation presents challenges, including the potential for unintended consequences like incorrect fixes that could worsen issues. Causal reasoning takes more information into account to drive the decision about root cause, impact, remediation options, and the effect of initiating those remediation options.
This is why a whole environment view combined with real-time causal analysis is required to be able to safely troubleshoot and take remedial actions without risk while also reducing the labor and effort required by operations teams.
Prioritizing action over visibility
Observability is a component of how we monitor and observe modern systems. Extending beyond observability with causal reasoning, impact determination, and automated remediation is the missing key to reducing human error and labor.
In order to move toward automation, you need trustworthy, data-driven decisions that are based on a real-time understanding of the impact of behavioral changes in your systems. Those decisions can be used to trigger automation and the orchestration of actions, ultimately leading to increased efficiency and productivity in operations.
Automated remediation can resolve issues before they escalate, and potentially before they occur at all. The path to automated remediation requires an in-depth understanding of the components of the system and how they behave as an overall system.
Integrating observability with automated remediation empowers organizations to boost their application performance and reliability. It’s important to assess your observability practices and incorporate causal reasoning to boost reliability and efficiency. The result is increased customer satisfaction, IT team satisfaction, and risk reduction.
Related resources
- What is causal AI and why do DevOps teams need it? Watch the webinar.
- Moving beyond traditional RCA in DevOps: Read the blog.
- Assure application reliability with Causely: See the product.
What is Causal AI & why do DevOps teams need it?
Causal AI can help IT and DevOps professionals be more productive, freeing hours of time spent troubleshooting so they can instead focus on building new applications. But when applying Causal AI to IT use cases, there are several domain-specific intricacies that practitioners and developers must be mindful of.
The relationships between application and infrastructure components are complex and constantly evolving, which means relationships and related entities are dynamically changing too. It’s important not to conflate correlation with causation, or to assume that all application issues stem from infrastructure limitations.
In this webinar, Endre Sara defines Causal AI, explains what it means for IT, and talks through specific use cases where it can help IT and DevOps practitioners be more efficient.
We’ll dive into practical implementations, best practices, and lessons learned when applying Causal AI to IT. Viewers will leave with tangible ideas about how Causal AI can help them improve productivity and concrete next steps for getting started.
Tight on time? Check out these highlights
- What is root cause and what is it not? Endre defines what we mean by “root cause” and how to know you’ve correctly identified it.
- How do you install Causely? What resources does it demand? Endre shows how easy it is.
Building Startup Culture Isn’t Like It Used To Be
When does culture get established in a startup? I’d say the company’s DNA is set during the first year or two, and the founding team should do everything possible to make this culture intentional vs a series of disconnected decisions. Over the years, I’ve seen many great startup cultures that led to successful products and outcomes (and others that were hobbled from the beginning by poor DNA). However, as we plan for our upcoming Causely quarterly team meetup in New York City, I’m struck by how things have changed in culture-building since my previous ventures.
Startup culture used to happen organically
Back in the day, we took a small office space, gathered the initial team and started designing and building. Our first few months were often in the incubator space at one of our early investors. This was a fun and formative experience, at least until we got big enough to be kicked out (“you’re funded, now get your own space!”). Sitting with a small group crowded around a table and sharing ideas with each other on all topics may not have been very comfortable or even efficient. But it did create a foundational culture based on jokes, stories and decisions we would refer back to for years to come. Also, it established the extremely open and non-hierarchical cultural norms we wanted to encourage as we added people.
Once we hit initial critical mass and needed more space for breakouts or private discussions, it was off to the Boston real estate market to see what could possibly be both affordable and reasonable for commutes. The more basic the space, the better in many ways, since it emphasized the need to run lean and spend money only on the things that mattered – hiring, engineering tools, early sales and marketing investments, etc. But most important was to spend on things that would encourage the team to get to know each other and build trust. Lunches, dinners, parties, local activities were all important, as was having the right snacks, drinks and games in the kitchen area to encourage people to hang out together (it’s amazing how much the snacks matter).
The new normal
Fast forward to now, post-Covid and all the changes that have occurred in office space and working remotely. Causely is a remote-from-birth company, with people scattered around the US and a couple internationally. I would never have considered building a company this way before Covid, but when Shmuel and I decided to start the company, it just didn’t seem that big an issue anymore. We committed ourselves to making the extra effort required to build culture remotely, and talked about it frequently with the early team and investors.
PS: We’re hiring! Want to help shape the Causely culture? Check out our open roles.
In my experiences hanging out with the local Boston tech community and hearing stories from other entrepreneurs, I’ve noticed some of the following trends (which I believe are typical for the startup world):
- Most companies have one or more offices that people come to weekly, but not daily; attendance varies by team and is tied to days of the week that the team is gathering for meetings or planning. Peak days are Tues-Thurs but even then, attendance may vary widely.
- Senior managers show up more frequently to encourage their teams to come in, but they don’t typically enforce scheduling.
- The office has become more a place to build social and mentoring relationships and less about getting work done, which may honestly be more efficient from home.
- Employees like to come in, and more junior staff in particular benefit from in-person interaction with peers and managers, as well as having a separate workspace from their living space. But the flexibility of working remotely is very hard to give up and is something people value.
- Gathering the entire company together regularly (and smaller groups in local offices/meetups) is much more important than it used to be for creating a company-wide culture and helping people build relationships with others in different teams and functional areas.
Given this new normal, I’ve been wondering where this takes us for the next generation of startup companies. It matters to me that people have a shared sense of the company’s vision and feel bound to each other on a company mission. Without this, joining a startup loses a big element of its appeal and it becomes harder to do the challenging, creative, exhausting and sometimes nutty things it takes to launch and scale. There are only so many hours anyone can spend on Zoom before fatigue sets in. And it’s harder to have the casual and serendipitous exchanges that used to generate new ideas and energize long-running brainstorming discussions.
Know where you want to go before you start
Building culture in the current startup world requires intention. Here are some things I think are critical to doing this well. I would love to hear about things that are working for other entrepreneurs!
- Founders: spend more time sharing your vision on team calls and 1:1 with new hires – this is the “glue” that holds the company together.
- Managers: schedule more frequent open-ended, 1:1 calls to chat about what’s on people’s minds and hear ideas on any topic. Leave open blocks of time on your weekly calendar so people can “drop by” for a “visit.”
- Encourage local meetups as often as practical – make it easy for local teams to get together where and when they want.
- Invest in your all-team meetups, and make these as fun and engaging as possible. (We’ve tried packed agendas with all-day presentations and realized that this was too much scheduling). Leave time for casual hangouts and open discussions while people are working or catching up on email/Slack.
- Do even more sharing of information about the company updates and priorities – there’s no way for people to hear these informally, so more communications are needed and repetition is good 🙂
- Encourage newer/younger employees to share their work and ideas with the rest of the team – it’s too easy for them to lack feedback or mentoring and to lose engagement.
- Consider what you will do in an urgent situation that requires team coordination: simulations and reviews of processes are much more important than in the past.
There’s no silver bullet to building great company culture, but instead a wide range of approaches that need to be tried and tested iteratively. These approaches also change as the company grows – building cross-functional knowledge and creativity requires all the above but even more leadership by the founders and management team (and a commitment to traveling regularly between locations to share knowledge). Recruiting, already such a critical element of building culture, now has an added dimension: will the person being hired succeed in this particular culture without many of the supporting structures they used to have? Will they thrive and help build bridges between roles and teams?
It’s easy to lose sight of the overall picture and trends amidst the day-to-day urgency, so it’s important to take a moment when you’re starting the company to actually write down what you want your company culture to be. Then check it as you grow and make updates as you see what’s working and where there are gaps. The founding team still sets the direction, but today more explicit and creative efforts are needed to stay on track and create a cultural “mesh” that scales.
Related reading
Assure application reliability with Causely
In this video, we’ll show how easy it is to continuously assure application reliability using Causely’s causal AI platform.
In a modern production microservices environment, the number of alerts from observability tooling can quickly amount to hundreds or even thousands, and it’s extremely difficult to understand how all these alerts relate to each other and to the actual root cause. At Causely, we believe these overwhelming alerts should be consumed by software, and root cause analysis should be conducted at machine speed.
Our causal AI platform automatically associates active alerts with their root cause, drives remedial actions, and enables review of historical problems as well. This information streamlines post-mortem analysis, frees DevOps time from complex, manual processes, and helps IT teams plan for upcoming changes that will impact their environment.
Causely installs in minutes and is SOC 2 compliant. Share your troubleshooting stories below or request a live demo – we’d love to see how Causely can help!
Cause and Effect: Solving the Observability Conundrum
The pressure on application teams has never been greater. Whether for Cloud-Native Apps, Hybrid Cloud, IoT, or other critical business services, these teams are accountable for solving problems quickly and effectively, regardless of growing complexity. The good news? There’s a whole new array of tools and technologies for helping enable application monitoring and troubleshooting. Observability vendors are everywhere, and the maturation of machine learning is changing the game. The bad news? It’s still largely up to these teams to put it all together. Check out this episode of InsideAnalysis to learn how Causal AI can solve this challenge. As the name suggests, this technology focuses on extracting signal from the noise of observability streams in order to dynamically ascertain root cause analysis, and even fix mistakes automatically.
Tune in to hear Host Eric Kavanagh interview Ellen Rubin of Causely, as they explore how this fascinating new technology works.
Fools Gold or Future Fixer: Can AI-powered Causality Crack the RCA Code for Cloud Native Applications?
The idea of applying AI to determine causality in an automated Root Cause Analysis solution sounds like the Holy Grail, but it’s easier said than done. There’s a lot of misinformation surrounding RCA solutions. This article cuts the confusion and provides a clear picture. I will outline the essential functionalities needed for automated root cause analysis. Not only will I define these capabilities, I will also showcase some examples to demonstrate their impact.
By the end, you’ll have a clearer understanding of what a robust RCA solution powered by causal AI can offer and how it can empower your IT team to better navigate the complexities of your cloud-native environment and most importantly dramatically reduce MTTx.
The Rise (and Fall) of the Automated Root Cause Analysis Holy Grail
Modern organizations are tethered to technology. IT systems, once monolithic and predictable, have fractured into a dynamic web of cloud-native applications. This shift towards agility and scalability has come at a cost: unprecedented complexity.
Troubleshooting these intricate ecosystems is a constant struggle for DevOps teams. Pinpointing the root cause of performance issues and malfunctions can feel like navigating a labyrinth – a seemingly endless path of interconnected components, each with the potential to be the culprit.
For years, automating Root Cause Analysis (RCA) has been the elusive “Holy Grail” for service assurance, as the business consequences of poorly performing systems are undeniable, especially as organizations become increasingly reliant on digital platforms.
Despite its importance, commercially available solutions for automated RCA remain scarce. While some hyperscalers and large enterprises have the resources and capital to attempt to develop in-house solutions to address the challenge (like Capital One’s example), these capabilities are out of reach for most organizations.
See how Causely can help your organization eliminate human troubleshooting. Request a demo of the Causal AI platform.
Beyond Service Status: Unraveling the Cause-and-Effect Relations in Cloud Native Applications
Highly distributed systems, regardless of technology, are vulnerable to failures that cascade and impact interconnected components. Cloud-native environments, due to their complex web of dependencies, are especially prone to this domino effect. Imagine a single malfunction in a microservice, triggering chain reaction, disrupting related microservices. Similarly, a database issue can ripple outwards, affecting its clients and in turn everything that relies on them.
The same applies to infrastructure services like Kubernetes, Kafka, and RabbitMQ. Problems in these platforms might not always be immediately obvious because of the symptoms they cause within their domain. Furthermore symptoms manifest themselves within applications they support. The problem can then propagate further to related applications, creating a situation where the root cause problem and the symptoms they cause are separated by several layers.
Although many observability tools offer maps and graphs to visualize infrastructure and application health, these can become overwhelming during service disruptions and outages. While a sea of red icons in a topology map might highlight one or more issues, they fail to illuminate cause-and-effect relationships. Users are then left to decipher the complex interplay of problems and symptoms to work out the root cause. This is even harder to decipher when multiple root causes are present that have overlapping symptoms.
In addition to topology based correlation, DevOps team may also have experience of other types of correlation including event deduplication, time based correlation and path based analysis all of which attempt to reduce the noise in observability data. Don’t loose sight of the fact that this is not root cause analysis, just correlation, and correlation does not equal causation. This subject is covered further in a previous article I published Unveiling The Causal Revolution in Observability.
The Holy Grail of troubleshooting lies in understanding causality. Moving beyond topology maps and graphs, we need solutions that represent causality depicting the complex chains of cause-and-effect relationships, with clear lines of responsibility. Precise root cause identification that clearly explains the relationship between root causes and the symptoms they cause, spanning the technology domains that support application service composition, empowers DevOps teams to:
- Accelerate Resolution: By pinpointing the exact source of the issue and the symptoms that are caused by this, responsible teams are notified instantly and can prioritize fixes based on a clear understanding of the magnitude of the problem. This laser focus translates to faster resolution times.
- Minimize Triage: Teams managing impacted services are spared the burden of extensive troubleshooting. They can receive immediate notification of the issue’s origin, impact, and ownership, eliminating unnecessary investigation and streamlining recovery.
- Enhance Collaboration: With a clear understanding of complex chains of cause-and-effect relationships, teams can collaborate more effectively. The root cause owner can concentrate on fixing the issue, while impacted service teams can implement mitigating measures to minimize downstream effects.
- Automate Responses: Understanding cause and effect is also an enabler for automated workflows. This might include automatically notifying relevant teams through collaboration tools, notification systems and the service desk, as well as triggering remedial actions based on the identified problem.
Bringing This To Life With Real World Examples
The following examples will showcase the concept of causality relations, illustrating the precise relationships between root cause problems and the symptoms they trigger in interrelated components that make up application services.
This knowledge is crucial for several reasons. First, it allows for targeted notifications. By understanding the cause-and-effect sequences, the right teams can be swiftly alerted when issues arise, enabling faster resolution. Second, service owners impacted by problems can pinpoint the responsible parties. This clarity empowers them to take mitigating actions within their own services whenever possible and not waste time troubleshooting issues that fall outside of their area of responsibility.
Infra Problem Impacting Multiple Services
In this example, a CPU congestion in a Kubernetes Pod is the root cause and this causes symptoms – high latency – in application services that it is hosting. In turn, this results in high latency on other applications services. In this situation the causal relationships are clearly explained.
A Microservice Hiccup Leads to Consumer Lag
Imagine you’re relying on a real-time data feed, but the information you see is outdated. In this scenario, a bug within a microservice (the data producer) disrupts its ability to send updates. This creates a backlog of events, causing downstream consumers (the services that use the data) to fall behind. As a result, users/customers end up seeing stale data, impacting the overall user experience and potentially leading to inaccurate decisions. Very often the first time DevOps find out about these types of issues is when end users and customers complain about the service experience.
Database Problems
In this example the clients of a database are experiencing performance issues because one of the clients is issuing queries that are particularly resource-intensive. Symptoms of this include:
- Slow query response times: Other queries submitted to the database take a significantly longer time to execute.
- Increased wait times for resources: Applications using the database experience high error rate as they wait for resources like CPU or disk access that are being heavily utilized by the resource-intensive queries.
- Database connection timeouts: If the database becomes overloaded due to the resource-intensive queries, applications might experience timeouts when trying to connect.
Summing Up
Cloud-native systems bring agility and scalability, but troubleshooting can be a nightmare. Here’s what you need to conquer Root Cause Analysis (RCA) in this complex world:
- Automated Analysis: Move beyond time-consuming manual RCA. Effective solutions automate data collection and analysis to pinpoint cause-and-effect relationships swiftly.
- Causal Reasoning: Don’t settle for mere correlations. True RCA tools understand causal chains, clearly and accurately explaining “why” things happen and the impact that they have.
- Dynamic Learning: Cloud-native environments are living ecosystems. RCA solutions must continuously learn and adapt to maintain accuracy as the landscape changes.
- Abstraction: Cut through the complexity. Effective RCA tools provide a clear view, hiding unnecessary details and highlighting crucial troubleshooting information.
- Time Travel:Post incident analysis requires clear explanations. Go back in time to understand “why” problems and understand the impact they had.
- Hypothesis: Understand the impact that degradation or failures in application services and infrastructure will have before they happen.
These capabilities unlock significant benefits:
- Faster Mean Time to Resolution (MTTR): Get back to business quickly.
- More Efficient Use Of Resources: Eliminate wasted time chasing the symptoms of problems and get to the root cause immediately.
- Free Up Expert Resources From Troubleshooting: Empower less specialized teams to take ownership of the work.
- Improved Collaboration: Foster teamwork because everyone understands the cause-and-effect chain.
- Reduced Costs & Disruptions: Save money and minimize business interruptions.
- Enhanced Innovation & Employee Satisfaction: Free up resources for innovation and create a smoother work environment.
- Improved Resilience: Take action now to prevent problems that could impact application performance and availability in the future
If you would like to get to avoid the glitter of “Fools Gold” and get to the Holy Grail of service assurance with automated Root Cause Analysis don’t hesitate to reach out to me directly, or contact the team at Causely today to discuss your challenges and discover how they can help you.
Related Resources
- Request a demo of the Causal AI platform from Causely
- Check out the podcast interview: Dr. Shmuel Kliger on Causely, Causal AI, and the Challenging Journey to Application Health
- Read the blog: Unveiling the Causal Revolution in Observability
On security platforms
🎧 This Tech Tuesday Podcast features Endre Sara, Founding Engineer at Causely!
Causely is bridging observability with automated orchestration for self-managed, resilient applications at scale.
In this episode, Amir and Endre discuss leadership, how to make people’s lives easier by operating complex, large software systems, and why Endre thinks IaC should be boring!
Dr. Shmuel Kliger on Causely, Causal AI, and the Challenging Journey to Application Health
Dr. Shmuel Kliger, the founder of Causely.io, discusses his journey in the IT industry and the development of Causely. With a strong focus on reducing labor associated with IT operations, Dr. Kliger emphasizes the importance of understanding causality and building intelligent systems to drive insights and actions in complex IT environments. He highlights the need to focus on purpose-driven analytics and structured causality models to effectively manage and control IT systems.
Dr. Kliger also touches on the role of human interaction in influencing system behavior, mentioning the importance of defining constraints and trade-offs to guide automated decision-making processes. He envisions a future where humans provide high-level objectives and constraints, allowing the system to automatically configure, deploy, and optimize applications based on these inputs. By combining human knowledge with machine learning capabilities, Dr. Kliger aims to create a more efficient and effective approach to IT management, ultimately reducing troubleshooting time and improving system performance.
Tight on time?
Get the cliff notes from these clips:
- How to move the needle beyond stuck DevOps industry practices
- Where state-of-the-art observability is falling short
- Shmuel’s career-long focus to reduce the labor associated with IT Operations
- The importance of understanding entity relationships and causality relationships
Other ways to watch/listen
See the Causely platform in action
Are you ready to eat your own dogfood?
It’s a truism of all cloud SaaS companies that we should run our businesses on our own technology. After all, if this technology is so valuable and innovative that customers with dozens of existing vendors, tools and processes need to adopt the new offering, shouldn’t the startup use it internally as well? It also seems obvious that a new company should reality check the value it’s claiming to provide to the industry by seeing it first-hand, and that any major gaps should ideally be experienced first by the home team rather than inflicting these on customers.
Sometimes it’s easier said than done.
It can be surprising how difficult this sometimes turns out to be for new companies.
When the technical and product teams focus on the ideal customer profile and likely user, it’s not always the case that these match the startup’s internal situation. Very often, the intended user is really part of a larger team that interacts with other teams in a complex set of processes and relationships, and the realistic environment that the new product will face is far larger and more diverse than anything a startup would have internally. This makes it difficult to apply the new technology in a meaningful way and can also make the value proposition less obvious.
For example, if a major claim of the innovation is to simplify complex, manual or legacy processes, or to reduce wasteful spending through optimization, these benefits may be less obvious in a smaller/newer company environment. As a result, the “eat your own dogfood” claim may be just a marketing slogan without real meaning.
And then there’s the other truism to consider: the cobbler’s children often have no shoes. When a startup is running fast and focused on building innovation – with a small team that prioritizes the core value of a new product and customer engagement over any internal efforts – it’s easy to push off eating your own dogfood for another day. Ironically, if there are challenges in using the product internally, these may not be seen as the highest priority to fix or improve, vs. the urgent customer-facing ones that are tied to adoption and revenue. It’s always easy to say, “Well, customers haven’t seen these issues yet, so it’s probably ok for now.”
But, in fact, this is at the heart of the innovation process.
No one cares more about your product than the team that built it. You’re always challenging yourself:
Is it really easy to use?
Does it work reliably
Where are the hidden “gotchas”?
But for a breakthrough new product, this isn’t enough.
As your earliest design partners start testing the product in their environments, it’s equally important to consider how you will use it internally as well. This is not just about testing for bugs or functionality as you build your software development processes. It’s about becoming the user in every way possible and seeing the product through their eyes and daily jobs.
The world can look quite different from this viewpoint. Using the product internally raises the bar higher than responding to customer feedback or watching them during usability testing. It gives you a direct, visceral reaction to your own product:
Does it delight you as a user?
Would you use it on a daily basis?
Does it make your job easier?
Does it provide value that’s beyond other products you’re currently using?
Even if your company is not a perfect match with your ICP and your employees are not the typical users, you can still learn a great deal. For example:
- Do a job that your customer would do, from end to end, and see whether the product made your work easier/better.
- Show someone else in your company (who’s less familiar with the product) an output from the product and ask if this was helpful and understandable.
- Think of what you’d want your product to do next if you were a paying customer and considering a renewal.
At Causely, we decided early on that a high priority was to run “Causely on Causely.” Since we are developing our own SaaS application (which of course is not nearly as complex or mature as our customers’ application environments, but still has many of the same potential cloud-native problems and failure scenarios), we also need to troubleshoot when things go wrong. So we wanted to make sure that Causely would automatically identify our own emerging issues, root cause them correctly, remediate them faster and prevent end-user impact. We judge our progress based on whether WE would find this compelling and validate the claims we are making to customers, such as enabling them to have healthier, more resilient applications without human troubleshooting.
As a team, this requires us to discuss our own experiences as a customer and makes it easier to imagine the experiences of larger, inter-connected teams of users running massive applications at scale. Eating our own dogfood helps us improve the product so it’s easier to use, more understandable and reliable. And it has laid the foundation for how we will develop and operate our own applications as we scale. Of course eating your own dogfood is not a substitute for all other required approaches to testing and improving the product, but it’s a critical element in a startup’s development that should be hard-wired into company culture.
I would love to hear about other founders’ dog-fooding experiences and what’s worked well (or not) as you build your products. Please share!
Related resources
- Read the blog: Don’t forget these 3 things when starting a cloud venture
- See Causely in action: Request a demo
Cause and Defect: Root Cause Analysis with Causely
The Fast Track to Fixes: How to Turbo Charge Application Instrumentation & Root Cause Analysis
In the fast-paced world of cloud-native development, ensuring application health and performance is critical. The application of Causal AI, with its ability to understand cause and effect relationships in complex distributed systems, offers the potential to streamline this process.
A key enabler for this is application instrumentation that facilitates an understanding of application services and how they interact with one another through distributed tracing. This is particularly important with complex microservices architectures running in containerized environments like Kubernetes, where manually instrumenting applications for observability can be a tedious and error-prone task.
This is where Odigos comes in.
In this article, we’ll share our experience working with the Odigos community to automate application instrumentation for cloud-native deployments in Kubernetes.
Thanks to Amir Blum for adding resources attributes to native OpenTelemetry instrumentation based on our collaboration. And I appreciate the community accepting my PR to allow easy deployment using a Helm chart in addition to using the CLI in your K8s cluster!
This collaboration enables customers to implement universal application instrumentation and automate root cause analysis process in just a matter of hours.
The Challenges of Instrumenting Applications to Support Distributed Tracing
Widespread application instrumentation remains a hurdle for many organizations. Traditional approaches rely on deploying vendor agents, often with complex licensing structures and significant deployment effort. This adds another layer of complexity to the already challenging task of instrumenting applications.
Because of the complexities and costs involved, many organizations struggle with making the business case for universal deployment, and are therefore very selective about which applications they choose to instrument.
While OpenTelemetry offers a step forward with auto-instrumentation, it doesn’t eliminate the burden entirely. Application teams still need to add library dependencies and deploy the code. In many situations this may meet resistance from product managers who prioritize development of functional requirements over operational benefits.
As applications grow more intricate, maintaining consistent instrumentation across a large codebase is a major challenge, and any gaps leave blind spots in an organization’s observability capabilities.
Odigos to the Rescue: Automating Application Instrumentation
Odigos offers a refreshing alternative. Their solution automates the process of instrumenting all applications running in Kubernetes clusters, with just a few Kubernetes API calls. This eliminates the need to call in applications developers to facilitate the process which may take time and also require approval from product managers. This not only saves development time and effort but also ensures consistent and comprehensive instrumentation across all applications.
Benefits of Using Odigos
Here’s how Odigos is helping Causely and its customers to streamline the process:
- Reduced development time: Automating instrumentation requires zero effort from development teams.
- Improved consistency: Odigos ensures consistent instrumentation across all applications, regardless of the developer or team working on them.
- Enhanced observability: Automatic instrumentation provides a more comprehensive view of application behavior.
- Simplified maintenance: With Odigos handling instrumentation, maintaining and updating is simple.
- Deeper insights into microservice communication: Odigos goes beyond HTTP interactions. It automatically instruments asynchronous communication through message queues, including producers and consumer flows.
- Database and cache visibility: Odigos doesn’t stop at message queues. It also instruments database interactions and caches, giving a holistic view of data flow within applications.
- Key performance metric capture: Odigos automatically instruments key performance metrics that can be consumed by any OpenTelemetry compliant backend application.
Using Distributed Tracing Data to Automate Root Cause Analysis
Causely consumes distributed tracing data along with observability data from Kubernetes, messaging platforms, databases and caches, whether they are self hosted or running in the cloud, for the following purposes:
- Mapping application interactions for causal reasoning: Odigos’ tracing data empowers Causely to build a comprehensive dependency graph. This depicts how application services interact, including:
- Synchronous and asynchronous communication: Both direct calls and message queue interactions between services are captured.
- Database and cache dependencies: The graph shows how services rely on databases and caches for data access.
- Underlying infrastructure: The compute and messaging infrastructure that supports the application services is also captured.
This dependency graph can be visualized but also is crucial for Causely’s causal reasoning engine. By understanding the interconnectedness of services and infrastructure, Causely can pinpoint the root cause of issues more effectively.
- Precise state awareness: Causely only consumes the observability data needed to analyze the state of application and infrastructure entities for causal reasoning, ensuring efficient resource utilization.
- Automated root cause analysis: Through its causal reasoning capability Causely is able to automatically identify the detailed chain of cause and effect relationships between problems and their symptoms in real time, when performance degrades or malfunctions occur in applications and infrastructure. These can be visualized through causal graphs which clearly depict the relationships between root cause problems and the symptoms/impacts that they cause.
- Time travel: Causely provides the ability to go back in time so devops teams can retrospectively review root cause problems and the symptoms/impacts they caused in the past.
- Assess application resilience: Causely enables users to reason about what the effect would be if specific performance degradations or malfunctions were to occur in application services or infrastructure.
Want to see Causely in action? Request a demo.
Conclusion
Working with Odigos has been a very smooth and efficient experience. They have enabled our customers to instrument their applications and exploit Causely’s causal reasoning engine within a matter of hours. In doing so they were able to:
- Instrument their entire application stack efficiently: Eliminating developer overheads and roadblocks without the need for costly proprietary agents.
- Assure continuous application reliability: Ensuring that KPIs, SLAs, SLOs and SLAs are continually met by proactively identifying and resolving issues.
- Improve operational efficiency: By minimizing the labor, data, and tooling costs with faster MTTx.
If you would like to learn more about our experience of working together, don’t hesitate to reach out to the teams at Odigos or Causely, or join us in contributing to the Odigos open source observability plane.
Related Resources
- Request a demo to experience automated root cause analysis with Causely first-hand
- Read the blog: Eating Our Own Dog Food: Causely’s Journey with OpenTelemetry & Causal AI
- Watch the video: Causely for asynchronous communication
- Join the Odigos community on Slack
Time to Rethink DevOps Economics? The Path to Sustainable Success
As organizations transform their IT applications and adopt cloud-native architectures, scaling seamlessly while minimizing resource overheads becomes critical. DevOps teams can play a pivotal role in achieving this by embracing automation across various facets of the service delivery process.
Automation shines in areas such as infrastructure provisioning and scaling, continuous integration and delivery (CI/CD), testing, security and compliance, but the practice of automated root cause analysis remains elusive.
While automation aids observability data collection and data correlation, understanding the relationships between cause and effect still requires the judgment and expertise of skilled personnel. This work falls on the shoulders of developers and SREs who have to manually decode the signals – from metrics, traces and logs – in order to get to the root cause when the performance of services degrades.
Individual incidents can take hours and even days to troubleshoot, demanding significant resources from multiple teams. The consistency of the process can also vary greatly depending on the skills that are available when these situations occur.
Service disruptions can also have significant financial consequences. Negative customer experiences directly impact revenue, and place an additional resource burden on the business functions responsible for appeasing unhappy customers. Depending on the industry you operate in and the type of services you provide, service disruptions may result in costly chargebacks and fines, making mitigation even more crucial.
Shining a Light on the Root Cause Analysis Problem in DevOps
While decomposing applications into microservices through the adoption of cloud-native architectures has enabled DevOps teams to increase the velocity with which they can release new functionality, it has also created a new set of operational challenges that have a significant impact on ongoing operational expenses and service reliability.
Increased complexity: With more services comes greater complexity, more moving parts, and more potential interactions that can lead to issues. This means diagnosing the root cause of problems becomes more difficult and time-consuming.
Distributed knowledge: In cloud-native environments, knowledge about different services often resides in different teams, who have limited knowledge of the wider system architecture. As the number of services scales, finding the right experts and getting them to collaborate on troubleshooting problems becomes more challenging. This adds to the time and effort required to coordinate and carry out root cause analysis and post incident analysis.
Service proliferation fuels troubleshooting demands: Expanding your service landscape, whether through new services or simply additional instances, inevitably amplifies troubleshooting needs which translate into more resource requirements in DevOps teams for troubleshooting overtime.
Testing regimes cannot cover all scenarios: DevOps, with its CI/CD approach, releases frequent updates to individual services. This agility can reveal unforeseen interactions or behavioral changes in production, leading to service performance issues. While rollbacks provide temporary relief, identifying the root cause is crucial. Traditional post-rollback investigations might fall short due to unreproducible scenarios. Instead, real-time root cause analysis of these situations as they happen is important to ensure swift fixes and prevent future occurrences.
Telling the Story with Numbers
As cloud-native services scale, troubleshooting demands also grow exponentially, in a similar way to compounding interest on a savings account. As service footprints expand, more DevOps cycles are consumed by troubleshooting versus delivering new code, creating barriers to innovation. Distributed ownership and unclear escalation paths can also mask the escalating time that is consumed by troubleshooting.
Below is a simple model that can be customized with company-specific data to illustrate the challenge in numbers. This model helps paint a picture of the current operational costs associated with troubleshooting. It also demonstrates how these are going to escalate over time, driven by the growth in cloud-native services (more microservices, serverless functions, etc).
The model also illustrates the impact of efficiency gains through automation versus the current un-optimized state. The gap highlights the size of the opportunity available to create more cycles for productive development while reducing the need for additional headcount into the future, by automating troubleshooting.
Beyond the cost of human capital, there are a number of other costs that have a direct impact on troubleshooting costs. These include the escalating costs of infrastructure and third-party SaaS services, dedicated to the management of observability data. These are well publicized and highlighted in a recent article published by Causely Founding Engineer Endre Sara that discusses avoiding the pitfalls of escalating costs when building out Causely’s own SaaS offering.
While DevOps have a cost, they pale in comparison to the financial consequences of service disruptions. With automated root cause analysis, DevOps teams can mitigate these risks, saving the business time, money and reputation.
Future iterations of the model will account for these additional dimensions.
If you would like to put your data to work and see the quantifiable benefits of automated root cause analysis in numbers, complete the short form to get started.
Translating Theory into Reality
Did you know companies like Meta and Capital One automate root cause analysis in devops, achieving 50% faster troubleshooting? However, the custom solutions built by these industry giants require vast resources and deep expertise in data science to build and maintain, putting automated root cause analysis capabilities out of reach for most companies.
The team at Causely are changing this dynamic. Armed with decades of experience – applying AI to the management of distributed systems, networks, and application resource management – they offer a powerful SaaS solution that removes the roadblocks to automated root cause analysis in DevOps environments. The Causely solution enables
- Clear, explainable insights: Instead of receiving many notifications when issues arise, teams receive clear notifications that explain the root cause along with the symptoms that led to these conclusions.
- Faster resolution times: Teams can get straight to work on problem resolution and even automate resolutions, versus spending time diagnosing problems.
- Business impact reduction: Problems can be prevented, early in their cycle, from escalating into critical situations that might otherwise have resulted in significant business disruption.
- Clearer communication & collaboration: RCA pinpoints issue owners, reducing triage time and wasted efforts from other teams.
- Simplified post-incident analysis: All of the knowledge about the cause and effect of prior problems is stored and available to simplify the process of post incident analysis and learning.
Wrapping Up
In this article we discussed the key challenges associated with troubleshooting and highlighted the cost implications of today’s approach. Addressing these issues is important because the business consequences of today’s approach are significant.
- Troubleshooting is costly because it consumes the time of skilled resources.
- Troubleshooting steals time from productive activities which impacts the ability of DevOps to deliver new capabilities.
- Service disruptions have business consequences: The longer they persist, the bigger the impact to customers and business.
If you don’t clearly understand your future resource requirements and costs associated with troubleshooting as you scale out cloud-native services, the model we’ve developed provides a simple way to capture this.
Want to see quantifiable benefits from automated root cause analysis?
Turn “The Optimized Mode of Operation” in the model from vision to reality with Causely.
The Causely service enables you to measure and showcase the impact of automating RCA in your organization. Through improved service performance and availability, faster resolution times, and improved team efficiency, the team will show you the “art of the possible.”
Related Resources
- Watch the video: Mission Impossible: Cracking the Code of Complex Tracing Data
- Read the blog: Eating Our Own Dog Food: Causely’s Journey with OpenTelemetry & Causal AI
- Request a demo: See Causely in action
What is causal AI?
The Fellowship of the Uptime Ring: A Quest for Site Reliability
Reposted with permission from its original source on LinkedIn.
A digital chill swept through King Reginald as he materialized back in his royal chambers, having returned from the Cloud Economic World Forum. The summit had cast a long shadow, its discussions echoing through his processors with a stark warning. One of the most concerning threats to the global economy, the forum declared, was the alarming unreliability of digital infrastructure. Countless nations, their economies heavily reliant on online services, were teetering on the brink of collapse due to ailing applications.
Haunted by the pleas of desperate leaders, King Reginald, a champion of digital prosperity, knew he could not stand idly by. He summoned his most trusted advisors: Causely, the griffin with an eagle eye for detail; OpenTelemetry, the ever-vigilant data whisperer; and Odigos, the enigmatic hobbit renowned for his eBPF magic.
“My friends,” boomed the King, his voice laced with urgency, “we face a momentous challenge. We must embark on a quest, a mission to guide failing Digital Realms back to stability, preventing a domino effect that could plunge the global economy into chaos. We need to formulate a Charter: a beacon of hope, a blueprint for success that will empower these realms to assure the reliability of their online services.”
Causely, his gaze sharp as a hawk’s, swiveled his head. “A noble undertaking, Your Majesty. But how do we convince the denizens of these realms to join us? We need not just a path to success, but a compelling reason for Business Knights, Technical Leaders, and Individual Contributors to rally behind this cause.”
The King Reginald hollered “The Charter will be a clarion call, that not only illuminates the path to success but also unveils the treasures that await those who embark on this journey with us. We cannot afford to delay in crafting this crucial charter. Therefore, I propose we convene an offsite meeting to dedicate our collective wisdom to this critical task.”
A Pint and a Plan
King Reginald’s favourite place for such gatherings would normally be Uncle Sam’s Oyster Bar & Grill, renowned for its delightful fare and stimulating intellectual discourse, but was fully booked for the next six months, thanks to the ever-growing popularity of their chatbot steaks.
A hint of amusement flickered across Odigos’ eyes. “Fear not, my friends,” he declared, a mischievous glint in his voice. “I know of a place steeped in history, and perhaps, even edible sustenance – The Hypervisor Arms!”
Causely’s gaze narrowed slightly. “The Hypervisor Arms?” he echoed, a hint of apprehension tinging his voice. “While once a legendary establishment, I’ve heard whispers…”
Odigos shrugged nonchalantly. “It recently changed hands, coming under the new management of Bob Broadcom. While some may cast aspersions, I believe it may hold the key to a productive and, hopefully, reasonably comfortable brainstorming session.”
Intrigued by the prospect of a new experience, the group followed Odigos to the Hypervisor Arms. An initial buzz of anticipation filled them as they approached the weathered facade. However, upon entering, their enthusiasm waned. The once-vibrant pub was a shadow of its former self. The menu offered a meager selection, the ale taps a disappointingly limited variety, and murmurs of discontent filled the air. Several patrons grumbled about long wait times, and a few bore the mark of undercooked meals on their disgruntled faces.
A hint of disappointment flickered across Odigos’ face. “Perhaps,” he started hesitantly, “we should reconsider another venue.”
King Reginald, however, straightened his virtual shoulders. “While the circumstances are not ideal,” he declared with unwavering resolve, “we shall not allow them to deter us. The ailing applications in these struggling economies demand our expertise. We may need to adapt, but our mission remains. Let us find a suitable table and proceed with our task. The fate of countless digital realms depends on it.”
Despite the underwhelming ambiance, fueled by their sense of purpose, the group settled in and began brainstorming. After hours of passionate discussion, fueled by stale ale and questionable mutton stew (courtesy of the Hypervisor Arms), they emerged with a finalized Charter:
The Charter: A Quest for Site Reliability
1. Building the Observatory (aka Not Using Carrier Pigeons):
- Leverage OpenTelemetry and its Magical Data Spyglass: Gain X-ray vision into your applications’ health, identifying issues before they wreak havoc like a toddler with a marker in an art museum.
- Unify with Semantic Conventions, the Universal App Translator: Forget the Tower of Babel, all your applications will speak the same language, making communication smoother than a greased piglet on an ice rink.
- Embrace Auto-Instrumentation and Odigos, the eBPF Whisperer: For stubborn applications that refuse to cooperate (like those grumpy dragons guarding their treasure), Odigos will use his mystical eBPF magic to pry open the data treasure chest.
2. Causely, the wise Griffin (Who Doesn’t Need Coffee):
- Empower Causely’s Detective Skills: Let Causely analyze your data like Sherlock Holmes on a caffeine bender, pinpointing the root cause of problems faster than you can say “performance degradation.”
- Direct Notification and Swift Action: Causely will alert the responsible teams with the urgency of a squirrel who just spotted a peanut, ensuring problems get squashed quicker than a bug under a magnifying glass.
3. The Spoils of Victory (Beyond Just Bragging Rights):
For Business Knights:
- Increased Productivity: Say goodbye to endless troubleshooting and hello to more time for conquering new territories (aka strategic initiatives).
- Stronger Financial Position: No more SLA and SLO violations that make customers grumpier than a hangry troll. This means fewer chargebacks, fines, and share price nosedives – more like rocketships, baby!
For Technical Leaders:
- Empower Your Team: Become the Gandalf of your team, guiding them on a path of innovation instead of getting bogged down in the troubleshooting trenches.
- Focus on the Future: Less time spent putting out fires means more time spent building the next big thing (and maybe finally getting that nap you deserve).
For Individual Contributors:
- Do More Than Just Slay Bugs: Break free from the monotony of troubleshooting and tackle exciting challenges you wouldn’t have had time for before.
- Level Up Your Skills: Spend more time learning new things and becoming the ultimate coding ninja.
Adding Value For Customers:
- Happy Customers, Happy Life: Customer service teams will be transformed from fire-fighters to customer experience superheroes, helping your customers achieve their goals with ease.
Digital Talent Optimization:
- Work in a Place That Doesn’t Feel Like Mordor: Create a work environment as inviting as a hobbit’s hole, attracting and retaining top talent who are tired of battling application dragons all day.
King Reginald declared “This charter is not just a document, it’s a beacon of hope for Digital Realms seeking application serenity (and maybe some decent snacks) and we must promote this immediately to spread the word for our cause throughout the Realms.
The Call to Arms: Beyond the Glitches, Lies a Digital Utopia
In a public broadcast to the Realms King Reginald announced the charter “This is not just a charter, it’s a battle cry, a war horn echoing across the Digital Realms. We stand united against a common foe: application dragons spewing fire (and error messages) that threaten to devour productivity and sanity.
But fear not, brave knights and valiant developers! We offer you the holy grail of application serenity and to help you understand the opportunity that lays ahead we offer;
- White Papers & Videos: That explain the details, the mechanics of the journey and the treasures that this unlocks.
- Demo Days: Witness the magic unfold before your very eyes as we showcase the power of this system in live demonstrations.
- Proof of Concepts: Doubt no more! We’ll provide tangible proof that this solution is the elixir your Digital Realm craves.
- Interactive ROI calculators: Quantify the value of the journey for your Digital Realm.
- Disco Tech Takeover: Prepare to groove! I’ll be guest-starring on the legendary podcast “Disco Tech,” hosted by the infamous (and influential) blogger, Eric Wrong. Millions of ears will hear the call, and the revolution will begin.
To the Business Knights:
Lead your teams to digital El Dorado, a land overflowing with increased productivity and financial prosperity.
To the Technical Leaders:
Become the Gandalf of your team, guiding them on quests of innovation and unleashing their full potential.
To the Individual Contributors:
Slay the bugs no more! Ascend to new heights of skill and mastery with newfound time and resources.
And to the Customers:
Rejoice! You shall be treated as kings and queens, receiving flawless service and exceptional experiences.
The call to action is clear: Join the quest! Together, we’ll vanquish the performance dragons, build a digital utopia, and snack on subpar catered lunches (we’re working on the catering part).
Remember, brave heroes, the fate of the Digital Realms rests in your hands (and keyboards)!
If you missed the first and second episodes in this Epic trilogy you can find them here:
Don’t Forget These 3 Things When Starting a Cloud Venture
I’ve been in the cloud since 2008. These were the early days of enterprise cloud adoption and the very beginning of hybrid cloud, which to this day remains the dominant form of enterprise cloud usage. Startups that deliver breakthrough infrastructure technology to enterprise customers have their own dynamic (which differs from consumer-focused companies), and although the plots may differ, the basic storylines stay the same. I’ve learned from several startup experiences (and a fair share of battle scars) there are several things you should start planning for right from the beginning of a new venture.
These aren’t the critical ones you’re probably already doing, such as selecting a good initial investor, building a strong early team and culture, speaking with as many early customers as possible to find product-market fit (PMF), etc.
Instead, here are three of the less-known but equally important things you should prioritize in your first years of building a company.
1. Build your early customer advisor relationships
As always, begin with the customer. Of course you will cast a wide net in early customer validation and market research efforts to refine your ideal customer profile (ICP) and customer persona(s). But as you’re iterating on this, you also want to build stronger relationships with a small group of customers who provide more intensive and strategic input. Often these early customers become part of a more formal customer advisory board, but even before this, you want to have an ongoing dialog with them 1:1 as you’re thinking through your product strategy.
These customers are strategic thinkers who love new technologies and startups. They may be senior execs but not necessarily – they are typically the people who think about how to bring innovation into their complex enterprise organizations and enjoy acting as “sherpas” to guide entrepreneurs on making their products and companies more valuable. I can remember the names of all the customer advisors from my previous companies and they have had an enormous impact on how things progressed. So beyond getting initial input and then focusing only on sales opportunities, build advisor relationships that are based on sharing vision and strategy, with open dialogue about what’s working/not working and where you want to take the company. Take the time to develop these relationships and share more deeply with these advisors.
2. Begin your patent work
Many B2B and especially infrastructure-oriented startups have meaningful IP that is the foundation of their companies and products. Patenting this IP requires the founding engineering team to spend significant time documenting the core innovation and its uniqueness, and the legal fees can run to thousands of dollars. As a result, there’s often a desire to hold off filing patents until the company is farther along and has raised more capital. Also, the core innovation may change as the company gets further into the market and founders realize more precisely what the truly valuable technology includes – or even if the original claims are not aligned with PMF.
However, it’s important to begin documenting your thinking early, as the company and development process are unfolding, so you have a written record and are prepared for the legal work ahead. In the end, the real value patent(s) provide for startups is less about protecting your IP from other larger players — who will always have more lawyers and money than you do and will be willing to take legal action to protect their own patent portfolios or keep your technology off-market — than it is about capturing the value of your innovation for future funding and M&A scenarios, and as a way to show larger players that you’ve taken steps to protect your IP and innovation during buy-out. In the US, a one year clock for filing the initial patent (frequently with a provisional patent filed first to preserve an early filing date) begins ticking upon external disclosure, such as when you launch your product, and you don’t want to address these issues in a rush. It’s important to get legal advice early on about whether you have something patentable, how much work it will be to write up the patent application, who will be named as inventors, and whether you want to file in more than one country.
3. Start your SOC2 process as you’re building MVP
Yes, SOC2. Not very sexy but as time has gone by, it’s become absolute table stakes for enabling early-adopter customers to test your initial product, even in a dev/test environment. In the past, it was possible to find a champion who loves new technology to “unofficially” test your early product since s/he was respected and trusted by their organizations to bring in great, new stuff. But as cloud and SaaS have matured and become so widespread at enterprise customers, the companies that provide these services – even to other startups – are requiring ALL their potential vendors to keep them aligned with their own SOC2 requirements. It’s like a ladder of compliance that is built on each vendor supporting the vendors and customers above them. There is typically a new vendor onboarding process and security review, even for testing out a new product, and these are more consistently enforced than in the past.
As a result, it has become more urgent to start your SOC2 process right away, so you can say you’re underway even as you’re building your minimum viable product (MVP) and development processes. Although it’s much easier now to automate and track SOC2 processes and prepare for the audit (this is my third time doing this, and it is far less manual than in the past) – if you launch your product and then go back later to set up security/compliance policies and processes, it will be much harder, more complicated and under much more pressure from your sales teams to be able to check this box.
There’s no question that other must-dos should be added to the above list (and it would be great to hear from other founders on this). With so many things to consider, it’s hard to prioritize in the early days and sometimes the less obvious things can be pushed for “later”. But as I build my latest cloud venture, Causely, I’ve kept these 3 priorities top of mind since I know how important they are to lay a strong foundation for growth.
Related resources
- Read the blog: Why do this startup thing all over again? Our reasons for creating Causely
- Read the blog: All sides of the table: Reflecting on the boardroom dynamics that really matter
- Read the 2-part Cervin Founder Spotlight
Mission Impossible? Cracking the Code of Complex Tracing Data
In this video, we’ll show how Causely leverages OpenTelemetry. (For more on how and why we use OpenTelemetry in our causal AI platform, read the blog from Endre Sara.)
Distributed tracing gives you a bird’s eye view of transactions across your microservices. Far beyond what logs and metrics can offer, it helps you trace the path of a request across service boundaries. Setting up distributed tracing has never been easier. In addition to OpenTelemetry and other existing tracing tools such as Tempo and Jaeger, with open source tools like Grafana Beyla and Keyval Odigos, you can enable distributed tracing in your system without a single line of change.
These tools allow the instrumented applications to start sending traces immediately. But, with potentially hundreds of spans in each trace and millions of traces generated per minute, you can easily become over overwhelmed. Even with a bird’s eye view, you might feel like you’re flying blind.
That’s where Causely comes in. Causely efficiently consumes and analyzes tracing data, automatically constructs a cause and effect relationship, and pinpoints the root cause.
Interested in seeing how Causely makes it faster and easier to use tracing data in your environment so you can understand the root cause of challenging problems?
Comment here or contact us. We hope to hear from you!
Related resources
- Read the blog: Using OpenTelemetry and the OTel Collector for Logs, Metrics and Traces
- Read the blog: Eating Our Own Dog Food: Causely’s Journey with OpenTelemetry & Causal AI
- Watch the video: Causely for Asynchronous Communication
Cloud Cuckoo Calamity: The eBPF-Wielding Hobbit and the Root Cause Griffin Save the Realm!
Reposted with permission from its source on LinkedIn.
The fate of the realm hangs in the balance. Join the mayhem in Cloud Cuckoo Calamity, the thrilling sequel to Data, Dragons & Digital Dreams.
A mournful dirge echoed through the digital realms, lamenting the passing of King Bartholomew the Bold, ruler of the fabled Cloud Cuckoo Land. With no direct heir apparent, the crown landed upon the brow of King Reginald Mainframe, already the esteemed sovereign of Microservice Manor. He inherited not just a kingdom, but a brewing chaos: Cloud Cuckoo Land, riddled with uninstrumented applications and plagued by performance woes, teetered on the brink of digital disaster.
The developers, pressured by the relentless Business Knights, had prioritized new features over the crucial art of instrumentation. Applications, like untended gardens, were choked with bugs and unpredictable performance. User complaints thundered like disgruntled dragons, and service reliability plummeted faster than a dropped byte.
King Reginald, a leader known for his strategic prowess, faced a quandary. Directly mandating changes would likely spark a rebellion amongst the Business Knights, jeopardizing the kingdom’s already fragile state. He needed a solution both swift and subtle, a whisper in the storm that could transform the chaos into a symphony of efficiency.
That’s when OpenTelemetry, his ever-watchful advisor, swooped in with a glimmer of hope. “Your Majesty,” she chirped, “I bring Odigos, the mystical hobbit rumored to possess not only the ability to weave instrumentation magic into even the most tangled code, but also mastery over the art of eBPF exploration!”
From the vibrant digital realm of Ahlan, renowned for its sand-swept servers and cunning code whisperers, Odigos arrived. Clad in a flowing robe adorned with intricate patterns, he bowed low. “Indeed, your Majesty,” he rasped in a warm, accented voice, “my eBPF skills allow me to delve into the deepest recesses of your applications, unearthing their secrets and revealing their inner workings. But fear not, for I can enchant them with instrumentation in a single, elegant command, seamlessly integrating into your Kubernetes clusters without disrupting your developers’ flow.”
With a twinkle in his eye, Odigos uttered the magical incantation, and lines of code danced across the screen. As if by magic, applications across Cloud Cuckoo Land blossomed with instrumentation, whispering their performance metrics to Open Telemetry’s attentive ear. The developers, initially wary, were astonished to discover the instrumentation seamlessly woven into their workflows.
But amidst the data deluge, chaos threatened to return. Enter Causely, the griffin detective, with his razor-sharp intellect and unwavering determination. Like a beacon in a storm, his gaze pierced through the mountains of metrics, his feathers bristling with the thrill of the hunt. He possessed an uncanny ability to weed out the root cause even in the most complex of situations, his logic as intricate as the threads of fate.
Armed with Open Telemetry’s data, Odigos’s newfound insights into the instrumented applications, and Causely’s expert analysis, King Reginald’s forces launched a targeted counteroffensive. Bottlenecks were unraveled, inefficiencies eradicated, and applications hummed with newfound stability. User complaints became echoes of praise, and Cloud Cuckoo Land emerged from the brink, its digital sun shining brighter than ever.
News of their triumph spread like wildfire, inspiring kingdoms across the digital landscape who were themselves struggling with instrumentation and root cause analysis woes. Odigos, the single-command code whisperer, and Causely, the master of root cause analysis, became legends. Their names were synonymous with efficiency, clarity, and the ability to tame even the most chaotic of digital realms.
But as celebrations died down, whispers of a new threat emerged, a malevolent force lurking in the shadows. Would their combined talents be enough to face this looming challenge? Stay tuned, dear reader, for the next chapter in the legend of King Reginald and his ever-evolving digital realm.
Read the next post in this series: The Fellowship of the Uptime Ring: A Quest for Site Reliability.
Eating Our Own Dog Food: Causely’s Journey with OpenTelemetry & Causal AI
Implementing OpenTelemetry at the core of our observability strategy for Causely’s SaaS product was a natural decision. In this article I would like to share some background on our rationale and how the combination of OpenTelemetry and Causal AI addresses several critical requirements that enable us to scale our services more efficiently.
Avoiding Pitfalls Based on Our Prior Experience
We already know from decades of experience working in and with operations teams in the most challenging environments, that bridging the gap between the vast ocean of observability data and actionable insights has and continues to be a major pain point. This is especially true in the complex world of cloud-native applications.
Missing application insights
Application observability remains an elusive beast for many, especially in complex microservices architectures. While infrastructure monitoring has become readily available, neglecting application data paints an incomplete picture, hindering effective troubleshooting and operations.
Siloed solutions
Traditional observability solutions have relied on siloed, proprietary agents and data sources, leading to fragmented visibility across teams and technologies. This makes it difficult to understand the complete picture of service composition and dependencies.
To me this is like trying to solve a puzzle with missing pieces – that’s essentially a problem that many DevOps teams face today – piecing together a picture of how microservices, serverless functions, databases, and other elements interact with one another, and underlying infrastructure and cloud services they run on. This hinders collaboration and troubleshooting efforts, making it challenging to pinpoint the root cause of performance issues or outages.
Vendor lock-in
Many vendors’ products also lock customers’ data into their cloud services. This can result in customers paying through the nose, because licensing costs are predicated on the volume of data that is being collected and stored in the service providers’ backend SaaS services. It can also be very hard to exit these services once locked in.
These are all pitfalls we want to avoid at Causely as we build out our Causal AI services.
Want to see Causely in action? Request a demo.
The Pillars of Our Observability Architecture Pointed Us to OpenTelemetry
OpenTelemetry provides us with a path to break free from these limitations, establishing a common framework that transcends programming languages and platforms that we are using to build our services, and satisfying the requirements laid out in the pillars of our observability architecture:
Precise instrumentation
OpenTelemetry offers automatic instrumentation options that minimize the amount of work we need to do on manual code modifications and streamline the integration of our internal observability capabilities into our chosen backend applications.
Unified picture
By providing a standardized data model powered by semantic conventions, OpenTelemetry enables us to paint an end to end picture of how all of our services are composed including application and infrastructure dependencies. We can also gain access to critical telemetry information, leveraging this semantically consistent data across multiple backend microservices even when written in different languages.
Vendor-neutral data management
OpenTelemetry enables us to avoid locking our application data into 3rd party vendors’ services by decoupling it from proprietary vendor formats. This gives us the freedom to choose the best tools on an ongoing basis based on the value they provide, and if something new comes along that we want to exploit, we can easily plug it into our architecture.
Resource-optimized observability
OpenTelemetry enables us to take a top down approach to data collection, starting with the problems we are looking to solve and eliminating unnecessary information. In doing so, this minimizes our storage costs and optimizes compute resources we need to support our observability pipeline.
We believe that following these pillars and building our Causal AI platform on top of OpenTelemetry will propel our product’s performance, enable rock-solid reliability, and ensure consistent service experiences for our customers as we scale our business. We will also minimize our ongoing operational costs, creating a win-win for us and our customers.
OpenTelemetry + Causal AI: Scaling for Performance and Cost Efficiency
Ultimately, observability aims to illuminate the behavior of distributed systems, enabling proactive maintenance and swift troubleshooting. Yet isolated failures manifest as cascading symptoms across interconnected services.
While OpenTelemetry enables back-end applications to use this data to provide a unified picture in maps, graphs and dashboards, the job of figuring out the cause and effect in the correlated data still requires highly skilled resources. This process can also be very time consuming, tying up personnel across multiple teams, with ownership for different elements of overall services.
There is a lot of noise in the industry right now about how AI and LLMs are going to magically come to the rescue, but reality paints a different picture. All of the solutions available in the market today focus on correlating data versus uncovering a direct understanding of causal relationships between problems and the symptoms they cause, leaving devops teams with noise, not answers.
Traditional AI and LLMs also require massive amounts of data as input for training and learning behaviors on a continuous basis. This is data that ultimately ends up being transferred and stored in some form of SaaS. Processing these large datasets is very computationally intensive. This all translates into significant cost overheads for the SaaS providers as customer datasets grow overtime – costs that ultimately result in ever increasing bills for customers.
By contrast, this is where Causal AI comes into its own, taking a fundamentally different approach. Causal AI provides operations and engineering teams with an understanding of the “why”, which is crucial for effective and timely troubleshooting and decision-making.
Causal AI uses predefined models of how problems behave and propagate. When combined with real-time information about a system’s specific structure, Causal AI computes a map linking all potential problems to their observable symptoms.
This map acts as a reference guide, eliminating the need to analyze massive datasets every time Causal AI encounters an issue. Think of it as checking a dictionary instead of reading an entire encyclopedia.
The bottom line is, in contrast to traditional AI, Causal AI operates on a much smaller dataset, requires far less resources for computation and provides more meaningful actionable insights, all of which translate into lower ongoing operational costs and profitable growth.
Summing it up
There’s massive potential for Causal AI and OpenTelemetry to come together to tackle the limitations of traditional AI to get to the “why.” This is what we’re building at Causely. Doing so will result in numerous benefits:
- Less time on Ops, more time on Dev: OpenTelemetry provides standardized data while Causal AI analyzes it to automate the root cause analysis (RCA) process, which will significantly reduce the time our devops teams have to spend on troubleshooting.
- Instant gratification, no training lag: We can eliminate AI’s slow learning curve, because Causal AI leverages OpenTelemetry’s semantic language and Causal AI’s domain knowledge of cause and effect to deliver actionable results, right out of the box without massive amounts of data and with no training lag!
- Small data, lean computation, big impact: Unlike traditional AI’s data gluttony and significant computational overheads, Causal AI thrives on targeted data streams. OpenTelemetry’s smart filtering keeps the information flow lean, allowing Causal AI to identify the root causes with a significantly smaller dataset and compute footprint.
- Fast root cause identification: Traditional AI might tell us “ice cream sales and shark attacks rise together,” but Causal AI reveals the truth – it’s the summer heat and not the sharks, driving both! By understanding cause-and-effect relationships, Causal AI cuts through the noise and identifies the root causes behind performance degradation and service malfunctions.
Having these capabilities is critical if we want to move beyond the labor intensive processes associated with how RCA is performed in devops today. This is why we are eating our own dog food and using Causely as part of our tech stack to manage the services we provide to customers.
If you would like to learn how to unplug from the Matrix of guesswork and embrace the opportunity offered through the combination of OpenTelemetry and Causal AI, don’t hesitate to reach out! The team and I at Causely are here to share our experience and help you navigate the path.
Related Resources
- Watch the video to see how Causely uses OpenTelemetry under the hood: Cracking the Code of Complex Tracing Data
- Read the blog: Using OpenTelemetry and the OTel Collector for Logs, Traces and Metrics
- See Causely in action: Request a demo
Data, Dragons & Digital Dreams: The Saga Of Microservice Manor
Reposted with permission from its source on LinkedIn.
In the bustling kingdom of Microservice Manor, where code flowed like rivers and servers hummed like contented bees, all was not well. Glitches lurked like mischievous sprites, transactions hiccuped like startled unicorns, and user satisfaction plummeted faster than a dropped byte.
King Reginald Mainframe, a wise old server with a crown of blinking LEDs, wrung his metaphorical hands. His loyal advisors, the Observability Owls, hooted in frustration. Metrics were a jumbled mess, logs spoke in cryptic tongues, and tracing requests felt like chasing greased squirrels through a server maze. Root cause analysis? More like root canal analysis!
Then, like a ray of sunshine in a server storm, arrived Open Telemetry. This sprightly young service, armed with instrumentation libraries and tracing tools, promised a unified telemetry framework to illuminate the Manor’s inner workings. Intrigued, King Reginald gave Open Telemetry a royal trial, equipping all his microservices with its magical sensors.
The transformation was instant. Metrics flowed like crystal rivers, logs sang in perfect harmony, and tracing requests became a delightful waltz through a well-mapped network. The Observability Owls, no longer befuddled, hooted gleefully as they pinpointed glitches with pinpoint precision.
But King Reginald, ever the wise ruler, knew true peace required more than just clear data. He needed someone to interpret the whispers of the metrics, to make sense of the digital symphony. Enter Causely, his new Digital Assistant. Causely, a majestic data analysis griffin with a keen eye and a beak imbued with the power of Causal AI, could see patterns where others saw only noise.
Together, OpenTelemetry and Causely formed the ultimate root cause analysis dream team. OpenTelemetry, the tireless scout, would reveal the Manor’s secrets, while Causely, the wise griffin, would decipher their meaning.
First, they tackled a rogue shopping cart service, hoarding transactions like a squirrel with acorns. From the whispers of OpenTelemetry, Causely revealed the culprit: a hidden bug causing carts to multiply like gremlins. With a swift code fix, the carts vanished, and checkout flowed like a well-oiled machine.
Then, a more insidious challenge arose. Whispers of delayed messages and sluggish microservices echoed through the Manor halls. OpenTelemetry traced the issue to the Message Queue Meadow, where messages piled up like autumn leaves. But the cause remained elusive.
Causely, feathers bristling with urgency, soared over OpenTelemetry’s data streams. He spotted a pattern: certain message types were clogging the queues, causing a domino effect of delays. Further investigation revealed a compatibility issue between a recently updated microservice and the messaging format.
With Causely’s guidance, the developers swiftly adjusted the messaging format, and the queues sprang back to life. Messages flowed like a rushing river, microservices danced in sync, and user satisfaction soared higher than a cloud-surfing unicorn.
But the saga wasn’t over. Fresh data, the lifeblood of the Manor, seemed to stagnate. Transactions stuttered, user complaints echoed like mournful owls, and the Observability Owls flapped in confusion. OpenTelemetry led them to the Kafka Canal, where messages, instead of flowing freely, were backed up like barges in a narrow lock.
Causely, the griffin detective perched atop OpenTelemetry’s digital cityscape, surveyed the scene. His gaze, piercing through the stream of metrics, snagged on a glaring imbalance: certain topics, once bustling avenues, now overflowed like overflowing dams, while others lay vacant as silent streets. With a determined glint in his eye, Causely unearthed the culprit: a misconfigured Kafka broker, its settings acting like twisted locks, choking the data flow.
With Causely’s guidance, the developers swiftly adjusted the broker configuration. The Kafka Canal unclogged, messages surged forward, and consumers feasted on fresh data. Transactions hummed back to life, user complaints turned to delighted chirps, and King Reginald’s crown shone brighter than ever.
The legend of OpenTelemetry and Causely spread far and wide. News of their triumphs, from rogue carts to stagnant data, reached other kingdoms, battling their own digital dragons and mischievous sprites. Causely, the wise griffin, and OpenTelemetry, the tireless scout, became symbols of hope, their teamwork a testament to the power of clear data, insightful analysis, and a dash of griffin magic.
And so, the quest for perfect digital harmony continued, fuelled by collaboration and a never-ending pursuit of efficiency. Microservice Manor, once plagued by glitches and stale data, became a beacon of smooth operations and fresh information, a reminder that even the trickiest of digital challenges can be overcome with the right tools and the unwavering spirit of a dream team.
The End. (But who knows, perhaps one day, even dragons will learn to code efficiently!)
Does chaos reign in your Digital Kingdom? Banish glitches and optimize your realm with the dynamic duo – OpenTelemetry and Causely! Don’t hesitate to reach out if you would like an introduction to this powerful team.
Related reading
Read the next articles in this saga:
Cervin Founder Spotlight: Ellen Rubin of Causely, Part 2
Cervin Founder Spotlight: Ellen Rubin of Causely, Part 1
Causely for asynchronous communication
Managing microservices-based applications at scale is challenging, especially when it comes to troubleshooting and pinpointing root causes.
In a microservices-based environment, when a failure occurs, it causes a flood of anomalies across the entire system. Pinpointing the root cause can be as difficult as searching for a needle in a haystack. In this video, we’ll share how Causely can eliminate human heavy lifting and automate the troubleshooting process.
Causely is the operating system to assure application service delivery by automatically preventing failures, pinpointing root causes, and remediating. Causely captures and analyzes cause and effect relationships so you can explore interesting insights and questions about your application environment.
Does this resonate with you? Feel free to share your troubleshooting stories here. We’d love to explore the ways Causely can help you!
Moving Beyond Traditional RCA In DevOps
Reposted with permission from LinkedIn.
Modernization Of The RCA Process
Over the past month, I have spent a significant amount of time researching what vendors and customers are doing in the devops space to streamline the process of root cause analysis (RCA).
My conclusion is that the underlying techniques and processes used in operational environments today to perform RCA remain human centric. As a consequence troubleshooting remains complex, resource intensive and requires skilled practitioners to perform the work.
So, how do we break free from this human bottleneck? Brace yourselves for a glimpse into a future powered by AI. In this article, we’ll dissect the critical issues, showcase how cutting-edge AI advancements can revolutionize RCA, and hear first hand from operations and engineering leaders who have shared their perspective on this transformative tech, having experienced the capabilities first hand.
Troubleshooting In The Cloud Native Era With Monitoring & Observability
Troubleshooting is hard because when degradations or failures occur in components of a business service, they spread like a disease to related service entities which also become degraded or fail.
This problem is amplified in the world of cloud-native applications where we have decomposed business logic into many separate but interrelated service entities. Today an organization might have hundreds or thousands of interrelated service entities (micro services, databases, caches, messaging…).
To complicate things even further, change is a constant – code changes, fluctuating demand patterns, and the inherent unpredictability of user behavior. These changes can result in service degradations or failures.
Testing for all possible permutations in this ever-shifting environment is akin to predicting the weather on Jupiter – an impossible feat – amplifying the importance of a fast, effective and consistent root cause analysis process, to maintain the availability, performance and operational resilience of business systems.
While observability tools have made strides in data visualization and correlation, their inherent inability to explain the cause-and-effect relationships behind problems leaves us dependent on human expertise to navigate the vast seas of data to determine the root cause of service degradation and failures.
This dependence becomes particularly challenging due to siloed devops teams that have responsibility for supporting individual service entities within the complex web of services entities that make up business services. In this context individual teams may frequently struggle to pinpoint the source of service degradation or failure as the entity they support might be the culprit, or a victim of another service entity’s malfunction.
The availability of knowledge and skills within these teams also fluctuate due to business priorities, vacations, holidays, and even the daily working cycles. This resource variability can lead to significant inconsistencies in problem identification and resolution times.
Causal AI To The Rescue: Automating The Root Cause Analysis Process For Cloud Native DevOps
For those who are not aware, Causal AI is a distinct field in Artificial Intelligence. It is already used extensively in many different industries but until recently there has been no application of the technology in the world of devops.
Causely is a new pioneer championing the use of Causal AI working in the area of cloud-native applications. Their platform embodies an understanding of causality so that when service entities are degraded or failing and affecting other service entities that make up business services, it can explain the cause and effect, by showing the relationship between the problem and the symptoms that this causes.
Through this capability, the team with responsibility for the failing or degraded service can be immediately notified and get to work on resolving the problem. Other teams might also be provided with notifications to let them know that their services are affected, along with an explanation for why this occurred. This eliminates the need for complex triage processes that would otherwise involve multiple teams and managers to orchestrate the process.
Understanding the cause-and-effect relationships in software systems serves as an enabler for automated remediation, predictive maintenance, and planning/gaming out operational resilience.
By using software in this way to automate the process of root cause analysis, organizations can reduce the time and effort and increase the consistency in the troubleshooting process, all of which leads to lower operational costs, improved service availability and less business disruption.
Customer Reactions: Unveiling the Transformative Impact of Causal AI for Cloud-Native DevOps
After sharing insights into Causely’s groundbreaking approach to root cause analysis (RCA) with operational and engineering leaders across various organizations, I’ve gathered a collection of anecdotes that highlight the profound impact this technology is poised to have in the world of cloud-native devops.
Streamlined Incident Resolution and Reduced Triage
“By accurately pinpointing the root cause, we can immediately engage the teams directly responsible for the issue, eliminating the need for war rooms and time-consuming triage processes. This ability to swiftly identify the source of problems and involve the appropriate teams will significantly reduce the time to resolution, minimizing downtime and its associated business impacts.”
Automated Remediation: A Path to Efficiency
“Initially, we’d probably implement a ‘fix it’ button that triggers remediation actions manually. However, as we gain confidence in the results, we can gradually automate the remediation process. This phased approach ensures that we can seamlessly integrate Causely into our existing workflows while gradually transitioning towards a more automated and efficient remediation strategy.”
Empowering Lower-Skilled Team Members
“Lower-skilled team members can take on more responsibilities, freeing up our top experts to focus on code development. By automating RCA tasks and providing clear guidance for remediation, Causely will empower less experienced team members to handle a wider range of issues, allowing senior experts to dedicate their time to more strategic initiatives.”
Building Resilience through Reduced Human Dependency
“Causely will enable us to build greater resilience into our service assurance processes by reducing our reliance on human knowledge and intuition. By automating RCA and providing data-driven insights, Causely will help us build a more resilient infrastructure that is less susceptible to human error and fluctuations in expertise.”
Enhanced Support Beyond Office Hours
“We face challenges maintaining consistent support outside of office hours due to reduced on-call expertise. Causely will enable us to handle incidents with the same level of precision and efficiency regardless of the time of day. Causely’s ability to provide automated RCA and remediation even during off-hours ensures that organizations can maintain a high level of service continuity around the clock.”
Automated Runbook Creation and Maintenance
“I was planning to create runbooks to guide other devops team members through troubleshooting processes. Causely can automatically generate and maintain these runbooks for me. This automated runbook generation eliminates the manual effort required to create and maintain comprehensive troubleshooting guides, ensuring that teams have easy access to the necessary information when resolving issues.”
Simplified Post-Incident Analysis
“Post-incident analysis will become much simpler as we’ll have a detailed record of the cause and effect for every incident. Causely’s comprehensive understanding of cause and effect provides a valuable resource for post-incident analysis, enabling us to improve processes, and prevent similar issues from recurring.”
Faster Problem Identification and Reduced Business Impacts
“Problems will be identified much faster, and there will be fewer business consequences. By automating RCA and providing actionable insights, Causely can significantly reduce the time it takes to identify and resolve problems, minimizing their impact on business operations and customer experience.”
These anecdotes underscore the transformative potential of Causely, offering a compelling vision of how root cause analysis is automated, remediation is streamlined, and operational resilience in cloud-native environments is enhanced. As Causely progresses, the company’s impact on the IT industry is poised to be profound and far-reaching.
Summing Things Up
Troubleshooting in cloud-native environments is complex and resource-intensive, but Causal AI can automate the process, streamline remediation, and enhance operational resilience.
If you would like to learn more about how Causal AI might benefit your organization, don’t hesitate to reach out to me or Causely directly.
Related Resources
- Learn about the causal AI platform from Causely
- Watch the video: Troubleshooting cloud-native applications with Causely
- Request a demo to see Causely in action
Root Cause Chronicles: Connection Collapse
The below post is reposted with permission from its original source on the InfraCloud Technologies blog.
This MySQL connection draining issue highlights the complexity of troubleshooting today’s complex environments, and provides a great illustration of the many rabbit holes SREs find themselves in. It’s critical to understand the ‘WHY’ behind each problem, as it paves the way for faster and more precise resolutions. This is exactly what we at Causely are on a mission to improve using causal AI.
On a usual Friday evening, Robin had just wrapped up their work, wished their colleagues a happy weekend, and turned themselves in for the night. At exactly 3 am, Robin receives a call from the organization’s automated paging system, “High P90 Latency Alert on Shipping Service: 9.28 seconds”.
Robin works as an SRE for Robot-Shop, an e-commerce company that sells various robotics parts and accessories, and this message does not bode well for them tonight. They prepare themselves for a long, arduous night ahead and turn on their work laptop.
Setting the Field
Robot-Shop runs a sufficiently complex cloud native architecture to address the needs of their million-plus customers.
- The traffic from load-balancer is routed via a gateway service optimized for traffic ingestion, called Web, which distributes the traffic across various other services.
- User handles user registrations and sessions.
- Catalogue maintains the inventory in a MongoDB datastore.
- Customers can see the ratings of available products via the Ratings service APIs.
- They choose products they like and add them to the Cart, a service backed by Redis cache to temporarily hold the customer’s choices.
- Once the customer pays via the Payment service, the purchased items are published to a RabbitMQ channel.
- These are consumed by the Dispatch service and prepared for shipping. Shipping uses MySQL as its datastore, as does Ratings.
Troubles in the Dark
“OK, let’s look at the latency dashboards first.” Robin clicks on the attached Grafana dashboard on the Slack notification for the alert sent by PagerDuty. This opens up the latency graph of the Shipping service.
“How did it go from 1s to ~9.28s within 4-5 minutes? Did traffic spike?” Robin decides to focus on the Gateway ops/sec panel of the dashboard. The number is around ~140 ops/sec. Robin knows this data is coming from their Istio gateway and is reliable. The current number is more than affordable for Robot-Shop’s cluster, though there is a steady uptick in the request-count for Robot-Shop.
None of the other services show any signs of wear and tear, only Shipping. Robin understands this is a localized incident and decides to look at the shipping logs. The logs are sourced from Loki, and the widget is conveniently placed right beneath the latency panel, showing logs from all services in the selected time window. Nothing in the logs, and no errors regarding connection timeouts or failed transactions. So far the only thing going wrong is the latency, but no requests are failing yet; they are only getting delayed by a very long time. Robin makes a note: We need to adjust frontend timeouts for these APIs. We should have already gotten a barrage of request timeout errors as an added signal.
Did a developer deploy an unapproved change yesterday? Usually, the support team is informed of any urgent hotfixes before the weekend. Robin decides to check the ArgoCD Dashboards for any changes to shipping or any other services. Nothing there either, no new feature releases in the last 2 days.
Did the infrastructure team make any changes to the underlying Kubernetes cluster? Any version upgrades? The Infrastructure team uses Atlantis to gate and deploy the cluster updates via Terraform modules. The last date of change is from the previous week.
With no errors seen in the logs and partial service degradation as the only signal available to them, Robin cannot make any more headway into this problem. Something else may be responsible, could it be an upstream or downstream service that the shipping service depends on? Is it one of the datastores? Robin pulls up the Kiali service graph that uses Istio’s mesh to display the service topology to look at the dependencies.
Robin sees that Shipping has now started throwing its first 5xx errors, and both Shipping and Ratings are talking to something labeled as PassthroughCluster. The support team does not maintain any of these platforms and does not have access to the runtimes or the codebase. “I need to get relevant people involved at this point and escalate to folks in my team with higher access levels,” Robin thinks.
Stakeholders Assemble
It’s already been 5 minutes since the first report and customers are now getting affected.
Robin’s team lead Blake joins in on the call, and they also add the backend engineer who owns Shipping service as an SME. The product manager responsible for Shipping has already received the first complaints from the customer support team who has escalated the incident to them; they see the ongoing call on the #live-incidents channel on Slack, and join in. P90 latency alerts are now clogging the production alert channel as the metric has risen to ~4.39 minutes, and 30% of the requests are receiving 5xx responses.
The team now has multiple signals converging on the problem. Blake digs through shipping logs again and sees errors around MySQL connections. At this time, the Ratings service also starts throwing 5xx errors – the problem is now getting compounded.
The Product Manager (PM) says their customer support team is reporting frustration from more and more users who are unable to see the shipping status of the orders they have already paid for and who are supposed to get the deliveries that day. Users who just logged in are unable to see product ratings and are refreshing the pages multiple times to see if the information they want is available.
“If customers can’t make purchase decisions quickly, they’ll go to our competitors,” the PM informs the team.
Blake looks at the PassthroughCluster node on Kiali, and it hits them: It’s the RDS instance. The platform team had forgotten to add RDS as an External Service in their Istio configuration. It was an honest oversight that could cost Robot-Shop significant revenue loss today.
“I think MySQL is unable to handle new connections for some reason,” Blake says. They pull up the MySQL metrics dashboards and look at the number of Database Connections. It has gone up significantly and then flattened. “Why don’t we have an alert threshold here? It seems like we might have maxed out the MySQL connection pool!”
To verify their hypothesis, Blake looks at the Parameter Group for the RDS Instance. It uses the default-mysql-5.7 Parameter group, and max_connections is set to:
{DBInstanceClassMemory/12582880}
But, what does that number really mean? Blake decides not to waste time with checking the RDS Instance Type and computing the number. Instead, they log into the RDS instance with mysql-cli and run:
#mysql> SHOW VARIABLES LIKE "max_connections";
Then Blake runs:
#mysql> SHOW processlist;
“I need to know exactly how many,” Blake thinks, and runs:
#mysql> SELECT COUNT(host) FROM information_schema.processlist;
It’s more than the number of max_connections. Their hypothesis is now validated: Blake sees a lot of connections are in sleep()
mode for more than ~1000 seconds, and all of these are being created by the shipping user.
“I think we have it,” Blake says, “Shipping is not properly handling connection timeouts with the DB; it’s not refreshing its unused connection pool.” The backend engineer pulls up the Java JDBC datasource code for shipping and says that it’s using defaults for max-idle, max-wait, and various other Spring datasource configurations. “These need to be fixed,” they say.
“That would need significant time,” the PM responds, “and we need to mitigate this incident ASAP. We cannot have unhappy customers.”
Blake knows that RDS has a stored procedure to kill idle/bad processes.
mysql#> CALL mysql.rds_kill(processID)
Blake tests this out and asks Robin to quickly write a bash script to kill all idle processes.
#!/bin/bash
# MySQL connection details
MYSQL_USER="<user>"
MYSQL_PASSWORD="<passwd>"
MYSQL_HOST="<rds-name>.<id>.<region>.rds.amazonaws.com"
# Get process list IDs
PROCESS_IDS=$(MYSQL_PWD="$MYSQL_PASSWORD" mysql -h "$MYSQL_HOST" -u "$MYSQL_USER" -N -s -e "SELECT ID FROM INFORMATION_SCHEMA.PROCESSLIST WHERE USER='shipping'")
for ID in $PROCESS_IDS; do
MYSQL_PWD="$MYSQL_PASSWORD" mysql -h "$MYSQL_HOST" -u "$MYSQL_USER" -e "CALL mysql.rds_kill($ID)"
echo "Terminated connection with ID $ID for user 'shipping'"
done
The team runs this immediately and the connection pool frees up momentarily. Everyone lets out a visible sigh of relief. “But this won’t hold for long, we need a hotfix on DataSource handling in Shipping,” Blake says. The backend engineer informs they are on it and soon they have a patch-up that adds better defaults for
spring.datasource.max-active
spring.datasource.max-age
spring.datasource.max-idle
spring.datasource.max-lifetime
spring.datasource.max-open-prepared-statements
spring.datasource.max-wait
spring.datasource.maximum-pool-size
spring.datasource.min-evictable-idle-time-millis
spring.datasource.min-idle
The team approves the hotfix and deploys it, finally mitigating a ~30 minute long incident.
Key Takeaways
Incidents such as this can occur in any organization with sufficiently complex architecture involving microservices written in different languages and frameworks, datastores, queues, caches, and cloud native components. A lack of understanding of end-to-end architecture and information silos only adds to the mitigation timelines.
During this RCA, the team finds out that they have to improve on multiple accounts.
- Frontend code had long timeouts and allowed for large latencies in API responses.
- The L1 Engineer did not have an end-to-end understanding of the whole architecture.
- The service mesh dashboard on Kiali did not show External Services correctly, causing confusion.
- RDS MySQL database metrics dashboards did not send an early alert, as no max_connection (alert) or high_number_of_connections (warning) thresholds were set.
- The database connection code was written with the assumption that sane defaults for connection pool parameters were good enough, which proved incorrect.
Pressure to resolve incidents quickly that often comes from peers, leadership, and members of affected teams only adds to the chaos of incident management, causing more human errors. Coordinating incidents such as this through the process of having an Incident Commander role has shown more controllable outcomes for organizations around the world. An Incident Commander assumes the responsibility of managing resources, planning, and communications during a live incident, effectively reducing conflict and noise.
When multiple stakeholders are affected by an incident, resolutions need to be handled in order of business priority, working on immediate mitigations first, then getting the customer experience back at nominal levels, and only afterward focusing on long-term preventions. Coordinating these priorities across stakeholders is one of the most important functions of an Incident Commander.
Troubleshooting complex architecture remains a challenging activity to date. However, with the Blameless RCA Framework coupled with periodic metric reviews, a team can focus on incremental but constant improvements of their system observability. The team could also convert all successful resolutions to future playbooks that can be used by L1 SREs and support teams, making sure that similar errors can be handled well.
Concerted effort around a clear feedback loop of Incident -> Resolution -> RCA -> Playbook Creation eventually rids the system of most unknown-unknowns, allowing teams to focus on Product Development, instead of spending time on chaotic incident handling.
That’s a Wrap
Hope you all enjoyed that story about a hypothetical but complex troubleshooting scenario. We see incidents like this and more across various clients we work with at InfraCloud. The above scenario can be reproduced using our open source repository. We are working on adding more such reproducible production outages and subsequent mitigations to this repository.
We would love to hear from you about your own 3 am incidents. If you have any questions, you can connect with me on Twitter and LinkedIn.
References
- Ending a session or query – Amazon Relational Database Service
- Blameless Root Cause Analyses – The GitLab Handbook
- Configuring a Tomcat Connection Pool in Spring Boot – Baeldung
- Controlling Database Connections: Spring Framework
- Istio / Accessing External Services
- Monitoring Blocked and Passthrough External Service Traffic
- Quotas and constraints for Amazon RDS – Amazon Relational Database Service
- The role of the incident commander – Atlassian
Related Resources
- For a very similar troubleshooting example that uses causal AI to automatically detect and identify root cause in real time, take a look at Enlin Xu’s post “One million ways to slow down your application response time and throughput.“
- To see how Causely works to automate troubleshooting in situations like this, check out these short videos.
Understanding failure scenarios when architecting cloud-native applications
Developing and architecting complex, large cloud-native applications is hard. In this short demo, we’ll show how Causely helps to understand failure scenarios before something actually fails in the environment.
In the demo environment we have a dozen applications with database servers, caches running in a cluster, providing multiple services. If we drill into these services and focus on the application, we can only see how the application is behaving right now. But Causely is automatically identifying the potential root causes and alerts that would be caused – services that would be impacted – by failures.
For example, a congested service would cause high latency across a number of different downstream dependencies. A malfunction of this service would make services unavailable and cause high error rates on the dependent services.
Causely is able to reason about the specific dependencies and all the possible root causes – not just for services, but for the applications – in terms of: what would happen if their database query takes too long, if their garbage collection time takes too long, if their transaction latency is high? What services would be impacted, and what alerts would it receive?
This allows developers to design a more resilient system, and operators can understand how to run the environment with their actual dependencies.
We’re hoping that Causely can help application owners avoid production failures and service impact by architecting applications to be resilient in the first place.
What do you think? Share your comments on this use case below.
Troubleshooting cloud-native applications with Causely
Running large, complex, distributed cloud-native applications is hard. This short demo shows how Causely can help.
In this environment, we are running a number of applications with database servers, caches, in a cluster, multiple services, pods, and containers. At any one point in time, we would be getting multiple alerts showing high latency, high CPU utilization, high garbage collection time, high memorization across multiple microservices. Troubleshooting what is the root cause of each one of these alerts is really difficult.
Causely automatically identifies the root cause and shows how the service that is actually congested causing all of these downstream alerts on its dependent services. Instead of individual teams troubleshooting their respective alerts, the team responsible for this product catalog service can focus on remediating and restoring this service while showing all of the other impacted services, so the teams are aware that their problems are caused by congestion in this service. This can significantly reduce the time to detect and to remediate and restore a service.
What do you think? Share your comments on this use case below.
Unveiling the Causal Revolution in Observability
Reposted with permission from LinkedIn.
OpenTelemetry and the Path to Understanding Complex Systems
Decades ago, the IETF’s (Internet Engineering Task Force) developed an innovative protocol, SNMP, revolutionizing network management. This standardization spurred a surge of innovation, fostering a new software vendor landscape dedicated to streamlining operational processes in network management, encompassing Fault, Configurations, Accounting, Performance, and Security (FCAPS). Today, SNMP reigns as the world’s most widely adopted network management protocol.
On the cusp of a similar revolution stands the realm of application management. For years, the absence of management standards compelled vendors to develop proprietary telemetry for application instrumentation, to enable manageability. Many of the vendors also built applications to report on and visualize managed environments, in an attempt to streamline the processes of incident and performance management.
OpenTelemetry‘s emergence is poised to transform the application management market dynamics in a similar way by commoditizing application telemetry instrumentation, collection, and export methods. Consequently, numerous open-source projects and new companies are emerging, building applications that add value around OpenTelemetry.
This evolution is also compelling established vendors to embrace OpenTelemetry. Their futures hinge on their ability to add value around this technology, rather than solely providing innovative methods for application instrumentation.
Adding Value Above OpenTelemetry
While OpenTelemetry simplifies the process of collecting and exporting telemetry data, it doesn’t guarantee the ability to pinpoint the root cause of issues. This is because understanding the causal relationships between events and metrics requires more sophisticated analysis techniques.
Common approaches to analyzing OpenTelemetry data that get devops teams closer to this goal include:
- Visualization and Dashboards: Creating effective visualizations and dashboards is crucial for extracting insights from telemetry data. These visualizations should present data in a clear and concise manner, highlighting trends, anomalies, and relationships between metrics.
- Correlation and Aggregation: To correlate logs, metrics, and traces, you need to establish relationships between these data streams. This can be done using techniques like correlation IDs or trace identifiers, which can be embedded in logs and metrics to link them to their corresponding traces.
- Pattern Recognition and Anomaly Detection: Once you have correlated data, you can apply pattern recognition algorithms to identify anomalies or outliers in metrics, which could indicate potential issues. Anomaly detection tools can also help identify sudden spikes or drops in metrics that might indicate performance bottlenecks or errors.
- Machine Learning and AI: Machine learning and AI techniques can be employed to analyze telemetry data and identify patterns, correlations, and anomalies that might be difficult to detect manually. These techniques can also be used to predict future performance or identify potential issues before they occur.
While all of these techniques might help to increase the efficiency of the troubleshooting process, human expertise is still essential for interpreting and understanding the results. This is because these approaches to analyzing telemetry data are based on correlation and lack an inherent understanding of cause and effect (causation).
Avoiding The Correlation Trap: Separating Coincidence from Cause and Effect
In the realm of analyzing observability data, correlation often takes center stage, highlighting the apparent relationship between two or more variables. However, correlation does not imply causation, a crucial distinction that software-driven causal analysis can effectively address and results in a better outcome in the following ways:
Operational Efficiency And Control: Correlation-based approaches often leave us grappling with the question of “why,” hindering our ability to pinpoint the root cause of issues. This can lead to inefficient troubleshooting efforts, involving multiple teams in a devops environment as they attempt to unravel the interconnectedness of service entities.
Software-based causal analysis empowers us to bypass this guessing game, directly identifying the root cause and enabling targeted corrective actions. This not only streamlines problem resolution but also empowers teams to proactively implement automations to mitigate future occurrences. It also frees up the time of experts in the devops organizations to focus on shipping features and working on business logic.
Consistency In Responding To Adverse Events: The speed and effectiveness of problem resolution often hinge on the expertise and availability of individuals, a variable factor that can delay critical interventions. Software-based causal analysis removes this human dependency, providing a consistent and standardized approach to root cause identification.
This consistency is particularly crucial in distributed devops environments, where multiple teams manage different components of the system. By leveraging software, organizations can ensure that regardless of the individuals involved, problems are tackled with the same level of precision and efficiency.
Predictive Capabilities And Risk Mitigation: Correlations provide limited insights into future behavior, making it challenging to anticipate and prevent potential problems. Software-based causal analysis, on the other hand, unlocks the power of predictive modeling, enabling organizations to proactively identify and address potential issues before they materialize.
This predictive capability becomes increasingly valuable in complex cloud-native environments, where the interplay of numerous microservices and data pipelines can lead to unforeseen disruptions. By understanding cause and effect relationships, organizations can proactively mitigate risks and enhance operational resilience.
Conclusion
OpenTelemetry marks a significant step towards standardized application management, laying a solid foundation for a more comprehensive understanding of complex systems. However, to truly unlock the full potential, the integration of software-driven causal analysis, also referred to as Causal AI, is essential. By transcending correlation, software-driven causal analysis empowers devops organizations to understand cause and effect of system behavior, enabling proactive problem detection, predictive maintenance, operational risk mitigation and automated remediation.
The founding team of Causely participated in the standards-driven transformation that took place in the network management market more than two decades ago at a company called SMARTS. The core of their solution was built on Causal AI. SMARTS became the market leader of Root Cause Analysis in networks and was acquired by EMC in 2005. The team’s rich experience in Causal AI is now being applied at Causely, to address the challenges of managing cloud native applications.
Causely’s embrace of OpenTelemetry stems from the recognition that this standardized approach will only accelerate the advancement of application management. By streamlining telemetry data collection, OpenTelemetry creates a fertile ground for Causal AI to flourish.
If you are intrigued and would like to learn more about Causal AI the team at Causely would love to hear from you, so don’t hesitate to get in touch.
All Sides of the Table
Reflecting on the boardroom dynamics that truly matter
This past month has been an eventful one. Like everyone in the tech world, I’m riveted by the drama unfolding at OpenAI, wondering how the board and CEO created such an extreme situation. I’ve been thinking a lot about board dynamics – and how different things look as a founder/CEO vs board member, especially at very different stages of company growth.
Closer to home and far less dramatic, last week we had our quarterly Causely team meetup in NYC, including our first in-person board meeting with 645 Ventures, Amity Ventures and Cervin Ventures. As a remote company, it was great to actually sit together in one room and discuss company and product strategy, including a demo of our latest iteration of the Causely product. Getting aligned and hearing the board’s input was truly helpful as we plan for 2024.
Also in the past month, my board experiences at Corvus Insurance and Chase Corporation came to an end. Corvus (a late-stage insurtech) announced it’s being acquired by Travelers Insurance for $435M, and Chase (a 75-year old manufacturing company) closed its acquisition by KKR for $1.3B. These exits were the culmination of years of work by the management teams (and support of their boards) to create significant shareholder value.
Each of these experiences has shown me models of board interaction and highlighted how critical it is for board members to build trust with the CEO and each other. I thought I’d share some thoughts on the most valuable traits or contributions a board member can offer, and what executives should look for in board members that will make a meaningful impact on the business, depending on the stage and size of the company.
From the startup CEO/founder view
As a founder who’s built and managed my own boards for the past 15 years, I’ve learned a lot about what kinds of board members are most impactful and productive for early-stage life. Here are a few examples of what these board members do:
- They get hands-on to provide real value through introductions to design partners, customers and potential key hires.
- They ask the operational questions relevant for the company’s current stage – for example, is this product decision (or hire) really the one we need to make now, while we are trying to validate product market fit, or can it be deferred?
- They hold the CEO accountable and keep board discussions focused, by asking questions like, “Ellen, are we actually talking about the topic that keeps you up at night?!”
- They don’t project concerns from other companies or past experiences that might be irrelevant.
- They stay calm through ends-of-quarters and acquisition processes, and balance the needs of investors and common shareholders.
From the board view
As an independent board member, I now appreciate these board members even more. It can be hard to step back from the operational role (“What do I need to do next?”) and provide guidance and support, sometimes just by asking the right question at the right moment in a board meeting. I find it very helpful to check in with the CEO and other board members before any official meeting, so I understand where the “hot” issues are and what decisions need to be made as a group.
In a public company, the challenge is even greater. The independent board member must maintain this same operator/advisor perspective, but also weigh decisions as they relate to corporate governance and enterprise risk management across a wide range of products, markets and countries. For example, how fast can management drive product innovation which may cause new cyber risk or data management concerns? And unlike in private and early-stage companies, which tend to focus almost entirely on top-line growth, what is the right balance of growth vs profitability for the more mature public company?
Building trust is key
As the recent chaos at OpenAI shows (albeit in an extreme way), strong board relationships and ongoing communications between the board and management are critical.
If you’re building a company and/or adding new board seats, think about what a new board member should bring to the table that will help you reach the next phase of growth and major milestones — and stay laser-focused on finding someone that meets your criteria.
If you’re considering serving on the board of a company, think about what kinds of companies you’re best suited to help, and find one where you can work closely with the CEO and where existing board members will complement your skill set and experience.
Regardless of which side of the table you’re on, take the time to build strong relationships and trust. Lead directors, who have taken a more central role in the past several years, can ensure that communications don’t break down. But even in earlier stage companies, it’s the job of everyone around the table to make sure there’s clarity on the key strategic issues the company is facing, and to provide the support that the CEO and management need to make the best decisions for the business.
Related reading
New England Venture Capital Association Unveils Nominees and Academies for the 11th Annual NEVY Awards
SiliconANGLE: ‘Causal AI’ company Causely raises $8.8M to automate IT operations
Click here to read the SiliconANGLE article.
VentureBeat: Causely launches Causal AI for Kubernetes, raises $8.8M in Seed funding
Click here to read the VentureBeat article.
Why do this startup thing all over again? Our reasons for creating Causely
Why be a serial entrepreneur?
It’s a question that my co-founder, Shmuel, and I are asked many times. Both of us have been to this rodeo twice before – Shmuel, with SMARTS and Turbonomic, myself with ClearSky Data and CloudSwitch. There are all the usual reasons, including love of hard challenges, creation of game-changing products and working with teams and customers who inspire you. And of course there’s more than a small share of insanity (or as one of our founding engineers, Endre Sara, might call it, an addiction to Sisyphean tasks?)
I’ve been pondering this as we build our new venture, Causely. The motivation behind Causely was a long-standing goal of Shmuel’s to tackle a problem he’s addressed in both previous companies, but still feels is unresolved: How to remove the burden of human troubleshooting from the IT industry? (Shmuel is not interested in solving small problems.)
Although there are tools galore and so much data being gathered that it takes data lakes to manage it all, at the heart of the IT industry there is a central problem that hasn’t fundamentally changed in decades. When humans try to determine the root cause of things that make applications slow down or break, they rely on the (often specific) expertise of people on their teams. And the scale, complexity and rate of change of their application environments will always grow faster than humans can keep up.
I saw this during my time at AWS, while I was running some global cloud storage services. No matter how incredible the people were, how well-architected the services, or robust the mechanisms and tools – when things went wrong (usually at 3 am) it always came down to having the right people online at that moment to figure out what was really happening. Much of the human effort went into stabilizing things asap for customers and then digging in for days (or longer) to understand what had happened.
When Peter Bell, our founding investor at Amity Ventures, originally introduced me and Shmuel, it was clear that Shmuel had the answer to this never-ending cycle of applications changing, scaling, breaking, requiring human troubleshooting, writing post-mortems… and starting all over again. He was thinking about the problem from the perspective of what’s actually missing: the ability to capture causality in software. AI and ML technologies are advancing beyond our wildest dreams, but they are still challenged by the inability to automate causation (vs correlation, which they do quite well). By building a causal AI platform that can tackle this huge challenge, we believe Causely can eliminate the need for humans to keep pushing the same rocks up the same hills.
So why do this startup thing all over again?
Because for each new venture there’s always been a big, messy problem that needs to be fixed. One that requires a solution that’s never been done before, that will be exciting to build with smart, creative people.
And so today, we announce funding for Causely and the opening of our Early Access program for initial users. We’ve been quietly designing and building for the past year, working with some awesome design partners. We’re thrilled to have 645 Ventures join us on this journey and are already seeing the impact of support from Aaron, Vardan and the team. We also welcome new investors Glasswing Ventures and Tau Ventures. We hope early users will love the Causely premise and early version of the product and give us the input we need to build something that truly changes how applications are built and operated.
Please take a look and let us know your thoughts.
Learn more
- Read the press release: Causely raises $8.8M in Seed funding to deliver IT industry’s first causal AI platform
- Learn about our causal AI platform for IT
- Request a demo to see Causely in action
Causely raises $8.8M in Seed funding to deliver IT industry’s first causal AI platform
Automation of causality will eliminate human troubleshooting and enable faster, more resilient cloud application management
Boston, June 29, 2023 – Causely, Inc., the causal AI company, today announced it has raised $8.8M in Seed funding, led by 645 Ventures with participation from founding investor Amity Ventures, and including new investors Glasswing Ventures and Tau Ventures. The funding will enable Causely to build its causal AI platform for IT and launch an initial service for applications running in Kubernetes environments. This financing brings the company’s total funding to over $11M since it was founded in 2022.
For years, the IT industry has struggled to make sense of the overwhelming amounts of data coming from dozens of observability platforms and monitoring tools. In a dynamic world of cloud and edge computing, with constantly increasing application complexity and scale, these systems gather metrics and logs about every aspect of application and IT environments. In the end, all this data still requires human troubleshooting to respond to alerts, make sense of patterns, identify root cause, and ultimately determine the best action for remediation. This process, which has not changed fundamentally in decades, is slow, reactive, costly and labor-intensive. As a result, many problems can cause end-user and business impact, especially in situations where complex problems propagate across multiple layers and components of an application.
Causely’s breakthrough approach is to remove the need for human intervention from the entire process by capturing causality in software. By enabling end-to-end automation, from detection through remediation, Causely closes the gap between observability and action, speeding time to remediation and limiting business impact. Unlike existing solutions, Causely’s core technology goes beyond correlation and anomaly detection to identify root cause in dynamic systems, making it possible to see causality relationships across any IT environment in real time.
The founding team, led by veterans Ellen Rubin (founder of ClearSky Data, acquired by Amazon Web Services, and CloudSwitch, acquired by Verizon) and Shmuel Kliger (founder of Turbonomic, acquired by IBM, and SMARTS, acquired by EMC), brings together world-class expertise from the IT Ops, cloud-native and Kubernetes communities and decades of experience successfully building and scaling companies.
“In a world where developers and operators are overwhelmed by data, alert storms and incidents, the current solutions can’t keep up,” said Ellen Rubin, Causely CEO and Founder. “Causely’s vision is to enable self-managed, resilient applications and eliminate the need for human troubleshooting. We are excited to bring this vision to life, with the support and partnership of 645 Ventures, Amity Ventures and our other investors, working closely with our early design partners.”
“Causality is the missing link in automating IT operations,” said Aaron Holiday, Managing Partner at 645 Ventures. “The Causely team is uniquely able to address this difficult and long-standing challenge and we are proud to be part of the next phase of the company’s growth.”
“Having worked with the founding team for many years, I’m excited to see them tackle an unsolved industry problem and automate the entire process from observability through remediation,” said Peter Bell, Chairman at Amity Ventures.
The initial Causely service, for DevOps and SRE users who are building and supporting apps in Kubernetes, is now available in an Early Access program. To learn more, please visit www.causely.io/early-access.
About Causely
Causely, the causal AI company, automates the end-to-end detection, prevention and remediation of critical defects that can cause user and business impact in application environments. Led by veterans from Turbonomic, AWS, SMARTS and other cloud infrastructure companies, Causely is backed by Amity Ventures and 645 Ventures. To learn more, please visit www.causely.io.
About 645 Ventures
645 Ventures is an early-stage venture capital firm that partners with exceptional founders who are building iconic companies. We invest at the Seed and Series A stages and leverage our Voyager software platform to enable our Success team and Connected Network to help founders scale to the growth stage. Select companies that have reached the growth stage include Iterable, Goldbelly, Resident, Eden Health, FiscalNote, Lunchbox, and Squire. 645 has $550m+ in AUM across 5 funds, and is growing fast with backing from leading institutional investors, including university endowments, funds of funds, and pension funds. The firm has offices in New York and SF, and you can learn more at www.645ventures.com.
About Amity Ventures
Amity Ventures is a venture capital firm based in San Francisco, CA. We are a closely knit team with 40+ years of collective experience partnering deeply with entrepreneurs building category-defining technology businesses at the early stage. Amity intentionally invests in a small number of startups per year, and currently has a portfolio of about 25 companies across multiple funds.
Contact
Kelsey Cullen, KCPR
kelsey@kcpr.com
650.438.1063