Author: causely2023

Navigating Kafka and the Challenges of Asynchronous Communication

Example of distributed tracing with sync communication

Welcome back to our series, “One Million Ways to Slow Down Your Application.” Having previously delved into the nuances of Postgres configurations, we now journey into the world of Kafka and asynchronous communication, another critical component of scalable applications.

Kafka 101: An Introduction

Kafka is an open-source stream-processing software platform. Developed by LinkedIn and donated to the Apache Software Foundation, written in Scala and Java. It is designed to handle data streams and provide a unified, high-throughput, low-latency platform for handling real-time data feeds.

Top Use Cases for Kafka

Kafka’s versatility allows for different application use cases, including:

  • Real-Time Analytics: Analyzing data in real-time can provide companies with a competitive edge. Kafka allows businesses to process massive streams of data on the fly.
  • Event Sourcing: This is a method of capturing changes to an application state as a series of events which can be processed, stored, and replayed.
  • Log Aggregation: Kafka can consolidate logs from multiple services and applications, ensuring centralized logging and ease of access.
  • Stream Processing: With tools like Kafka Streams and KSQL, Kafka can be used for complex stream processing tasks.

Typical Failures of Kafka

Kafka is resilient, but like any system, it can fail. Some of the most common failures include:

  • Broker Failures: Kafka brokers can fail due to hardware issues, lack of resources or misconfigurations.
  • Zookeeper Outages: Kafka relies on Zookeeper for distributed coordination. If Zookeeper faces issues, Kafka can be adversely impacted.
  • Network Issues: Kafka relies heavily on networking. Network partitions or latencies can cause data delays or loss.
  • Disk Failures: Kafka persists data on disk. Any disk-related issues can impact its performance or cause data loss.

Typical Manifestations of Kafka Failures

Broker Metrics
Brokers are pivotal in the Kafka ecosystem, acting as the central hub for data transfer. Monitoring these metrics can help you catch early signs of failures:

  • Under Replicated Partitions: A higher than usual count can indicate issues with data replication, possibly due to node failures.
  • Offline Partitions Count: If this is non-zero, it signifies that some partitions are not being served by any broker, which is a severe issue.
  • Active Controller Count: There should only ever be one active controller. A deviation from this norm suggests issues.
  • Log Flush Latency: An increase in this metric can indicate disk issues or high I/O wait, affecting Kafka’s performance.
  • Request Handler Average Idle Percent: A decrease can indicate that the broker is getting overwhelmed.

Consumer Metrics
Consumers pull data from brokers. Ensuring they function correctly is vital for any application depending on Kafka:

  • Consumer Lag: Indicates how much data the consumer is behind in reading from Kafka. A consistently increasing lag may denote a slow or stuck consumer.
  • Commit Rate: A drop in the commit rate can suggest that consumers aren’t processing messages as they should.
  • Fetch Rate: A decline in this metric indicates the consumer isn’t fetching data from the broker at the expected rate, potentially pointing to networking or broker issues.
  • Rebalance Rate: Frequent rebalances can negatively affect the throughput of the consumer. Monitoring this can help identify instability in the consumer group.

Producer Metrics
Producers push data into Kafka. Their health directly affects the timeliness and integrity of data in the Kafka ecosystem:

  • Message Send Rate: A sudden drop can denote issues with the producer’s ability to send messages, possibly due to network issues.
  • Record Error Rate: An uptick in errors can signify that messages are not being accepted by brokers, perhaps due to topic misconfigurations or broker overloads.
  • Request Latency: A surge in latency can indicate network delays or issues with the broker handling requests.
  • Byte Rate: A drop can suggest potential issues in the pipeline leading to the producer or within the producer itself.


The Criticality of Causality in Kafka

Understanding causality between failures and how they are manifested in Kafka is vital. Failures, be they from broker disruptions, Zookeeper outages, or network inconsistencies, send ripples across the Kafka ecosystem, impacting various components. For instance, a spike in consumer lag could be traced back to a broker handling under-replicated partitions, and an increase in producer latency might indicate network issues or an overloaded broker.

Furthermore, applications using asynchronous communications are much more difficult to troubleshoot than those using synchronous communications. As seen in the examples below, it’s pretty straightforward to troubleshoot using distributed tracing if the communication is synchronous. But with asynchronous communication, there are gaps in the spans that make it harder to understand what’s happening.

Example of distributed tracing with sync communication

Figure 1: Example of distributed tracing with sync communication


Figure 2: Example of distributed tracing for async communication

Figure 2: Example of distributed tracing for async communication


This isn’t about drawing a straight line from failure to manifestation; it’s about unraveling a complex network of events and repercussions. For every failure that occurs, the developer must first manually determine where the failure happened—was it the Broker? The Zookeeper? The Consumer? Following this, they need to zoom in and figure out the specific problem. Is it a broker misconfiguration or a lack of resources? A misconfigured Zookeeper? Or is the consumer application not consuming messages quickly enough, resulting in disk full?

Software automation that captures causality can help get to the correct answer!


A Broker failure causes Producer failure

Figure 3: A Broker failure causes Producer failure

Signing Off

Delving into Kafka highlights the complexities of asynchronous communication in today’s apps. Just like our previous exploration of Postgres, getting the configuration right and understanding causality are key.

By understanding the role of each component and what could go wrong, developer teams can focus on developing applications instead of troubleshooting what happened in Kafka.

Keep an eye out for more insights as we navigate the diverse challenges of managing resilient applications. Remember, it’s not only about avoiding slowdowns, but also about building a system that excels in any situation.

Related Resources

One million ways to slow down your application response time and throughput

Application: Database Connection Noisy Neighbor causing service and infrastructure symptoms

This blog was originally posted on LinkedIn.

Navigating the Perilous Waters of Misconfigured MaxOpenConnection in Postgres Applications

Welcome to the inaugural post in our series, “One Million Ways to Slow Down Your Application Response Time and Throughput”. In this series, we will delve into the myriad of factors that, if neglected, can throw a wrench in the smooth operation of your applications.

Today, we bring to focus a common yet often under-appreciated aspect related to database configuration and performance tuning in PostgreSQL, affectionately known as Postgres. Although Postgres is renowned for its robustness and flexibility, it isn’t exempt from performance downturns if not properly configured. Our focus today shines on the critical yet frequently mismanaged parameter known as MaxOpenConnection.

Misconfiguration of this parameter can lead to skyrocketing response times and plummeting throughput, thereby negatively influencing your application’s performance and overall user experience. This lesson, as you’ll see, was born from our first hand experience.

How much you learnt from mistakes

How much you learnt from mistakes

The Awakening: From Error to Enlightenment

Our journey into understanding the critical role of the MaxOpenConnection parameter in Postgres performance tuning started with a blunder during the development of our Golang application. We employ Gorm to establish a connection to a Postgres database in our application. However, in the initial stages, we overlooked the importance of setting the maximum number of open connections with SetMaxOpenConns, a lapse that rapidly manifested its consequences.

Our API requests, heavily reliant on database interactions, experienced significant slowdowns. Our application was reduced to handling a scanty three Requests Per Second (RPS), resulting in a bottleneck that severely undermined the user experience.

This dismal performance prompted an extensive review of our code and configurations. The cause? Our connection configuration with the Postgres database. We realized that, by not setting a cap on the number of open connections, we were unwittingly allowing an unlimited number of connections, thereby overwhelming our database and causing significant delays in response times.

Quick to rectify our error, we amended our Golang code to incorporate the SetMaxOpenConns function, limiting the maximum open connections to five. Here’s the modified code snippet:

Code snippet with SetMaxOpenConns

Code snippet with SetMaxOpenConns


The difference was monumental. With the same load test, our application’s performance surged, with our RPS amplifying by a remarkable 100 times. This situation underscored the significance of correctly configuring database connection parameters, specifically the MaxOpenConnection parameter.

The MaxOpenConnection Parameter: A Client-Side Perspective

When discussing connection management in a PostgreSQL context, it’s essential to distinguish between client-side and server-side configurations. While Postgres has a server-side parameter known as max_connections, our focus here lies on the client-side control, specifically within our application written in Golang using the GORM library for database operations.

From the client-side perspective, “MaxOpenConnection” is the maximum number of open connections the database driver can maintain for your application. In Golang’s database/SQL package, this is managed using the SetMaxOpenConns function. This function sets a limit on the maximum number of open connections to the database, curtailing the number of concurrent connections the client can establish.

If left un-configured, the client can attempt to open an unlimited number of connections, leading to significant performance bottlenecks, heightened latency, and reduced throughput in your application. Thus, appropriately managing the maximum number of open connections on the client-side is critical for performance optimization.

The Price of Neglecting SetMaxOpenConns

Overlooking the SetMaxOpenConns parameter can severely degrade Postgres database performance. When this parameter isn’t set, Golang’s database/SQL package doesn’t restrict the number of open connections to the database, allowing the client to open a surplus of connections. While each individual connection may seem lightweight, collectively, they can place a significant strain on the database server, leading to:

  • Resource Exhaustion: Each database connection consumes resources such as memory and CPU. When there are too many connections, the database may exhaust these resources, leaving fewer available for executing actual queries. This can undermine your database’s overall performance.
  • Increased Contention: Too many open connections, all vying for the database’s resources (like locks, memory buffers, etc.), result in increased contention. Each connection might have to wait its turn to access the resources it needs, leading to an overall slowdown.
  • Increased I/O Operations: More open connections equate to more concurrent queries, which can lead to increased disk I/O operations. If the I/O subsystem can’t keep pace, this can slow down database operations.

Best Practices for Setting Max Open Connections to Optimize Postgres Performance

Establishing an optimal number for maximum open connections requires careful balance, heavily dependent on your specific application needs and your database server’s capacity. Here are some best practices to consider when setting this crucial parameter:

  • Connection Pooling: Implementing a connection pool can help maintain a cache of database connections, eliminating the overhead of opening and closing connections for each transaction. The connection pool can be configured to have a maximum number of connections, thus preventing resource exhaustion.
  • Tune Max Connections: The maximum number of connections should be carefully calibrated. It’s influenced by your application’s needs, your database’s capacity, and your system’s resources. Setting the number too high or too low can impede performance. The optimal max connections value strikes a balance between the maximum concurrent requests your application needs to serve and the resource limit your database can handle.
  • Monitor and Optimize: Keep a constant eye on your database performance and resource utilization. If you observe a high rate of connection errors or if your database is using too many resources, you may need to optimize your settings.

Signing Off

Our experience highlights the importance of correct configuration when interfacing your application with a Postgres database, specifically parameters like MaxOpenConns. These parameters are not just trivial settings; they play a crucial role in defining the performance of both your application and the database.

Ignoring these parameters is akin to driving a car without brakes. By comprehending the implications of each setting and configuring them accordingly, you can stave off unnecessary performance bottlenecks and deliver a smoother, faster user experience. It’s not merely about making your application work – it’s about ensuring it functions efficiently and effectively.

To conclude, it’s crucial to understand that there is no universally applicable method to set up database connections. It’s not merely about setting thresholds for monitoring purposes, as this often leads to more disturbance than usefulness. The critical aspect to keep an eye on is potential misuse of the database connection by a client, leading to adverse effects on the database and its other clients. This becomes especially complex when dealing with shared databases, as the “noisy neighbor” phenomenon can exacerbate problems if an application isn’t correctly set up. Each application has distinct needs and behaviors, thus requiring a carefully thought-out, bespoke configuration to guarantee maximum efficiency.


Curious about the potential symptoms caused by a noisy application on a database connection? Take a look at the causality view presented by Causely:

Application: Database Connection Noisy Neighbor causing service and infrastructure symptoms

Application: Database Connection Noisy Neighbor causing service and infrastructure symptoms

According to the causality diagram, the application noisy neighbor of the database connection leads to increased CPU usage in the Postgres container. Consequently, the Postgres container becomes the noisy neighbor of the CPU on the specific Kind node where it runs on. This elevated CPU utilization on the Kind node directly results in higher service latency for clients attempting to access the service provided by the pods residing on the same node. Therefore, addressing each issue individually by merely allocating more resources equates to applying a temporary fix rather than a sustainable solution.

Learn more

DevOps may have cheated death, but do we all need to work for the king of the underworld?

Causality Chain

Sisyphus. Source:

This blog was originally posted on LinkedIn.

How causality can eliminate human troubleshooting

Tasks that are both laborious and futile are described as Sisyphean. In Greek mythology, Sisyphus was the founder and king of Ephyra (now known as Corinth). Hades – the king of the underworld – punished him for cheating death twice by forcing him to roll an immense boulder up a hill only for it to roll back down every time it neared the top, repeating this action for eternity.

The goal of application support engineers is to identify, detect, remediate, and prevent failures or violations of service level objectives (SLOs). DevOps have been pronounced dead by some, but still seem to be tasked with building and running apps at scale. Observability tools provide comprehensive monitoring, proactive alerting, anomaly detection, and maybe even some automation of routine tasks, such as scaling resources. But they leave the Sisyphean heavy lifting job of troubleshooting, incident response and remediation, as well as root cause analysis and continuous improvements during or after an outage, to humans.

Businesses are changing rapidly; application management has to change

Today’s environments are highly dynamic. Businesses must be able to rapidly adjust their operations, scale resources, deliver sophisticated services, facilitate seamless interactions, and adapt quickly to changing market conditions.

The scale and complexity of application environments is expanding continuously. Application architectures are increasingly complex, with organizations relying on a larger number of cloud services from multiple providers. There are more systems to troubleshoot and optimize, and more data points to keep track of. Data is growing exponentially across all technology domains affecting its collection, transport, storage, and analysis. Application management relies on technologies that try to capture this growing complexity and volume, but those technologies are limited by the fact that they’re based on data and models that assume that the future will look a lot like the past. This approach can be effective in relatively static environments where patterns and relationships remain consistent over time. However, in today’s rapidly changing environments, this will fail.

As a result, application support leaders find it increasingly difficult to manage the increasing complexity and growing volume of data in cloud-native technology stacks. Operating dynamic application environments is simply beyond human scale, especially in real time. The continuous growth of data generated by user interactions, cloud instances, and containers requires a shift in mindset and management approaches.

Relationships between applications and infrastructure components are complex and constantly changing

A major reason that relationships and related entities are constantly changing is because of the complicated and dynamic nature of application and infrastructure components. Creating a new container and destroying it takes seconds to minutes each time, and with every change includes changes to tags, labels, and metrics. This demonstrates the sheer volume, cardinality, and complexity of observability datasets.

The complexity and constant change within application environments is why it can take days to figure out what is causing a problem. It’s hard to capture causality in a dataset that’s constantly changing based on new sets of applications, new databases, new infrastructure, new software versions, etc. As soon as you identify one correlation, the landscape has likely already changed.

Correlation is not causation

Correlation is not causation. Source:

Correlation is NOT causation

The most common trap that people fall into is assuming correlation equals causation. Correlation and causation both indicate a relationship exists between two occurrences, but correlation is non-directional, while causation implies direction. In other words, causation concludes that one occurrence was the consequence of another occurrence.

It’s important to clearly distinguish correlation from causation before jumping to any conclusions. Neither pattern identification nor trend identification is causation. Even if you apply correlation on top of an identified trend, you won’t get the root cause. Without causality, you cannot understand the root cause of a set of observations and without the root cause, the problem cannot be resolved or prevented in the future.

Blame the network

Blame the network. Source @ioshints

Don’t assume that all application issues are caused by infrastructure

In application environments, the conflation between correlation and causation often manifests through assumptions that symptoms propagate on a predefined path – or, to be more specific, that all application issues stem from infrastructure limitations or barriers. How many times have you heard that it is always “the network’s fault”?

In a typical microservices environment, application support teams will start getting calls and alerts about various clients experiencing high latency, which will also lead to the respective SLOs being violated. These symptoms can be caused by increased traffic, inefficient algorithms, misconfigured or insufficient resources or noisy neighbors in a shared environment. Identifying the root cause across multiple layers of the stack, typically managed by different application and infrastructure teams, can be incredibly difficult. It requires not just observability data including logs, metrics, time-series anomalies, and topological relationships, but also the causality knowledge to reason if this is an application problem impacting the infrastructure vs. an infrastructure problem impacting the applications, or even applications and microservices impacting each other.

Capture knowledge, not just data

Gathering more data points about every aspect of an application environment will not enable you to learn causality – especially in a highly dynamic application environment. Causation can’t be learned only by observing data or generating more alerts. It can be validated or enhanced as you get data, but you shouldn’t start there.

Think failures/defects, not alerts

Start by thinking about failures/defects instead of the alerts or symptoms that are being observed. Failures require intervention and either recur or currently cannot be resolved. Only when you know the failures you care about should you look at the alerts or symptoms that may be caused by them.

Root cause analysis (RCA) is the problem of inferring failures from an observed set of symptoms. For example, bad choices of algorithms or data structures may cause service latency, high CPU or high memory utilization as observed symptoms and alerts. The root cause of bad choices of algorithms and data structures can be inferred from the observed symptoms.

Causal AI is required to solve the RCA problem

Causal AI is an artificial intelligence system that can explain cause and effect. Unlike predictive AI models that are based on historical data, systems based on causal AI provide insight by identifying the underlying web of causality for a given behavior or event. The concept of causal AI and the limits of machine learning were raised by Judea Pearl, the Turing Award-winning computer scientist and philosopher, in The Book of Why: The New Science of Cause and Effect.

“Machines’ lack of understanding of causal relations is perhaps the biggest roadblock to giving them human-level intelligence.”
– Judea Pearl, The Book of Why

Causal graphs are the best illustration of causal AI implementations. A causal graph is a visual representation that usually shows arrows to indicate causal relationships between different events across multiple entities.

Causality Chain

Database Noisy Neighbor causing service and infrastructure symptoms

In this example, we are observing multiple clients experiencing errors and service latency, as well as neighboring microservices suffering from not getting enough compute resources. Any attempt to tackle the symptoms independently, by for instance increasing CPU limit, or horizontal scaling the impacted service, will not solve the REAL problem.

The best explanation for this combination of observed symptoms is the problem with the application’s interaction with the database. The root cause can be inferred even when not all the symptoms are observed. Instead of troubleshooting individual client errors or infrastructure symptoms, the application support team can focus on the root cause and fix the application.

Capturing this human knowledge in a declarative form allows causal AI to reason about not just the observed symptoms but also the missing symptoms in the context of the causality propagations between application and infrastructure events. You need to have a structured way of capturing the knowledge that already exists in the minds and experiences of application support teams.

Wrapping up

Hopefully this blog helps you to begin to think about causality and how you can capture your own knowledge in causality chains like the one above. Human troubleshooting needs to be relegated to history and replaced with automated causality systems.

This is something we think about a lot at Causely, and would love any feedback or commentary about your own experiences trying to solve these kinds of problems.

Related resources