Author: causely2023

One million ways to slow down your application response time and throughput

This blog was originally posted on LinkedIn.

Navigating the Perilous Waters of Misconfigured MaxOpenConnection in Postgres Applications

Welcome to the inaugural post in our series, “One Million Ways to Slow Down Your Application Response Time and Throughput”. In this series, we will delve into the myriad of factors that, if neglected, can throw a wrench in the smooth operation of your applications.

Today, we bring to focus a common yet often under-appreciated aspect related to database configuration and performance tuning in PostgreSQL, affectionately known as Postgres. Although Postgres is renowned for its robustness and flexibility, it isn’t exempt from performance downturns if not properly configured. Our focus today shines on the critical yet frequently mismanaged parameter known as MaxOpenConnection.

Misconfiguration of this parameter can lead to skyrocketing response times and plummeting throughput, thereby negatively influencing your application’s performance and overall user experience. This lesson, as you’ll see, was born from our first hand experience.

How much you learnt from mistakes

How much you learnt from mistakes

The Awakening: From Error to Enlightenment

Our journey into understanding the critical role of the MaxOpenConnection parameter in Postgres performance tuning started with a blunder during the development of our Golang application. We employ Gorm to establish a connection to a Postgres database in our application. However, in the initial stages, we overlooked the importance of setting the maximum number of open connections with SetMaxOpenConns, a lapse that rapidly manifested its consequences.

Our API requests, heavily reliant on database interactions, experienced significant slowdowns. Our application was reduced to handling a scanty three Requests Per Second (RPS), resulting in a bottleneck that severely undermined the user experience.

This dismal performance prompted an extensive review of our code and configurations. The cause? Our connection configuration with the Postgres database. We realized that, by not setting a cap on the number of open connections, we were unwittingly allowing an unlimited number of connections, thereby overwhelming our database and causing significant delays in response times.

Quick to rectify our error, we amended our Golang code to incorporate the SetMaxOpenConns function, limiting the maximum open connections to five. Here’s the modified code snippet:

Code snippet with SetMaxOpenConns

Code snippet with SetMaxOpenConns

 

The difference was monumental. With the same load test, our application’s performance surged, with our RPS amplifying by a remarkable 100 times. This situation underscored the significance of correctly configuring database connection parameters, specifically the MaxOpenConnection parameter.

The MaxOpenConnection Parameter: A Client-Side Perspective

When discussing connection management in a PostgreSQL context, it’s essential to distinguish between client-side and server-side configurations. While Postgres has a server-side parameter known as max_connections, our focus here lies on the client-side control, specifically within our application written in Golang using the GORM library for database operations.

From the client-side perspective, “MaxOpenConnection” is the maximum number of open connections the database driver can maintain for your application. In Golang’s database/SQL package, this is managed using the SetMaxOpenConns function. This function sets a limit on the maximum number of open connections to the database, curtailing the number of concurrent connections the client can establish.

If left un-configured, the client can attempt to open an unlimited number of connections, leading to significant performance bottlenecks, heightened latency, and reduced throughput in your application. Thus, appropriately managing the maximum number of open connections on the client-side is critical for performance optimization.

The Price of Neglecting SetMaxOpenConns

Overlooking the SetMaxOpenConns parameter can severely degrade Postgres database performance. When this parameter isn’t set, Golang’s database/SQL package doesn’t restrict the number of open connections to the database, allowing the client to open a surplus of connections. While each individual connection may seem lightweight, collectively, they can place a significant strain on the database server, leading to:

  • Resource Exhaustion: Each database connection consumes resources such as memory and CPU. When there are too many connections, the database may exhaust these resources, leaving fewer available for executing actual queries. This can undermine your database’s overall performance.
  • Increased Contention: Too many open connections, all vying for the database’s resources (like locks, memory buffers, etc.), result in increased contention. Each connection might have to wait its turn to access the resources it needs, leading to an overall slowdown.
  • Increased I/O Operations: More open connections equate to more concurrent queries, which can lead to increased disk I/O operations. If the I/O subsystem can’t keep pace, this can slow down database operations.

Best Practices for Setting Max Open Connections to Optimize Postgres Performance

Establishing an optimal number for maximum open connections requires careful balance, heavily dependent on your specific application needs and your database server’s capacity. Here are some best practices to consider when setting this crucial parameter:

  • Connection Pooling: Implementing a connection pool can help maintain a cache of database connections, eliminating the overhead of opening and closing connections for each transaction. The connection pool can be configured to have a maximum number of connections, thus preventing resource exhaustion.
  • Tune Max Connections: The maximum number of connections should be carefully calibrated. It’s influenced by your application’s needs, your database’s capacity, and your system’s resources. Setting the number too high or too low can impede performance. The optimal max connections value strikes a balance between the maximum concurrent requests your application needs to serve and the resource limit your database can handle.
  • Monitor and Optimize: Keep a constant eye on your database performance and resource utilization. If you observe a high rate of connection errors or if your database is using too many resources, you may need to optimize your settings.

Signing Off

Our experience highlights the importance of correct configuration when interfacing your application with a Postgres database, specifically parameters like MaxOpenConns. These parameters are not just trivial settings; they play a crucial role in defining the performance of both your application and the database.

Ignoring these parameters is akin to driving a car without brakes. By comprehending the implications of each setting and configuring them accordingly, you can stave off unnecessary performance bottlenecks and deliver a smoother, faster user experience. It’s not merely about making your application work – it’s about ensuring it functions efficiently and effectively.

To conclude, it’s crucial to understand that there is no universally applicable method to set up database connections. It’s not merely about setting thresholds for monitoring purposes, as this often leads to more disturbance than usefulness. The critical aspect to keep an eye on is potential misuse of the database connection by a client, leading to adverse effects on the database and its other clients. This becomes especially complex when dealing with shared databases, as the “noisy neighbor” phenomenon can exacerbate problems if an application isn’t correctly set up. Each application has distinct needs and behaviors, thus requiring a carefully thought-out, bespoke configuration to guarantee maximum efficiency.

Bonus

Curious about the potential symptoms caused by a noisy application on a database connection? Take a look at the causality view presented by Causely:

Application: Database Connection Noisy Neighbor causing service and infrastructure symptoms

Application: Database Connection Noisy Neighbor causing service and infrastructure symptoms

According to the causality diagram, the application noisy neighbor of the database connection leads to increased CPU usage in the Postgres container. Consequently, the Postgres container becomes the noisy neighbor of the CPU on the specific Kind node where it runs on. This elevated CPU utilization on the Kind node directly results in higher service latency for clients attempting to access the service provided by the pods residing on the same node. Therefore, addressing each issue individually by merely allocating more resources equates to applying a temporary fix rather than a sustainable solution.

Learn more

DevOps may have cheated death, but do we all need to work for the king of the underworld?

Sisyphus

Sisyphus. Source: https://commons.wikimedia.org/wiki/File:Punishment_sisyph.jpg

This blog was originally posted on LinkedIn.

How causality can eliminate human troubleshooting

Tasks that are both laborious and futile are described as Sisyphean. In Greek mythology, Sisyphus was the founder and king of Ephyra (now known as Corinth). Hades – the king of the underworld – punished him for cheating death twice by forcing him to roll an immense boulder up a hill only for it to roll back down every time it neared the top, repeating this action for eternity.

The goal of application support engineers is to identify, detect, remediate, and prevent failures or violations of service level objectives (SLOs). DevOps have been pronounced dead by some, but still seem to be tasked with building and running apps at scale. Observability tools provide comprehensive monitoring, proactive alerting, anomaly detection, and maybe even some automation of routine tasks, such as scaling resources. But they leave the Sisyphean heavy lifting job of troubleshooting, incident response and remediation, as well as root cause analysis and continuous improvements during or after an outage, to humans.

Businesses are changing rapidly; application management has to change

Today’s environments are highly dynamic. Businesses must be able to rapidly adjust their operations, scale resources, deliver sophisticated services, facilitate seamless interactions, and adapt quickly to changing market conditions.

The scale and complexity of application environments is expanding continuously. Application architectures are increasingly complex, with organizations relying on a larger number of cloud services from multiple providers. There are more systems to troubleshoot and optimize, and more data points to keep track of. Data is growing exponentially across all technology domains affecting its collection, transport, storage, and analysis. Application management relies on technologies that try to capture this growing complexity and volume, but those technologies are limited by the fact that they’re based on data and models that assume that the future will look a lot like the past. This approach can be effective in relatively static environments where patterns and relationships remain consistent over time. However, in today’s rapidly changing environments, this will fail.

As a result, application support leaders find it increasingly difficult to manage the increasing complexity and growing volume of data in cloud-native technology stacks. Operating dynamic application environments is simply beyond human scale, especially in real time. The continuous growth of data generated by user interactions, cloud instances, and containers requires a shift in mindset and management approaches.

Relationships between applications and infrastructure components are complex and constantly changing

A major reason that relationships and related entities are constantly changing is because of the complicated and dynamic nature of application and infrastructure components. Creating a new container and destroying it takes seconds to minutes each time, and with every change includes changes to tags, labels, and metrics. This demonstrates the sheer volume, cardinality, and complexity of observability datasets.

The complexity and constant change within application environments is why it can take days to figure out what is causing a problem. It’s hard to capture causality in a dataset that’s constantly changing based on new sets of applications, new databases, new infrastructure, new software versions, etc. As soon as you identify one correlation, the landscape has likely already changed.

Correlation is not causation

Correlation is not causation. Source: https://twitter.com/OdedRechavi/status/1442759942553968640/photo/1

Correlation is NOT causation

The most common trap that people fall into is assuming correlation equals causation. Correlation and causation both indicate a relationship exists between two occurrences, but correlation is non-directional, while causation implies direction. In other words, causation concludes that one occurrence was the consequence of another occurrence.

It’s important to clearly distinguish correlation from causation before jumping to any conclusions. Neither pattern identification nor trend identification is causation. Even if you apply correlation on top of an identified trend, you won’t get the root cause. Without causality, you cannot understand the root cause of a set of observations and without the root cause, the problem cannot be resolved or prevented in the future.

Blame the network

Blame the network. Source @ioshints

Don’t assume that all application issues are caused by infrastructure

In application environments, the conflation between correlation and causation often manifests through assumptions that symptoms propagate on a predefined path – or, to be more specific, that all application issues stem from infrastructure limitations or barriers. How many times have you heard that it is always “the network’s fault”?

In a typical microservices environment, application support teams will start getting calls and alerts about various clients experiencing high latency, which will also lead to the respective SLOs being violated. These symptoms can be caused by increased traffic, inefficient algorithms, misconfigured or insufficient resources or noisy neighbors in a shared environment. Identifying the root cause across multiple layers of the stack, typically managed by different application and infrastructure teams, can be incredibly difficult. It requires not just observability data including logs, metrics, time-series anomalies, and topological relationships, but also the causality knowledge to reason if this is an application problem impacting the infrastructure vs. an infrastructure problem impacting the applications, or even applications and microservices impacting each other.

Capture knowledge, not just data

Gathering more data points about every aspect of an application environment will not enable you to learn causality – especially in a highly dynamic application environment. Causation can’t be learned only by observing data or generating more alerts. It can be validated or enhanced as you get data, but you shouldn’t start there.

Think failures/defects, not alerts

Start by thinking about failures/defects instead of the alerts or symptoms that are being observed. Failures require intervention and either recur or currently cannot be resolved. Only when you know the failures you care about should you look at the alerts or symptoms that may be caused by them.

Root cause analysis (RCA) is the problem of inferring failures from an observed set of symptoms. For example, bad choices of algorithms or data structures may cause service latency, high CPU or high memory utilization as observed symptoms and alerts. The root cause of bad choices of algorithms and data structures can be inferred from the observed symptoms.

Causal AI is required to solve the RCA problem

Causal AI is an artificial intelligence system that can explain cause and effect. Unlike predictive AI models that are based on historical data, systems based on causal AI provide insight by identifying the underlying web of causality for a given behavior or event. The concept of causal AI and the limits of machine learning were raised by Judea Pearl, the Turing Award-winning computer scientist and philosopher, in The Book of Why: The New Science of Cause and Effect.

“Machines’ lack of understanding of causal relations is perhaps the biggest roadblock to giving them human-level intelligence.”
– Judea Pearl, The Book of Why

Causal graphs are the best illustration of causal AI implementations. A causal graph is a visual representation that usually shows arrows to indicate causal relationships between different events across multiple entities.

Causality Chain

Database Noisy Neighbor causing service and infrastructure symptoms

In this example, we are observing multiple clients experiencing errors and service latency, as well as neighboring microservices suffering from not getting enough compute resources. Any attempt to tackle the symptoms independently, by for instance increasing CPU limit, or horizontal scaling the impacted service, will not solve the REAL problem.

The best explanation for this combination of observed symptoms is the problem with the application’s interaction with the database. The root cause can be inferred even when not all the symptoms are observed. Instead of troubleshooting individual client errors or infrastructure symptoms, the application support team can focus on the root cause and fix the application.

Capturing this human knowledge in a declarative form allows causal AI to reason about not just the observed symptoms but also the missing symptoms in the context of the causality propagations between application and infrastructure events. You need to have a structured way of capturing the knowledge that already exists in the minds and experiences of application support teams.

Wrapping up

Hopefully this blog helps you to begin to think about causality and how you can capture your own knowledge in causality chains like the one above. Human troubleshooting needs to be relegated to history and replaced with automated causality systems.

This is something we think about a lot at Causely, and would love any feedback or commentary about your own experiences trying to solve these kinds of problems.

Related resources