Diagnosing performance degradation under adverse circumstances

[This post is a few years old and was never published. Recently, I was reminded about memcached slab imbalance, which in turn reminded me of this post.]

At work, we encountered a sudden and precipitous performance regression on one particular page of a legacy application. It's a Perl web application, running under mod_perl, using ModPerl::RegistryLoader to compile scripts at server startup, and Apache::DBI to provide persistent database connections.

Our users suddenly began complaining about one particular page being "three times slower than normal." Later examination of the Apache logs showed a 20x(!!) slowdown.

Investigating this performance problem was interesting because we didn't have good access to required data, and our technology choices slowed us down or completely prevented us from collecting it. Although we solved the mystery, the experience had several important lessons.

Over the course of several days, we examined a number of possibilities for the sudden performance degradation.

Initially, we thought it was related to a release we did on the same day the performance issues began being reported. Nobody on the team could think of anything relevant that was changed, but I read the entire diff between the two releases to confirm it. We hadn't touched the code, so we quickly started thinking about external factors: database, caching, server health, etc.

I was unable to reproduce the performance issue in a development environment, but when I disabled memcached, I saw pathological slowdowns on all pages. This was an interesting result, because we had started caching some frequently-accessed files in memcached. That could potentially fill the cache, or cause a slab imbalance, for example. These new items being cached were part of a separate application -- perhaps we were hitting limits on the number of connections.

But, examination of memcached statistics (what statistics we had available; recent versions of memcached provide much better data) showed that it was performing normally in the production environment. The cache hadn't reached maxbytes, and the server had memory available for allocation. There were no cache evictions, so slab imbalance wasn't to blame. The number of connections was nowhere near the limit.

So, we have two interesting data points now:

  1. With memcached, the application performs normally in a development environment.
  2. Without memcached, everything is slow in the development environment -- just like the performance issues in production, except for being isolated to one particular page.

So, we needed to get some data from production to see what's really happening there, instead of extrapolating from a development environment. We were able to obtain a clone of the server, hooked it up to the production database, and use SSH port forwarding to connect to the production server's memcached. This should be just like the live system, except with only one user accessing the application.

Lo and behold, I saw performance degradation for just the one page, just like on the live system. That's great -- now we need tracing or profiling data so we can see what's going on beneath the surface.

Well, the combination of technologies we chose makes getting DBI_PROFILE enabled surprisingly difficult. After wasting more hours than I care to admit, I gave up and reached for Devel::NYTProf instead. It turns out that this was probably a more useful tool in the end.

This application uses an in-house legacy ORM-like tool for a lot of database access. When I had disabled memcached entirely, I saw that all the object constructors were very slow -- this "ORM" relies very heavily on caching for performance. I had been pointing to the "ORM" as the culprit -- it was clearly performing poorly without a cache -- but it did have a cache in production. This difference was the last piece of the puzzle.

With the new profiling data from the production system, I could see that the ORM's object constructors were only performing poorly for a single type, rather than all of them. And we use that particular type's constructor on the affected page. It turns out it loads an entire database table, and attempts to store it as a single key in memcached. BINGO!

Memcached has a 1MB limit, and upon checking the size of the table, it just reached 1.1MB. We have a winner! (Or is that a loser?)

We don't need the whole table, because some of those records are inactive. Actually, we don't even need all the active records, because only some (usually zero!) are relevant for the current request. And really, we don't need the whole row, just one column. And we certainly don't need to store all this data as a single item in memcached. This is a whole huge bag of fail.

The easiest short-term fix is to upgrade memcached and set a per-item limit higher than 1MB. Recent versions of memcached allow setting the slab page size, which increases the per-item size limit. Instead of fixing the legacy code, this gives us a great reason for porting it to Catalyst.

Lessons learned

Tracking down this problem was extraordinarily frustrating because of the difficulties in getting valid data. We tried to extrapolate from a development environment for too long before trying to get data from a production-like environment. If we had been able to do that from the beginning, we would have solved the problem in a day instead of a week.

Another massive frustration was trying to get DBI_PROFILE to work with our choice of technologies. mod_perl alone isn't so bad, but ModPerl::RegistryLoader and Apache::DBI are complications that just make life difficult.

Finally, we didn't have good metrics on performance in the live environment. We didn't think anything was wrong until people complained, and we didn't know something was wrong until we pulled apache logs, and did ad-hoc analysis on the reponse times. We need real-time performance monitoring.

Old versions of memcached can be recompiled to increase the slab page size, but recent versions have a -I flag to control this at server startup. Likewise, recent versions of memcached provide statistics necessary to diagnose slab imbalance and other critical health indicators. For our case, you can monitor cmd_set - total_items -- anything other than zero is a problem. Also, make sure you check if the memcached set worked -- programmer "optimism" is really just programmer error. Finally, stability is not always worth flying blind -- consider upgrading.

Recap

  1. Be sure there is a problem before trying to solve it. If you can't tell this from your normal monitoring, enhance your monitoring.
  2. Be able to collect tracing/profiling data from your real production system if possible, or a production-like system otherwise. Know how to do this ahead of time, and have the process documented. You don't want to waste time figuring this out when you need data.
  3. Loading way more data than you need is probably a very bad idea, even if you hide the problem with exorbitant amounts of caching.
  4. Understand the limits of your tools. Memcached isn't a hard-disk where you can dump as much data as you want (and even hard disks are of finite capacity). Trying to cache an entire database table as a single item is totally wrong.
  5. Technology progresses for a reason. While stability is important, it probably isn't worth flying blind or doing without important features.
  6. Consider your tradeoffs carefully when dealing with legacy code, and take advantage of convincing reasons to dump it when they come along.
Comments
Pingback from SRE Weekly Issue #22 – SRE WEEKLY - May 8, 2016 at 11:18 pm

[…] Diagnosing performance degradation under adverse circumstances An excellent deep dive into a performance issue (which really equals a reliability issue), including some good lessons learned. […]