An alert reader asked what, as a software developer, I thought devs should learn from the fiasco.  It’s a good question!  I don’t know what systems are involved and of course have no insider information, but I think some general lessons are obvious anyway.

1. Program defensively. According to early reporting, the front end and back end were developed by different firms which barely talked to each other.  This is a recipe for a clusterfuck, but it could have been mitigated by error handling.

The thing is, programmers are optimists.  We think things should work, and our natural level of testing is “I ran it one time and it worked.”  Adding error checking uglifies the code, and makes it several times more complicated.  (And the error checks themselves have to be tested!)

But it’s all the more necessary in a complex situation where you’re relying on far-flung components.  You must assume they will fail, and take reasonable steps to recover.

An example: I managed– with difficulty– to create an account on in early October.  And for three weeks I was unable to log in.  Either nothing would happen, or I’d get a 404 page.  The front end just didn’t know what to do if the back end refused a login.

Oops, I’m anthropomorphizing again, as devs do.  More accurately: some idiot dev wrote the login code, which presumably involved  some back end code, and assumed it would work.  He failed to provide a useful error message when it didn’t.

Now, it’s a tricky problem to even know what to do in case something fails!  The user doesn’t care about your internal error.  Still, there are things to do:

  • Tell the user in general terms that something went wrong– even if it’s as simple as “The system is too busy right now”.  Be accurate: if the problem is that an external call timed out, don’t say that the login failed.
  • Tell them what to do: nothing?  try again?  call someone?
  • If you’re relying on an external call, can you retry it in another thread, and provide feedback to the user that you’re retrying it?
  • Consider providing that internal error number, so it can be given to customer service and ultimately to the developers.  Devs hate to hear “It doesn’t work”, because there’s nothing they can do with that.  But it’s their own fault if they didn’t provide an error message that pinpoints exactly what is failing.
  • In a high-volume system like, there will be too many errors to look at individually.  So they should be logged, so general patterns emerge.
  • Modern databases have safeguards against bad data, but it usually takes extra work on someone’s part to set them up.  E.g. a failed transaction may have to be rolled back, or tools need to be provided to find bad data.

2. Stress-test.  Programmers hate these too, partly because we assume that once the code is written we’re done, and partly because they’re just a huge hassle to set up.  Plus they generate gnarly errors that are hard to reproduce and thus hard to fix.

But pretty obviously wasn’t ready for large-scale usage on October 1… which means that the devs didn’t plan for that usage.

3. Provide tools to fix errors.  I called a human being later on– apparently human beings are cheap, I didn’t even have to wait long.  I explained the problem, and the rep (a very nice woman) said that they were using the same tools as the public– the same tools that weren’t working– so she couldn’t fix the problem.  D’oh.

Frederick Brooks, 38 years ago in The Mythical Man-Month, answered the common question of why huge enterprise systems take so long and so many people to write when a couple of hackers in a garage can whip up something in a week.  Partly it’s scale– the garage project is much smaller.  But there are two factors that each add (at least) a tripling of complexity:

  • Generalizing, testing, and documenting– what turns a program that runs more or less reliably for the hackers who made it into a product that millions of people can use.
  • A system composed of multiple parts requires a huge effort to coordinate and error-check the interactions.

Part of that additional work is to make tools to facilitate smooth operation and fix errors.  It’s pretty sad if really has no way of fixing all the database cruft that’s been generated by a month of failed operations.  There need to be tools apart from the main program to go in and clean things up.

(I was actually able to log in today!  Amazing!  So they are fixing things, albeit agonizingly slowly.  We’ll see if I can actually do anything once logged in…)

3. I hope this next lesson isn’t needed, but reports that huge teams are being shipped in from Google worry me: Adding people to a late project makes it later. This is Brooks again.  Managers are always tempted to add in the Mongolian hordes to finish a system that’s late and breaking.  It’s almost always a disaster.

The problem is that the hordes need time to understand what they’re looking at, and that steals time from the only prople who presently understand it well enough to fix it.  With a complex system, it would be generous to assume that an outsider can come up to speed within a month, and in that time they’ll be pestering people with stupid questions and generally wasting the devs’ time.

As a qualifier, I’ve participated in successful rescue operations, on a much smaller scale.  Sometimes you do need an outside set of eyes.  On the other hand, those rescues involved finding that the existing code base, or part of it, was an unrecoverable disaster and rewriting it.  And that takes time, not something that the media or a congressional committee is going to understand.