Why shit breaks.

So yesterday was my one year anniversary at Formspring.

We built a lot of stuff in the past year and it’s been a hell of a ride.
One of the first features I built was the ability to ask a question to all of your followers. We wanted to get the site ready to be used by high profile users. You know - the ones that get a lot of followers.
In order to understand this, picture your Facebook, you have maybe 400 or 500 friends on there, but the maximum number of friends you can get max-out at 5000.
Some of our accounts have over 1 Million Followers. That’s 200 times what you can get on Facebook. Yep that’s right.
So now picture that new feature. You can ask a question to 1 Million person. Have you ever sent an email to 1M person ?
Let’s say you have a Million friend (lucky you) where would you be storing this information ? Have you ever opened a spreadsheet with a Million rows ? It kinds of make your laptop look like it went from the latest build of Mac Os X to Windows 95.
Now imagine you have 27 Million spreadsheets (one per user of course), and all of them constantly add friends, remove them, that’s a lot of editing ! 
So obviously, all this data is stored on several machines, because
  • a) it doesn’t fit on one machine
  • b) you don’t want to lose that data so you have to replicate it on other machines.
Still following me ?
So we basically know that it takes a lot of machines to be able to store that data accurately, but now you need to actually process the message that you want to send to all the people you follow, and again, how does this work ? 
Well it’s being stored in another spreadsheet ! That’s right. It’s a lot of spreadsheets. With a lot of information.
Because every time one of these accounts ask a question to their 1 Million followers, we need to go through the spreadsheet with the information about their friends, and for each friend, open their spreadsheet and add a new question to it :) 
Wait - What - did you just add a line to 1M different spreadsheets ? 
Yep that’s right. And so forth and so on every time people ask questions on the site.
For your to get an idea, the order of magnitude of spreadsheets edited per minute is in the "tens of thousands"
In order to accomplish this, we have a lot of machines to host that information, process it, send emails etc. etc.
These machines must be running 24/7. You can never turn then off, because well - if you do, they stop doing their work and people do not get their messages in their inbox :) 
So again - for you to understand this, we’re not talking about a couple of machines processing that information, we’re talking about hundreds of them.
Now ask yourself this question - in the past year or so, has your laptop/computer ever act up ? You know like - slowing down, freezing, lost connection to internets. That kind of stuff.
To make this easy, let’s say we have 365 servers running at Formspring. On average a computer that runs 24/7 acts up once every two years.
The site has been up for over 2 years now. Which means that every single of these server actually failed at one point.
That’s a server acting up every 2 days.
We currently have 7 engineers on rotation to take care of the site 24/7. No matter what time of the day/night one of these server goes down, the designated engineer has to fix it with as little impact as possible for the users.
I can tell you that when you’re responsible for the whole infrastructure of a website with 27 Million users, and something breaks, you suddenly feel very tiny, and quite lonely.
I realize this is very very abstract (no we don’t actually use spreadsheet to store our data :) ), but I hope it gives people a good idea of what’s going on behind the scene. 
At that scale, the smallest changes can make big differences. but keep in mind that the more you grow, the more likely you are to go down.

