I become rather epic when I have a problem. Some would say that I have a colorful language, I like to think that I just sound very French. And a couple weeks back, I did sound rather French.
While EC2 (Elastic Compute Cloud ) gives us the ability to launch as many machine instances as we could possibly want, I’ve always been a rather cheap and demanding technologist: I have a backup strategy that covers me in case of need but I see no reason to waste energy and money if I can squeeze everything I need from one single stone. And that worked well for a few months. Believe it or not, iSyndica started running on a server so tiny (2 EC2 compute units) that many would not have used it for their own desktop. Yet the system ran beautifully well, supporting hundreds of users. The database wasn’t even optimized for I/O performance and we were using SQL Server Express because it gave us the best “bang for the buck” at the beginning.
That changed almost overnight as our name started spreading: what works well for hundreds of users really doesn’t suffice for thousands. The server started showing some pretty mean seizures. Oh it worked, mind you, but just not great at all, and I found myself babysitting a few too many things. In itself, it was a valid experience: a lot of performance improvements came out of this, and we still squeezed out more from that same little stone. But then came the time to consider upgrading the hardware, whether it was rolling out new machines to take some of the load off the main server or simply changing our main host into a more powerful server. Even though the Virtual Distribution Service has been built to be massively parallel, and AWS definitely gives us the ability to quickly start up new servers, I didn’t necessarily want the added complexity of monitoring a server farm, and it didn’t make sense either in terms of costs: with barely more than twice the money we were spending on our single server, I could get a machine ten times as powerful and get rid of the problem altogether, for a long time.
So I did.
The beauty of EC2 is the ability you have to simply start a new server template and customize it to your need before bundling it into a custom template of your own that you can use for redundancy purposes. The startup time of a brand-new Windows instance from the public AMI seemed a bit long to me. It took a good 20 minutes of sys-prep (that includes a reboot) before I could start using the new server, but from there on things went so smoothly I abused of my new found power. The server purred like a happy kitten and within 3 hours I had my new machine configured with all the settings I needed. Having the ability to create as many EBS volumes as you need was also a spoiling factor: I went overboard for the database optimization, using multiple spindles to distribute my IO, even aligning the partitions for optimal performance and building a system built to support tenth of thousands of users.
Meanwhile, the old server still kept limping along, doing its job as best it could – and, in all truth, it did a pretty good job considering how much users pounded on it. Midnight approached on the East coast (US), the time at which I like to perform system upgrades to minimize user impact (even through we truly have a worldwide audience and 60% of our users are in Europe).
The roll-out was beyond all hopes. Within a 2 minute window (yes, yes: 120 seconds!), I had moved the database (by detaching and re-attaching my main EBS volume from the old server to the new), re-pointed our Elastic IP address (no DNS change that would come with a dramatic 30 minute transfer delay), and the site was back-online. I could have done this upgrade during lunch break! The conversion was just such a walk in a park that I was in shock for two days. I remember in my youth at VistaPrint – not so long ago really – rushing to a data center with Chris at night and fiddling with wiring while freezing as much inside as we did outside where a snow storm raged, working for hours transferring files, configuring RAID arrays and still having to go through a 2 hour down time on a good day. It took us years of experience and a lot of money to build an infrastructure that would support a “zero-downtime” release process.
While I have the experience to do this, most importantly, I have with the entire AWS platform the infrastructure to make it possible – for just a few bucks a day. I don’t have to worry about RAID array configuration, I can allocate hundreds of gigs of space for my database while still paying only pennies until I start using that much, and changing network topology comes at the click of a button. iSyndica is a startup not even 6 months old and already has a zero-downtime release process, which truly help support our geometric user growth. I’ve always wanted to bring Chris onboard, because he’s just simply the best network engineer I’ve ever worked with and a great friend, but ironically, I have just about no use for his skill-set: he would be relegated to watching a monitoring screen and nodding in annoyance at the lack of incidents. I just can’t do that to him!
Our new server, an Extra-Large High-CPU EC2 instance (20 EC2 compute units) is boringly “flat”. I have seen thousands of files thrown at it (and swallowed in minutes) and the CPU usage barely spikes up. The database, newly optimized (we also use a special SQL Server Enterprise license, courtesy of Microsoft’s Bizspark program), screams, demanding more action. And overall, we’re poised perfectly for our next, rather ambitious, steps and I sleep like a baby at night, knowing that the day we need, we can fire up 10 clones of our current primary server and take care of 1 million users; it will take 30 minutes.
Look, it’s easy: the only thing that ever felt close to this was sitting on a pile of cash and rolling out 20 Protoss Carriers, 3 at a time, just because I could, to hammer a Starcraft game under my dominion.
That… is living spoiled!
Seb Coursol, Co-founder and CTO of iSyndica