AWS Startups Blog
Piling on the Nines with EC2 at Cotap
Guest post by Zack Parker, CTO, Cotap
Cotap is a secure enterprise mobile messaging service built to help people communicate at work. My cofounder Jim Patterson and I have been working on this problem for years, and we’re both really excited by the new opportunities afforded by mobile devices. More and more people have the ability to run complex network-connected software on commodity devices they carry in their pockets. This is transforming the way people work, and we recognized that and wanted to build the best possible software to help make it happen.
Starting from Scratch
We founded Cotap in the spring of 2013. In the beginning, I was doing all the back-end engineering myself. There were some interesting messaging system services on the market, like Firebase, but I knew that we wouldn’t want to outsource the core of our business. We needed control over the behavior of the fundamental features of our app. We were also concerned what would happen if our provider went out of business. We decided to build our messaging API ourselves. With my background in building past messaging products with Ruby on Rails, it was natural to start there. I built a basic Rails API server and considered my options for hosting.
A managed service like Heroku would be fine for my prototype, but as the scale and complexity of our back end grew, we’d see our costs grow quickly as well. We’d also be constrained in the technology choices we could make. Our API would begin with Rails, but I knew that as we grew we’d be handling certain tasks, like file uploads or proxying requests to third-party APIs, with systems better suited to those tasks than Rails is. We hadn’t even built the team yet, let alone started building these components of our API, and I didn’t want to commit to a path that might constrain our ability to build the best infrastructure we could. The control over hosts that Amazon EC2 offers along with the suite of supporting services made AWS the natural choice for our technology platform.
Proof of Concept
I’d worked on large-scale Rails applications before, but this was my first experience with EC2, and I wanted to get an understanding of the environment. I hand-rolled a Rails host on Ubuntu with Apache, Passenger, and Postgres. It kept the app’s development moving, and I began to put together a checklist of the technical needs we’d have for our production system: monitoring, logging, email deliverability, database backup and failover, an asynchronous job queue, and more. The list kept growing, and soon it was obvious that this would be a full-time job.
Martin Cozzi was one of our first candidates for the role, and his interview was unlike any I’d done before. I explained the app, showed him the checklist I’d been compiling, and he began whiteboarding a design that’s remarkably close to our production infrastructure today. He joined the team almost immediately and began working to implement the architecture he’d designed.
I’ll let Martin take it from here:
Cotap’s Infrastructure Today
One of the first things we had to decide was whether or not we wanted to start building on top of EC2-Classic or Virtual Private Cloud. VPC at the time had just been made the default for everybody, and although some companies had already migrated, it is easy when pressured by time to fall back to what you know and go with EC2-Classic.
Before making a decision, we designed a network on paper that consisted mostly of private instances where critical data storage or data manipulation had to be done isolated from public instances. The nature of our business means that we must keep our users’ data private and secure, and letting each host have a public IP just seemed wrong, regardless of how tight security groups were. So naturally, we went the VPC route.
Designing our VPC did not require as much work as we anticipated, and basic network knowledge coupled with simple logic led us to build a redundant and resilient infrastructure. We decided early on that as a rule, every instance had to be behind an Auto Scaling group, regardless of the number of instances (even if it’s a single instance). That meant that AWS would rotate instances for us when something went wrong with the hardware. We also decided to not use EBS-backed instances, forcing people to design around the fact that they couldn’t snapshot data easily. This led us to a culture where we treat instances “as cattle and not cats.” Every time someone starts remembering an IP address we go ahead and kill that instance. A great benefit of this culture is that we end up with an infrastructure that heals itself without human intervention, making for great nights of sleep.
To achieve this, another decision was made that no instance should ever be launched manually, either through the CLI or through the AWS web console. Thanks to AWS CloudFormation, our entire infrastructure is mapped using JSON files and stored under version control using Git. At any given time we can retrieve information about an open port or an instance family and look up a Git commit, which gives us context. This means that someone new to the team can go back to the beginning of our infrastructure, read all the Git logs, and understand how we got where we are without having to make assumptions.
Smooth Sailing
Because everything is code, engineers feel comfortable taking ownership and modifying the infrastructure. Changing cluster sizes or instance sizes only takes a few minutes thanks to CloudFormation and Stacker, a tool we built to help us manage dependencies between stacks. This also helps us reduce our costs greatly by growing and shrinking the infrastructure very quickly as we need. Last but not least, we can launch our infrastructure in an entirely new region in a matter of hours by simply changing the region endpoint in the configuration files.
As AWS rolls out new features, we test whether they can help us reduce the time spent managing the infrastructure. For example, we migrated our Redis 2.6 instances to ElastiCache when it released a 2.8 upgrade. We also took advantage of PostgreSQL on RDS and use it for our analytics database.
Martin will be speaking more about how Cotap’s automated infrastructure helps them to deliver scalable, reliable, secure service on November 4 at the AWS Pop-Up Loft. Be sure to come check it out and learn more!