How to Run a World-class Website with a DevOps team of Two
Guest post by Alexander Döring, Senior Software Engineer at Smallpdf
At last count, Smallpdf, the PDF conversion startup I work for, had roughly 13 million monthly users. As for the number of employees currently running our website? That would be 10—with only two employees focused on the backend and infrastructure.
You might be curious to learn how we run such a processing intensive website with such a small DevOps team. Our little secret stands in automation and delegation. We send as much work as possible to external services. Here’s how we do it.
At Smallpdf, we allow users from all over the world to compress, convert, and edit PDFs online for free. To accomplish this, users don’t need to install any additional software; all they need to do is upload a file. This action automatically sends a task to a queue describing what to do with the document (for example: make the uploaded PDF smaller or turn it into a Word document). A “worker” server then grabs the task and the input file, processes these files, writes the output file, and finally sends a result back into the queue.
Given this setup, one of the keys to Smallpdf’s success is the number of “worker” servers necessary for processing. Using AWS, we’re able to spin up an EC2 instance within minutes, as opposed to having to wait for hours (or even days).
Before deciding to migrate to AWS, we used to have a few dedicated “file” servers for uploading and downloading files. It was highly complicated and time-consuming, as we had to constantly keep an eye on them: Do they have enough bandwidth? Are they running out of hard disk space? The near infinite storage of AWS S3 replaced these concerns completely, becoming a key factor in keeping our infrastructure lean and robust. We let AWS experts worry about performance, uptime and scaling in this specific domain, and use our limited manpower in other areas.
A similar action was undertaken with our central nervous system, the “queue” servers. Without them, the “workers” can get no task to process, and the user can’t get the result they need. Our “queue” servers were working pretty well, but over time we got more and more reluctant to make any changes to them for fear of breaking everything. Therefore, we soon decided to replace them with SQS for sending tasks and SNS for sending results.
Getting rid of dependencies and state
We’d been thinking about replacing our queue with SQS for some time before. The goal of making changes to the “queue” servers, which would require the use of the new EC2 instances, put us in quite a dilemma. The process was more complicated than we thought, as a lot of things were tightly coupled to the “queue” servers. The frontend would send a task to a random queue and expect the result to arrive later exactly at the same server. Replacing them with zero downtime would be very difficult and time-consuming.
In the end, we decided to switch to SQS and SNS instead, and implement the change in a timeframe comparable to our estimation for the” queue” server replacement. This is how we got rid of our last large dependency between servers. After that, no server would need to know about the existence of another, and all communication would happen through AWS.
In combination with that, we’re now using Elastic Load Balancers for all connections from the outside and are building our programs in Docker containers with a startup time of less than a second. This method brings us a lot of resilience and scalability. We can replace servers without any hassle and scale them up automatically (or manually) within minutes. If any process fails, it restarts immediately or a replacement gets launched automatically.
Our traffic and workloads are highly dependent on the working hours of users all over the world, which peaks at around 4 pm CET, when Europeans are still in the office and workers in North America begin their day. EC2 auto-scaling groups and auto-scaling ECS services let us handle our workload dynamically and try new things. For example, when we were building a new analytics system, Redshift and Kinesis Firehose were the first options we looked at and the choice we went with in the end. Experimentation is a big part of how we continue to optimize our workloads, which currently involves experimenting with Deep Learning and SageMaker. Early indicators point to both services looking very promising.