AWS Cloud Operations Blog
Secure, Scalable, and Efficient Instance Management Using Amazon EC2 Run Command
This post was written by Miguel João, Cloud Software Engineer at OutSystems.
The OutSystems low-code development platform allows users to create and deliver high-quality web and mobile apps a lot faster, leveraging all the advantages of visual programming with few of the drawbacks. Of course, providing this high productivity, enterprise-grade Platform-as-a-Service (PaaS) solution can be challenging. For us at OutSystems, those challenges ended up inspiring us to build custom solutions to manage large infrastructures.
We were working on a custom offer for clients that would enable them to build their tailored apps. That led us to deploy our own orchestration processes.
- Instead of only using the common configuration management tools, we had to deploy a custom remote command execution environment. This resulted in a tight control over all the steps in the deployment and configuration of the infrastructure.
- We needed to provide an enterprise-grade PaaS solution with secure access, data integrity, and high availability.
- This solution must scale to meet future demand, and for the long run we’re talking about 1M+ instances.
- We had to ensure a path for the solution to evolve without disrupting the customer service.
Sounds complicated, right? Well, it was, especially when you consider that we had to apply our custom environment to an existing underlying infrastructure while keeping the security, isolation, and evolution requirements.
The end result was a leaner and more secure solution. Amazon EC2 Run Command service improved the stability of our orchestration processes (error ratio reduction of over 80%), and the performance (10–20 times faster).
Problem: Difficult to manage instance proliferation
We designed with the standardization of the underlying infrastructure that supports our PaaS offer in mind. However, the need to respond to the specific needs of our enterprise customers led to the development of an orchestration process that takes advantage of configuration management tools like Chef for the initial and base configuration. Afterwards, we extended that orchestration to support the customization using on-demand remote command execution.
The adoption of these orchestration processes in our cloud services has grown, and it now supports all of our paid and free PaaS offers, and some of our R&D internal development quality assurance requirements.
We currently provision EC2 instances and Amazon RDS instances to meet our needs in six different AWS Regions, in an automated fashion, 24/7. Our current infrastructure landscape consists of more than a thousand EC2 instances, and hundreds of RDS instances with many different software versions and software configurations. We have Windows and Linux operating systems, and at least two database flavors: Microsoft SQL Server and Oracle. This graphic shows the management nightmare for which we signed up.
Temporary solution (not scalable): SSH and ESB
Early in the PaaS project, we determined that direct remote commands would control the infrastructure servers, and we designed the orchestration processes to support that.
The result was a remote command architecture with an Enterprise Service Bus (ESB) and secure shell connections (SSH). The ESB served as a central point of convergence for all remote connections managing the cloud servers, allowing remote commands to be executed through a synchronous SSH. The orchestration processes invoke the remote command execution via the ESB, and expect a callback from the ESB when the command finishes executing.
However, this solution rapidly began to show its limitations:
- Performance overhead due to connection handshake and authentication
- Increased error rate due to instability in the network and long standing connections
- Limited parallelism due to concurrent number of remote long standing connections in the ESB (we started to see instability after 40 concurrent remote executions per ESB node)
- Security concerns about having SSH inbound traffic on the instances for orchestration purposes, and all the hassle of managing the SSH authentication best practices (keys vs. passwords)
With the growth in demand for our cloud services, we had to consider other options immediately. Before we committed to searching for or developing better alternatives that would scale for 1M+ instances, AWS answered our prayers with the EC2 Run Command feature.
Better solution: Scalable, secure remote commands
After a short assessment of Run Command, we realized that it would allow us to greatly improve our efforts and our custom orchestration systems, increasing both reliability and performance. We started changing our orchestration to use this new feature, and replaced the remote command execution engine with Run Command.
Sending the remote command through Run Command removed the SSH connectivity requirement, eliminating security concerns. The feature also keeps network instability concerns at bay because keeping long-standing connections during the command execution is no longer necessary. It’s all asynchronous.
The most significant changes in our orchestration were:
- Roll out (re-create) EC2 instances with IAM roles
- Deploy updated EC2Config and SSM Agent services on the EC2 instances
- Implement the callback after execution end, based on the S3 output files, using an AWS Lambda function that runs when a new output file is created
- Change our orchestration command execution engine module to invoke Run Command (using SSM API)
We were able to evaluate, design, develop, test, and deploy the changes to our orchestration processes in approximately two months. This opened the door to a new growth spurt in our cloud services, mainly because the parallelism limitations were gone.
Replacing the ESB services: Additional details
We had our system architected to be modular, so replacing the ESB services with the Run Command services seemed like it would be straightforward. But as with any new development, we had to deal with some unexpected challenges. No straightforward replacement—after all, where would be the fun in that?
- To use Run Command, each EC2 instance had to have an associated instance role. Unfortunately, at the time, it was not possible to associate an IAM role with a running instance. Instead, we re-created about 80% of our EC2 instances to activate Run Command. Thankfully, it is now possible to add an instance role to an existing EC2 instance without having to re-create it
- Run Command requires agents installed in the EC2 instances. However, the configuration and execution outputs of these agents differ between Windows and Linux systems. At the time, to use Run Command on Windows, we needed to have the EC2Config service running. For Linux, we needed the SSM Agent service. These two different applications require different configurations, and as such, the output log files prefixes in the S3 output bucket were also different. So, we had to make allowances for these differences as part of the process. Nowadays, the SSM Agent service is available for both Windows and Linux, which simplifies the configurations and eases the setup between different operating systems
- Managing the S3 bucket access across the few hundred different AWS accounts that require specific permissions is practically impossible. We had to get creative with the S3 bucket management for the Run Command outputs.
Using a single bucket, each output log would be owned by the account hosting the EC2 instance where the log ran. No other user could access it, so we created a secondary S3 bucket and moved the output logs from the original bucket, fixing the object permissions. This way, we could keep the output logs secured in a bucket with restricted accesses, not allowing the instances to access it anymore. In the meantime, new versions of the SSM Agent support changing the output log ownership in the bucket.
After it was all up and running, magic started to happen. Before, with the ESB solution, our average error rate was usually up to 5%, and it increased to over 20% during the second half of 2016. This growing error rate was a reflex of the ESB solution not keeping up with the growing demand. When we started using Run Command, the average error ratio dropped to values below 1%, regardless of the growth in demand. It seriously made our lives better, as you can see in the stability comparison:
Additionally, the Run Command solution improved the remote command execution average time by one order of magnitude: from 10 to 20 times faster. The solution allowed us to remove the bottlenecks in the ESB and the SSH connections, as well as improve stability by reducing the error rate. Here’s the performance comparison so you don’t have to take our word for it:
The new Run Command feature responded as advertised. The end result is a faster, leaner, and more robust remote command execution engine that complies with our on-demand custom configuration orchestration requirements.
About the Author
Miguel João joined OutSystems R&D in 2005, and became a Cloud Software Engineer in 2013. Since then, he’s been working on the Platform-as-a-Service offer of the industry-leading, low-code platform for mobile and web application development. Miguel is a technology enthusiast, and he is passionate about automation.
AWS is not responsible for the content or accuracy of this post. The content and opinions in this blog are solely those of the third party author.