AWS Partner Network (APN) Blog
Deploy High Availability Architectures with the Help of APN Consulting Partners
Guest Post: Kamal Arora, AWS GSI Partner Solutions Architect (SA)
When I work with Amazon Web Services (AWS) Customers and AWS Partner Network (APN) Partners, I often hear that they need to host a highly available workload that adheres to a strict X-9s of availability/uptime. Some companies look to achieve above 3-9s of availability, and we have many customers achieving high levels of availability on AWS and whose uptime has increased on AWS. Take a look at some AWS case studies below:
- myThings – 99.999% availability
- Y-cam Solutions – 99.997% availability
- Deputy – 99.99% availability
- EROAD – 99.99% availability
- Marine Desk – 99.99% availability
What Factors Impact Service Availability?
Availability or uptime numbers are frequently misunderstood, as these numbers often refer to application/service availability numbers, which are dependent on multiple other factors beyond just the underlying infrastructure components. If you think about possible reasons for a service downtime, there can be many factors, including: a software upgrade, OS upgrade, file/data corruption, defective programs/application code, user error, and so on. Underlying infrastructure is just a component to consider, but not the only function governing service availability numbers.
AWS Design Recommendations
Below are some of the AWS-specific design recommendations, which you can implement in your AWS architecture:
- Build redundancy at each layer and avoid single point of failures, including:
- Instance/Function level – Have multiple instances for each function, including components like a NAT/Web/App/DB servers
- Storage – Take regular backups to durable storage like Amazon Simple Storage Service (Amazon S3)
- Networking – Have backup DX/VPN connections
- Availability Zone (AZ) Level – Utilize multiple AZs within a region (Some of our customers also make use of multi-region architectures for very high availability and regional user base)
- Externalize data/state to a common store and keep a replica:
- Never store session state in the web/app tiers; externalize that to a cache (like Amazon ElastiCache) or persistent database layer (like Amazon DynamoDB)
- For Amazon Relational Database Service (Amazon RDS) (or database on Amazon Elastic Compute Cloud (Amazon EC2)), have multi-AZ deployments and possibly also have same-region/cross-region Read Replicas enabled. You could also make use of tools like GoldenGate, Attunity, etc. for same/cross region data synchronization and backups
- Utilize features like Amazon DynamoDB streams for data backups
- Use health checks, monitoring and auto-recovery features:
- Utilize Amazon Route53, Amazon Elastic Load Balancer (Amazon ELB) and Amazon EC2 instance level health/status checks. You could also tie that with Amazon EC2 auto-scaling functionality in case of any particular instance going down and therefore launching a replacement instance. Amazon EC2 now also has auto recovery feature which helps in case of underlying host level failures
- Enable continuous, detailed Amazon CloudWatch metrics and custom monitoring along-with alerting can tremendously help detect and act on failures in almost real-time. You can also make use of the Amazon CloudWatch Logs feature, in which real-time monitoring of application logs can also be done
- Optimized App architecture based on micro-services/SOA pattern – Decouple components using services like Amazon Simple Queue Service (Amazon SQS) which make the architecture resilient to individual service level failures
- Have graceful failure modes – Have a static website being served from Amazon S3/Amazon CloudFront directly as a failover mechanism in case of any issues with your web/application servers serving dynamic content
- Automate every possible action, including provisioning to updates to tear-down from a single instance to a complete stack. Use services/tools like Amazon CloudFormation, Chef, Ansible, etc. to help with that process
- Test all possible failure points – It’s very important to test instance, AZ or even region level failures and see if your architecture is able to sustain that or not. To ease through the process, you can utilize Simian Army from Netflix (tools like Chaos Monkey, Chaos Gorilla, Chaos Kong, etc.)
- To validate your operational readiness and to continue to add more checks with time, you can also refer to the comprehensive AWS Operational Checklist.
How Can APN Consulting Partners Help?
As outlined above, we have many recommendations for designing and deploying high availability architectures on AWS. There are many additional considerations to make when one has to roll-up an application/service availability number, which is where our APN Consulting Partners can help with their managed services, migrations, monitoring, and operations-related expertise and offerings. Here are a few details on the offerings from a few of our Premier APN Consulting Partners:
- Accenture’s AWS Migration Framework: Helps you with the complex tasks including migration workloads assessment, re-platforming, and actual steady state optimized deployments.
- Cognizant’s Cloud360 Platform: Helps you see at a glance the status of any running applications, the number of virtual machines in use, the number of instances deployed, how much of each resource is being consumed and much more.
- Infosys’s Cloud Ecosystem Hub: Helps with the process of deploying and managing enterprise business-critical workloads on AWS.
Another aspect to consider is that you are unlikely to achieve X-9s of availability on day one of your deployment; rather, it is a gradual process as you optimize your deployment, as well as gain more expertise with your operations. It’s an ever-improving iterative cycle to increase your application availability, and our Premier APN Consulting Partners are well equipped in this exercise, given their experience on AWS, as well as their solutions and tool-sets to architect and manage your infrastructure and applications.
To conclude, to optimize your architecture and achieve your desired service availability levels, we recommend that you (a) ‘Design for Failure’ at each level, considering all levels of deployment (b) Assess, automate and optimize your operations, and (c) Continuously Iterate. Finally, don’t hesitate to utilize the expertise of our APN Premier Consulting Partners in this process!