Dear DevOps Abby

Certs, structuring clusters, and autoscaling ECS.

Welcome to this week’s edition tof “Ask AWS Abby”!  This week, I’m tackling three more reader questions.  Got a question you want to see answered?  Let me know on Twitter with #askawsabby.

Happy building!

AWS Abby


“Dear AWS Abby,

What’s the best way to structure my ECS clusters?”

This comes up a lot, from people in all kinds of situations, from weekend projects, to startups, to folks running huge production clusters.  Here is what I would do:  separate clusters for each environment, i.e., a development cluster, a staging cluster, and a production cluster.  I find that it works easier, because a CI/CD job will logically tie into that.  For example,  images can be pushed from a build to ECR, tagged with the build number, and pushed to a cluster.  It also means that you don’t pollute your production environment.  Staging changes stay on staging until they’re tested and ready for production, instead of having, for example, one cluster per service.

 

“do you have a best practice/slide deck/blog on handling/injectioning https certs into containers?
Thanks!”

I’m assuming for the purposes of this post that terminating SSL connections at the ELB level isn’t sufficient.  In that case, I’d recommend this blog, by AWS Senior Cloud Infrastructure Architect Anabell St Vincent.  Anabell covers using Network Load Balancer for just this use case.

 

“What’s the best way to manage autoscaling ECS clusters (other than “don’t do that, use Fargate”) – would love a Cloudwatch event to fire when ECS services can’t be launched due to lack of capacity”

@willthames

So the way I’ve done it in the past is to use CloudWatch alarms to trigger the autoscaling events themselves, both at the Service level and the Cluster level.  Past that, you need to use custom logic to a) check for the event that says that your service can’t be launched because of resources, and b) either scale up the cluster, or re-balance your running tasks to acquire the needed capacity.  There isn’t currently a CloudWatch alarm out of the box for this, but I’ll pass the suggestion along!  Right now, you could either write a custom metric, or use a Lambda function, which I think would be your best bet. You can see an example reference architecture for doing this with Lambda here.

For what it’s worth, a way to avoid some occurrences of the resources issue is to use a task placement strategy called “bin packing”:  this will place as many tasks on an instance as possible, before placing on the next instance.  In a production environment, you’d want to use this in conjunction with spreading across availability zones.  You can also write custom policies.  Read more about these in the documentation here.