AWS DevOps & Developer Productivity Blog
IT Governance in a Dynamic DevOps Environment
IT Governance in a Dynamic DevOps Environment
Governance involves the alignment of security and operations with productivity to ensure a company achieves its business goals. Customers who are migrating to the cloud might be in various stages of implementing governance. Each stage poses its own challenges. In this blog post, the first in a series, I will discuss a four-step approach to automating governance with AWS services.
Governance and the DevOps Environment
Developers with a DevOps and agile mindset are responsible for building and operating services. They often rely on a central security team to develop and apply policies, seek security reviews and approvals, or implement best practices.
These policies and rules are not strictly enforced by the security team. They are treated as guidelines that developers can follow to get the much-desired flexibility from using AWS. However, due to time constraints or lack of awareness, developers may not always follow best practices and standards. If these best practices and rules were strictly enforced, the security team could become a bottleneck.
For customers migrating to AWS, the automated governance mechanisms described in this post will preserve flexibility for developers while providing controls for the security team.
These are some common challenges in a dynamic development environment:
· Quick or short path to accomplishing tasks like hardcoding credentials in code.
· Cost management (for example, controlling the type of instance launched).
· Knowledge transfer.
· Manual processes.
Steps to Governance
Here is a four-step approach to automating governance:
At initial setup, you want to implement some (1) controls for high-risk actions. After they are in place, you need to (2) monitor your environment to make sure you have configured resources correctly. Monitoring will help you discover issues you want to (3) fix as soon as possible. You’ll also want to regularly produce an (4) audit report that shows everything is compliant.
The example in this post helps illustrate the four-step approach: A central IT team allows its Big Data team to run a test environment of Amazon EMR clusters. The team runs the EMR job with 100 t2.medium instances, but when a team member spins up 100 r3.8xlarge instances to complete the job more quickly, the business incurs an unexpected expense.
The central IT team cares about governance and implements a few measures to prevent this from happening again:
· Control elements: The team uses CloudFormation to restrict the number and type of instances and AWS Identity and Access Management to allow only a certain group to modify the EMR cluster.
· Monitor elements: The team uses tagging, AWS Config, and AWS Trusted Advisor to monitor the instance limit and determine if anyone exceeded the number of allowed instances.
· Fix: The team creates a custom Config rule to terminate instances that are not of the type specified.
· Audit: The team reviews the lifecycle of the EMR instance in AWS Config.
Control
You can prevent mistakes by standardizing configurations (through AWS CloudFormation), restricting configuration options (through AWS Service Catalog), and controlling permissions (through IAM).
AWS CloudFormation helps you control the workflow environment in a single package. In this example, we use a CloudFormation template to restrict the number and type of instances and tagging to control the environment.
For example, the team can prevent the choice of r3.8xlarge instances by using CloudFormation with a fixed instance type and a fixed number of instances (100).
Cloudformation Template Sample
EMR cluster with tag:
{
“Type” : “AWS::EMR::Cluster”,
“Properties” : {
“AdditionalInfo” : JSON object,
“Applications” : [ Applications, … ],
“BootstrapActions” [ Bootstrap Actions, … ],
“Configurations” : [ Configurations, … ],
“Instances” : JobFlowInstancesConfig,
“JobFlowRole” : String,
“LogUri” : String,
“Name” : String,
“ReleaseLabel” : String,
“ServiceRole” : String,
“Tags” : [ Resource Tag, … ],
“VisibleToAllUsers” : Boolean
}
}
EMR cluster JobFlowInstancesConfig InstanceGroupConfig with fixed instance type and number:
{
“BidPrice” : String,
“Configurations” : [ Configuration, … ],
“EbsConfiguration” : EBSConfiguration,
“InstanceCount” : Integer,
“InstanceType” : String,
“Market” : String,
“Name” : String
}
AWS Service Catalog can be used to distribute approved products (servers, databases, websites) in AWS. This gives IT administrators more flexibility in terms of which user can access which products. It also gives them the ability to enforce compliance based on business standards.
AWS IAM is used to control which users can access which AWS services and resources. By using IAM role, you can avoid the use of root credentials in your code to access AWS resources.
In this example, we give the team lead full EMR access, including console and API access (not covered here), and give developers read-only access with no console access. If a developer wants to run the job, the developer just needs PEM files.
IAM Policy
This policy is for the team lead with full EMR access:
{
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Action”: [
“cloudwatch:*”,
“cloudformation:CreateStack”,
“cloudformation:DescribeStackEvents”,
“ec2:AuthorizeSecurityGroupIngress”,
“ec2:AuthorizeSecurityGroupEgress”,
“ec2:CancelSpotInstanceRequests”,
“ec2:CreateRoute”,
“ec2:CreateSecurityGroup”,
“ec2:CreateTags”,
“ec2:DeleteRoute”,
“ec2:DeleteTags”,
“ec2:DeleteSecurityGroup”,
“ec2:DescribeAvailabilityZones”,
“ec2:DescribeAccountAttributes”,
“ec2:DescribeInstances”,
“ec2:DescribeKeyPairs”,
“ec2:DescribeRouteTables”,
“ec2:DescribeSecurityGroups”,
“ec2:DescribeSpotInstanceRequests”,
“ec2:DescribeSpotPriceHistory”,
“ec2:DescribeSubnets”,
“ec2:DescribeVpcAttribute”,
“ec2:DescribeVpcs”,
“ec2:DescribeRouteTables”,
“ec2:DescribeNetworkAcls”,
“ec2:CreateVpcEndpoint”,
“ec2:ModifyImageAttribute”,
“ec2:ModifyInstanceAttribute”,
“ec2:RequestSpotInstances”,
“ec2:RevokeSecurityGroupEgress”,
“ec2:RunInstances”,
“ec2:TerminateInstances”,
“elasticmapreduce:*”,
“iam:GetPolicy”,
“iam:GetPolicyVersion”,
“iam:ListRoles”,
“iam:PassRole”,
“kms:List*”,
“s3:*”,
“sdb:*”,
“support:CreateCase”,
“support:DescribeServices”,
“support:DescribeSeverityLevels”
],
“Resource”: “*”
}
]
}
This policy is for developers with read-only access:
{
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Action”: [
“elasticmapreduce:Describe*”,
“elasticmapreduce:List*”,
“s3:GetObject”,
“s3:ListAllMyBuckets”,
“s3:ListBucket”,
“sdb:Select”,
“cloudwatch:GetMetricStatistics”
],
“Resource”: “*”
}
]
}
These are IAM managed policies. If you want to change the permissions, you can create your own IAM custom policy.
Monitor
Use logs available from AWS CloudTrail, Amazon Cloudwatch, Amazon VPC, Amazon S3, and Elastic Load Balancing as much as possible. You can use AWS Config, Trusted Advisor, and CloudWatch events and alarms to monitor these logs.
AWS CloudTrail can be used to log API calls in AWS. It helps you fix problems, secure your environment, and produce audit reports. For example, you could use CloudTrail logs to identify who launched those r3.8xlarge instances.
AWS Config can be used to keep track of and act on rules. Config rules check the configuration of your AWS resources for compliance. You’ll also get, at a glance, the compliance status of your environment based on the rules you configured.
Amazon CloudWatch can be used to monitor and alarm on incorrectly configured resources. CloudWatch entities–metrics, alarms, logs, and events–help you monitor your AWS resources. Using metrics (including custom metrics), you can monitor resources and get a dashboard with customizable widgets. Cloudwatch Logs can be used to stream data from AWS-provided logs in addition to your system logs, which is helpful for fixing and auditing.
CloudWatch Events help you take actions on changes. VPC flow, S3, and ELB logs provide you with data to make smarter decisions when fixing problems or optimizing your environment.
AWS Trusted Advisor analyzes your AWS environment and provides best practice recommendations in four categories: cost, performance, security, and fault tolerance. This online resource optimization tool also includes AWS limit warnings.
We will use Trusted Advisor to make sure a limit increase is not going to become bottleneck in launching 100 instances:
Trusted Advisor
Fix
Depending on the violation and your ability to monitor and view the resource configuration, you might want to take action when you find an incorrectly configured resource that will lead to a security violation. It’s important the fix doesn’t result in unwanted consequences and that you maintain an auditable record of the actions you performed.
You can use AWS Lambda to automate everything. When you use Lambda with Amazon Cloudwatch Events to fix issues, you can take action on an instance termination event or the addition of new instance to an Auto Scaling group. You can take an action on any AWS API call by selecting it as source. You can also use AWS Config managed rules and custom rules with remediation. While you are getting informed about the environment based on AWS Config rules, you can use AWS Lambda to take action on top of these rules. This helps in automating the fixes.
AWS Config to Find Running Instance Type
To fix the problem in our use case, you can implement a Config custom rule and trigger (for example, the shutdown of the instances if the instance type is larger than .xlarge or the tearing down of the EMR cluster).
Audit
You’ll want to have a report ready for the auditor at the end of the year or quarter. You can automate your reporting system using AWS Config resources.
You can view AWS resource configurations and history so you can see when the r3.8xlarge instance cluster was launched or which security group was attached. You can even search for deleted or terminated instances.
AWS Config Resources
More Control, Monitor, and Fix Examples
Armando Leite from AWS Professional Services has created a sample governance framework that leverages Cloudwatch Events and AWS Lambda to enforce a set of controls (flows between layers, no OS root access, no remote logins). When a deviation is noted (monitoring), automated action is taken to respond to an event and, if necessary, recover to a known good state (fix).
· Remediate (for example, shut down the instance) through custom Config rules or a CloudWatch event to trigger the workflow.
· Monitor a user’s OS activity and escalation to root access. As events unfold, new Lambda functions dynamically enable more logs and subscribe to log data for further live analysis.
· If the telemetry indicates it’s appropriate, restore the system to a known good state.