AWS Compute Blog

Update: Issue affecting HashiCorp Terraform resource deletions after the VPC Improvements to AWS Lambda

On September 3, 2019, we announced an exciting update that improves the performance, scale, and efficiency of AWS Lambda functions when working with Amazon VPC networks. You can learn more about the improvements in the original blog post. These improvements represent a significant change in how elastic network interfaces (ENIs) are configured to connect to your VPCs. With this new model, we identified an issue where VPC resources, such as subnets, security groups, and VPCs, can fail to be destroyed via HashiCorp Terraform. More information about the issue can be found here. In this post we will help you identify whether this issue affects you and the steps to resolve the issue.

How do I know if I’m affected by this issue?

This issue only affects you if you use HashiCorp Terraform to destroy environments. Versions of Terraform AWS Provider that are v2.30.0 or older are impacted by this issue. With these versions you may encounter errors when destroying environments that contain AWS Lambda functions, VPC subnets, security groups, and Amazon VPCs. Typically, terraform destroy fails with errors similar to the following:

Error deleting subnet: timeout while waiting for state to become 'destroyed' (last state: 'pending', timeout: 20m0s)

Error deleting security group: DependencyViolation: resource sg-<id> has a dependent object
        	status code: 400, request id: <guid>

Depending on which AWS Regions the VPC improvements are rolled out, you may encounter these errors in some Regions and not others.

How do I resolve this issue if I am affected?

You have two options to resolve this issue. The recommended option is to upgrade your Terraform AWS Provider to v2.31.0 or later. To learn more about upgrading the Provider, visit the Terraform AWS Provider Version 2 Upgrade Guide. You can find information and source code for the latest releases of the AWS Provider on this page. The latest version of the Terraform AWS Provider contains a fix for this issue as well as changes that improve the reliability of the environment destruction process. We highly recommend that you upgrade the Provider version as the preferred option to resolve this issue.

If you are unable to upgrade the Provider version, you can mitigate the issue by making changes to your Terraform configuration. You need to make the following sets of changes to your configuration:

  1. Add an explicit dependency, using a depends_on argument, to the aws_security_group and aws_subnet resources that you use with your Lambda functions. The dependency has to be added on the aws_security_group or aws_subnet and target the aws_iam_policy resource associated with IAM role configured on the Lambda function. See the example below for more details.
  2. Override the delete timeout for all aws_security_group and aws_subnet resources. The timeout should be set to 40 minutes.

The following configuration file shows an example where these changes have been made(scroll to see the full code):

provider "aws" {
    region = "eu-central-1"
}
 
resource "aws_iam_role" "lambda_exec_role" {
  name = "lambda_exec_role"
  assume_role_policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": "sts:AssumeRole",
      "Principal": {
        "Service": "lambda.amazonaws.com"
      },
      "Effect": "Allow",
      "Sid": ""
    }
  ]
}
EOF
}
 
data "aws_iam_policy" "LambdaVPCAccess" {
  arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaVPCAccessExecutionRole"
}
 
resource "aws_iam_role_policy_attachment" "sto-lambda-vpc-role-policy-attach" {
  role       = "${aws_iam_role.lambda_exec_role.name}"
  policy_arn = "${data.aws_iam_policy.LambdaVPCAccess.arn}"
}
 
resource "aws_security_group" "allow_tls" {
  name        = "allow_tls"
  description = "Allow TLS inbound traffic"
  vpc_id      = "vpc-<id>"
 
  ingress {
    # TLS (change to whatever ports you need)
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    # Please restrict your ingress to only necessary IPs and ports.
    # Opening to 0.0.0.0/0 can lead to security vulnerabilities.
    cidr_blocks = ["0.0.0.0/0"]
  }
 
  egress {
    from_port       = 0
    to_port         = 0
    protocol        = "tcp"
    cidr_blocks     = ["0.0.0.0/0"]
  }
  
  timeouts {
    delete = "40m"
  }
  depends_on = ["aws_iam_role_policy_attachment.sto-lambda-vpc-role-policy-attach"]  
}
 
resource "aws_subnet" "main" {
  vpc_id     = "vpc-<id>"
  cidr_block = "172.31.68.0/24"

  timeouts {
    delete = "40m"
  }
  depends_on = ["aws_iam_role_policy_attachment.sto-lambda-vpc-role-policy-attach"]
}
 
resource "aws_lambda_function" "demo_lambda" {
    function_name = "demo_lambda"
    handler = "index.handler"
    runtime = "nodejs10.x"
    filename = "function.zip"
    source_code_hash = "${filebase64sha256("function.zip")}"
    role = "${aws_iam_role.lambda_exec_role.arn}"
    vpc_config {
     subnet_ids         = ["${aws_subnet.main.id}"]
     security_group_ids = ["${aws_security_group.allow_tls.id}"]
  }
}

The key block to note here is the following, which can be seen in both the “allow_tls” security group and “main” subnet resources:

timeouts {
  delete = "40m"
}
depends_on = ["aws_iam_role_policy_attachment.sto-lambda-vpc-role-policy-attach"]

These changes should be made to your Terraform configuration files before destroying your environments for the first time.

Can I delete resources remaining after a failed destroy operation?

Destroying environments without upgrading the provider or making the configuration changes outlined above may result in failures. As a result, you may have ENIs in your account that remain due to a failed destroy operation. These ENIs can be manually deleted a few minutes after the Lambda functions that use them have been deleted (typically within 40 minutes). Once the ENIs have been deleted, you can re-re-run terraform destroy.