Networking & Content Delivery

How to migrate your VPC endpoint service backend targets

Amazon Virtual Private Cloud (VPC) endpoints – powered by AWS PrivateLink—allow you to securely expose your application to consumers on AWS without using public IP space and without worrying about overlapping private IP space. You also don’t have to worry about creating bidirectional network paths using services like AWS Transit Gateway or Amazon VPC Peering.

To act as a VPC endpoint service, your application must be hosted on a Network Load Balancer (NLB). Once an endpoint service is created you can create endpoints in consumer VPCs to reach to your application. PrivateLink powered endpoints are used by partners as well as AWS managed services and can scale to thousands of VPCs.

At some point in your NLB/endpoint service lifecycle, you may need to replace all of your backend targets at once. This could be because you’re upgrading your application version or you need to move it to another account or VPC. When using VPC endpoint services with NLB, you can’t use traditional migration methods like DNS weighted routing for blue/green deployments. That’s because you must consider the NLB and also any endpoints created against the endpoint service it supports. Replacing the entire endpoint service is not a valid option as it would require re-creating all the existing endpoints.

This post covers the approaches and considerations for replacing backend targets used by a VPC endpoint service.

Overview

The following diagram (figure 1) depicts our starting point architecture. It shows an endpoint service with a single NLB and multiple consumers. Each consumer has an endpoint created connecting to the endpoint service. Note that, for simplicity, we only show one endpoint per VPC. However, we recommend deploying multiple endpoints across multiple Availability Zones (AZs) in practice.

Figure 1. Starting point architecture - endpoint service with multiple endpoints

Figure 1. Starting point architecture – endpoint service with multiple endpoints

In the following sections, we explore different options for migrating the targets used by an NLB in the endpoint service.

The post looks at two approaches:

  1. Replacing the Target Group on the existing NLB used by the endpoint service (recommended for migration)
  2. Expanding the endpoint service with a new NLB (not recommended for migration, only scaling)

1. Replace Target Group on existing NLB

The recommended approach is to swap the target groups used on an NLB listener. Even though the change operation is simple, you should follow a certain set of steps to minimize the downtime of your service throughout the migration.

Before we go into the steps, let’s understand the relevant states of a health check that a target can go through during the process of being added to a Target Group (target registration) and being removed from it (target deregistration). Figure 2 depicts the relevant stages of the change process.

Figure 2. Health states of NLB targets

Figure 2. Health states of NLB targets

When you create a target group and targets to it, those targets remain unused until you attach the group to a load balancer listener. Once the target group is attached to the load balancer listener, the targets transition into the initial state. They’re not receiving traffic yet, and the load balancer is going through the initial health checks to validate that the targets can be used. Once all the health checks succeed, a target can move to the healthy state where it can receive traffic. 

When a target is deregistered, it moves into the draining state. Targets in the draining state continue to handle connections opened prior to the start of the draining state. However, they don’t receive any new connections. You can configure how long targets should stay in the draining state by configuring the deregistration delay, as well as what happens at the end of the draining process (i.e. connections could be actively terminated) before targets move to the unused state.

If you detach a target group from all listeners, then the targets move immediately to the unused stage, bypassing the draining state. If there are active, in-flight connections, then they are handled by the original targets. However, the termination behavior is not controlled by the NLB, rather directly by the client or application. If you want the NLB to terminate connections at the end of the draining state, then you must deregister targets before your target group is removed from any listener on the NLB.

Now that we covered the states of a target health check, we can cover the steps for target group migration. The goal of this process is to have the new targets appear as healthy before they replace the old ones. Simultaneously, we want the old targets to remain healthy in case we must rollback.

The following list covers the steps required to migrate between target groups on the same NLB. This approach focuses on minimizing the impact on any live traffic. The only impact expected is for in-flight connections that are lasting longer than the duration of the deregistration delay and must be actively reset by the NLB.

  1. Configure the deregistration delay on the old target group to your preferred value (the maximum is one hour) and setup the desired behavior for the end of the delay (i.e. connection termination)
  2. Create two temporary listeners on your NLB on any unused port
  3. Attach the new Target Group to one of the temporary listeners. This starts the registration/initiation process for the targets in the new target group.
  4. Attach the old Target Group to the other temporary listener, as shown in the following diagram (figure 3).

    Figure 3. Interim state before migrating target groups

    Figure 3. Interim state before migrating target groups

  5. Wait until all new targets are in the ‘healthy’ state
    • Optional – once targets are healthy, you can test out the new Target Port by connecting to a temporary listener through any of the existing VPC endpoints. For example, if you’re running a web application, you can use ‘curl’ to connect to port 81 on the VPC endpoint in any of the consumer VPCs.
  6. Update the original (old) listener and swap the old target group for the new one.
    • At this point, new connections are sent to the new target group
    • In-flight connections are still handled by the old target group
    • If at any point you must roll back, then update the original listener again to use the old target group. The old target group is still going to be healthy since it’s still attached to the temporary listener.
    • If you stop here, then NLB does not actively terminate connections and you must do that yourself on the client or application side.Figure 4 shows how the connections are handled by the original listener.

      Figure 4. Connection flow after target group migration on original listener

      Figure 4. Connection flow after target group migration on original listener

  7. Once you confirm that traffic is flowing as expected on the new target group you can deregister all targets on the old target group. The deregistration delay period starts and targets go into draining mode. If configured, then NLB actively terminates connections at the end of the deregistration period.
  8. Finally, delete the temporary listeners from the NLB as shown in figure 5.
Figure 5. Final state post migration. 

Figure 5. Final state post migration.

Considerations:

  • This is the recommended process for replacing target groups of an NLB that’s part of an endpoint service.
  • For an NLB that is not part of an endpoint service the recommended approach is to create a new NLB and use DNS to switch between the old and the new NLB.
  • This approach works regardless of the target types (i.e. instance or IP) or configuration of other attributes of a target group.

2. Add new NLB to endpoint service

A VPC endpoint service can be configured to use multiple NLBs. This is typically used to extend the total capacity of an endpoint service.

Once you add a new NLB, any new endpoints that are added to your service could use either the new or old NLB. Note that if the targets of one NLB fail, then the other one does not automatically take over.

Figure 6. endpoint service with multiple NLBs

Figure 6. endpoint service with multiple NLBs

You can’t remove an NLB from an endpoint service that has active connections. To remove the old NLB from this setup, you must first ‘reject’ all the endpoint connections that are already in use by the endpoint service. This creates downtime for all the consumers for however long the endpoint connections remain rejected.  

Figure 7. Console example of rejecting an endpoint.

Figure 7. Console example of rejecting an endpoint.

Once all the endpoints are rejected, you can remove the old NLB and then accept the endpoints all over again (you still have visibility of all the requests even after you reject them). Once all the endpoints are accepted, traffic flows to only the new NLB.

Considerations:

  • This option is best for expanding the capacity of your endpoint service, but we do not recommend it for migrating backends, as it comes with the longest downtime. It’s included here for completeness and scenarios where you may want to remove an already existing NLB from an endpoint service.
  • You must programmatically reject all the existing endpoint connections, and then re-approve them again. Throughout that period, consumers cannot connect to your service.
  • When adding a new NLB to the endpoint service, you don’t have control over which NLB is used by which endpoint connections. 

Conclusion

In this post, we covered the mechanism for migrating between backend targets used by an NLB that is part of an endpoint service. Since it’s not possible to replace the whole NLB, the recommended migration path is to switch target groups and there is a specific order of operations you must follow to minimize any impact to live traffic.

As with any production change, make sure you test out the steps in a development or test environment before you make changes in production. Even though the first approach using target group migration minimizes impact to live traffic, you should still schedule an ample outage window for any production changes.

About the authors

Tom Adamski

Tom Adamski

Tom is a Principal Solutions Architect specializing in Networking. He has over 15 years of experience building networks and security solutions across various industries – from telcos and ISPs to small startups. He has spent the last 6 years helping AWS customers build their network environments in the AWS Cloud. In his spare time Tom can be found hunting for waves to surf around the California coast.

Luis Felipe Silveira da Silva

Luis Felipe Silveira da Silva

Luis Felipe is a Principal Solutions Architect in the ELB Team. He works with a diverse range of load balancing and networking technologies, collaborating with customers and internal teams to design and optimize workloads, along with ensuring successful implementation and adoption of EC2 Networking services.