Containers

Canary delivery with Argo Rollout and Amazon VPC Lattice for Amazon EKS

Modern application delivery demands agility and reliability, where updates are rolled out progressively while making sure of the minimal impact on end users. Progressive delivery strategies, such as canary deployments, allow organizations to release new features by shifting traffic incrementally between old and new versions of a service. This allows organizations to first release features to a small subset of users, monitor system behavior and performance in real time, and automatically roll back if anomalies are detected. This is particularly valuable in modern microservices environments running on platforms such as Amazon Elastic Kubernetes Service (Amazon EKS), where service meshes and traffic routers provide the necessary infrastructure for fine-grained control over traffic routing.

This post explores an architectural approach to implementing progressive delivery using Amazon VPC Lattice, Amazon CloudWatch Synthetics, and Argo Rollouts. The solution uses VPC Lattice for enhanced traffic control across microservices, CloudWatch Synthetics for real-time health and validation monitoring, and Argo Rollouts for orchestrating canary updates. The content in this post addresses readers who are already familiar with networking constructs on Amazon Web Services (AWS), such as Amazon Virtual Private Cloud (Amazon VPC), CloudWatch Synthetics and Amazon EKS. Instead of defining these services, we focus on their capabilities and integration with VPC Lattice. We also build upon your existing understanding of VPC Lattice concepts and Argo Rollouts. For more background on Amazon VPC Lattice, we recommend that you review the post, Build secure multi-account multi-VPC connectivity for your applications with Amazon VPC Lattice, and the collection of resources in the VPC Lattice Getting started guide.

Solution overview

The architecture integrates multiple AWS services and Kubernetes-native components, providing a comprehensive solution for progressive delivery:

  • Amazon EKS: A fully managed Kubernetes service to host microservices.
  • VPC Lattice: A service networking layer that enables consistent traffic routing, authentication, and observability across services.
    • In this post, we enable traffic routing between versions of services (for example, prodDetail v1 and prodDetail v2) deployed on the same EKS cluster.
    • Uses AWS Gateway API Controller to manage Kubernetes Custom Resource Definitions like such as HTTPRoute configurations for weighted traffic distribution. Refer to the following post for more details on AWS Gateway API controller for Amazon VPC Lattice.
    • Although our example uses services in a single cluster, a key advantage of VPC Lattice is its ability to route traffic to services across different clusters or even different VPCs within the same AWS Region, providing flexibility for distributed architectures.
  • Argo Rollouts: A Kubernetes controller for advanced deployment strategies such as canary and blue/green.
    • Dynamically adjusts traffic weights in the HTTPRoute using the Gateway API plugin for traffic routing.
    • Provides rollback capabilities in case of failures during the rollout.
  • CloudWatch Synthetics: Monitors endpoints using configurable test scripts (“canaries”) to validate application behavior during rollout application (for example URL reachability or response text).
  • AnalysisTemplates in Argo Rollouts:
    • Define validation logic to assess metrics such as success rates and error percentages.
    • Trigger actions such as weight adjustments or rollbacks based on metric evaluations.

Walkthrough: progressive delivery with VPC Lattice and Argo Rollouts

In this section we consider an application running on Amazon EKS, where a new version of a microservice—prodDetail v2—needs to be rolled out with minimal impact to users relying on the stable version v1. To do this, we implement a canary deployment strategy using VPC Lattice, Argo Rollouts, CloudWatch Synthetics, and AnalysisTemplates. Figure 1 shows the architecture diagram.

Architecture diagram

Figure 1: Architecture diagram

  1. We start with VPC Lattice as the backbone of service-to-service communication. It allows the application to abstract away networking complexity and focus on routing traffic intelligently.
    • Using the Gateway API integration, we define an HTTPRoute resource that routes 100% of traffic to proddetail v1.
    • As the rollout progresses, we gradually shift a percentage of traffic to proddetail v2, starting with 20%, then 40%, and eventually reaching 100%. VPC Lattice makes this traffic shifting seamless and consistent across environments.
  2. To manage this rollout, we use Argo Rollouts. It orchestrates the entire deployment, controlling how and when traffic moves to the canary version.
    • Traffic routing plugin: Argo interacts directly with the Gateway API to update the HTTPRoute in VPC Lattice using the Traffic Routing Plugin, adjusting weights as each step in the rollout plan is completed.
    • For example, it begins with 100% of the traffic on v1, then transitions to 80-20%, 60-40%, and so on, until it fully cuts over to v2.
    • Rollback mechanism: If something goes wrong, Argo can immediately roll back all traffic to v1 and clean up any temporary resources.

    Argo Rollouts uses this by doing the following:

    • Registering each version of the service (v1, v2) as a separate target group
    • Using weighted routing policies in the VPC Lattice service network
    • Controlling traffic at the request level (not just at the pod level)

    This creates a clear separation between deployment logic (Argo) and traffic management (VPC Lattice), while still allowing programmatic coordination.

    Example Argo Rollouts canary strategy:

    apiVersion: argoproj.io/v1alpha1
    kind: Rollout
    metadata:
      name: proddetail
      namespace: workshop
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: proddetail
      template:
        metadata:
          labels:
            app: proddetail
        spec:
          containers:
            - name: proddetail
              image: public.ecr.aws/u2g6w7p2/eks-workshop-demo/catalog_detail@sha256:83a708cddd3fae0d71ff6e3
             imagePullPolicy: Always
              resources:
                  requests:
                      cpu: 100m
                      memory: 128Mi
                  limits:
                     cpu: 500m
                     memory: 512Mi
              livenessProbe:
                httpGet:
                  path: /ping
                  port: 3000
                initialDelaySeconds: 0
                periodSeconds: 10
                timeoutSeconds: 1
                failureThreshold: 3
              readinessProbe:
                httpGet:
                  path: /ping
                  port: 3000 
                successThreshold: 3
              ports:
                  - containerPort: 3000
              env:
                   - name: AWS_XRAY_DAEMON_ADDRESS
                    value: xray-service.default:2000
              securityContext:
                  runAsNonRoot: true
                  allowPrivilegeEscalation: false
                  readOnlyRootFilesystem: true
      strategy:
        canary:
          canaryService: proddetail-canary-service
          stableService: proddetail-stable-service
          trafficRouting:
            plugins:
              argoproj-labs/gatewayAPI:
                httpRoute: latcan-app
                namespace: default
          steps:
          - analysis:
              templates:
                - templateName: start-cloudwatch-canary
          - setWeight: 20
          - pause: {duration: 3m}
          - analysis:
              templates:
                - templateName: successrate-80-20
          - setWeight: 40
          - pause: {duration: 3m}
          - analysis:
              templates:
                - templateName: successrate-60-40
          - setWeight: 60
          - pause: {duration: 3m}
          - analysis:
              templates:
                - templateName: successrate-40-60
          - setWeight: 80
          - pause: {duration: 3m}
          - analysis:
              templates:
                - templateName: successrate-20-80
          - setWeight: 100
          - pause: {duration: 3m}
          - analysis:
              templates:
                - templateName: successrate-0-100
          - analysis:
              templates:
                - templateName: stop-cloudwatch-canary
  3. However, traffic shifting alone isn’t enough—we need to validate that v2 is healthy before progressing. That’s where CloudWatch Synthetics comes in. CloudWatch Synthetics canaries are used for real-time validation during the rollout:
    • Validation metrics: As part of the rollout process, CloudWatch canaries simulate user behavior by making HTTP requests to the v2 version of the service.
    • For example, the canary might check if a specific product page loads correctly (navigateToUrl) and whether certain expected content appears on the page (verifyText). These checks run automatically during each phase of the rollout.

    Example canary script:

    import asyncio
    from selenium.webdriver.common.by import By
    from aws_synthetics.selenium import synthetics_webdriver as syn_webdriver
    from aws_synthetics.common import synthetics_logger as logger, synthetics_configuration
    TIMEOUT = 10
    
    async def main():
        url = "<Place your NLB DNS>"
        browser = syn_webdriver.Chrome()
    
        # Set synthetics configuration
        synthetics_configuration.set_config({
           "screenshot_on_step_start" : True,
           "screenshot_on_step_success": True,
           "screenshot_on_step_failure": True
        });
    
        def navigate_to_page():
            browser.implicitly_wait(TIMEOUT)
            browser.get(url)
    
        await syn_webdriver.execute_step("navigateToUrl", navigate_to_page)
    
        # Execute customer steps
        def customer_actions_1():
            browser.find_element(By.XPATH, "/html/body/table/tbody/tr/td[1]/p[3]/label/mark[2][contains(text(),'XYZ.com')]")
    
        await syn_webdriver.execute_step('verifyText', customer_actions_1)
    
        logger.info("Canary successfully executed.")
    
    async def handler(event, context):
        # user defined log statements using synthetics_logger
        logger.info("Selenium Python workflow canary.")
        return await main()

    CloudWatch Synthetic canaries run steps showing the step name and its status

    Figure 2: CloudWatch Synthetic canaries run steps showing the step name and its status

  4. Integration with AnalysisTemplates: All of this validation is automated using Argo Rollouts’ AnalysisTemplates. These templates define rules that interpret the canary metrics and decide whether the rollout should continue, pause, or roll back.
    • For example, if the metric SuccessPercent from CloudWatch indicates a healthy threshold (for example over 20%), then the rollout proceeds. If the metric dips below, then Argo halts the rollout and reverts traffic to the stable version.
    • Example: Metric-Based ValidationCloudWatch metrics evaluated using successCondition and failureCondition logic.
      • Success: len(result[0].Values) > 0 && result[0].Values[-1] >= 20
      • Failure: result[0].Values[-1] < 20

    Example analysis template:

    apiVersion: argoproj.io/v1alpha1
    kind: AnalysisTemplate
    metadata:
      name: successrate-0-100
      namespace: workshop
    spec:
      metrics:
      - name: navigate-to-url-success
        count: 3
        interval: 1m
        successCondition: "len(result[0].Values) > 0 && result[0].Values[-1] == 100"
        failureLimit: 1  # Fail immediately if navigateToUrl is not 100%
        provider:
          cloudWatch:
            interval: 1m
            metricDataQueries:
            - {
                "id": "navigateToUrlSuccess",
                "metricStat": {
                  "metric": {
                    "namespace": "CloudWatchSynthetics",
                    "metricName": "SuccessPercent",
                    "dimensions": [
                      {"name": "CanaryName", "value": "productdetail_v2_synthcanary"},
                      {"name": "StepName", "value": "navigateToUrl"}
                    ]
                  },
                  "period": 60,
                  "stat": "Average"
                },
                "returnData": true
              }
      - name: verify-text-success
        interval: 1m
        count: 10
        successCondition: "len(result[0].Values) > 0 && result[0].Values[-1] >= 80"
        failureLimit: 10  # Allow up to 10 retries for verifyText
        provider:
          cloudWatch:
            interval: 5m
            metricDataQueries:
            - {
                "id": "verifyTextSuccess",
                "metricStat": {
                  "metric": {
                    "namespace": "CloudWatchSynthetics",
                    "metricName": "SuccessPercent",
                    "dimensions": [
                      {"name": "CanaryName", "value": "productdetail_v2_synthcanary"},
                      {"name": "StepName", "value": "verifyText"}
                    ]
                  },
                  "period": 300,
                  "stat": "Average"
                },
                "returnData": true
              }

Together, this architecture makes sure that updates to microservices can be deployed gradually and automatically, without manual intervention. Developers gain confidence in pushing new features, knowing that the system is constantly monitoring for failures and is ready to act instantly if something goes wrong.

Integration highlights

The integration of VPC Lattice with the AWS Gateway API plugin allows for seamless coordination with Argo Rollouts, enabling dynamic traffic management during application deployments. As each rollout phase completes, traffic weights in the HTTPRoute are dynamically updated according to the defined strategy, which makes sure of a safe and gradual transition to the new version.

Argo Rollouts also integrates closely with CloudWatch Synthetics, using canary tests to gather real-time health metrics throughout the deployment process. These metrics are accessed through the CloudWatch provider configured in AnalysisTemplates. This allows each rollout phase to be gated automatically based on actual performance and availability indicators.

In the event of degraded metrics, Argo Rollouts uses AnalysisTemplates to trigger automatic rollbacks, thus maintaining application stability without manual intervention. Throughout the rollout, ReplicaSets are dynamically created and deleted by Argo Rollouts based on the rollout state. Corresponding updates to HTTPRoute weights are synchronized with ReplicaSet lifecycle, making sure that traffic routing always aligns with the current rollout state.

Argo rollout CLI showing canary replica with adjusted weights and analysis run

Figure 3: Argo rollout CLI showing canary replica with adjusted weights and analysis run

Conclusion

Combining Amazon VPC Lattice, Argo Rollouts, and Amazon CloudWatch Synthetics allows you to build a production-grade progressive delivery system that is safe, observable, and scalable. This architecture enables the following:

  • Gradual traffic shifting.
  • Real-time validation of new service versions.
  • Automated rollback in case of failures.

This integration is ideal for organizations adopting modern DevOps practices, providing a seamless and reliable path for deploying updates with confidence. To find out more details about VPC Lattice, check the documentation. For a detailed demonstration of this solution in action, you can watch the execution video on the Containers from the Couch YouTube channel. If you have questions about this post, then start a new thread on AWS re:Post or contact AWS Support.


About the authors

Nikit Swaraj is an Enterprise Solutions Architect at AWS, supporting customers in the Commercial sector with a focus on industries such as Manufacturing, Real Estate, and Media. With deep expertise in cloud architecture, containers, and generative AI, Nikit helps customers design secure, scalable, and innovative solutions on AWS. He is particularly passionate about enabling users to accelerate their digital transformation journeys through cloud-native technologies. Beyond his professional role, Nikit pursues his interests in travel and gaming.

Mokshith Kumar is a Sr. Go-To-Market (GTM) Specialist Solutions Architect for Core Networking at AWS, supporting ISV and FSI customers across North America. He plays a key role in developing GTM strategies, leading strategic initiatives, uncovering new opportunities, accelerating the adoption of AWS networking services, and driving impactful customer and partner engagements. He enjoys working directly with customers to solve complex cloud networking challenges and is passionate about helping them modernize their architectures. Outside of work, he’s an avid swimmer and music enthusiast.