Microsoft Workloads on AWS

Modernizing a legacy monolith with Amazon EKS Windows containers

This post is co-written with Danny Teller from Tipalti.

Tipalti is a global payables automation platform processing billions of dollars annually for thousands of customers worldwide. The platform handles complex payment workflows requiring high availability, data integrity, and rapid scaling during peak processing periods.

This blog post will show you how Tipalti modernized a legacy Microsoft .NET Framework monolithic application by migrating from Amazon Elastic Compute Cloud (Amazon EC2) to Amazon Elastic Kubernetes Service (Amazon EKS) with Windows containers. Tipalti achieved a 50% performance improvement and a 60% cost reduction through automated scaling and enhanced observability. If you are running legacy .NET Framework applications, this migration path offers a proven blueprint for modernization without the risk and expense of a complete rewrite.

Business challenge

Tipalti built its core payment processing service using Microsoft .NET Framework 4.7 and deployed it to Amazon EC2 instances on Amazon Web Services (AWS). This monolithic application served the company well during its startup phase, but as Tipalti scaled to handle increasing payment volumes, the limitations of this legacy architecture became increasingly problematic.

Tipalti faced three critical challenges:

  1. Scaling required manual and complex provisioning during month-end spikes.
  2. Deployments terminated processes without warning, potentially interrupting in-flight payments.
  3. Debugging proved to be an immense challenge because file-based logging was fragmented across instances and multiple child processes.

Solution overview

Tipalti chose to containerize the existing Microsoft .NET Framework 4.7 monolith and deploy it to Amazon EKS with Windows Server nodes. This approach preserved the existing application code while gaining the operational benefits of container orchestration.

The solution (Figure 1) leverages AWS services working in concert. Amazon EKS provides Kubernetes orchestration for container lifecycle management and scaling, and the Amazon Virtual Private Cloud (VPC) CNI for Windows for pod networking within the cluster. To handle variable workloads, Kubernetes Event-Driven Autoscaling (KEDA) monitors RabbitMQ queue depths and automatically adjusts pod replicas based on demand. Amazon Elastic Block Store (Amazon EBS) provides persistent storage with optimized throughput and IOPS, reducing Windows node startup time during scale-up events by over 20%.

Architecture comparison showing legacy Amazon EC2 deployment on the left and new Amazon EKS deployment on the right

Figure 1: Architecture comparison showing legacy Amazon EC2 deployment on the top and new Amazon EKS deployment at the bottom.

Implementation journey

The migration followed a five-phased approach that validated each component before moving to production.

Phase 1: Setup

Tipalti created a Docker image using Windows Server 2019 Core as the base, then configured an Amazon EKS cluster with Windows nodes and the Amazon VPC CNI for pod networking.

Phase 2: Graceful shutdown implementation

The first major technical challenge occurred during testing:

When Tipalti restarted pods, SIGTERM signals, the standard operating system signal for graceful process termination, failed to propagate from the Windows node to the container. While the application process inside the container terminated, the pod itself remained stuck in a “Terminating” state for extended periods.

Working with AWS Support, Tipalti identified that the containerd version on the Windows nodes did not yet support SIGTERM propagation to Windows containers, a feature that was still maturing in the containerd project at the time. As an interim solution, Tipalti used Kubernetes lifecycle hooks with defined graceful termination periods to manage the shutdown process. Once an updated EKS-optimized Windows AMIs with a compatible containerd version became available, Tipalti implemented graceful shutdown logic directly in the application code.

Phase 3: Logging Transformation

The migration to Windows containers exposed a critical challenge: Tipalti’s complex logging infrastructure.

The legacy application was hardcoded to write logs to local files, a pattern that worked on EC2 but became problematic in the ephemeral world of containers. Amazon EKS requires logs to stream to standard output so that the Coralogix daemonset can capture and centralize them.

Tipalti’s team took an iterative approach to solving this. They first implemented Microsoft’s LogMonitor, a wrapper tool that intercepts file and event log writes and redirects them to standard output. This quick win validated the approach and kept the migration from stalling, but the additional process layer introduced performance overhead and complexity.

Rather than accept this compromise, the team refactored the application’s logging configuration to write directly to standard output, eliminating the middleware entirely. This seemingly small change delivered additional benefits: improved application stability, reduced resource consumption, and simplified troubleshooting.

Phase 4: Event-driven auto-scaling

With containerization and logging stabilized, Tipalti turned to autoscaling. They deployed Kubernetes Event-Driven Autoscaling (KEDA) to monitor self-hosted RabbitMQ queue depths and automatically adjust pod replicas based on workload demand—scaling from baseline capacity during quiet periods to peak capacity during payment processing surges.

The team also implemented observability instrumentation to collect detailed application telemetry. This observability layer provided visibility into individual service components and their performance characteristics—eliminating the “black box” debugging challenges that plagued the monolithic EC2 deployment. Engineers could now pinpoint bottlenecks in specific application components rather than troubleshooting an opaque monolith.

Phase 5: Performance optimization

Moving to Windows containers introduced performance challenges. New nodes took 7 minutes to join the cluster, too slow for payment processing surges.

The team discovered a disk I/O bottleneck during booting and another while pulling Tipalti’s 4.7 GB images that initially took 4 minutes.

Doubling Amazon EBS throughput from 125 MB/s to 250 MB/s and IOPS from 3,000 to 6,000; combined with AWS-optimized AMIs with pre-cached base layers, reduced pull time. The team pushed this further and deployed an internal container image registry.

Combined, these optimizations reduced total scale-up time from 11 minutes to under 7 minutes – a 36% improvement enabling responsive autoscaling during peak periods.

Troubleshooting production challenges

Running 106 pods across 23 Windows nodes in production revealed two complex networking issues that required deep troubleshooting:

The “zombie pod” regression

Shortly after launching, pods began getting stuck in the Terminating state during deployments, effectively becoming “zombies.” Working with AWS Support, the team identified a race condition in the Windows Host Networking Service (HNS). When a new pod was created at the exact moment another was terminating (common during rolling updates), the HNS registry would corrupt, leaving the old pod with a valid networking endpoint but no running process.

As a temporary mitigation, the team implemented a 3-hour time-to-live (TTL) on Windows nodes to force periodic rotation. AWS subsequently released an updated Windows AMI that resolved the race condition, allowing Tipalti to remove the periodic rotation.

The DNS resolution issue

When pod density exceeded 20 pods per node, containers began crashing with DNS resolution errors. There is a hard limit of 1024 packets per second per Elastic Network Interface (ENI) for link-local traffic in AWS that was impacting our DNS query services. Packet monitoring showed DNS queries arriving at the host but being dropped by the virtual switch before reaching the containers.

The culprit was UDP Checksum Offload. The physical Network Interface Card (NIC) calculated the checksum, but the virtual switch misinterpreted it as invalid and dropped the packets. The team disabled UDP Checksum Offload on the Elastic Network Adapter (ENA) using a PowerShell script in the node’s user data:

# Identify the Amazon Elastic Adapter
$adapter = Get-NetAdapter | Where-Object { $_.InterfaceDescription -like "*Amazon Elastic*" }

# Disable UDP Checksum Offload on that adapter to prevent virtual switch drops
Disable-NetAdapterChecksumOffload -InputObject $adapter -UdpIPv4

This instantly resolved the packet drops, and the DNS errors disappeared.

Results

The migration to Windows containers on Amazon EKS delivered measurable improvements across all three problem areas Tipalti faced.

Cost and performance

·       60% cost reduction through automated scaling that matches capacity to demand.

·       50% performance improvement compared to Amazon EC2 instances.

·       Zero data loss during deployments through graceful shutdown implementation.

·       Multiple deployments per day, up from weekly deployments.

Automated scaling

Event-driven auto scaling eliminated manual capacity management. During month-end processing periods, the application now automatically scales from 10 pods to over 100 based on RabbitMQ queue depth-in minutes rather than hours. The operating team no longer monitors capacity or manually provisions instances.

Enhanced observability

Centralized logging and metrics collection transformed the debugging experience. Engineers can immediately identify which pod and child processes are affected when issues occur. Log correlation across the distributed system is automatic, reducing the mean time to resolution from hours to minutes. Custom metrics now expose the health of individual child processes, providing visibility that was impossible with the monolithic Amazon EC2 deployment.

Conclusion

Legacy Microsoft .NET Framework monoliths running on EC2 instances face fundamental challenges with scaling, reliability, and observability. While a complete rewrite to cloud-native microservices is ideal, the time and cost required make this approach impractical for some teams. Using Windows containers with Amazon container services provides a pragmatic middle path that delivers immediate operational benefits without requiring application rewrites.

Tipalti’s migration shows the viability of this approach. By containerizing a Microsoft .NET Framework 4.7 monolith and deploying it to Amazon EKS with Windows Server 2019 nodes, Tipalti achieved automated scaling, graceful shutdown handling, comprehensive observability, and a 50% performance improvement, avoiding a complex refactoring exercise. The migration required addressing Windows-specific challenges around networking and signal handling, but the operational benefits outweighed the effort.

Ready to modernize your legacy .NET applications? Start by reviewing the Amazon EKS documentation to understand Windows container support and best practices. Explore the AWS Windows and .NET on AWS page for migration strategies and architectural guidance specific to Microsoft workloads.

Stay informed: Follow the AWS Containers Blog for the latest updates on Amazon EKS capabilities and the AWS Windows and .NET Developer Blog for .NET-specific modernization patterns and best practices.

Maya Morav Freiman

Maya Morav Freiman

Maya Morav Freiman is a Technical Account Manager at AWS helping customers maximize value from AWS services and achieve their operational and business objectives. She is part of the AWS Serverless community and has 10 years experience as a DevOps engineer.

Dror Helper

Dror Helper

Dror is a Sr. Microsoft Specialist Solutions Architect at AWS. He has been developing and architecting software for the last 20 years and when not writing code has worked with software companies on modernizing their existing legacy code while improving the way they develop software. In his current role Dror assist organizations to migrate and modernize their existing applications.

Danny Teller

Danny Teller

Danny is Tipalti's DevOps Architect, OSS enthusiast, AWS community builder and Certified Golden Jacket