IBM & Red Hat on AWS

Scaling telco automation to millions of devices with Managed Red Hat Ansible Automation Platform (AAP) on AWS and Red Hat OpenShift Service on AWS (ROSA)

Managing a nationwide telecommunications network requires handling millions of customer endpoints. Relying on manual management or legacy scripting cannot keep up with the demand for regular updates, monitoring, and troubleshooting.

In this blog post we shall explore how Managed Red Hat Ansible Automation Platform (AAP) on AWS, while running the execution nodes on Red Hat OpenShift Service on AWS (ROSA) provides an elastic, highly scalable automation solution for managing up to  millions of  home router broadband gateway, modem-router combo, broadband gateway, and wireless routers.

The Challenge: Visualizing the Customer Service Architecture

To grasp the scope of the challenge, we must first examine the network edge. The customer service infrastructure originates at the Service Provider Data Center. Here, the Optical Line Terminal (OLT) aggregates the fiber connections from millions of individual “home router broadband gateways.” These gateways are a combination of Optical Network Terminal (ONT) and external home routers and/or an all-in-one hub device (which serves as both an ONT and a router, often with integrated wireless backup). This vast network of connections ultimately reaches end-user devices via WiFi and Ethernet. Managing the lifecycle, data collection, monitoring, and troubleshooting for millions of deployed home router broadband gateways is a constant and demanding requirement.

High-Level Automation Core Architecture

High-Level Automation Core Architecture

High-Level Automation Core Architecture

The Core Technologies

This architecture is built on two foundational components:

  • Ansible Automation Platform (AAP) Service on AWS: Available as a managed service via the AWS Marketplace, AAP serves as the central automation engine, and has a distributed architecture strictly separating the control plane from the execution plane. This separation allows for independent scaling of automation capacity, improved security, and faster, more reliable execution across geographically distributed environments. Deploying the AAP managed service natively through the AWS marketplace offers unified billing, allowing organizations to bill directly through their AWS accounts and utilize existing cloud committed spend.
  • Red Hat OpenShift Service on AWS (ROSA): ROSA is a fully managed, turnkey application platform that provides the recommended, expedited route to a production-ready environment. Jointly engineered and supported by Red Hat and AWS, ROSA manages the underlying OpenShift/Kubernetes “plumbing” so your teams can focus on writing Ansible playbooks and innovating, rather than patching infrastructure.

To manage this vast fleet of home router broadband gateways, the architecture relies on Ansible Automation Platform Service on AWS to provide the Automation Orchestration and Red Hat OpenShift Service on AWS (ROSA) to provide the scalable and cost efficient infrastructure to host the Ansible execution on-demand.

At a high level, AAP dynamically integrates with third-party systems, pulling inventory data from external sources (cloud providers, CMDB, etc.) , and syncing its automation playbooks via a Version Control Code Repository, such as GitHub.

When Ansible jobs are triggered, AAP sends the Ansible workload to Execution Node Container Groups  in OpenShift clusters. These PODs run the Ansible Playbooks, interacting with home router broadband gateways via IPv4 and/or IPv6 to perform configuration changes, data collection, and management tasks.

To ensure total visibility and full analytics (e.g. train your operations LLM Mode), all execution data, logs, and metrics generated by AAP are continuously ingested into a distributed OLAP (Online Analytical Processing) database designed for real-time analytics and log aggregation systems such as  Grafana / Grafana Loki.

Deep Dive: Scalable. Resilient, Multi-Region Architecture on ROSA

Deep Dive: Scalable. Resilient, Multi-Region Architecture on ROSA

Deep Dive: Scalable. Resilient, Multi-Region Architecture on ROSA

Scaling an automation platform to hit millions of endpoints requires robust, distributed infrastructure. By utilizing Red Hat OpenShift Service on AWS (ROSA), organizations get a fully managed environment that handles complex scale.

In this detailed, multi-region architecture, the ROSA deployment happens at each AWS geographic region (e.g., us-east-1, us-west-1, and us-west-2). Within those regions, the infrastructure is heavily distributed across three Availability Zones (AZ-A, AZ-B, and AZ-C) to designed to provide high availability for AAP Execution OpenShift pods.

The key to this architecture’s efficiency is the separation of automation control and execution:

The Magic: Scaling Execution Environments During Peak Bursts

The true value of running AAP on ROSA becomes apparent during massive “automation bursts,” such as scheduling a firmware upgrade for thousands  of home router broadband gateways. AAP utilizes a split architecture, meaning Automation Controller pods handle incoming API/UI requests, while Execution Node pods (known on k8s as “container groups”) handle the heavy lifting of running the actual Ansible playbooks.

When a massive scheduled job hits, the Horizontal Pod Autoscaler (HPA) instantly detects CPU and memory spikes and AAP begins spinning up more container pods. But what happens if the underlying physical OpenShift worker nodes (Amazon EC2 instances) run out of capacity?

This is where native scaling mechanisms in ROSA step in to save the day:

  • 09:00 AM: The cluster is operating normally under a quiet load, running just 3 worker nodes.
  • 09:01 AM: A scheduled “Firmware Tuesday” job initiates for 1,000 home router broadband gateway
  • 09:02 AM: To handle the concurrency (based on configured “fork” settings), AAP requests 200 execution pods.
  • 09:05 AM: Because the existing worker nodes are full, the new pods enter a “Pending” state. The ROSA Cluster Autoscaler detects these pending pods and automatically provides the necessary additional EC2 / OpenShift worker nodes in the background. The pods are instantly scheduled as soon as the nodes are ready.
  • 09:30 AM: The firmware upgrades complete, and the 200 container pods are destroyed.
  • 09:45 AM: The ROSA metrics server observes that the newly created OpenShift worker nodes are now empty. It automatically terminates them, reducing costs on your AWS bill.

The scaling is only limited by two “fences” you set:

Kubernetes Resource Quotas: You can set a Hard Limit on the namespace where AAP runs (e.g., “This team or AAP cannot use more than 200 CPUs total”). If you hit this, AAP will wait for pods to finish before starting new ones.

AWS Service Quotas: Even if ROSA wants to scale, it is bound by your AWS account limits (e.g., maximum number of m5.xlarge instances allowed in a region).

Conclusion: Technology Value, Performance and Cost Efficiency

The combination of Self-Managed Red Hat Ansible Automation Platform (AAP) on AWS and Red Hat OpenShift Service on AWS (ROSA) on AWS delivers technology value or large-scale operations. From a technological standpoint, the split control-and-execution architecture ensures that your automation platform is designed to handle large-scale automation workloads, while joint support from Red Hat and AWS removes the operational burden of patching infrastructure.

Most importantly, this architecture drives cost efficiency while delivering the appropriate performance. By utilizing the Cluster Autoscaler in ROSA alongside AAP’s Horizontal Pod Autoscaling, you only pay for the exact compute capacity you need. The system dynamically scales up EC2/ OpenShift Worker instances to handle massive automation bursts and automatically terminates them the moment the jobs finish, ensuring your AWS bill remains strictly aligned with actual usage.

Reference Links & Integrations:

Red Hat OpenShift Service on AWS

Red Hat Ansible Automation Platform (AAP) on AWS
Red Hat Advanced Cluster Management for Kubernetes
Ansible Automation in AWS
AAP on AWS: Getting Started

Control automation with Container Groups
AWS

AWS Red Hat OpenShift Service on AWS

Red Hat Ansible Automation Platform Service on AWS
What is Memcached?

Ryan Niksch

Ryan Niksch

Ryan Niksch is a Partner Solutions Architect focusing on application platforms, hybrid application solutions, and modernization. Ryan has worn many hats in his life and has a passion for tinkering and a desire to leave everything he touches a little better than when he found it.

Chad Ferman

Chad Ferman

Chad is a Senior Principal Product Manager, Ansible, where he brings over 20 years of experience building enterprise automation architecture for anything in IT from Infrastructure to application delivery and lifecycle. Chad previously worked for ExxonMobil, AAFES and Tandy/Radioshack where he helped internal customers architect, deploy, manage and automate their applications, business processes and infrastructure. He resides in The Woodlands, TX (for now) with his wife and cats.

Mayur Shetty

Mayur Shetty

Mayur Shetty is a Senior Solution Architect within Red Hat’s Global Partners and Alliances organization. He has been with Red Hat for four years, where he was also part of the OpenStack Tiger Team. He previously worked as a Senior Solutions Architect at Seagate Technology driving solutions with OpenStack Swift, Ceph, and other Object Storage software. Mayur also led ISV Engineering at IBM creating solutions around Oracle database, and IBM Systems and Storage. He has been in the industry for almost 20 years, and has worked on Sun Cluster software, and the ISV engineering teams at Sun Microsystems.