Containers

How Upstox built Next-Generation trading platform using Amazon EKS, Karpenter, and Spot Instances

This is a guest post by Pranav Kapoor, Head of DevOps at Upstox co-authored with Jayesh Vartak, Solutions Architect at AWS and Jitendra Shihani, Technical Account Manager (TAM) at AWS.

Upstox is India’s largest investech, a multi-unicorn valued at $3.5 billion. It allows you to buy and sell stocks, mutual funds, and derivatives, and is loved and trusted by over 12 million customers. It is backed by Mr. Ratan Tata and Tiger Global and is the official partner for the Tata IPL (Indian Premier League).

Upstox experienced 10x growth during the pandemic, with the number of users increasing from 1 million to over 10 million in 2022. To sustain this exponential growth and prepare for future expansion, Upstox set high standards for running its trading platform. These standards include availability, scalability, security, operational efficiency, and cost optimization.

  • Availability: Enhancing our Service Level Agreement (SLA) from 99.9% to a minimum of 99.99%, ensuring high reliability and minimal downtime.
  • Scalability: Minimizing scaling delays and maintaining exceptional performance during peak traffic periods, including market openings, budget announcements, and significant market news.
  • Security: Implementing additional security measures to strengthen our security posture. Our focus extends to enhancing data privacy and the secure handling of Personally Identifiable Information (PII). Streamlining our auditing and compliance processes to ensure data is stored and processed securely.
  • Operational efficiency: Adopting Infrastructure-as-a-Code (IaaC) to automate the creation of new environments and infrastructure. Additionally, we plan to integrate chaos engineering into our release lifecycle to improve the system resilience.
  • Cost Optimization: Reducing both infrastructure and operational costs without compromising on availability, security, or scalability.

In order to meet these targets, Upstox embarked on the journey of building a Next-Generation platform called “Greenfield”. Upstox chose Amazon Elastic Kubernetes Service (Amazon EKS) as a core compute platform for Greenfield to leverage the benefits of containers, whereas the earlier platform was on Amazon Elastic Compute Cloud (Amazon EC2). In this post, we share the tenets followed to build the Greenfield, its core differentiators, and the outcomes.

Upstox Greenfield philosophy

In order to build the future-proof architecture that can handle further exponential growth in traffic and incorporating flexibility for evolving over time, we used the following tenets:

  • Security: Security is “job zero” at AWS and Upstox. It is the top priority in every aspect of the platform, following the least access privileges, mandatory encryption for data at-rest and in-transit, no SSH keys sharing, mandatory AWS Identity and Access Management (IAM) role based access (no IAM access key and secret access key sharing).
  • Customer Experience: The architecture is focused on availability, performance, scalability, and resiliency to deliver the best customer experience. We set the simple principle that any server belonging to any service can be terminated at any point-in-time, and it should not have an impact on the customer experience.
  • Smart Defaults: Simplicity is key to the long-term success of the platform. Today Kubernetes offers a wide range of capabilities, but if we use all these capabilities then we can quickly make the platform complex. Therefore, we decided to keep the platform simple yet powerful by selecting only a few of the Kubernetes capabilities. Additionally, to keep the platform simple, we followed the principle of building services with smart defaults. The idea is that all the services come with default settings so that novice users or new users can easily run their applications without getting into the complexities of the platform. At the same time, it allows advanced users to customize the defaults as per their respective use cases. For example, the default for the minimum number of pods is two, and the zone topology constraints spread the pods across multiple Availability Zones (AZs) so that by default the application is highly available. To avoid single point of failure, the application teams can’t override the minimum number of pods to less than two and can only override to more than two.

Keeping the platform simple along with smart defaults has enabled all teams, such as development, QA, InfoSec, and Operations, to focus on understanding and leveraging a few yet powerful capabilities and be proficient in them.

  • NoOps: The goal is to eliminate or minimize the manual activities as much as possible. This involves having continuous integration/continuous deployment (CI/CD) pipeline for not only applications but also infrastructure (such as IaaC). Also, the test first approach along with chaos engineering is vital to achieve NoOps. The objective is to empower the team by providing self-service, automated platform, and keeping the human activities to a minimum. For example, if a team wants to deploy the application and InfoSec needs to approve it, then they should keep the human touch-point to only the InfoSec team and eliminate the touch-point with the DevSecOps team. To keep the focus on automating everything, the DevSecOps team is comprised of mostly developers, so that they are spending most of the time automating the platform as opposed to day-to-day operations.
  • Cost optimized: We wanted to build the cost-optimized platform. Along with the core cost optimization levers, such as right sizing, auto-scaling, AMD powered instances, AWS Graviton, Amazon EC2 Spot Instances, savings plan, and reserved instances, the platform also focuses on leveraging services features (such as S3 intelligent tier) and re-architecture of the applications for cost optimization.

Approach

It took about a year to build the Next-Generation trading platform. In the planning phase, we evaluated the applications portfolio for various dimensions, such as criticality of the application, dependencies of the application, complexity of the application, containerization effort, testing effort, current cost, and roadmap. Based on this evaluation, we categorized the applications into five buckets:

  1. Two representative applications for identified capabilities
  2. Two mission-critical applications
  3. Top cost contributing applications
  4. Remaining to-be containerized applications
  5. Applications to-be retained (not to-be containerized)

Then, we migrated the applications in phases as follows,

  • Phase 1 – Two representative applications for identified capabilities: In the first phase, we selected a few representative applications that were good candidates for migration to Amazon EKS. This migration was successfully completed, allowing us to validate critical features, such as gRPC, websockets, and load balancing. It helped the team learn Docker, Kubernetes, and Amazon EKS. The team understood the nuances and became confident in migrating and running containers.
  • Phase 2 – Two mission-critical applications: In the second phase, we focused on the top two mission-critical applications. These applications provide the exchange’s data feed to end-users in real-time and also power the charts and graphs. These applications are critical for end-users to make trading decisions. The team resolved all the challenges in migrating and running these mission-critical applications and learned the specifics of running mission-critical applications on Amazon EKS. This experience streamlined the later phases, as the majority of the use cases for forthcoming applications had already been addressed through the initial two representative and two mission-critical applications.
  • Phase 3 – Top cost contributing applications: In the third phase, we picked up the top cost contributing applications, such as complex applications. There were multiple benefits. Apart from realizing the cost savings early, we increased the agility, resiliency, scalability, and performance of these applications.
  • Phase 4 – Remaining to-be containerized applications: In the fourth phase, we migrated the rest of the applications to Amazon EKS. With the learnings and insights we gained in the earlier phases, we were able to expedite the migration process significantly.

We evaluated the applications for the fifth phase and concluded that containerization did not present the cost benefit ratio due to various factors, such as the efforts involved as opposed to the advantages and future strategic plans for these applications. Therefore, we retained these applications as is.

Greenfield differentiation

Scaling based on Traffic Pattern: Upstox did an in-depth analysis of daily traffic pattens. On a typical day, there’s an exponential spike in traffic during the market’s opening hour, approximately from 9:00 AM to 9:30 AM. Then, there is another surge in the traffic before market closing hour. The traffic varies throughout the trading hours. Whereas, traffic is very less during non-trading hours. This traffic pattern is shown in the following diagram.

Scaling based on Traffic Pattern

Scaling based on Traffic Pattern

However, during a non-typical day, traffic varies based on market news, events such as budget day announcements. Considering these aspects, Upstox Implemented scaling based on multiple dimensions like time-based, request-based, and Memory/CPU based scaling. To further reduce/eliminate the scaling lag, we leveraged the technique “Eliminate Kubernetes node scaling lag with pod priority and over-provisioning”.

 Application Load Balancer pre-warming:

Since there is a sudden spike in the traffic in a short span of time during market opening hours, the Application Load Balancer (ALB) also needs to scale accordingly. The ALB can scale to handle up to double the traffic in the next five minutes. However, the increase in the traffic during market opening hours is much higher than ALB can handle. Therefore, to address this challenge, Upstox has been using ALB pre-warming. With pre-warming, the higher capacity (LSU) is pre-provisioned to handle the sudden spike in the traffic.

Karpenter:

Upstox implemented Kubernetes Karpenter autoscaler instead of cluster autoscaler to scale the Amazon EKS worker nodes in-line with the traffic patterns. We used the following key features of Karpenter:

  • Instance type selection:We used a broad set of EC2 instance types to mitigate the risk of instance unavailability and to optimize the cost.
  • Node-recycling: The Amazon EKS cluster’s worker nodes use Amazon EKS optimized Amazon Linux AMIs to run containers securely and performantly. To make sure that the worker nodes are always up-to-date with the latest security patches and fixes, we adopted a routine cycle of replacing them with the most recent Amazon EKS-optimized Amazon Machine Images (AMIs). Therefore, by replacing the worker node instead of doing in-place OS updates of the existing worker node, we aligned with the best practice of treating the infrastructure as immutable. We used the following NodePool parameter to continuously replace the node with the latest Amazon EKS optimized AMI.
spec.disruption.expireAfter
  • Consolidation: We further optimized the cost by scaling-in the worker nodes in-line with the load. We used the following NodePool parameter to downsize the over-provisioned nodes and to reduce the number of nodes, as the load is decreased and pods are removed.
spec.disruption.consolidationPolicy

Security:

The Greenfield platform is highly secured with a focus on data security, data privacy, access control, and compliance. Upstox leveraged AWS encryption SDK to manage PII data.

Upstox leveraged AWS Systems Manager extensively as follows:

  • Systems Manager – Session Manager: We managed the SSH access using Systems Manager – Session Manager and followed the least access principle using IAM policies and roles. Therefore, we eliminated the sharing of SSH keys.
  • Systems Manager – Patch Manager: We also used the Systems Manager – Patch Manager to periodically apply the patches to the EC2 instances outside of Amazon EKS clusters.
  • Systems Manager – Compliance: We used the Systems Manager – Compliance to generate the audit reports such as patch compliance data.

Additional Cost Optimizations:

To further optimize the cost, we followed these advanced cost optimization techniques:

  1. Right Sizing: As we containerized the applications, based on the load and performance testing, we chose the right size and type of the EC2 instances to get the best performance at the lowest cost.
  2. AMD powered instances for applications: AMD-powered EC2 instances give customers the ability to run general purpose, memory intensive, burstable, compute intensive, and graphics intensive workloads, all at a significant price advantage relative to comparable offerings. By adopting AMD powered Amazon EKS EC2 instances, Upstox achieved an impressive cost reduction of approximately 40-45% in the AWS Mumbai Region.
  3. Spot instances: EC2 Spot instances provide up to 90% cost savings as compared to on-demand instances. Therefore, Upstox migrated Amazon EKS Dev and Staging environments to Spot instances with fallback to on-demand instances if Spot instances were unavailable. This resulted in an overall 70% cost savings for dev and staging environments.
  4. Graviton instances for managed services: AWS Graviton processor offers upto 40% cost performance benefits and most of the managed services can be updated to use Graviton processor. Therefore, we leveraged graviton processor for managed services such as Amazon ElastiCache and Amazon Aurora.
  5. Amazon EBS gp3: Amazon EBS gp3 volumes offer up to 20% lower cost than previous generation gp2 volumes and also allow scaling of the performance independent of storage capacity. Therefore, we migrated all the EBS volumes from gp2 to gp3 to lower the cost by approximately by 18% and to get better performance.
  6. Amazon S3: With exponential growth in the business, the data and cost also increased significantly. Upstox leveraged S3 storage lens to optimize the Amazon Simple Storage Service (Amazon S3) cost as outlined in the case study “Upstox Saves $1 Million Annually Using Amazon S3 Storage Lens

Architectural Roadmap

As part of the architecture evolution, Upstox is considering the following roadmap:

Multi-architecture CPU: Upstox is considering a multi-architecture CPU, x86 and ARM, for all of their workloads by leveraging Graviton instances along with AMD instances. This diversification makes sure of a wider selection of Graviton instances, mitigating the risk of instance shortages.

Auto-scaling for databases: Currently databases are provisioned for peak capacity and there is no auto-scaling. Upstox is planning to implement auto-scaling for databases read-replicas so that the number of read replicas are automatically adjusted based on the load and to further optimize the cost.

Combination of on-demand and Spot instances for production workload: Currently the production environment uses on-demand instances and the plan is to have a combination of on-demand and spot instances. This can further reduce the cost.

Conclusion

In this post, we showed the detailed journey of Upstox in developing a Greenfield platform, a transformative project that has significantly enhanced customer experiences, agility, security, and operational efficiency, all while reducing costs. The Platform has not only shortened the release lifecycle, enabling faster delivery of new use cases, but also managed to lower operational costs despite handling increased volumes. This was achieved without compromising on security, scalability, or performance. The Upstox initiative stands as a testament to how thoughtful innovation and strategic investment in technology can lead to substantial business benefits.