AWS Startups Blog
Revamping the Cure.Fit Cloud: A Kubernetes Story
Guest post by Romil Punetha (Engineer at Cure.Fit) and Vikramaditya M (Engineer at Cure.Fit)
Cure.fit is a mobile app that takes a holistic approach towards health and fitness by bringing together all aspects of a healthy lifestyle on a single platform. cure.fit offers both online and offline experiences across fitness, nutrition, mental well-being and primary healthcare through its 4 products, i.e, cult.fit, eat.fit, mind.fit & care.fit.
With the aim of making fitness fun and easy, cult.fit gives workouts a whole new meaning with a range of trainer-led group workout classes. eat.fit makes healthy eating easy & affordable. Every eat.fit meal is designed to balance macro and micro nutrients, cooked fresh with the best ingredients and delivered to your doorstep. Addressing the core aspect of mental wellbeing, mind.fit aims to bring about a lifestyle change and focuses on reducing day to day stress and improving overall mental wellbeing through yoga, guided meditation & 1:1 therapy. care.fit is a state-of-the-art chain of medical and diagnostic centres that offers doctor consultations and clinical expertise to provide care for both common illnesses and complex issues.
Over the last 3.5 years we have grown rapidly and scaled our operations to 250+ fitness centers across 2 countries and 15+ cities, 60 eat.fit kitchens across 14 cities, and 8 care.fit centers in Bengaluru, all of which are powered by our in-house tech.
When we started building this tech, like most startups we began small with smaller teams, a smaller scale and a host of typical engineering problems. We started off with small agile teams — each setting up their own infrastructure for their applications for the quickest turnaround. Each team in turn also had their own different ways of building and deployment. This approach had a couple of very obvious issues:
- Poor resource utilization — All our applications make heavy use of caching (high memory utilization) and need much lower CPU. They typically require less than 1 core of CPU, and 8GB of memory. Such a configuration isn’t available, so we had to manually club services together, and manage them as usage patterns changed. Moreover, for high availability, we need to do this on more machines, which became cumbersome.
2. Managing permissions — To start off, we allowed most developers to provision their own resources, which worked in the short term, but became hard to audit as we began to scale rapidly (both machines and developers).
Individual teams tried solving the problem of poor resource utilization by manually clubbing compatible services, but this broke isolation, and required constant rework to keep things balanced. Instead of applying short term, band-aid solutions we decided it was time to fix both problems permanently.
Kubernetes to the rescue!
Looking around for pre-made solutions, Kubernetes caught our eye. Kubernetes (k8s) is the most popular container orchestration software available, which is used as a standard across many big and small organizations. We evaluated it quickly, and realized it was the perfect tool to solve both our problems.
We chose Amazon Elastic Kubernetes Service (EKS) as the platform to host our k8s cluster, since it offloaded the control plane management for us. It would also allow us to define our entire infrastructure as configuration, ensuring that on-boarding a new application was as easy as creating a basic config file (with only a few mandatory values, and some application specific overrides).
The deployment pipeline
It took us a few weeks to define what our infrastructure should look like:
- An EKS cluster: this is where we host our worker nodes that run our applications
- A Build and deploy pipeline: Jenkins for performing the builds & Spinnaker as the deployment platform
- Logging: via Fluentd
- Monitoring: through Prometheus, Grafana and NewRelic
The EKS Cluster
We setup a dev, a stage, and a prod cluster. We used the dev cluster for all the initial testing and experimentation. Only once we finalized what settings we wanted, we set up the stage and prod cluster.
Amazon provides Amazon optimized AMIs for EKS, with settings like max pods in a node, etc. which we used along with user data from 90 days of AWS EKS in Production for our worker nodes. It allowed us to reserve CPU and Memory for system processes and kubelet, thereby preventing a faulty container from taking up all the CPU.
The build and deploy pipeline
To build and deploy an application we used:
- Jenkins: We use a Jenkins job that is configured via a Jenkinsfile for (i) creating docker images using a multi-stage docker build (ii) Configuring the Kubernetes manifests for the application and pushing to the chart repository.
- Spinnaker: For deploying the applications
The Build
For all our applications, we build images using a multi-stage docker build for isolation of builds, and used slim images as runner containers (alpine images are smaller than slim, however they[11] have an open DNS issue). This setup helped us optimize on the size of the OS, enabled better security measures, fast and easy scaling, and lowered network and storage costs.
We established standards for Dockerfiles, Jenkinsfiles, and k8s manifests which are followed by all teams to build & deploy their applications.
A sample Dockerfile:
“`
FROM node:8.15-jessie-slim as BUILDER
ARG APP_NAME
ARG ENVIRONMENT
ADD . /${APP_NAME}
RUN mkdir -p /${APP_NAME}-deploy/
RUN deploy/build_k8s.sh /${APP_NAME}-deploy ${ENVIRONMENT}
FROM node:8-slim as RUNNER
ARG APP_NAME
ENV destination=’/home/ubuntu/deployment’
COPY –from=BUILDER /${APP_NAME}-deploy/ ${destination}
COPY –from=BUILDER /${APP_NAME}-deploy/deploy/${APP_NAME}.supervisor.conf /etc/supervisor.conf
RUN apt-get update && apt-get install supervisor -y
RUN npm install -g typescript
RUN mkdir -p /logs/${APP_NAME}
CMD [“/usr/bin/supervisord”, “-n”, “-c”, “/etc/supervisor.conf”]
“`
“`
appName: demo
service:
type: ClusterIP
expose:
– externalPort: 8080
internalPort: 8080
type: external
– externalPort: 8081
internalPort: 8081
type: internal
ingress:
exposeName: test
annotations:
external:
kubernetes.io/ingress.class: nginx
nginx.ingress.kubernetes.io/force-ssl-redirect: “true”
internal:
kubernetes.io/ingress.class: nginx-internal
resources:
limits:
cpu: 1000m
memory: 1024Mi
requests:
cpu: 100m
memory: 100Mi
kind: Deployment
image:
repository: draft
tag: dev
probePath: /status
replicaCount: 2
“`
A sample values.yaml file
For every application, we need:
a. CPU and Memory requirements
b. Load balancer configuration
c. Scaling criteria
d. High availability of the application
e. Easy deployments/rollbacks
f. Graceful Termination
To fulfill all these requirements, we use helm as a templating tool, and created a generic template, or helm chart, which is overridden by each application, and finally pushed to the chart museum. This allows us to keep all boiler-plate configurations in one place, and update all apps as necessary.
Eg. We wanted to add a pre-stop hook to every application, so all we did was update the generic template, as opposed to having to edit individual charts in an application’s own repositories. The chart museum is hosted on the cluster itself, with S3 serving as a persistent storage.
Our helm template is organized as follows:
- Deployment configuration: The deployment defines 2 containers — the application container, and a Fluentd-CloudWatch sidecar container (explained in detail further below). Configs like update strategy, pod anti-affinity, liveness and readiness probes are defined here.
- Service: These are application specific service configurations
- Ingress: We defined 3 types of ingress manifests — external, internal and VPN. These manifests allow us to share load balancers between applications. As compared to the 80+ load balancers in our previous setup, where one Load Balancer was dedicated to 1 application, we now need as little as 6–10 load balancers in our k8s setup across all environments
- Service Monitor: This is used to expose business metrics to prometheus
- Pod Disruption Budget: This is created to prevent the application’s instance count to fall below a certain threshold. However, this is helpful only against voluntary termination of instances.
The Deploy
Defining a stable deployment system was one of the hardest tasks. When the focus is on standardising the development and deployment pipeline, the idea of distributing kubeconfigs to each developer, and running a `kubectl apply` for deploying an application seems inconsistent. What we needed was a central system, that could perform seamless deployments and rollbacks, with authentication and authorisation capabilities.
We tried the following options for our use cases:
- Jenkins-X
- Kubeapps
- Harness.io
- Spinnaker
We rejected the first 3 and went ahead with Spinnaker for the following reasons:
- Jenkins-X used to create all the deployment revisions in GitHub itself, this redundant step increased the deploy time to almost 10 mins from 4–5 mins.
- Kubeapps had service account token-based authentication and authorization, which created the complexity of maintaining tokens for individuals.
- Harness.io wasn’t out with the community edition back then, and their pricing policy was aggressively targeted towards the number of deployment instances, i.e., pods in terms of kubernetes. Our near future estimates showed that the costs incurred would outweigh the benefits, deeming it unfit for a longer run.
Spinnaker for the win!
Spinnaker has a rich documentation available here, stating all the capabilities it offers, from managing cloud resources to [15] K8s support, integrated authentication and authorization mechanism.
Its commands section was sufficient for us to define what we needed. After having configured Spinnaker, we proceeded with defining pipelines for all the applications, which were triggered on completion of the Jenkins job, and the end result was k8s pods running in their respective namespaces.
Spinnaker pipeline
Spinnaker dashboard
Logging
In our previous setup, we had all applications writing logs to different files such as access, debug, info, error, and the CloudWatch agent was consuming those files. We moved to k8s and didn’t want to change much of that.
Approach 1: Mount the file system of the host machine onto the container so that applications can write to the host machine, and have an agent (CloudWatch agent or Fluentd) push to CloudWatch.
Approach 2: Deploy a Fluentd sidecar along with the application, both mounting the same volume specified in the deployment config. This way, logs are written to a temp storage by the application pod, and read by the Fluentd pod.
We use the second approach in production. We use kubernetes metadata to define our log group and log stream.
For applications like NGINX and kube-proxy which write to stdout and stderr, we deployed a Fluentd DaemonSet, that reads from stdout and stderr of the containers, as well as other system logs, and pushes to CloudWatch. AWS has documented the steps here.
Monitoring
We’ve been using a Prometheus and Grafana setup, along with New Relic to capture all system and application metrics. When applications are migrated to k8s APM works out-of-the-box because of the new relic agent in the codebase.
For k8s-based monitoring, there’s a plethora of dashboards providing all necessary visualization on the CPU, memory and disk metrics of the cluster. Below is an NGINX ingress controller dashboard showing request and error metrics across namespaces. We also have dedicated monitoring for Spinnaker components within the same setup.
In the end…
Onboarding a new application is now as simple as running a script, which creates necessary resources on AWS and K8s (such as namespaces and ECR repository) and the Spinnaker pipelines. Not only does config as code enable teams to make reliable changes to their application setup using just git, but also allows for a quicker disaster recovery where the entire setup can be easily replicated.
● We have already migrated services with the greatest benefit to K8s (About 50% of our total compute) and are currently evaluating using a service mesh for better service discoverability and network visibility. Once this is done, and we nail down our disaster recovery plan for k8s, we will be migrating all of our compute to k8s.
● We have already managed to reduce our costs by 60% so far on the services that have been migrated, which is bound to grow as and when we migrate more services (since fixed costs remain the same).
● The deployment time has gone down by almost 75%, even though we’ve introduced alpha deployments to the pipelines. Scaling up applications has also become faster by 85%, since there is 0 wait time for instances to come up (we typically run with 30% spare[20] capacity).
● Images are always present on the instances, and the containers pass the readiness check in under 1 min. This allows us to handle a traffic surge with almost no impact on latencies.
Find our sample codes & configurations here!