Containers

Delivering tree insights at scale at Aerobotics

This post is contributed by Nic Coles, Head of Software Engineering, Aerobotics

Aerobotics is an agri-tech company operating in 18 countries around the world, based out of Cape Town, South Africa. Our mission is to provide intelligent tools to feed the world. We aim to achieve this by providing farmers with actionable data and insights on our platform, Aeroview, so that they can make the necessary interventions at the right time in the growing season. Our predominant data source is aerial drone imagery: capturing visual and multispectral images of trees and fruit in an orchard.

The data

For tree-specific data, we send a drone on an automated data-collection mission on a farm. This mission involves having the drone fly above an orchard in a grid-like pattern, collecting unprocessed, GPS-referenced images of the orchard. Once the mission is complete, the images are uploaded to a bucket on S3, through Aeroview, and our tree data processing engine commences.

The Tree Engine is responsible for turning raw aerial imagery captured by a drone into “tree insights” used by growers to project crop yields and identify any unexpected anomalies. It does this by performing a number of sequential operations from stitching the imagery into high resolution geo-referenced maps to running machine learning models that detect individual tree locations within the imagery and compute tree health.

Processing tree data at scale is challenging as we have particular constraints and requirements:

  • Datasets (maps) are large (up to 10Gb per high resolution orchard map).
  • Our customers require high quality and accurate results, given the cost of decision making.
  • Data processing needs to be completed within a 3-day SLA.
  • The data processing demand (and required capacity) is completely dynamic.

Based on these points, our data processing system needs to achieve optimal:

  • Scalability to handle dynamic surges in demand.
  • Performance and utilization to reduce overall spend and improve gross profit margins.
  • Availability to meet external SLAs.

The Tree Engine has evolved greatly over the years, ensuring that we continue to meet the needs of our customers and the business.

Our journey to a scalable data processing architecture

Chapter 1: “Pre-cloud” (2014-2015)

Our customers, few and far between, couriered flash disks to our offices. We would run lengthy Python scripts overnight, on our laptops, which processed and prepared the data. Not only was this an operational burden for our customers but our data turnaround time was incredibly slow due to our hardware limitations. In addition to this, all other software development halted because our workstations were fully occupied with data processing.

Chapter 2: AWS EC2 (2015-2018)

The natural baby steps into the world of cloud computing was with AWS and EC2. We used a single EC2 instance per customer upload, which performed all the data processing required.

Challenges:

  1. Having the entire data pipeline on a single EC2 instance meant that we were paying for unnecessary compute resources when our tasks were not actively running.
  2. The time to process an upload scaled with the size of upload.
  3. Updating the data processing software on the instance was invasive and time-consuming due to the lack of modern, automated deployment pipelines

Chapter 3: Spinnaker + Kubernetes (2018-2020)

In 2018, our data processing infrastructure could not support the data demands and we needed to increase the velocity at which we could improve our system. We broke up our core processing pipeline into various Docker containers. We wanted to achieve a few things:

  • Ensure that production and local environments are exactly the same.
  • Improve velocity of development by separating responsibilities of tasks in the pipeline. Making changes to smaller, more isolated components is much easier and lower risk than making changes to a larger, more complex system.
  • Decrease the compute requirements for less complex tasks and in doing so, optimize overall compute efficiency (improving unit economics).

At the time, Kubernetes had just graduated from CNCF. We decided to use Kubernetes with Spinnaker to manage the delivery of these containers. We initially set up this architecture on Google Kubernetes Engine (GKE) then moved across to Amazon Elastic Kubernetes Service (Amazon EKS) when it became available.

Challenges:

  1. Spinnaker was not designed for managing data processing pipelines, rather the continuous delivery of API services. A more appropriate alternative would have been a tool like Apache Airflow.
  2. At that scale, Spinnaker required a substantial amount of infrastructure in order to be reliable enough for our use case. (Redis cluster, SQL DB, and fairly large EC2 instances).
  3. The bulk of our pipeline logic was sitting within the interface of Spinnaker. We lacked visibility of changes and the overall pipeline setup was not easily reproducible.
  4. Kubernetes had a few quirks when using it for data processing:
    1. You could not easily run a sidecar with a batch processing job (issue open since 2016).
    2. The cluster autoscaler moved pods, managed by controlled objects (Job in this case), from node to node. This frequently resulted in jobs restarting from the beginning as most of our application code was unable to efficiently handle restarts. The alternative was to set a `safe-to-evict=false` annotation on the pod, which was very inefficient from an instance utilization perspective.
  5. There was a substantial amount of operational overhead in time spent managing EC2 groups, EKS clusters, Spinnaker, and other open-source tooling such as Prometheus, Grafana, Fluentd, etc.
  6. It was difficult to correlate data processing costs to margins (billed per EC2 instance, not container).
  7. EC2 instances were underutilized, and this expense was scaling linearly with processing demand.

Fargate and EKS

One of our challenges running Kubernetes was that a member of our team had to manage the underlying EC2 instances (node groups) for the data plane. We also found it virtually impossible to keep utilization of these underlying EC2 instances above 50%, which meant we were spending 50% more than required each month.

As referenced in this article, we found that unless you were able to achieve greater than 70-80% utilization of your EC2 instances, Fargate was more cost effective. This is portrayed in the graph below.

At the time, we had very limited exposure to Fargate and so when Fargate for EKS was launched late 2019, we were eager to roll it out where possible. Fargate for EKS was definitely a step in the right direction, but we needed to make some fundamental changes.

Processing data at scale: the next chapter

At the end of 2019, we went to KubeCon in San Diego. It was eye-opening to see how the open-source community around Kubernetes was booming with technical innovation and supporting projects. When turning to our internal teams, we decided that managing open source software was not a desired core competency of our tech team.

We think Kelsey Hightower put it best in this tweet:

For a sustainable data processing architecture, we identified that there were 5 areas important to us, and we were going to optimize for these:

  1. Negligible operational overhead
  2. Reproducible and standardized infrastructure
  3. Low infrastructure costs
  4. Unit tracking, for accounting purposes
  5. Serverless compute, where possible, for utilization and management purposes

AWS Step Functions + ECS/Fargate (2020+)

To address the first area of concern, it became clear that we needed a completely managed service. This eliminated possibilities of Apache Airflow, Lyft Flyte, KubeFlow, Spinnaker etc. In addition to this, Kubernetes was no longer desirable for our data processing use-case.

The natural first place to look was within the AWS ecosystem – AWS Step Functions. Fortunately, Step Functions was able to address all the above areas in the following ways:

  • Reproducible and standardized infrastructure: AWS Step Functions and ECS are fully supported by infrastructure-as-code tools like Terraform and AWS CloudFormation.
  • Low infrastructure costs: The pricing model is perfect for our on-demand use case. Costs are low.
  • Unit tracking, for accounting purposes: You are able to dynamically tag resources which link directly to the AWS Bill.
  • Serverless compute, where possible, for utilization and management purposes: You are able to run ECS Fargate tasks, SageMaker jobs, and Lambda tasks. A recent Fargate update made this even more viable as containers now have 20GB ephemeral storage.

Reproducible and standardized infrastructure

Terraform was an easy choice as the tool for building, changing and versioning our infrastructure. It enabled us to use infrastructure-as-code principles for all of our data processing infrastructure (reproducible). Utilizing an essential feature, Terraform Modules, allowed us to create repeatable chunks of infrastructure where appropriate (standardized).

This would have been equally possible through a tool like AWS CloudFormation, but we found the way in which Terraform handles modularization to be slightly more intuitive and the team was more comfortable with syntax and flow.

The setup is as follows:

Our experience so far

1. Negligible operational overhead
The various backend teams have great autonomy. Introducing new workloads is quick, easy, and has very little impact on existing workloads across the team.

2. Reproducible and standardized infrastructure
Developers can focus on the actual application logic, knowing that everything they need from an infrastructure perspective is programmable through a main.tf file.

module "tree_engine_data_processing_module" { 
  ... 
  application_name = “tree-finding”
  cpu = 4096     // 4 CPU
  memory = 30720 // 30 Gb
  // updated as part of the CI build process
  ecs_task_container_image_id = var.ecs_task_container_image_id
  environment = var.environment
  ...
}


// For consumption by Tree Engine Pipelines workspace
output “step_function_id” { 
  value = module.tree_engine_data_processing_module.step_function_id
}

We now have full visibility into changes in the data processing pipelines as standard pull request mechanisms are used.

3. Low infrastructure costs
The Step Functions dynamic pricing model means that we incur a negligible additional cost per data processing task. This is much more cost effective in comparison to the substantial fixed costs we incurred in managing our previous data processing infrastructure prior to deploying Step Functions.

4. Unit tracking for accounting purposes
We are tagging all of the ECS tasks we run with unique ids that match objects in our databases. The intention here is to get to a point where we are able to query the exact cost of processing on an orchard-by-orchard basis.

5. Serverless compute for utilization and management purposes
All our data processing tasks are running within Fargate barring one particular task as the CPU/memory requirements are too large. Our data processing varies month-to-month but we have seen savings of up to 40% after adopting Fargate.

We firmly believe that adopting a serverless architecture (Step Functions and Fargate) is setting us up for success by minimizing infrastructure being a bottleneck to scaling. Fargate will continue to grow and support more demanding workloads and work better with other AWS services. We look forward to the team benefiting greatly from this in the coming future.