Eightfold Uses Amazon EMR to Speed Up ML Workloads by 80% at an 80% Lower Cost


Founded in 2016, Eightfold.ai (Eightfold) uses artificial intelligence in its Talent Intelligence Platform to drive data-based insights for companies on hiring, retaining, and diversifying talent. The platform was seeing data grow 20 percent month over month, yet its machine learning (ML) models needed 20 days to be trained—meaning that the models took 20 days to be updated. Eightfold needed a cost-effective solution that would shorten the training cycle without overburdening its staff.

Having used Amazon Web Services (AWS) for its backend services, Eightfold decided to use Amazon EMR, the industry-leading cloud big data platform for processing vast amounts of data using open-source tools such as Apache Spark. By using Amazon EMR as an engine that can scale up and down to power its ML workloads, Eightfold’s platform can iterate faster and build smarter, more accurate models with a lower total cost of ownership, enabling Eightfold to rapidly innovate so its customers can better manage their workforce.

Technology concept. 3D render

The stability, constant innovation, lower costs, and performance of Amazon EMR have been crucial for our success in delivering better results to our customers."

Varun Kacholia
cofounder and chief technology officer, Eightfold.ai

Seeking a Scalable Engine to Power ML Workloads

Eightfold offers the Talent Intelligence Platform to provide businesses with data-based insights that will help them hire and retain the right people, reach diversity hiring goals, and manage and develop their workforce. Eightfold’s customers span financial services, media and entertainment, technology, consumer packaged goods, insurance, professional services, and other sectors. The company, which launched using AWS, stores and processes data using AWS services, including Amazon Simple Storage Service (Amazon S3), an object storage service that offers industry-leading scalability, data availability, security, and performance. 

Eightfold’s legacy data processing platform, which the company managed and maintained itself, was written in the Python programming language and ran on Amazon Elastic Compute Cloud (Amazon EC2) instances. As the amount of data in the platform—pulled primarily from social media, search engines, subscription databases, and public sources of data such as the government and census—increased 20 percent every month, so did Eightfold’s need for scalable compute capacity to process the data. At the same time, Eightfold wanted to reduce the training time for its ML models, leading to an even greater need for scalable compute capacity. A fundamental part of any ML workflow is experimentation, and Eightfold’s engineering team had to iterate faster to produce results. 

The company wanted to use Apache Spark, an analytics engine for big data processing, but installing, tuning, scaling, and managing it would consume a majority of the engineering team’s time. Eightfold needed a solution that would enable the in-house team to focus on analysis, not system management and maintenance. Amazon EMR provides the latest stable open-source software releases, so Eightfold wouldn’t have to manage updates and bug fixes, resulting in fewer issues and less effort to maintain the environment. In addition, Amazon EMR runs petabyte-scale analysis at less than half of the cost of traditional on-premises solutions and is more than three times faster than standard Apache Spark. “Amazon EMR offered us the ability to create scalable Apache Spark clusters in minutes instead of the days or months it would have taken us to build, secure, and optimize them on our own,” says Sanjeet Hajarnis, principal machine learning engineer at Eightfold. “All the components that we needed to run our Apache Spark workloads were already factored in.”

Using Amazon EMR to Flexibly Scale at a Reasonable Cost

Eightfold first migrated on a fixed cluster: in a few minutes, an Amazon EMR cluster was launched and used to convert the binary classifier model that is trained on hundreds of millions of career trajectories and real-world outcomes to PySpark, the Python API written in Python to support Apache Spark. With data stored on the Amazon EMR cluster, the first successful run of the pipeline completed 80 percent faster than it had on the previous system, with little to no tuning. “We ran the model on the Amazon EMR cluster and found out that the first run out of the box completed in 80 percent less time than during the previous runs,” says Kevin Cherian, a senior software engineer at Eightfold. 

The next step was scalability. Eightfold eased into using Amazon EMR, starting with a low cluster footprint to run its daily operations, such as ad hoc querying and extract, transform, load jobs with PySpark. Now, to automatically scale up the nodes in order to train the ML models using data stored in Amazon S3, Eightfold pairs the automatic scaling feature of Amazon EMR with Amazon EC2 Spot Instances. That combination scales the company’s cluster up to eight times on training days and scales down the nodes for performing simpler transformations and ad hoc data exploration on other days—saving Eightfold 80 percent of the costs compared to not taking advantage of Amazon EMR’s automatic scaling and Amazon EC2 Spot Instances. 

Amazon EMR’s automatic scaling capability also has enabled Eightfold’s engineering team to run experiments on different volumes of data with no changes in code. Using Amazon Machine Images (AMIs), the company can install custom ML libraries at cluster deployment time so its data scientists can focus on building and training models without needing to manage library dependencies. That capability enables the engineering team to quickly operationalize models at scale.

Providing Customers with Better Insights to Find and Retain Talent

By adopting Amazon EMR as its core ML platform and using it for data preprocessing and model training, prediction, and validation, Eightfold has been able to build more accurate ML models 80 percent faster than it could on its previous system and at an 80 percent lower cost. That ability to quickly innovate means that its platform can provide better insights for its customers, ultimately giving them a competitive edge by helping them find and retain talent. For example, one customer, a top-ten communications company in Asia, increased the hiring of women by 19 percent, and another customer, a top-five robotic process automation company, increased by 50 percent the number of positions filled per recruiter per quarter. 

According to Varun Kacholia, cofounder and chief technology officer of Eightfold, “The stability, constant innovation, lower costs, and performance of Amazon EMR have been crucial for our success in delivering better results to our customers.”

To learn more, visit aws.amazon.com/emr.

Click to enlarge Eightfold.ai's Talent Intelligence Platform architecture. 

About Eightfold.ai

Founded in 2016, Eightfold.ai offers its Talent Intelligence Platform, which uses artificial intelligence to provide companies with data-driven insights to hire and retain talent, reach diversity goals, and manage and develop their workforce.

Benefits of AWS

  • Completed ML workloads 80% faster
  • Saved 80% on compute costs
  • Scales clusters up and down as many as 8 times daily
  • Runs experiments on different volumes of data with no changes in code
  • Iterates and builds ML models faster
  • Launches an Amazon EMR cluster in a few minutes

AWS Services Used

Amazon EC2 Spot Instances

Amazon EC2 Spot Instances let you take advantage of unused EC2 capacity in the AWS cloud. Spot Instances are available at up to a 90% discount compared to On-Demand prices.

Learn more »

Amazon EMR

Amazon EMR is the industry-leading cloud big data platform for processing vast amounts of data using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto.

Learn more »

Amazon EC2

Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides secure, resizable compute capacity in the cloud. It is designed to make web-scale cloud computing easier for developers.

Learn more »

Amazon S3

Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance.

Learn more »

Get Started

Companies of all sizes across all industries are transforming their businesses every day using AWS. Contact our experts and start your own AWS Cloud journey today.