On AWS, accomplishing the eightfold increase in capacity and throughput we needed for this project was just another day at the office.
Robert Burcham CTO, Pinsight Media
  • About Pinsight Media

    Pinsight Media is a mobile data and insights company based in Kansas City, Missouri. The company’s services include customer modeling, analysis of brand affinity, competitive benchmarking, and market analysis.

  • Benefits of AWS

    • Executed 100,000 compute hours in eight days
    • Translated complex workflows easily
    • Optimized resource costs
    • Avoided resource-related cost overruns
    • Achieved eightfold capacity and throughput increase
  • AWS Services Used

Pinsight Media is a mobile data and insights company based in Kansas City, Missouri. The company provides businesses with actionable marketing insights and intelligence based on exclusive first-party, network-level mobile data. Pinsight services include customer modeling, analysis of brand affinity, competitive benchmarking, and market analysis.

Each day, Pinsight gathers and processes more than 80 terabytes of anonymized location signals, packet layer data, and other kinds of mobile-carrier signal data. Pinsight houses this data in an on-premises data center and manages it with its internally developed data management platform (DMP).

In addition to managing the company's round-the-clock extract, transform, and load (ETL) processes, the Pinsight DMP is also a shared resource for teams across the enterprise. Data scientists and Hadoop, Spark, Hive, and other big data developers use the DMP to analyze the anonymized signal data, model it to derive insights, and construct workflows that execute on data product delivery governed by stringent service-level agreements (SLAs).

This structure served Pinsight well—until a customer asked the company to help it analyze a six-month, 900 TB backlog of data. The volume of inputs, the complexity of the needed workflow, and a tight deadline posed significant challenges for the DMP system. “We had eight days to turn around a compute problem that was going to require about 100,000 compute hours to complete," says Robert Burcham, CTO at Pinsight Media. "Using our DMP, we could produce about three days of output for every one day of compute time, so processing six months of inputs would have missed the deadline and consumed all our on-premises compute resources for months."

Pinsight needed an easy-to-deploy solution that could provide the compute power necessary to deliver its customer's data product without turning the contract unprofitable.

"We knew the only way to get the short-term scale we needed was to go to the cloud," says Burcham. He explains that Pinsight has years of experience using the Amazon Web Services (AWS) Cloud to analyze big data and build exploratory data science workflows. "Our deep DevOps experience on AWS made us confident that a cloud-based approach would help us get this customer what it wanted."

Pinsight first staged the customer's sanitized dataset from its on-premises data center to AWS using Amazon Elastic Block Store (Amazon EBS), which provides persistent block storage volumes for use with Amazon Elastic Compute Cloud (Amazon EC2) instances, and Amazon Simple Storage Service (Amazon S3), object storage that can retrieve and manage any amount of data from any source. Pinsight then processed the data in Amazon Elastic MapReduce (Amazon EMR) pipelines and packaged outputs for delivery to the client using a custom workflow management framework built on Amazon EC2.

To keep costs manageable, Pinsight configured its Amazon EMR pipelines to use Amazon EC2 Spot Instances, ephemeral compute resources that can be as much as 90 percent less expensive than standard Amazon EC2 instances. Pinsight also set a maximum price it was willing to pay per instance to avoid resource-related cost overruns.

“The whole engagement was a classic case of using managed services where it made sense, and running our own where it made sense,” Burcham says.

By using AWS, Pinsight protected its margin and impressed its customer by shipping within the project's tight deadline. The keys to beating that deadline were on-demand scalability and the ease of replicating a critical on-premises workflow on AWS.

“We were banking on the fact that we would be able to not only translate our workflows to AWS, but that the AWS services we were going to use would also be able to scale enough to get the backfill done in time,” says Chris Swanda, a senior DevOps engineer at Pinsight. "Using AWS, we were able to quickly replicate a critical product delivery workflow at scale."

Controlling costs was also crucial to project success, so Pinsight was glad for the transparent pricing structure and easy-to-use cost tools available on AWS. “AWS provides a clear set of tools so you can see per-unit prices,” Swanda says. “Then it’s just a matter of manipulating spreadsheets to estimate the costs of different solution approaches.”

An approach that relied on Amazon EC2 Spot Instances was important to the project's overall cost picture. "We knew that judicious use of Spot Instances would play a big part in keeping our costs under control, so we used the AWS Spot pricing tool to analyze Spot prices and set our ceilings," says Swanda. “It was rare for us to not get capacity on a cluster, but when we did, Amazon EMR just picked up and carried on with the workflow without problems.”

The easy scalability of AWS has transformed what Burcham and his team can accomplish—on this project, and others like it. “Ten years ago, the lead time and cost of achieving such a massive compute expansion at scale and on-demand just did not exist,” he says. “On AWS, accomplishing the eightfold increase in capacity and throughput we needed for this project was just another day at the office.”

Learn more about big data on AWS.