St. Louis University uses AWS to make big data accessible for researchers

The research team at St. Louis University‘s (SLU) Sinquefield Center for Applied Economic Research (SCAER) required vast quantities of anonymized cell phone data in order to study the impacts of large-scale social problems like homelessness and access to healthcare. Finding a reliable data supplier was relatively simple. When it came to storing, cleaning, and processing 450 terabytes of data, however, the team ran into challenges. This blog post explains how SCAER worked with Amazon Web Services (AWS) to create a fast, cost-effective solution for managing the growing quantities of data—and how this technology empowered researchers to tackle critical social issues head-on.

A data project grows too large to handle

“The idea to use anonymized cell phone data to help with economic-development use cases came from Michael Podgursky, SCAER’s director. He chose a vendor called Veraset, which coordinates and collects multiple data sets from different sources and sends it out as one large dataset,” said Shruthi Sreenivasa Murthy, assistant director of research computing at SLU. “Originally, our center wanted to buy one year of data to try it out. But as the COVID-19 pandemic hit, we realized that we needed the data to be ongoing and updated.”

SCAER soon moved to receiving quarterly dumps of data—which exponentially increased the amount of information they needed to manage and store. As the project progressed, the team ran into several challenges:

Receiving and storing hundreds of terabytes of data. Veraset’s transfer process included sending data from its data warehouse to SLU’s cloud storage, but SCAER had no easy way to receive the vast amount of data. Choosing a place to store, clean, and manage the data became the number one priority but it wasn’t as simple as choosing an off-the-shelf data solution.
Cleaning the data. Another issue was cleaning the data to remove duplicate data points. Veraset collected data from many different sources, some of which had significant overlap, such as region-based data.
Making the data compatible for researchers. Finally, SLU researchers had their preferred SaaS tools and platforms—and many were not compatible with the Parquet format used by Veraset.

“We were trying to understand how we could put this data to use efficiently and quickly, because we didn’t want researchers wasting their time in pre-processing,” said Shruthi. “They could build out use cases and continue their research if we could solve this problem for them.”

Turning to AWS to handle big data in the cloud

Between cleaning the data, formatting it from Parquet to the desired data format, and ensuring it was compressed enough to store affordably, the team faced a large job. SCAER knew that the vast storage and compute power of the cloud was needed for the project’s success, and they began to look for solutions.

AWS wasn’t the only cloud provider the research team at SLU considered for managing the data, but the one it felt most comfortable with. “AWS gave us a dedicated solutions architect who would do whatever they could to help us succeed,” said Shruthi. “This was key.”

AWS also stood out in its ability to meet the unique needs of SCAER, an institution inundated with unique research requests from many different SLU departments and teams. The non-transferable nature of these requests required a knowledgeable technology partner to help set up the data solution correctly. AWS took the challenge and ensured the SCAER team never felt alone.

With the help of a tech partner, Ideas2IT, the SCAER team started on an innovative and money-saving path to using their giant data stream in the most effective way possible. Ideas2IT brought the dual benefit of being AWS experts while also understanding the needs of an institution that relied on large volumes of data. Their experience with larger pipelines and streaming real-time data was just what SCAER needed to modernize while keeping their budget under control.

From ideation to fast results

Ideas2IT created a solution with a data-processing cost on AWS of less than $28,000. They also pre-processed more than 4.5 years of data in less than a week. These results were possible by utilizing the following technical best practices:

Eliminating the intermediate data bucket between Veraset and SCAER’s AWS solution which previously contained thousands of micro files.
Leveraging Amazon EMR through Apache Spark.
Repartitioning the data to improve Amazon Simple Storage Service (Amazon S3) speeds and reduce cost per data send.

Ideas2IT also created a highly scalable pre-processing workflow that met SCAER’s unique requirements. Data in raw form isn’t useful for researchers, so the solution made data user-friendly and searchable. With the inclusion of geocodes by region and dates, data was primed for a more complete pre-processing, cleaning, and storage method that also reduced duplicate datasets.

Making researchers’ jobs easier

With the data cleaned and searchable, it was now ready to be shared with researchers—and the process was easier and more secure than ever before. After a brief 30-minute training session to learn the new technology, researchers can request access to specific datasets and receive AWS logins to access the specific Amazon Elastic Compute Cloud (Amazon EC2) instances related to their query needs. “The data manager can just do the data cut and give it to the researcher without any hassle, confusion, or manual workflows,” said Shruthi.

For security purposes, each login comes with its own permissions—including the geospatial analysis tools needed to do the research work—while restricting access to other data services. The desktop application offers its own videoconferencing, presentation, and communication tools, while preventing downloading or sharing the data outside of the AWS infrastructure.

“Because we are using distributed processing, the size of the instance is smaller, the time is faster, and the cost is lower,” said Shruthi. “In the past, it took 15 minutes to retrieve that data. Now it’s only 30 seconds.”

Improving research capabilities at SCAER—and beyond

The SCAER team plans to continue their commitment to improving the field of research by sharing what they’ve learned through this process. While their data uses may be unique to their researchers, the challenges they faced are not. Cost, data security, and build-time are genuine roadblocks for institutions to make data technology investments, and the SCAER leadership believes more researchers should have access.

“We have built a small charge-back model, so other research groups can use our datasets for a small fee,” said Shruthi. “Our intention is to collaborate and let many more people make use of this data.”

Learn more about how higher education institutions around the world are using AWS to support research and teaching, connect the campus community, make data-driven decisions to save money and resources, accelerate research efforts, and more at the AWS Cloud for Higher Education hub.

Read related stories on the AWS Public Sector Blog:

AWS Public Sector Blog