Data Lakes and Analytics
The Nasdaq Composite is a stock market index of the common stocks and similar securities listed on the Nasdaq stock market. Demand on Nasdaq's enterprise data warehouse was growing so rapidly that Nasdaq feared its system would soon hit a limit beyond which it could not scale. The Nasdaq team was tasked with redesigning the architecture of its data warehouse to handle rapidly changing service level demands from customers and decided to partner with the AWS Data Lab to accelerate the creation of this solution. Over four days, the Nasdaq team worked with the AWS Data Lab to explore and test various options for improving scalability and decided to separate storage from compute by using Amazon Redshift as a compute engine on top of its data lake. Rather than maintaining a single large Amazon Redshift cluster, the team deployed smaller Amazon Redshift clusters suited to the needs of its different business users. Deployment of this new architecture to production created "infinite" capacity for additional data without manual intervention, increased scalability and parallelism, and resulted in a 75% reduction in Reserved Instance costs.
"I wish that we hadn’t waited so late in the project to take advantage of [the AWS Data Lab]. We came out of that week at AWS Data Lab with answers and a clear path to how we were going to solve the problems that we were facing.” Robert Hunt, VP Software Engineering, Nasdaq.
Learn more about Nasdaq's solution and experience with the AWS Data Lab>>
Allen Institute focuses on accelerating foundational research, developing standards and models, and cultivating new ideas to make a broad, transformational impact on science. One of its research institutes, the Allen Institute for Brain Science, partnered with the AWS Data Lab to rapidly accelerate its journey into data platform modernization. As part of its mission to share massive amounts of data with the public to accelerate advancement in neuroscience, Allen Institute needed to build a solution that could provide researchers around the world with the ability to work with extremely wide datasets - more than 50,000 columns - at scale and with very low latency. In only four days, the Allen Institute team built a working prototype of an end-to-end feature matrix ingestion pipeline using transient Amazon Elastic MapReduce (EMR) clusters and Amazon DynamoDB that dynamically ingests and transforms its wide datasets into consumable, interactive datasets for researchers. The team left the AWS Data Lab with an accelerated plan to bring this solution to production, furthering its commitment to support researchers in the quest for improved health outcomes.
KnowBe4, Inc. provides Security Awareness Training to help companies manage the IT security problems of social engineering, spear phishing, and ransomware attacks. Its training platform revolves around the Risk Score pipeline, which generates an individualized risk score for tens of millions of users daily. KnowBe4 worked with the AWS Data Lab to build a working prototype of a new Risk Score pipeline that reduced total runtime from 7.5 or more hours to 3.5 hours and horizontally scaled every aspect of data retrieval, processing, and training. After the AWS Data Lab, the team used the skills it learned to continue to optimize its pipeline. Five months post-lab, KnowBe4 launched to production with a final runtime of 1.1 hours. In addition to this six-fold reduction in total runtime, KnowBe4's new architecture revealed a four-fold savings in cost.
“What we did in four days would have taken us weeks, maybe months, to achieve some of this refactor of the technical debt we had with our AI pipeline. And at the same time prepare our data handling to scale to 10x what we have today” Marcio Castilho, Chief Architect Officer, KnowBe4.
Sportradar is a global provider of sports data intelligence, serving leagues, news media, consumer platforms, and sports betting operators with deep insights and a suite of strategic solutions to help grow their businesses. It engaged the AWS Data Lab for guidance on developing a modernized, low latency data analytics pipeline and workflow to power real-time statistical models, feature extraction, and inference using machine learning models and real-time dashboards. The Sportradar team left the AWS Data Lab with a clear path forward for real-time sportsbook risk management and real-time fraud detection, as well as a scalable process for deploying and managing additional data pipelines on a global level. It used the AWS Data Lab to help expand the capabilities of its existing cloud-native big data and analytics platform for real-time analytics workloads.
“Using the elasticity and value-added services from AWS, we have managed to analyze a high volume of transactions to produce deep real-time analytics. This gives our traders a crucial edge.” Ben Burdsall, CTO, Sportradar.
Jungle Scout is an all-in-one platform for finding, launching, and selling Amazon products. With the support of the AWS Data Lab, Jungle Scout built the foundation of an Amazon Simple Storage Service (S3) based data lake in only four days, as well as a repeatable pattern for building data pipelines that hydrate the data lake and provide a simple method for joining datasets from different systems into one centralized location. By using S3 as the core of the data lake, Jungle Scout is able to reduce its storage footprint across other databases and remove data silos, ultimately helping the team reduce cost and increase productivity. The solution also makes it simpler to manage multiple versions of product metadata changes, giving Jungle Scout’s data scientists and engineers the flexibility to view data changes several times per day and troubleshoot data faster.
“By leveraging the AWS Data Lab, we were able to launch our analytics solution to production only three months after joining the lab and with only two engineers working full-time on the project. This has resulted in a major shift in how engineers at Jungle Scout build data processing pipelines.” Alex Handley, Principal Architect, Jungle Scout.
Learn more about Jungle Scout's solution>>
Freeman is a leader in brand experience. The Freeman team was tasked with creating a streamlined approach for handling, validating, and joining data that would power visualizations in its custom dashboard service. Freeman partnered with the AWS Data Lab to accelerate the architectural design and prototype build of this solution. In only four days, the Freeman team built a data pipeline prototype for both streaming and batch datasets leveraging Amazon Kinesis and AWS Glue workflows to ingest, curate, and prepare the data. Using Amazon Athena, Amazon Kinesis Data Analytics, and Amazon Elasticsearch Service to query the various curated datasets and Amazon QuickSight and Kibana to visualize the results in easy to consume dashboards, the Freeman team left the AWS Data Lab with a clear path forward for enabling end users to gain valuable insights into its data.
"We were able to leverage our existing knowledge and infrastructure within AWS by expanding into new services and features that we hadn't explored before. With the help of the AWS solutions architects that worked side-by-side with us, we were able to greatly accelerate the delivery of our system and set up a foundation that we can build on down the road.” Casey McMullen, Director of Digital Solution Development, Freeman.
TownSq connects neighbors, board members, and management teams to easy, proven, collaborative tools designed to enhance the community living experience. TownSq needed to upgrade its data and analytics capabilities due to exponential client growth. It decided to build a data lake to enable greater insights about business performance, client benchmarking, engagement levels, and success rates on new products and tools. TownSq also wanted to deploy algorithms to highlight unmet client needs, automate key processes, and provide recommendations to mitigate any emerging or detected risks. In four days, the TownSq team achieved its goal of building a functioning data lake and an extract, transform, load (ETL) pipeline capable of processing data from multiple sources, including Amazon DynamoDB and internal MongoDB and ERP systems. Immediately following the lab, the team was able to use the solution to realign its product roadmap to focus on higher return-on-investment opportunities and dramatically increase engagement on newly-launched features.
"Working directly with Amazon's architects is a major accelerator, especially in a business driven by speed to market. The AWS Data Lab prepped for us, were in the room to support our build, and we walked out days later with a functioning product. The new products we are launching are game-changing and the added knowledge we have will help us continue to lead the market." Luis Lafer-Sousa, President - US, TownSq.
hc1 offers a suite of cloud-based, high-value care solutions that enable healthcare organizations to transform business and clinical data into the intelligence necessary to deliver on the promise of personalized care, all while eliminating waste. As an aggregator of billions of healthcare records from a number of large diagnostic testing providers, hc1 identified the need to migrate from its existing data warehouse to a scalable data lake on AWS to support its advanced analytics initiatives with AWS Artificial Intelligence (AI) and Machine Learning (ML) services. AWS Data Lab helped hc1 migrate its patient diagnostic testing data warehouse to a data lake architecture by partnering to rebuild its core SQL-based ingestion, cleanup, and patient-matching Extract, Transform, Load (ETL) scripts as AWS Glue ETL jobs. The team also leveraged AWS Glue FindMatches to deduplicate patient test panel records across testing providers. hc1's team left the AWS Data Lab with a well-architected data lake framework for its application’s core data repository. The hc1 team also learned best practices for matching patient information across datasets using AWS AI services, which will ensure patient medical record completeness and accuracy by deduplicating data from different points of care.
"Reliable patient record matching is pivotal in improving patient outcomes and reducing clinical waste. AWS AI services allows us to flexibly update our matching system. We are able to incorporate new sources in less than half the time.” Charles Clarke, SVP of Technology, hc1.
Automox is an information technology company providing a cloud-native, zero-maintenance solution that modernizes endpoint management for optimized security and business outcomes. Automox is unique in that it combines individual endpoint management modules into an extensible automation framework that can query endpoints, collect insights, and take action automatically, at scale. Automox collaborated with the AWS Data Lab to build a platform for providing enterprise customers with analytics and insights into endpoint management, patching, and vulnerabilities. Automox leveraged the Data Lab to prototype an end-to-end data pipeline with the goal of enabling an analytics API that can be used without knowledge of the structure in the underlying data stores. This included an ingestion service to load endpoint and patch data from their unified data layer, a data lake for multipurpose storage, and a batch processing layer for aggregations and dynamic querying. This reporting and analytics platform will support both internal users and external customers. The team left the AWS Data Lab with a validated prototype for a data processing pipeline that will support Automox's analytics and query requirements, offering scalability and flexibility as its data footprint continues to grow.
"To address our customers' problems, we need to build fast and make the right technology decisions. AWS Data Lab was the right accelerator for us and gives us a wonderful advantage, being able to validate our assumptions and answer our questions with the right expertise” Pascal Borghino, Head of Engineering, Automox.
Since 1882, Dow Jones has been finding new ways to bring information to the world’s top business entities. Dow Jones had several Informix databases to migrate to Amazon Aurora PostgreSQL and engaged the AWS Data Lab to help it test different data migration options and establish a well-architected data migration approach to apply to its 100+ databases. In just a week, Dow Jones emerged with a finalized approach for scripting and automating data migration and code deployment, including how to convert stored procedures, triggers, and tables, setting the stage for future Informix migrations.
Verisk is a leading data analytics provider serving customers in insurance, energy and specialized markets, and financial services. To scale solutions quickly and achieve greater resilience against points of failure, Verisk chose to migrate its legacy database footprint to AWS. Verisk collaborated with AWS Data Lab to receive expert guidance on navigating the design, architectural, and implementation challenges that come with undertaking mass migrations involving complex data types like large objects and geospatial data, large volumes of data, and complex procedures and schemas developed over 20+ years. AWS Data Lab worked with Verisk to architect and prove out a migration path from Verisk's legacy systems to Amazon Aurora PostgreSQL using Amazon Database Migration Service and AWS Schema Conversion Tool. In addition to the technical work achieved in the AWS Data Lab, Verisk came away with an increasingly focused migration strategy, a deepened understanding of how to execute migrations to AWS databases, and best practices for database administration and operating PostgreSQL databases in production.
"As a Database Administrator at Verisk working on the data migration, I am miles ahead of where I was prior to working with the AWS Data Lab. I have more confidence in being able to successfully migrate our legacy database to Aurora PostgreSQL and have a better understanding of what products are available to us. I couldn't have asked for a better experience."
3M is an American enterprise company operating in the fields of industry, worker safety, health care, and consumer goods. 3M R&D needed to enhance its machine learning, analytics, and reporting capabilities for more than 10,000 spreadsheets across six different business operations with more than fifty different schemas. With guidance from the AWS Data Lab, 3M developed a minimum viable product (MVP) for multiple data pipelines, processed with extract, transform, load (ETL), to flow into a data lake in Amazon S3, and then interpret, analyze, and visualize the data using Amazon SageMaker Notebooks and Amazon QuickSight for enhanced insights. This solution will allow 3M to work with customers more interactively, enabling immediate response time and higher customer satisfaction with the entire sales and solutioning process.
“I never knew it was possible to organize so much data in a way that would allow me to effectively access and analyze millions of rows of data, where before I was constantly looking for spreadsheets or just asking for another test to be run.” Lead Materials Application Engineer, 3M.
Civitas Learning is a data science company dedicated to helping higher education solve pressing challenges and improve student success outcomes. The company partnered with the AWS Data Lab to architect and integrate key building blocks in machine learning (ML) causal inference in order to create a real-world evidence knowledge base. Civitas Learning implemented an architecture for using notebooks in a production environment and left the AWS Data Lab with a new, repeatable workflow it can use for additional data science tasks.
“AWS assembled a super team to help us architect and integrate key building blocks in ML causal inference so that we could construct real-world evidence knowledge base. They also made sure that we stayed on course after our Data Lab engagement, which is helping us scale our ML practice with much faster deployment speed. It’s been a great, rewarding experience for us all, and our customers are happier as a result.” David Kil, Chief Data Scientist, Civitas Learning.
PHD Media is a global communications planning and media buying agency network. PHD Media needed to build a lean, high-performant, and scalable extract, transform, load (ETL) and data storage infrastructure that could support future Machine Learning workloads. The AWS Data Lab helped PHD Media move its ETL jobs to AWS Glue and rebuild its pipeline into a three-part process: data ingestion, data staging, and data summarization. PHD Media left the AWS Data Lab with a new architecture for its data pipeline that reduces ETL processing time from 21 hours to 75 minutes and is capable of integrating with Amazon SageMaker and BI tools.
“We would not have been able to dedicate the same amount of time to the development, nor been able to resolve our questions and problems as quickly without the AWS Data Lab. Doing the same work outside of the AWS Data Lab would have cost us significantly more in funds and time.” Amar Vyas, Global Data Strategy Director, PHD Global Business.