AWS Public Sector Blog

Building a data lake at your university for academic and research success

According to the National Center for Education Statistics, only 60 percent of college students receive a degree within six years. Universities—like Portland State University (PSU) and Oklahoma State University (OSU-OKC)—are using data lakes for analytics and machine learning to improve academic achievement by helping students reach their educational goals faster.

And now, with the launch of an Education Pricing Program (EPP) for Amazon Simple Storage (Amazon S3), education institutions can use storage to build a data lake for up to 45 percent off public pricing for the S3 One Zone–Infrequent Access and S3 Glacier Storage Classes in the IAD, PDX, DUB, and CMH Regions when they enter into a storage commitment. Read on for how institutions use Amazon S3 for data lakes, and contact us to get started.

Using data to tailor your student experience

PSU is using artificial intelligence (AI) and analytics to help students find the most effective pathways to graduation. By tracking the course history of successful graduates and presenting recommendations to current students, PSU can provide guardrails and best practices for a focused journey towards degree completion.

The role of data analytics extends beyond improving student outcomes to include increased student safety and enhanced student experience. IT administrators assess data from sensors and cameras on campus to continually monitor and refine safety and security practices. They track traffic, and usage data from various student services to continually fine tune existing services or launch student-friendly offerings on campus.

One of the first steps to analyze existing structured and unstructured data is to build a data lake to help deal with massive volumes of heterogeneous data which can then be queried by multiple users within the organization without the risk of jeopardizing production data.

Using a data lake to facilitate research

With cloud computing researchers can quickly analyze massive data pipelines, store petabytes of data, advance research using transformative technologies like artificial intelligence (AI), machine learning (ML), and big data and share their results quickly with others around the world. AWS also provides researchers with access to open datasets, funding, and training to accelerate the pace of innovation.

The Allen Institute collaborated with the AWS Open Data Program to make data from the Allen Brain Observatory (which includes nearly 100TB of neurophysiology data) available to research and academic users in a public Amazon S3 bucket. The initiative helps other scientists who want to use the data in their own research. Moving data between users can be a challenging, expensive process AWS allow users to bring analysis to the data instead. Interested users can spin up an Amazon Elastic Compute Cloud (Amazon EC2) instance and access an entire dataset in minutes, rather than spending weeks downloading (and duplicating) data locally.

Building a data lake

With a data lake being a repository to store all your structured and unstructured data, you can store your data as-is, without having to first structure it. You can then run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning (ML)—to guide better decisions.

The five typical steps in building a data lake include:

  1. Set up storage
  2. Move data (immediately secure it, with a generic but highly restricted bucket policy) and validate it
  3. Cleanse, prep, and catalog data (Catalog – Operational, Technical, and Business metadata)
  4. Configure and enforce security and compliance policies
  5. Make data available for analysis

Staging the data

Determine the best place to store and stage your data. Many organizations use Amazon S3 for their data lake because it is a highly durable, cost-effective object store that supports open formats of data while decoupling storage from compute. It also works with AWS analytics services. Although Amazon S3 provides the foundation of a data lake, you can add other services to tailor the data lake to your business needs.

For example, Oklahoma State University in Oklahoma City (OSU-OKC), a two-year, technical-focused college, faced challenges with consistent reporting, database management, and analytics. With the help of a data lake in Amazon S3 and Amazon QuickSight for analytics, the institution has a full view of its 60-year enrollment history. Long-term enrollment fluctuations and diversity trend data coupled with external employment and economic data has allowed them to create targeted strategies to increase student completion and retention. And, OSU-OKC entered publicly available comparative data from 30 of the state’s universities. Using Amazon QuickSight for data insights allows university leadership to provide faster market-based curriculum and operational decisions to meet student, industry training, and critical occupational needs.

A template to build a data lake

AWS Lake Formation can help organizations set up a secure data lake in a matter of days. AWS Lake Formation helps you collect and catalog your data from databases and object storage, move the data into your new Amazon S3 data lake, clean and classify your data using ML algorithms, and secure access to your sensitive data. Users can then access a centralized data catalog, which describes available datasets and their appropriate usage. You can then use these datasets with your choice of analytics and ML services, like Amazon RedshiftAmazon Athena, and Amazon EMR for Apache Spark. AWS Lake Formation builds on the capabilities available in AWS Glue.

Optimizing Amazon S3

Amazon S3 optimization reduces cost of data lake by placing objects in the correct S3 tier based on the frequency of access and durability requirements. For example, if the S3 object in the data lake can be easily reproduced from primary sources, AWS recommends the use of S3 Single Zone, Infrequent Access. If, for example the data lake usage pattern is unknown or infrequent, utilization of S3 Intelligent Tiering allows you to automatically reduce S3 storage cost. In addition, S3 lifecycle policies can further reduce S3 costs by moving your data sets into the longer term storage services: Amazon S3 Glacier or Glacier Deep Archive.

Beyond data lake to data warehouse

Beyond data lakes, universities also rely on data warehouses—using Amazon Redshift—to identify academic achievement. University of Maryland Global Campus (formerly known University College) runs predictive models for improving student outcomes by identifying students at-risk of dropping out. Once identified, these students are directed to the pathways best suited to help them succeed. Administrators use dashboards to track student performance, course details, and enrollment numbers to help determine which courses need to be re-designed.

Having the ability to change affordably and securely with reliable reporting and information has transformed the way higher education administrators are able to make decisions. More than ever, administrators need the responsiveness to effectively serve their students and their constituents.

Learn more about data lakes, Amazon S3, AWS Lake Formation, Amazon Athena, Amazon SageMaker, AWS Glue, Amazon Redshift, and how to apply for $5,000 in AWS Promotional Credits to get started through Project Resilience.

Nader Nanjiani

Nader Nanjiani

Nader Nanjiani is a technology marketing leader with more than 20 years of product marketing experience at Fortune 500 companies. At Amazon Web Services (AWS), Nader leads the solutions marketing of storage, compute, open data, and analytics services into the public sector segment. He has also co-authored two books—one on collaboration and another on e-learning—and secured a patent on a game application. Nader completed his graduate studies at Harvard University and Syracuse University.