AWS Public Sector Blog

Modern data engineering in higher ed: Doing DataOps atop a data lake on AWS

Photo by Hunter Harritt on Unsplash

Modern data engineering covers several key components of building a modern data lake. Most databases and data warehouses, to an extent, do not lend themselves well to a DevOps model.

DataOps grew out of frustrations trying to build a scalable, reusable data pipeline in an automated fashion. DataOps was founded on applying DevOps principles on top of data lakes to help build automated solutions in a more agile manner. With DataOps, users apply principles of data processing on the data lake to curate and collect the transformed data for downstream processing.

One reason that DevOps was hard on databases was because testing was hard to automate on such systems. At California State University Chancellors Office (CSUCO), we took a different approach by residing most of our logic with a programming framework that allows us to build a testable platform.

We applied DataOps in ten steps:

  1. Add data and logic tests: PyTest, CodeCoverage, and AWS CodeBuild
  2. Use a version control system: Git and AWS CodeCommit
  3. Branch and merge: CodeCommit, AWS CodePipeline, and AWS Lambda
  4. Use multiple services: CodeBuild, CodePipeline, and AWS CloudFormation
  5. Reuse and containerize: Amazon Elastic Kubernetes Service (Amazon EKS) and Docker
  6. Parameterize your processing: Amazon EMR, Amazon CloudWatch, AWS Step Functions, and Amazon Simple Notification Service (Amazon SNS)
  7. Set up storage: Amazon Simple Storage Service (Amazon S3)
  8. Configure security: AWS Identity and Access Management (IAM), AWS Key Management Service (AWS KMS), Amazon GuardDuty, Amazon Inspector, and AWS Trusted Advisor
  9. Set up data management: AWS Database Migration Service (AWS DMS), AWS Glue Data Catalog, and Amazon Athena
  10. Set up databases/data warehouse: Amazon Aurora Postgres and Amazon Redshift

High-level architecture

We built our data lake using AWS tools and applications. We wanted to keep our costs down by following a dynamic model, except for Tableau. Most of our services are pay-as-you-go—we don’t launch services unless we need them. This conforms to the DataOps and DevSecOps model of being agile and lean.

The architecture we follow sources our data from a traditional data warehouse that has all of our Oracle and Peoplesoft databases. We then load this data into our data lake using AWS DMS. We try to directly load them as parquet files but if we cannot load them as parquet we source them as is and then convert it to Parquet. We built our architecture to scale out and up. We can accept any kind of data source whether it be streaming or batch in any kind of format.

Once loaded, we apply a series of continuous integration and continuous delivery (CI/CD) processes from testing, to code coverage, to builds, to orchestrating and configuring infrastructure across multiple environments, to running code to process the data to loading the transforms back into Amazon S3 for further downstream reporting and analysis. For these, we use tools like CodeCommit, CodeBuild, CodePipeline, AWS CodeDeploy, AWS CloudFormation, Step Functions, and Lambda.

All of our processed data goes into Amazon S3. We apply AWS Glue for schema management atop these transformed datasets so we can query them via Athena. We also load subsets of data into Postgres backed by Aurora and Amazon Redshift for better and faster responses. This helps keep our costs down to a minimum. Lastly, we use Tableau to report on our data to our customers and clients.

Containerizing data

We strive to innovate quickly to deliver the most value to support our DataOps projects as fast as possible. To achieve this, we used modern applications with CI/CD to automate the entire release process: building and running tests, promoting artifacts to staging, and the final deployment to production. We took advantage of fully managed CI/CD services such as CodeBuild, CodePipeline, and Amazon EKS. By modeling infrastructure as code, we incorporated it into our standard application development lifecycle, executing infrastructure changes in our CI/CD pipeline. We used Kubernetes and Amazon Web Services (AWS) together to create a fully managed, continuous delivery pipeline for container-based applications. This approach takes advantage of Kubernetes’s open source system to manage our containerized applications and the AWS developer tools to manage our source code, builds, and pipelines.

Production deployment process overview

We check the latest code into CodeCommit, test it, and see that it passes code coverage during the integration testing cycle. To make sure we have the highest level of resource and security isolation, we implemented the continuous delivery pipeline using two AWS accounts. The DEV account primarily uses CodeCommit, CodeBuild, CodePipeline, and AWS governance and security resources to orchestrate the continuous delivery process.

When code is ready to move to production, a pull request is submitted for manager’s approval. When a request is approved, the process automatically merges the change to the production branch. All artifacts are packaged and deployed to the PROD account using CodeBuild. The CodePipeline triggers to deploy the AWS CloudFormation stack to provision an Amazon EMR cluster, run Amazon EMR steps, and terminate the cluster when all steps are done in the PROD account. During this stage, AWS CloudFormation creates a change set for the PROD stack and executes the change set. Upon the change set execution, the production deployment process completes. This process eliminates duplication of the delivery pipeline. The delivery pipeline is secured using cross account IAM roles and encryption of artifacts using AWS KMS key.

Conclusion

By moving to this DataOps architecture, we brought our running costs down 60 percent from tens of thousands of dollars to several thousands of dollars. Our maintenance on call reduced dramatically to around 53 percent of being on call from before. By automating all these processes, we also reduced critical errors from production loads thereby reducing manual interventions by approximately 61 percent. We are better able to test and optimize our code. We can also do performance load tests in a cleaner and more stable way.

Check out more stories on data lakes and learn more about the cloud for higher education.

Subash D'Souza

Subash D'Souza

Subash D'Souza leads the cloud data group at the California State University (CSU) Chancellors Office where he uses his more than ten years of knowledge in the cloud to help with CSU’s transition to the cloud. He is an data evangelist. He is the founder and organizer of Data Con LA formerly known as Big Data Day LA, a data conference based in Southern California. He is also the founder for Data 4 Good, a public entity using data to solve social causes. Subash’s passions lies in building scalable and performant systems.

Babu Repaka

Babu Repaka

Babu Repaka is a DataOps engineer at the California State University (CSU) Chancellors Office. Babu has been working in the industry for more than 19 years, and in enterprise resource systems (ERP) and business intelligence (BI) applications / data warehousing for 15 plus years. In his various roles from developer to BI solution architect and manager, he has directly led teams for over 10 years. Several of his projects have been end to end ERP-Peoplesoft, BI, and data warehouse implementations where he was engaged in assessment, solution conceptualization, requirements analysis, solution design and architecture, configuration, user testing and deployment. His experience also goes beyond the deployment; he led a large ERP-PeopleSoft, data warehouse operations and production support team for a real time data warehouse.

Maria Fung

Maria Fung

Maria Fung is a DataOps engineer at the California State University (CSU) Chancellors Office. Maria is a lifelong learner with a passion for technology, and has spent the past 20 years specializing in data warehousing and business intelligence solutions.