AWS Database Blog
Cal Poly’s software engineering capstone class builds MySQL capture and replay on AWS
At California Polytechnic State University, the software engineering school’s capstone class was created in support of the university’s “Learn by Doing” philosophy. Students in the class experience over the course of a full school year what it’s like to work on a collaborative industry project. This is Dr. David Janzen’s tenth year teaching this course, and he’s an expert at not only finding projects that work best for a course of this length, but also providing the appropriate amount of difficulty for the course.
This year’s project was called “AWS MyCRT”, which stands for “MySQL Capture and Replay Tool.” The project was supported by the Amazon RDS team. The class of 30 Cal Poly students was divided into teams based on each student’s individual abilities, and the result was five well-rounded teams that each implemented their own version of MyCRT.
The project was shaped through team and customer meetings and requirements elicitation with Amazon RDS team members Brian Welcker, Rosa Thomas, and Sachin Honnudike. From these meetings, one of the teams, Team Titans, created the following background for the project:
On the Amazon Web Services (AWS) platform, customers can take advantage of the Relational Database Service (RDS), Amazon’s managed database offering. There is a wide variety of server hardware to choose from to best suit the customer’s need. However, choosing the right hardware to meet certain performance requirements is difficult to gauge. It can also be very costly to a company if servers are underperforming, or if they are paying too much for what they need. Amazon’s customers have expressed that they would like a way to compare performance across different server setups, and ensure they have the right amount of server resources to meet company demand.
The MyCRT solution to this problem was a tool that can capture workloads and metrics on an RDS MySQL database. The solution then allows users to replay this workload on an alternative database configuration and gather corresponding metrics. The tool also allows users to analyze the results in order to select an ideal MySQL and RDS configuration for their business needs. Each team was able to move forward with the year-long project with help from AWS and Dr. Janzen along the way.
The first eleven weeks of the year focused on software requirements. Here, students learned software requirements elicitation, analysis, and documentation. In teams, students focus on software architectural design and prototyping. They reviewed team process infrastructure and resource estimation to support appropriate levels of quality. In the following ten weeks, the focus moved towards building the product in the class. Teams focused on software development process models, software design, documentation, quality assurance during development, software unity and integration testing while applying CASE tools. They also focused on development environments, security best practices, test tools, continuous integration, and configuration management.
In the final ten weeks, teams focused on deploying the software and learned about software maintenance, deployment, economic issues, version control, defect tracking, and technical support. Each of the five capstone teams implemented a solution to the problem in their own way with a wide variety of tools, technologies, and architectures.
Example MyCRT project
As one example, Lil Bobby Tables’ MyCRT implementation can shed more light on what MyCRT does. The source code is available on GitHub. This solution was built with several use cases in mind, each at a different scope. At a high level, MyCRT is a service installed on an EC2 instance that serves a client (web) application and runs captures and replays on target RDS databases. These captures and replays are independent programs that can also be run from the command line. The entire project is managed with NPM, built with TypeScript, and runs on either NodeJS or in the browser (for a GUI client).
Users can interact with MyCRT at various levels. Most find enough functionality through the GUI client, but there are more options for advanced users. This includes direct interaction with the REST API, or manually launching captures and replays from the command line. Lil Bobby Tables also implemented a concept called environments. The concept allows users to narrow their use of MyCRT to their use case while others can use the same MyCRT installation for unrelated projects. By exposing the MyCRT functionality at these different interfaces, there is flexibility in how MyCRT can be used.
When a capture runs, users provide a source RDS MySQL database on which to capture the workload. The capture starts by turning on the MySQL general log. After it’s running, MyCRT periodically polls the source RDS database for activity and CloudWatch metrics. These metrics include CPU utilization, read/write IOPS, and freeable memory. The data is dumped to a user-specified Amazon S3 location in raw JSON format. The capture continues this cycle until it is told to stop, which can be done manually or set to some duration. At this point, users have access to the captured workload and all of the CloudWatch metrics that are gathered.
When a replay runs, users provide a target RDS MySQL database and a previously captured workload for the replay to repeat. The replay steps through the given workload, and reproduces MySQL queries on the target database. Like the collection of capture metrics, the replay also periodically gathers metrics to dump in S3.
After a user completes both a capture and replay, they have two sets of metrics in S3 for identical workloads, but on different RDS MySQL databases. The simplest way for the user to compare these metrics is through the MyCRT GUI client. Metrics are displayed in a graph like the one here and the user can choose which metrics they want to view. If they want to perform a more detailed analysis on the metrics files, all of the data is stored in raw JSON on S3.Environments are an extra layer of organization. They allow multiple users and teams within the same organization to perform captures and replays without sharing their data with other users. At the same time, environments isolate users to the captures and replays that they care about, thus making the whole process much smoother. Each environment specifies one source database that captures can be created from, and a target database is provided for each replay that is created within the environment.
If you want to learn more about any of the five MyCRT implementations, you can set one up and run it in your AWS environment. All the source code is available with an open-source license. You can download them from the capstone project page on GitHub.
About the Authors
Amy Lewis is a senior at California Polytechnic State University finishing her undergrad in Software Engineering. Next year she will be pursuing her master’s degree in Computer Science in Cal Poly’s blended 4+1 program. This summer she will be interning in Los Angeles on the Flight Systems team at SpaceX. In her free time she loves to read, go to the beach, swim, and backpack.
Cameron Taylor is a graduating senior at California Polytechnic State University in Software Engineering. After graduation, he will be joining the Commercial Software Engineering team at Microsoft in Redmond.
Christiana Ushana is a graduating senior at California Polytechnic State University, San Luis Obispo in Software Engineering. She is currently working as a Frontend Software Developer at Bishop Peak Technology, Inc. In the coming months, she will be working as a Software Engineer for Sapient Razorfish. Christiana volunteers her time at Vista Church in SLO. She spends her free time learning new technical skills, indoor rock climbing, swimming, and hiking.