AWS Big Data Blog
Explore real-world use cases for Amazon CodeWhisperer powered by AWS Glue Studio notebooks
Many customers are interested in boosting productivity in their software development lifecycle by using generative AI. Recently, AWS announced the general availability of Amazon CodeWhisperer, an AI coding companion that uses foundational models under the hood to improve software developer productivity. With Amazon CodeWhisperer, you can quickly accept the top suggestion, view more suggestions, or continue writing your own code. This integration reduces the overall time spent in writing data integration and extract, transform, and load (ETL) logic. It also helps beginner-level programmers write their first lines of code. AWS Glue Studio notebooks allows you to author data integration jobs with a web-based serverless notebook interface.
In this post, we discuss real-world use cases for CodeWhisperer powered by AWS Glue Studio notebooks.
Solution overview
For this post, you use the CSV eSports Earnings dataset, available to download via Kaggle. The data is scraped from eSportsEarnings.com, which provides information on earnings of eSports players and teams. The objective is to perform transformations using an AWS Glue Studio notebook with CodeWhisperer recommendations and then write the data back to Amazon Simple Storage Service (Amazon S3) in Parquet file format as well as to Amazon Redshift.
Prerequisites
Our solution has the following prerequisites:
- Set up AWS Glue Studio.
- Configure an AWS Identity and Access Management (IAM) role to interact with CodeWhisperer. Attach the following policy to your IAM role that is attached to the AWS Glue Studio notebook:
- Download the CSV eSports Earnings dataset and upload the CSV file
highest_earning_players.csv
to the S3 folder you will be using in this use case.
Create an AWS Glue Studio notebook
Let’s get started. Create a new AWS Glue Studio notebook job by completing the following steps:
- On the AWS Glue console, choose Notebooks under ETL jobs in the navigation pane.
- Select Jupyter Notebook and choose Create.
- For Job name, enter
CodeWhisperer-s3toJDBC
.
A new notebook will be created with the sample cells as shown in the following screenshot.
We use the second cell for now, so you can remove all the other cells.
- In the second cell, update the interactive session configuration by setting the following:
- Worker type to G.1X
- Number of workers to 3
- AWS Glue version to 4.0
- Moreover, import the
DynamicFrame
module andcurrent_timestamp
function as follows:
After you make these changes, the notebook should be looking like the following screenshot.
Now, let’s ensure CodeWhisperer is working as intended. At the bottom right, you will find the CodeWhisperer option beside the Glue PySpark status, as shown in the following screenshot.
You can choose CodeWhisperer to view the options to use Auto-Suggestions.
Develop your code using CodeWhisperer in an AWS Glue Studio notebook
In this section, we show how to develop an AWS Glue notebook job for Amazon S3 as a data source and JDBC data sources as a target. For our use case, we need to ensure Auto-Suggestions are enabled. Write your recommendation using CodeWhisperer using the following steps:
- Write a comment in natural language (in English) to read Parquet files from your S3 bucket:
After you enter the preceding comment and press Enter, the CodeWhisperer button at the end of the page will show that it is running to write the recommendation. The output of the CodeWhisperer recommendation will appear in the next line and the code is chosen after you press Tab. You can learn more in User actions.
After you enter the preceding comment, CodeWhisperer will generate a code snippet that is similar to the following:
Note that you need to update the paths to match the S3 bucket you’re using instead of the CodeWhisperer-generated bucket.
From the preceding code snippet, CodeWhisperer used Spark DataFrames to read the CSV files.
- You can now try some rephrasing to get a suggestion with DynamicFrame functions:
Now CodeWhisperer will generate a code snippet that is close to the following:
Rephrasing the sentences written now has proved that after some modifications to the comments we wrote, we got the correct recommendation from CodeWhisperer.
- Next, use CodeWhisperer to print the schema of the preceding AWS Glue DynamicFrame by using the following comment:
CodeWhisperer will generate a code snippet that is close to the following:
We get the following output.
Now we use CodeWhisperer to create some transformation functions that can manipulate the AWS Glue DynamicFrame read earlier. We start by entering code in a new cell.
- First, test if CodeWhisperer can use the correct AWS Glue context functions like ResolveChoice:
CodeWhisperer has recommended a code snippet similar to the following:
The preceding code snippet doesn’t accurately represent the comment that we entered.
- You can apply sentence paraphrasing and simplifying by providing the following three comments. Each one has different ask and we use the withColumn Spark Frame method, which is used in casting columns types:
CodeWhisperer will pick up the preceding commands and recommend the following code snippet in sequence:
The following output confirms the PlayerId
column is changed from string to integer.
- Apply the same process to the resultant AWS Glue DynamicFrame for the
TotalUSDPrize
column by casting it from string to long using thewithColumn
Spark Frame functions by entering the following comments:
The recommended code snippet is similar to the following:
The output schema of the preceding code snippet is as follows.
Now we will try to recommend a code snippet that reflects the average prize for each player according to their country code.
- To do so, start by getting the count of the player per each country:
The recommended code snippet is similar to the following:
We get the following output.
- Join the main DataFrame with the country code count DataFrame and then add a new column calculating the average highest prize for each player according to their country code:
The recommended code snippet is similar to the following:
The output of the schema now confirms the both DataFrames where correctly joined and the Count
column is added to the main DataFrame.
- Get the code recommendation on the code snippet to calculate the average
TotalUSDPrize
for each country code and add it to a new column:
The recommended code snippet is similar to the following:
The output of the preceding code should look like the following.
- Join the
country_code_sum
DataFrame with the main DataFrame from earlier and get the average of the prizes per player per country:
The recommended code snippet is similar to the following:
- The last part in the transformation phase is to sort the data by the highest average prize per player per country:
The recommended code snippet is similar to the following:
The first five rows will be similar to the following.
For the last step, we write the DynamicFrame to Amazon S3 and to Amazon Redshift.
- Write the DynamicFrame to Amazon S3 with the following code:
The CodeWhisperer recommendation is similar to the following code snippet:
We need to correct the code snippet generated after the recommendation because it doesn’t contain partition keys. As we pointed out, partitionkeys
is empty, so we can have another code block suggestion to set partitionkey
and then write it to the target Amazon S3 location. Also, according to the newest updates related to writing DynamicFrames to Amazon S3 using glueparquet, format = "glueparquet"
is no longer used. Instead, you need to use the parquet type with useGlueParquetWriter
enabled.
After the updates, our code looks similar to the following:
Another option here would be to write the files to Amazon Redshift using a JDBC connection.
- First, enter the following command to check whether CodeWhisperer will understand the comment in one sentence and use the correct functions or not:
The output of the comment is similar to the following code snippet:
As we can see, CodeWhisperer correctly interpreted the comment by selecting only the specified columns to write to Amazon Redshift.
- Now, use CodeWhisperer to write the DynamicFrame to Amazon Redshift. We use the Preaction parameter to run a SQL query to select only certain columns to be written to Amazon Redshift:
The CodeWhisperer recommendation is similar to the following code snippet:
After checking the preceding code snippet, you can observe that there is a misplaced format
, which you can remove. You can also add the iam_role
as an input in connection_options
. You can also notice that CodeWhisperer has automatically assumed the Redshift URL to have the same name as the S3 folder that we used. Therefore, you need to change the URL and the S3 temp directory bucket to reflect your own parameters and remove the password parameter. The final code snippet should be similar to the following:
The following is the whole code and comment snippets:
Conclusion
In this post, we demonstrated a real-world use case on how AWS Glue Studio notebook integration with CodeWhisperer helps you build data integration jobs faster. You can start using the AWS Glue Studio notebook with CodeWhisperer to accelerate building your data integration jobs.
To learn more about using AWS Glue Studio notebooks and CodeWhisperer, check out the following video.
About the authors
Ishan Gaur works as Sr. Big Data Cloud Engineer ( ETL ) specialized in AWS Glue. He’s passionate about helping customers building out scalable distributed ETL workloads and analytics pipelines on AWS.
Omar Elkharbotly is a Glue SME who works as Big Data Cloud Support Engineer 2 (DIST). He is dedicated to assisting customers in resolving issues related to their ETL workloads and creating scalable data processing and analytics pipelines on AWS.