Explore real-world use cases for Amazon CodeWhisperer powered by AWS Glue Studio notebooks

Many customers are interested in boosting productivity in their software development lifecycle by using generative AI. Recently, AWS announced the general availability of Amazon CodeWhisperer, an AI coding companion that uses foundational models under the hood to improve software developer productivity. With Amazon CodeWhisperer, you can quickly accept the top suggestion, view more suggestions, or continue writing your own code. This integration reduces the overall time spent in writing data integration and extract, transform, and load (ETL) logic. It also helps beginner-level programmers write their first lines of code. AWS Glue Studio notebooks allows you to author data integration jobs with a web-based serverless notebook interface.

In this post, we discuss real-world use cases for CodeWhisperer powered by AWS Glue Studio notebooks.

Solution overview

For this post, you use the CSV eSports Earnings dataset, available to download via Kaggle. The data is scraped from eSportsEarnings.com, which provides information on earnings of eSports players and teams. The objective is to perform transformations using an AWS Glue Studio notebook with CodeWhisperer recommendations and then write the data back to Amazon Simple Storage Service (Amazon S3) in Parquet file format as well as to Amazon Redshift.

Prerequisites

Our solution has the following prerequisites:

Set up AWS Glue Studio.

Configure an AWS Identity and Access Management (IAM) role to interact with CodeWhisperer. Attach the following policy to your IAM role that is attached to the AWS Glue Studio notebook:

{
    "Version": "2012-10-17",
    "Statement": [{
        "Sid": "CodeWhispererPermissions",
        "Effect": "Allow",
        "Action": [
            "codewhisperer:GenerateRecommendations"
        ],
        "Resource": "*"
    }]
}

Download the CSV eSports Earnings dataset and upload the CSV file highest_earning_players.csv to the S3 folder you will be using in this use case.

Create an AWS Glue Studio notebook

Let’s get started. Create a new AWS Glue Studio notebook job by completing the following steps:

On the AWS Glue console, choose Notebooks under ETL jobs in the navigation pane.
Select Jupyter Notebook and choose Create.
For Job name, enter CodeWhisperer-s3toJDBC.

A new notebook will be created with the sample cells as shown in the following screenshot.

We use the second cell for now, so you can remove all the other cells.

In the second cell, update the interactive session configuration by setting the following:
1. Worker type to G.1X
2. Number of workers to 3
3. AWS Glue version to 4.0

Moreover, import the DynamicFrame module and current_timestamp function as follows:

from pyspark.sql.functions import current_timestamp
from awsglue.dynamicframe import DynamicFrame

After you make these changes, the notebook should be looking like the following screenshot.

Now, let’s ensure CodeWhisperer is working as intended. At the bottom right, you will find the CodeWhisperer option beside the Glue PySpark status, as shown in the following screenshot.

You can choose CodeWhisperer to view the options to use Auto-Suggestions.

Develop your code using CodeWhisperer in an AWS Glue Studio notebook

In this section, we show how to develop an AWS Glue notebook job for Amazon S3 as a data source and JDBC data sources as a target. For our use case, we need to ensure Auto-Suggestions are enabled. Write your recommendation using CodeWhisperer using the following steps:

Write a comment in natural language (in English) to read Parquet files from your S3 bucket:
```
# Read CSV files from S3
```

After you enter the preceding comment and press Enter, the CodeWhisperer button at the end of the page will show that it is running to write the recommendation. The output of the CodeWhisperer recommendation will appear in the next line and the code is chosen after you press Tab. You can learn more in User actions.

After you enter the preceding comment, CodeWhisperer will generate a code snippet that is similar to the following:

df = (spark.read.format("csv")
      .option("header", "true")
      .option("inferSchema", "true")
      .load("s3://<bucket>/<path>/highest_earning_players.csv"))

Note that you need to update the paths to match the S3 bucket you’re using instead of the CodeWhisperer-generated bucket.

From the preceding code snippet, CodeWhisperer used Spark DataFrames to read the CSV files.

You can now try some rephrasing to get a suggestion with DynamicFrame functions:

# Read CSV file from S3 with the header format option using DynamicFrame"

Now CodeWhisperer will generate a code snippet that is close to the following:

dyF = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={
        "paths": ["s3://<bucket>/<path>/highest_earning_players.csv"],
        "recurse": True,
    },
    format="csv",
    format_options={
        "withHeader": True,
    },
    transformation_ctx="dyF")

Rephrasing the sentences written now has proved that after some modifications to the comments we wrote, we got the correct recommendation from CodeWhisperer.

Next, use CodeWhisperer to print the schema of the preceding AWS Glue DynamicFrame by using the following comment:
```
# Print the schema of the above DynamicFrame
```

CodeWhisperer will generate a code snippet that is close to the following:

dyF.printSchema()

We get the following output.

Now we use CodeWhisperer to create some transformation functions that can manipulate the AWS Glue DynamicFrame read earlier. We start by entering code in a new cell.

First, test if CodeWhisperer can use the correct AWS Glue context functions like ResolveChoice:
```
# Convert the "PlayerId" type from string to integer
```

CodeWhisperer has recommended a code snippet similar to the following:

dyF = dyF.resolveChoice(specs=[('PlayerId', 'cast:long')])
dyF.printSchema()

The preceding code snippet doesn’t accurately represent the comment that we entered.

You can apply sentence paraphrasing and simplifying by providing the following three comments. Each one has different ask and we use the withColumn Spark Frame method, which is used in casting columns types:
```
# Convert the DynamicFrame to spark data frame
# Cast the 'PlayerId' column from string to Integer using WithColumn function
 # Convert the spark frame back to DynamicFrame and print the schema
```

CodeWhisperer will pick up the preceding commands and recommend the following code snippet in sequence:

df = dyF.toDF()
df = df.withColumn("PlayerId", df["PlayerId"].cast("integer"))
dyF = DynamicFrame.fromDF(df, glueContext, "dyF")
dyF.printSchema()

The following output confirms the PlayerId column is changed from string to integer.