AWS Big Data Blog

Explore real-world use cases for Amazon CodeWhisperer powered by AWS Glue Studio notebooks

Many customers are interested in boosting productivity in their software development lifecycle by using generative AI. Recently, AWS announced the general availability of Amazon CodeWhisperer, an AI coding companion that uses foundational models under the hood to improve software developer productivity. With Amazon CodeWhisperer, you can quickly accept the top suggestion, view more suggestions, or continue writing your own code. This integration reduces the overall time spent in writing data integration and extract, transform, and load (ETL) logic. It also helps beginner-level programmers write their first lines of code. AWS Glue Studio notebooks allows you to author data integration jobs with a web-based serverless notebook interface.

In this post, we discuss real-world use cases for CodeWhisperer powered by AWS Glue Studio notebooks.

Solution overview

For this post, you use the CSV eSports Earnings dataset, available to download via Kaggle. The data is scraped from eSportsEarnings.com, which provides information on earnings of eSports players and teams. The objective is to perform transformations using an AWS Glue Studio notebook with CodeWhisperer recommendations and then write the data back to Amazon Simple Storage Service (Amazon S3) in Parquet file format as well as to Amazon Redshift.

Prerequisites

Our solution has the following prerequisites:

  1. Set up AWS Glue Studio.
  2. Configure an AWS Identity and Access Management (IAM) role to interact with CodeWhisperer. Attach the following policy to your IAM role that is attached to the AWS Glue Studio notebook:
    {
        "Version": "2012-10-17",
        "Statement": [{
            "Sid": "CodeWhispererPermissions",
            "Effect": "Allow",
            "Action": [
                "codewhisperer:GenerateRecommendations"
            ],
            "Resource": "*"
        }]
    }
  3. Download the CSV eSports Earnings dataset and upload the CSV file highest_earning_players.csv to the S3 folder you will be using in this use case.

Create an AWS Glue Studio notebook

Let’s get started. Create a new AWS Glue Studio notebook job by completing the following steps:

  1. On the AWS Glue console, choose Notebooks under ETL jobs in the navigation pane.
  2. Select Jupyter Notebook and choose Create.
  3. For Job name, enter CodeWhisperer-s3toJDBC.

A new notebook will be created with the sample cells as shown in the following screenshot.

We use the second cell for now, so you can remove all the other cells.

  1. In the second cell, update the interactive session configuration by setting the following:
    1. Worker type to G.1X
    2. Number of workers to 3
    3. AWS Glue version to 4.0
  2. Moreover, import the DynamicFrame module and current_timestamp function as follows:
    from pyspark.sql.functions import current_timestamp
    from awsglue.dynamicframe import DynamicFrame

After you make these changes, the notebook should be looking like the following screenshot.

Now, let’s ensure CodeWhisperer is working as intended. At the bottom right, you will find the CodeWhisperer option beside the Glue PySpark status, as shown in the following screenshot.

You can choose CodeWhisperer to view the options to use Auto-Suggestions.

Develop your code using CodeWhisperer in an AWS Glue Studio notebook

In this section, we show how to develop an AWS Glue notebook job for Amazon S3 as a data source and JDBC data sources as a target. For our use case, we need to ensure Auto-Suggestions are enabled. Write your recommendation using CodeWhisperer using the following steps:

  1. Write a comment in natural language (in English) to read Parquet files from your S3 bucket:
    # Read CSV files from S3

After you enter the preceding comment and press Enter, the CodeWhisperer button at the end of the page will show that it is running to write the recommendation. The output of the CodeWhisperer recommendation will appear in the next line and the code is chosen after you press Tab. You can learn more in User actions.

After you enter the preceding comment, CodeWhisperer will generate a code snippet that is similar to the following:

df = (spark.read.format("csv")
      .option("header", "true")
      .option("inferSchema", "true")
      .load("s3://<bucket>/<path>/highest_earning_players.csv"))

Note that you need to update the paths to match the S3 bucket you’re using instead of the CodeWhisperer-generated bucket.

From the preceding code snippet, CodeWhisperer used Spark DataFrames to read the CSV files.

  1. You can now try some rephrasing to get a suggestion with DynamicFrame functions:
# Read CSV file from S3 with the header format option using DynamicFrame"

Now CodeWhisperer will generate a code snippet that is close to the following:

dyF = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={
        "paths": ["s3://<bucket>/<path>/highest_earning_players.csv"],
        "recurse": True,
    },
    format="csv",
    format_options={
        "withHeader": True,
    },
    transformation_ctx="dyF")

Rephrasing the sentences written now has proved that after some modifications to the comments we wrote, we got the correct recommendation from CodeWhisperer.

  1. Next, use CodeWhisperer to print the schema of the preceding AWS Glue DynamicFrame by using the following comment:
    # Print the schema of the above DynamicFrame

CodeWhisperer will generate a code snippet that is close to the following:

dyF.printSchema()

We get the following output.

Now we use CodeWhisperer to create some transformation functions that can manipulate the AWS Glue DynamicFrame read earlier. We start by entering code in a new cell.

  1. First, test if CodeWhisperer can use the correct AWS Glue context functions like ResolveChoice:
    # Convert the "PlayerId" type from string to integer

CodeWhisperer has recommended a code snippet similar to the following:

dyF = dyF.resolveChoice(specs=[('PlayerId', 'cast:long')])
dyF.printSchema()

The preceding code snippet doesn’t accurately represent the comment that we entered.

  1. You can apply sentence paraphrasing and simplifying by providing the following three comments. Each one has different ask and we use the withColumn Spark Frame method, which is used in casting columns types:
    # Convert the DynamicFrame to spark data frame
    # Cast the 'PlayerId' column from string to Integer using WithColumn function
     # Convert the spark frame back to DynamicFrame and print the schema

CodeWhisperer will pick up the preceding commands and recommend the following code snippet in sequence:

df = dyF.toDF()
df = df.withColumn("PlayerId", df["PlayerId"].cast("integer"))
dyF = DynamicFrame.fromDF(df, glueContext, "dyF")
dyF.printSchema()

The following output confirms the PlayerId column is changed from string to integer.

  1. Apply the same process to the resultant AWS Glue DynamicFrame for the TotalUSDPrize column by casting it from string to long using the withColumn Spark Frame functions by entering the following comments:
    # Convert the dynamicFrame to Spark Frame
    # Cast the "TotalUSDPrize" column from String to long
    # Convert the spark frame back to dynamic frame and print the schema

The recommended code snippet is similar to the following:

df = dyF.toDF()
df = df.withColumn("TotalUSDPrize", df["TotalUSDPrize"].cast("long"))
dyF = DynamicFrame.fromDF(df, glueContext, "dyF")
dyF.printSchema()

The output schema of the preceding code snippet is as follows.

Now we will try to recommend a code snippet that reflects the average prize for each player according to their country code.

  1. To do so, start by getting the count of the player per each country:
    # Get the count of each country code

The recommended code snippet is similar to the following:

country_code_count = df.groupBy("CountryCode").count()
country_code_count.show()

We get the following output.

  1. Join the main DataFrame with the country code count DataFrame and then add a new column calculating the average highest prize for each player according to their country code:
    # Convert the DynamicFrame (dyF) to dataframe (df)
    # Join the dataframe (df) with country_code_count dataframe with respect to CountryCode column
    # Convert the spark frame back to DynamicFrame and print the schema

The recommended code snippet is similar to the following:

df = dyF.toDF()
df = df.join(country_code_count, "CountryCode")
dyF = DynamicFrame.fromDF(df, glueContext, "dyF")
dyF.printSchema()

The output of the schema now confirms the both DataFrames where correctly joined and the Count column is added to the main DataFrame.

  1. Get the code recommendation on the code snippet to calculate the average TotalUSDPrize for each country code and add it to a new column:
    # Get the sum of all the TotalUSDPrize column per countrycode
    # Rename the sum column to be "SumPrizePerCountry" in the newly generated dataframe

The recommended code snippet is similar to the following:

country_code_sum = df.groupBy("CountryCode").sum("TotalUSDPrize")
country_code_sum = country_code_sum.withColumnRenamed("sum(TotalUSDPrize)", "SumPrizePerCountry")
country_code_sum.show()

The output of the preceding code should look like the following.

  1. Join the country_code_sum DataFrame with the main DataFrame from earlier and get the average of the prizes per player per country:
    # Join the above dataframe with the main dataframe with respect to CountryCode
    # Get the average Total prize in USD per player per country and add it to a new column called "AveragePrizePerPlayerPerCountry"

The recommended code snippet is similar to the following:

df = df.join(country_code_sum, "CountryCode")
df = df.withColumn("AveragePrizePerPlayerPerCountry", df["SumPrizePerCountry"] / df["count"])
  1. The last part in the transformation phase is to sort the data by the highest average prize per player per country:
    # sort the above dataframe descendingly according to the highest Average Prize per player country
    # Show the top 5 rows

The recommended code snippet is similar to the following:

df = df.sort(df["AveragePrizePerPlayerPerCountry"].desc())
df.show(5)

The first five rows will be similar to the following.

For the last step, we write the DynamicFrame to Amazon S3 and to Amazon Redshift.

  1. Write the DynamicFrame to Amazon S3 with the following code:
    # Convert the data frame to DynamicFrame
    # Write the DynamicFrame to S3 in glueparquet format

The CodeWhisperer recommendation is similar to the following code snippet:

dyF = DynamicFrame.fromDF(df, glueContext, "dyF")

glueContext.write_dynamic_frame.from_options(
frame=dyF,
connection_type="s3",
connection_options={
"path": "s3://<enter your own>/",
"partitionKeys": [],
},
format = "glueparquet",
)

We need to correct the code snippet generated after the recommendation because it doesn’t contain partition keys. As we pointed out, partitionkeys is empty, so we can have another code block suggestion to set partitionkey and then write it to the target Amazon S3 location. Also, according to the newest updates related to writing DynamicFrames to Amazon S3 using glueparquet, format = "glueparquet" is no longer used. Instead, you need to use the parquet type with useGlueParquetWriter enabled.

After the updates, our code looks similar to the following:

dyF = DynamicFrame.fromDF(df, glueContext, "dyF")

glueContext.write_dynamic_frame.from_options(
frame=dyF,
connection_type="s3",
connection_options={
"path": "s3://<enter your own>/",
},
format = "parquet",
 format_options={
 "useGlueParquetWriter": True,
    },
)

Another option here would be to write the files to Amazon Redshift using a JDBC connection.

  1. First, enter the following command to check whether CodeWhisperer will understand the comment in one sentence and use the correct functions or not:
    # Write the DynamicFrame to Redshift
    # Select only the following 3 columns in the DynamicFrame to write to redshift: CountryCode, TotalUSDPrize, NameFirst, NameLast, AveragePrizePerPlayerPerCountr

The output of the comment is similar to the following code snippet:

dyF = dyF.select_fields(["CountryCode", "TotalUSDPrize", "NameFirst","NameLast","AveragePrizePerPlayerPerCountry"])
glueContext.write_dynamic_frame.from_options(
    frame=dyF,
    connection_type="redshift",
    connection_options={
        "url": "jdbc:redshift://omarglueblog.cjzxbkzmhwqm.us-east-1.redshift.amazonaws.com:5439/omarglueblog",
        "user": "omarglueblog",
        "password": "omarglueblog",
        "dbtable": "highest_earning_players_redshift",
        "driver": "com.amazon.redshift.jdbc42.Driver",
        "preactions": "truncate table highest_earning_players_redshift",
        "redshiftTmpDir": "s3://awsgluetest/tmp/",
        "aws_iam_role": "0000000000000000000000000:role/glue-role",
    },
     format = "redshift",
     transformation_ctx="dyF"
    
)

As we can see, CodeWhisperer correctly interpreted the comment by selecting only the specified columns to write to Amazon Redshift.

  1. Now, use CodeWhisperer to write the DynamicFrame to Amazon Redshift. We use the Preaction parameter to run a SQL query to select only certain columns to be written to Amazon Redshift:
    # Write the resultant DynamicFrame to Redshift 
    # using preaction that selects only the following columns: CountryCode, TotalUSDPrize, NameFirst, NameLast, AveragePrizePerPlayerPerCountry

The CodeWhisperer recommendation is similar to the following code snippet:

glueContext.write_dynamic_frame.from_options(
    frame = dyf,
    connection_type = "redshift",
    connection_options = {
        "url": "jdbc:redshift://awsgluetest.cjw8y5zdqmhz.us-east-1.redshift.amazonaws.com:5439/dev",
        "user": "awsuser",
        "password": "awsuser",
        "dbtable": "players",
        "preactions": "SELECT CountryCode, TotalUSDPrize, NameFirst, NameLast, AveragePrizePerPlayerPerCountry FROM highest_earning_player",
        "redshiftTmpDir": "s3://awsgluetest/tmp/"
        },
    format = "glueparquet",
    transformation_ctx = "write_dynamic_frame")

After checking the preceding code snippet, you can observe that there is a misplaced format, which you can remove. You can also add the iam_role as an input in connection_options. You can also notice that CodeWhisperer has automatically assumed the Redshift URL to have the same name as the S3 folder that we used. Therefore, you need to change the URL and the S3 temp directory bucket to reflect your own parameters and remove the password parameter. The final code snippet should be similar to the following:

glueContext.write_dynamic_frame.from_options(
frame=dyF,
connection_type="redshift",
connection_options={
"url": "jdbc:redshift://<enter your own>.cjwjn5pzxbhx.us-east-1.redshift.amazonaws.com:5439/<enter your own>",
"user": "<enter your own>",
"dbtable": "<enter your own>",
"driver": "com.amazon.redshift.jdbc42.Driver",
"preactions": "SELECT CountryCode, TotalUSDPrize, NameFirst, NameLast, AveragePrizePerPlayerPerCountry FROM <enter your table>",
"redshiftTmpDir": "<enter your own>",
"aws_iam_role": "<enter your own>",
}
)

The following is the whole code and comment snippets:

%idle_timeout 2880
%glue_version 4.0
%worker_type G.1X
%number_of_workers 3

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.functions import current_timestamp
from awsglue.DynamicFrame import DynamicFrame


sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)


# Read CSV files from S3
dyF = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={
        "paths": ["s3://<bucket>/<path>/highest_earning_players.csv"],
        "recurse": True,
    },
    format="csv",
    format_options={
        "withHeader": True,
    },
    transformation_ctx="dyF")
    
# Print the schema of the above DynamicFrame
dyF.printSchema()


# Convert the DynamicFrame to spark data frame
# Cast the 'PlayerId' column from string to Integer using WithColumn function
# Convert the spark frame back to DynamicFrame and print the schema
df = dyF.toDF()
df = df.withColumn("PlayerId", df["PlayerId"].cast("integer"))
dyF = DynamicFrame.fromDF(df, glueContext, "dyF")
dyF.printSchema()


# Convert the dynamicFrame to Spark Frame
# Cast the "TotalUSDPrize" column from String to long
# Convert the spark frame back to dynamic frame and print the schema
df = dyF.toDF()
df = df.withColumn("TotalUSDPrize", df["TotalUSDPrize"].cast("long"))
dyF = DynamicFrame.fromDF(df, glueContext, "dyF")
dyF.printSchema()

# Get the count of each country code
country_code_count = df.groupBy("CountryCode").count()
country_code_count.show()

# Convert the DynamicFrame (dyF) to dataframe (df)
# Join the dataframe (df) with country_code_count dataframe with respect to CountryCode column
# Convert the spark frame back to DynamicFrame and print the schema
df = dyF.toDF()
df = df.join(country_code_count, "CountryCode")
df.printSchema()

# Get the sum of all the TotalUSDPrize column per countrycode
# Rename the sum column to be "SumPrizePerCountry"
country_code_sum = df.groupBy("CountryCode").sum("TotalUSDPrize")
country_code_sum = country_code_sum.withColumnRenamed("sum(TotalUSDPrize)", "SumPrizePerCountry")
country_code_sum.show()

# Join the above dataframe with the main dataframe with respect to CountryCode
# Get the average Total prize in USD per player per country and add it to a new column called "AveragePrizePerPlayerPerCountry"
df.join(country_code_sum, "CountryCode")
df = df.withColumn("AveragePrizePerPlayerPerCountry", df["SumPrizePerCountry"] / df["count"])

# sort the above dataframe descendingly according to the highest Average Prize per player country
# Show the top 5 rows
df = df.sort(df["AveragePrizePerPlayerPerCountry"].desc())
df.show(5)

# Convert the data frame to DynamicFrame
# Write the DynamicFrame to S3 in glueparquet format
dyF = DynamicFrame.fromDF(df, glueContext, "dyF")

glueContext.write_dynamic_frame.from_options(
frame=dyF,
connection_type="s3",
connection_options={
"path": "s3://<enter your own>/",
},
format = "parquet",
 format_options={
 "useGlueParquetWriter": True,
    },
)

# Write the resultant DynamicFrame to Redshift 
# using preaction that selects only the following columns: CountryCode, TotalUSDPrize, NameFirst, NameLast, AveragePrizePerPlayerPerCountry
glueContext.write_dynamic_frame.from_options(
frame=dyF,
connection_type="redshift",
connection_options={
"url": "jdbc:redshift://<enter your own>.cjwjn5pzxbhx.us-east-1.redshift.amazonaws.com:5439/<enter your own>",
"user": "<enter your own>",
"dbtable": "<enter your own>",
"driver": "com.amazon.redshift.jdbc42.Driver",
"preactions": "SELECT CountryCode, TotalUSDPrize, NameFirst, NameLast, AveragePrizePerPlayerPerCountry FROM <enter your table>",
"redshiftTmpDir": "<enter your own>",
"aws_iam_role": "<enter your own>",
}
)

Conclusion

In this post, we demonstrated a real-world use case on how AWS Glue Studio notebook integration with CodeWhisperer helps you build data integration jobs faster. You can start using the AWS Glue Studio notebook with CodeWhisperer to accelerate building your data integration jobs.

To learn more about using AWS Glue Studio notebooks and CodeWhisperer, check out the following video.


About the authors

Ishan Gaur works as Sr. Big Data Cloud Engineer ( ETL ) specialized in AWS Glue. He’s passionate about helping customers building out scalable distributed ETL workloads and analytics pipelines on AWS.

Omar Elkharbotly is a Glue SME who works as Big Data Cloud Support Engineer 2 (DIST). He is dedicated to assisting customers in resolving issues related to their ETL workloads and creating scalable data processing and analytics pipelines on AWS.