AWS Big Data Blog
Add your own libraries and application dependencies to Spark and Hive on Amazon EMR Serverless with custom images
Amazon EMR Serverless allows you to run open-source big data frameworks such as Apache Spark and Apache Hive without managing clusters and servers. Many customers who run Spark and Hive applications want to add their own libraries and dependencies to the application runtime. For example, you may want to add popular open-source extensions to Spark, or add a customized encryption-decryption module that is used by your application.
We are excited to announce a new capability that allows you to customize the runtime image used in EMR Serverless by adding custom libraries that your applications need to use. This feature enables you to do the following:
- Maintain a set of version-controlled libraries that are reused and available for use in all your EMR Serverless jobs as part of the EMR Serverless runtime
- Add popular extensions to open-source Spark and Hive frameworks such as pandas, NumPy, matplotlib, and more that you want your EMR serverless application to use
- Use established CI/CD processes to build, test, and deploy your customized extension libraries to the EMR Serverless runtime
- Apply established security processes, such as image scanning, to meet the compliance and governance requirements within your organization
- Use a different version of a runtime component (for example the JDK runtime or the Python SDK runtime) than the version that is available by default with EMR Serverless
In this post, we demonstrate how to use this new feature.
Solution Overview
To use this capability, customize the EMR Serverless base image using Amazon Elastic Container Registry (Amazon ECR), which is a fully managed container registry that makes it easy for your developers to share and deploy container images. Amazon ECR eliminates the need to operate your own container repositories or worry about scaling the underlying infrastructure. After the custom image is pushed to the container registry, specify the custom image while creating your EMR Serverless applications.
The following diagram illustrates the steps involved in using custom images for your EMR Serverless applications.
In the following sections, we demonstrate using custom images with Amazon EMR Serverless to address three common use cases:
- Add popular open-source Python libraries into the EMR Serverless runtime image
- Use a different or newer version of the Java runtime for the EMR Serverless application
- Install a Prometheus agent and customize the Spark runtime to push Spark JMX metrics to Amazon Managed Service for Prometheus, and visualize the metrics in a Grafana dashboard
General prerequisites
The following are the prerequisites to use custom images with EMR Serverless. Complete the following steps before proceeding with the subsequent steps:
- Create an AWS Identity and Access Management (IAM) role with IAM permissions for Amazon EMR Serverless applications, Amazon ECR permissions, and Amazon S3 permissions for the Amazon Simple Storage Service (Amazon S3) bucket
aws-bigdata-blog
and any S3 bucket in your account where you will store the application artifacts. - Install or upgrade to the latest AWS Command Line Interface (AWS CLI) version and install the Docker service in an Amazon Linux 2 based Amazon Elastic Compute Cloud (Amazon EC2) instance. Attach the IAM role from the previous step for this EC2 instance.
- Select a base EMR Serverless image from the following public Amazon ECR repository. Run the following commands on the EC2 instance with Docker installed to verify that you are able to pull the base image from the public repository:
- Log in to Amazon ECR with the following commands and create a repository called
emr-serverless-ci-examples
, providing your AWS account ID and Region: - Provide IAM permissions to the EMR Serverless service principal for the Amazon ECR repository:
- On the Amazon ECR console, choose Permissions under Repositories in the navigation pane.
- Choose Edit policy JSON.
- Enter the following JSON and save:
Make sure that the policy is updated on the Amazon ECR console.
For production workloads, we recommend adding a condition in the Amazon ECR policy to ensure only allowed EMR Serverless applications can get, describe, and download images from this repository. For more information, refer to Allow EMR Serverless to access the custom image repository.
In the next steps, we create and use custom images in our EMR Serverless applications for the three different use cases.
Use case 1: Run data science applications
One of the common applications of Spark on Amazon EMR is the ability to run data science and machine learning (ML) applications at scale. For large datasets, Spark includes SparkML, which offers common ML algorithms that can be used to train models in a distributed fashion. However, you often need to run many iterations of simple classifiers to fit for hyperparameter tuning, ensembles, and multi-class solutions over small-to-medium-sized data (100,000 to 1 million records). Spark is a great engine to run multiple iterations of such classifiers in parallel. In this example, we demonstrate this use case, where we use Spark to run multiple iterations of an XGBoost model to select the best parameters. The ability to include Python dependencies in the EMR Serverless image should make it easy to make the various dependencies (xgboost
, sk-dist
, pandas
, numpy
, and so on) available for the application.
Prerequisites
The EMR Serverless job runtime IAM role should be given permissions to your S3 bucket where you will be storing your PySpark file and application logs:
Create an image to install ML dependencies
We create a custom image from the base EMR Serverless image to install dependencies required by the SparkML application. Create the following Dockerfile in your EC2 instance that runs the docker process inside a new directory named datascience
:
Build and push the image to the Amazon ECR repository emr-serverless-ci-examples
, providing your AWS account ID and Region:
Submit your Spark application
Create an EMR Serverless application with the custom image created in the previous step:
Make a note of the value of applicationId
returned by the command.
After the application is created, we’re ready to submit our job. Copy the application file to your S3 bucket:
Submit the Spark data science job. In the following command, provide the name of the S3 bucket and prefix where you stored your application file. Additionally, provide the applicationId
value obtained from the create-application
command and your EMR Serverless job runtime IAM role ARN.
After the Spark job succeeds, you can view the best model estimates from our application by viewing the Spark driver’s stdout
logs. Navigate to Spark History Server, Executors, Driver, Logs, stdout.
Use case 2: Use a custom Java runtime environment
Another use case for custom images is the ability to use a custom Java version for your EMR Serverless applications. For example, if you’re using Java11 to compile and package your Java or Scala applications, and try to run them directly on EMR Serverless, it may lead to runtime errors because EMR Serverless uses Java 8 JRE by default. To make the runtime environments of your EMR Serverless applications compatible with your compile environment, you can use the custom images feature to install the Java version you are using to package your applications.
Prerequisites
An EMR Serverless job runtime IAM role should be given permissions to your S3 bucket where you will be storing your application JAR and logs:
Create an image to install a custom Java version
We first create an image that will install a Java 11 runtime environment. Create the following Dockerfile in your EC2 instance inside a new directory named customjre
:
Build and push the image to the Amazon ECR repository emr-serverless-ci-examples
, providing your AWS account ID and Region:
Submit your Spark application
Create an EMR Serverless application with the custom image created in the previous step:
Copy the application JAR to your S3 bucket:
Submit a Spark Scala job that was compiled with Java11 JRE. This job also uses Java APIs that may produce different results for different versions of Java (for example: java.time.ZoneId
). In the following command, provide the name of the S3 bucket and prefix where you stored your application JAR. Additionally, provide the applicationId
value obtained from the create-application
command and your EMR Serverless job runtime role ARN with IAM permissions mentioned in the prerequisites. Note that in the sparkSubmitParameters
, we pass a custom Java version for our Spark driver and executor environments to instruct our job to use the Java11 runtime.
You can also extend this use case to install and use a custom Python version for your PySpark applications.
Use case 3: Monitor Spark metrics in a single Grafana dashboard
Spark JMX telemetry provides a lot of fine-grained details about every stage of the Spark application, even at the JVM level. These insights can be used to tune and optimize the Spark applications to reduce job runtime and cost. Prometheus is a popular tool used for collecting, querying, and visualizing application and host metrics of several different processes. After the metrics are collected in Prometheus, we can query these metrics or use Grafana to build dashboards and visualize them. In this use case, we use Amazon Managed Prometheus to gather Spark driver and executor metrics from our EMR Serverless Spark application, and we use Grafana to visualize the collected metrics. The following screenshot is an example Grafana dashboard for an EMR Serverless Spark application.
Prerequisites
Complete the following prerequisite steps:
- Create a VPC, private subnet, and security group. The private subnet should have a NAT gateway or VPC S3 endpoint attached. The security group should allow outbound access to the HTTPS port 443 and should have a self-referencing inbound rule for all traffic.
Both the private subnet and security group should be associated with the two Amazon Managed Prometheus VPC endpoint interfaces. - On the Amazon Virtual Private Cloud (Amazon VPC) console, create two endpoints for Amazon Managed Prometheus and the Amazon Managed Prometheus workspace. Associate the endpoints to the VPC, private subnet, and security group to both endpoints. Optionally, provide a name tag for your endpoints and leave everything else as default.
- Create a new workspace on the Amazon Managed Prometheus console.
- Note the ARN and the values for Endpoint – remote write URL and Endpoint – query URL.
- Attach the following policy to your Amazon EMR Serverless job runtime IAM role to provide remote write access to your Prometheus workspace. Replace the ARN copied from the previous step in the
Resource
section of"Sid": "AccessToPrometheus"
. This role should also have permissions to your S3 bucket where you will be storing your application JAR and logs. - Create an IAM user or role with permissions to create and query the Amazon Managed Prometheus workspace.
We use the same IAM user or role to authenticate in Grafana or query the Prometheus workspace.
Create an image to install the Prometheus agent
We create a custom image from the base EMR Serverless image to do the following:
- Update the Spark metrics configuration to use
PrometheusServlet
to publish driver and executor JMX metrics in Prometheus format - Download and install the Prometheus agent
- Upload the configuration YAML file to instruct the Prometheus agent to send the metrics to the Amazon Managed Prometheus workspace
Create the Prometheus config YAML file to scrape the driver, executor, and application metrics. You can run the following example commands on the EC2 instance.
- Copy the
prometheus.yaml
file from our S3 path: - Modify
prometheus.yaml
to replace the Region and value of theremote_write
URL with the remote write URL obtained from the prerequisites: - Upload the file to your own S3 bucket:
- Create the following Dockerfile inside a new directory named
prometheus
on the same EC2 instance that runs the Docker service. Provide the S3 path where you uploaded theprometheus.yaml
file. - Build the Dockerfile and push to Amazon ECR, providing your AWS account ID and Region:
Submit the Spark application
After the Docker image has been pushed successfully, you can create the serverless Spark application with the custom image you created. We use the AWS CLI to submit Spark jobs with the custom image on EMR Serverless. Your AWS CLI has to be upgraded to the latest version to run the following commands.
- In the following AWS CLI command, provide your AWS account ID and Region. Additionally, provide the subnet and security group from the prerequisites in the network configuration. In order to successfully push metrics from EMR Serverless to Amazon Managed Prometheus, make sure that you are using the same VPC, subnet, and security group you created based on the prerequisites.
- Copy the application JAR to your S3 bucket:
- In the following command, provide the name of the S3 bucket and prefix where you stored your application JAR. Additionally, provide the
applicationId
value obtained from thecreate-application
command and your EMR Serverless job runtime IAM role ARN from the prerequisites, with permissions to write to the Amazon Managed Prometheus workspace.
Inside this Spark application, we run the bash script in the image to start the Prometheus process. You will need to add the following lines to your Spark code after initiating the Spark session if you’re planning to use this image to monitor your own Spark application:
For PySpark applications, you can use the following code:
Query Prometheus metrics and visualize in Grafana
About a minute after the job changes to Running
status, you can query Prometheus metrics using awscurl.
- Replace the value of
AMP_QUERY_ENDPOINT
with the query URL you noted earlier, and provide the job run ID obtained after submitting the Spark job. Make sure that you’re using the credentials of an IAM user or role that has permissions to query the Prometheus workspace before running the commands.The following is example output from the query:
- Install Grafana on your local desktop and configure our AMP workspace as a data source.Grafana is a commonly used platform for visualizing Prometheus metrics.
- Before we start the Grafana server, enable AWS SIGv4 authentication in order to sign queries to AMP with IAM permissions.
- In the same session, start the Grafana server. Note that the Grafana installation path may vary based on your OS configurations. Modify the command to start the Grafana server in case your installation path is different from
/usr/local/
. Also, make sure that you’re using the credentials of an IAM user or role that has permissions to query the Prometheus workspace before running the following commands - Log in to Grafana and go on the data sources configuration page /datasources to add your AMP workspace as a data source.The URL should be without the
/api/v1/query
at the end. EnableSigV4 auth
, then choose the appropriate Region and save.
When you explore the saved data source, you can see the metrics from the application we just submitted.
You can now visualize these metrics and create elaborate dashboards in Grafana.
Clean up
When you’re done running the examples, clean up the resources. You can use the following script to delete resources created in EMR Serverless, Amazon Managed Prometheus, and Amazon ECR. Pass the Region and optionally the Amazon Managed Prometheus workspace ID as arguments to the script. Note that this script will not remove EMR Serverless applications in Running
status.
Conclusion
In this post, you learned how to use custom images with Amazon EMR Serverless to address some common use cases. For more information on how to build custom images or view sample Dockerfiles, see Customizing the EMR Serverless image and Custom Image Samples.
About the Author
Veena Vasudevan is a Senior Partner Solutions Architect and an Amazon EMR specialist at AWS focusing on Big Data and Analytics. She helps customers and partners build highly optimized, scalable, and secure solutions; modernize their architectures; and migrate their Big Data workloads to AWS.