AWS Big Data Blog
Orchestrate big data workflows with Apache Airflow, Genie, and Amazon EMR: Part 2
December 2023: This post was reviewed and we recommend referring to the following blog posts for more recent and relevant for orchestrating big data workloads:
- Orchestrate big data jobs on on-premises clusters with AWS Step Functions
- Orchestrate Amazon EMR Serverless jobs with AWS Step functions
- Prepare, transform, and orchestrate your data using AWS Glue DataBrew, AWS Glue ETL, and AWS Step Functions
- Simplify AWS Glue job orchestration and monitoring with Amazon MWAA
Large enterprises running big data ETL workflows on AWS operate at a scale that services many internal end-users and runs thousands of concurrent pipelines. This, together with a continuous need to update and extend the big data platform to keep up with new frameworks and the latest releases of big data processing frameworks, requires an efficient architecture and organizational structure that both simplifies management of the big data platform and promotes easy access to big data applications.
In Part 1 of this post series, you learned how to use Apache Airflow, Genie, and Amazon EMR to manage big data workflows.
This post guides you through deploying the AWS CloudFormation templates, configuring Genie, and running an example workflow authored in Apache Airflow.
Prerequisites
For this walkthrough, you should have the following prerequisites:
- An AWS account
Solution overview
This solution uses an AWS CloudFormation template to create the necessary resources.
Users access the Apache Airflow Web UI and the Genie Web UI via SSH tunnel to the bastion host.
The Apache Airflow deployment uses Amazon ElastiCache for Redis as a Celery backend, Amazon EFS as a mount point to store DAGs, and Amazon RDS PostgreSQL for database services.
Genie uses Apache Zookeeper for leader election, an Amazon S3 bucket to store configurations (binaries, application dependencies, cluster metadata), and Amazon RDS PostgreSQL for database services. Genie submits jobs to an Amazon EMR cluster.
The architecture in this post is for demo purposes. In a production environment, the Apache Airflow and the Genie instances should be part of an Auto Scaling Group. For more information, see Deployment on the Genie Reference Guide.
The following diagram illustrates the solution architecture.
Creating and storing admin passwords in AWS Systems Manager Parameter Store
This solution uses AWS Systems Manager Parameter Store to store the passwords used in the configuration scripts. With AWS Systems Manager Parameter Store, you can create secure string parameters, which are parameters that have a plaintext parameter name and an encrypted parameter value. Parameter Store uses AWS KMS to encrypt and decrypt the parameter values of secure string parameters.
Before deploying the AWS CloudFormation templates, execute the following AWS CLI commands. These commands create AWS Systems Manager Parameter Store parameters to store the passwords for the RDS master user, the Airflow DB administrator, and the Genie DB administrator.
Creating an Amazon S3 Bucket for the solution and uploading the solution artifacts to S3
This solution uses Amazon S3 to store all artifacts used in the solution. Before deploying the AWS CloudFormation templates, create an Amazon S3 bucket and download the artifacts required by the solution from this link.
Unzip the artifacts required by the solution and upload the airflow
and genie
directories to the Amazon S3 bucket you just created. Keep a record of the Amazon S3 root path because you add it as a parameter to the AWS CloudFormation template later.
As an example, the following screenshot uses the root location geniestackbucket
.
Use the value of the Amazon S3 Bucket you created for the AWS CloudFormation parameters GenieS3BucketLocation
and AirflowBucketLocation
.
Deploying the AWS CloudFormation stack
To launch the entire solution, choose Launch Stack.
The following table explains the parameters that the template requires. You can accept the default values for any parameters not in the table. For the full list of parameters, see the AWS CloudFormation template.
Parameter | Value | |
Location of the configuration artifacts | GenieS3BucketLocation |
The S3 bucket with Genie artifacts and Genie’s installation scripts. For example: geniestackbucket . |
AirflowBucketLocation |
The S3 bucket with the Airflow artifacts. For example: geniestackbucket . |
|
Networking | SSHLocation |
The IP address range to SSH to the Genie, Apache Zookeeper, and Apache Airflow EC2 instances. |
Security | BastionKeyName |
An existing EC2 key pair to enable SSH access to the bastion host instance. |
AirflowKeyName |
An existing EC2 key pair to enable SSH access to the Apache Airflow instance. | |
ZKKeyName |
An existing EC2 key pair to enable SSH access to the Apache Zookeeper instance. | |
GenieKeyName |
An existing EC2 key pair to enable SSH access to the Genie. | |
EMRKeyName |
An existing Amazon EC2 key pair to enable SSH access to the Amazon EMR cluster. | |
Logging | emrLogUri |
The S3 location to store Amazon EMR cluster Logs. For example: s3://replace-with-your-bucket-name/emrlogs/ |
Post-deployment steps
To access the Apache Airflow and Genie Web Interfaces, set up an SSH and configure a SOCKS proxy for your browser. Complete the following steps:
- On the AWS CloudFormation console, choose the stack you created.
- Choose the Outputs
- Find the public DNS of the bastion host instance.The following screenshot shows the instance this post uses.
- Set up an SSH tunnel to the master node using dynamic port forwarding.
Instead of using the master public DNS name of your cluster and the usernamehadoop
, which the walkthrough references, use the public DNS of the bastion host instance and replace the userhadoop
for the userec2-user
.
- Configure the proxy settings to view websites hosted on the master node.
You do not need to modify any of the steps in the walkthrough.
This process configures a SOCKS proxy management tool that allows you to automatically filter URLs based on text patterns and limit the proxy settings to domains that match the form of the Amazon EC2 instance’s public DNS name.
Accessing the Web UI for Apache Airflow and Genie
To access the Web UI for Apache Airflow and Genie, complete the following steps:
- On the CloudFormation console, choose the stack you created.
- Choose the Outputs
- Find the URLs for the Apache Airflow and Genie Web UI.The following screenshot shows the URLs that this post uses.
- Open two tabs in your web browser. You will use the tabs for the Apache Airflow UI and the Genie UI.
- For the Foxy Proxy you configured previously, click the icon Foxy Proxy added to the top right section of your browser and choose Use proxies based on their predefined patterns and priorities.The following screenshot shows the proxy options.
- Enter the URL for the Apache Airflow Web UI and for the Genie Web UI on their respective tabs.
You are now ready to run a workflow in this solution.
Preparing application resources
The first step as a platform admin engineer is to prepare the binaries and configurations of the big data applications that the platform supports. In this post, the Amazon EMR clusters use release 5.26.0. Because Amazon EMR release 5.26.0 has Hadoop 2.8.5 and Spark 2.4.3 installed, those are the applications you want to support in the big data platform. If you decide to use a different EMR release, prepare binaries and configurations for those versions. The following sections guide you through the steps to prepare binaries should you wish to use a different EMR release version.
To prepare a Genie application resource, create a YAML file with fields that are sent to Genie in a request to create an application resource.
This file defines metadata information about the application, such as the application name, type, version, tags, the location on S3 of the setup script, and the location of the application binaries. For more information, see Create an Application in the Genie REST API Guide.
Tag structure for application resources
This post uses the following tags for application resources:
- type – The application type, such as Spark, Hadoop, Hive, Sqoop, or Presto.
- version – The version of the application, such as 2.8.5 for Hadoop.
The next section shows how the tags are defined in the YAML file for an application resource. You can add an arbitrary number of tags to associate with Genie resources. Genie also maintains their own tags in addition to the ones the platform admins define, which you can see in the field ID and field name of the files.
Preparing the Hadoop 2.8.5 application resource
This post provides an automated creation of the YAML file. The following code shows the resulting file details:
The file is also available directly at s3://Your_Bucket_Name/genie/applications/hadoop-2.8.5/hadoop-2.8.5.yml
.
NOTE: The following steps are for reference only, should you be completing this manually, rather than using the automation option provided.
The S3 objects referenced by the setupFile
and dependencies labels are available in your S3 bucket. For your reference, the steps to prepare the artifacts used by properties setupFile
and dependencies are as follows:
- Download
hadoop-2.8.5.tar.gz
fromhttps://archive.apache.org/dist/hadoop/core/hadoop-2.8.5/
. - Upload
hadoop-2.8.5.tar.gz
tos3://Your_Bucket_Name/genie/applications/hadoop-2.8.5/
.
Preparing the Spark 2.4.3 application resource
This post provides an automated creation of the YAML file. The following code shows the resulting file details:
The file is also available directly at s3://Your_Bucket_Name/genie/applications/spark-2.4.3/spark-2.4.3.yml
.
NOTE: The following steps are for reference only, should you be completing this manually, rather than using the automation option provided.
The objects in setupFile
and dependencies are available in your S3 bucket. For your reference, the steps to prepare the artifacts used by properties setupFile
and dependencies are as follows:
- Download
spark-2.4.3-bin-hadoop2.7.tgz
fromhttps://archive.apache.org/dist/spark/spark-2.4.3/
. - Upload
spark-2.4.3-bin-hadoop2.7.tgz
tos3://Your_Bucket_Name/genie/applications/spark-2.4.3/
.
Because spark-2.4.3-bin-hadoop2.7.tgz
uses Hadoop 2.7 and not Hadoop 2.8.3, you need to extract the EMRFS libraries for Hadoop 2.7 from an EMR cluster running Hadoop 2.7 (release 5.11.3). This is already available in your S3 Bucket. For reference, the steps to extract the EMRFS libraries are as follows:
- Deploy an EMR cluster with release 5.11.3.
- Run the following command:
Preparing a command resource
The next step as a platform admin engineer is to prepare the Genie commands that the platform supports.
In this post, the workflows use Apache Spark. This section shows the steps to prepare a command resource of type Apache Spark.
To prepare a Genie command resource, create a YAML file with fields that are sent to Genie in a request to create a command resource.
This file defines metadata information about the command, such as the command name, type, version, tags, the location on S3 of the setup script, and the parameters to use during command execution. For more information, see Create a Command in the Genie REST API Guide.
Tag structure for command resources
This post uses the following tag structure for command resources:
- type – The command type, for example, spark-submit.
- version – The version of the command, for example, 2.4.3 for Spark.
The next section shows how the tags are defined in the YAML file for a command resource. Genie also maintains their own tags in addition to the ones the platform admins define, which you can see in the field ID and field name of the files.
Preparing the spark-submit command resource
This post provides an automated creation of the YAML file. The following code shows the resulting file details:
The file is also available at s3://Your_Bucket_Name/genie/commands/spark-2.4.3_spark-submit/spark-2.4.3_spark-submit.yml
.
The objects in setupFile
are available in your S3 bucket.
Preparing cluster resources
This post also automated the step to prepare cluster resources; it follows a similar process as described previously but applied to cluster resources.
During the startup of the Amazon EMR cluster, a custom script creates a YAML file with the metadata details about the cluster and uploads the file to S3. For more information, see Create a Cluster in the Genie REST API Guide.
The script also extracts all Amazon EMR libraries and uploads them to S3. The next section discusses the process of registering clusters with Genie.
The script is available at s3://Your_Bucket_Name/genie/scripts/genie_register_cluster.sh
.
Tag structure for cluster resources
This post uses the following tag structure for cluster resources:
- cluster.release – The Amazon EMR release name. For example, emr-5.26.0.
- cluster.id – The Amazon EMR cluster ID. For example,
j-xxxxxxxx
. - cluster.name – The Amazon EMR cluster name.
- cluster.role – The role associated with this cluster. For this post, the role is batch. Other possible roles would be ad hoc or Presto, for example.
You can add new tags for a cluster resource or change the values of existing tags by editing s3://Your_Bucket_Name/genie/scripts/genie_register_cluster.sh
.
You could also use other combinations of tags, such as a tag to identify the application lifecycle environment or required custom jars.
Genie also maintains their own tags in addition to the ones the platform admins define, which you can see in the field ID and field name of the files. If multiple clusters share the same tag, by default, Genie distributes jobs across clusters associated with the same tag randomly. For more information, see Cluster Load Balancing in the Genie Reference Guide.
Registering resources with Genie
Up to this point, all the configuration activities mentioned in the previous sections were already prepared for you.
The following sections show how to register resources with Genie. In this section you will be connecting to the bastion via SSH to run configuration commands.
Registering application resources
To register the application resources you prepared in the previous section, SSH into the bastion host and run the following command:
To see the resource information, navigate to the Genie Web UI and choose the Applications tab. See the following screenshot. The screenshot shows two application resources, one for Apache Spark (version 2.4.3) and the other for Apache Hadoop (version 2.8.5).
Registering commands and associate commands with applications
The next step is to register the Genie command resources with specific applications. For this post, because spark-submit
depends on Apache Hadoop and Apache Spark, associate the spark-submit
command with both applications.
The order you define for the applications in file genie_register_command_resources_and_associate_applications.py
is important. Because Apache Spark depends on Apache Hadoop, the file first references Apache Hadoop and then Apache Spark. See the following code:
To register the command resources and associate them with the application resources registered in the previous step, SSH into the bastion host and run the following command:
To see the command you registered plus the applications it is linked to, navigate to the Genie Web UI and choose the Commands tab.
The following screenshot shows the command details and the applications it is linked to.
Registering Amazon EMR clusters
As previously mentioned, the Amazon EMR cluster deployed in this solution registers the cluster when the cluster starts via an Amazon EMR step. You can access the script that Amazon EMR clusters use at s3://Your_Bucket_Name/genie/scripts/genie_register_cluster.sh
. The script also automates deregistering the cluster from Genie when the cluster terminates.
In the Genie Web UI, choose the Clusters tab. This page shows you the current cluster resources. You can also find the location of the configuration files that uploaded to the cluster S3 location during the registration step.
The following screenshot shows the cluster details and the location of configuration files (yarn-site.xml, core-site.xml, mapred-site.xml).
Linking commands to clusters
You have now registered all applications, commands, and clusters, and associated commands with the applications on which they depend. The final step is to link a command to a specific Amazon EMR cluster that is configured to run it.
Complete the following steps:
- SSH into the bastion host.
- Open
/tmp/genie_assets/scripts/genie_link_commands_to_clusters.py
with your preferred text editor. - Look for the following lines in the code:
# Change cluster_name below
clusters = [{'cluster_name' : 'j-xxxxxxxx', 'commands' :
['spark-2.4.3_spark-submit']}]
- Replace
j-xxxxxxxx
in the file with thecluster_name
.
To see the name of the cluster, navigate to the Genie Web UI and choose Clusters. - To link the command to a specific Amazon EMR cluster run the following command:
The command is now linked to a cluster.
In the Genie Web UI, choose the Commands tab. This page shows you the current command resources. Select spark-2.4.3_spark_submit
and see the clusters associated with the command.
The following screenshot shows the command details and the clusters it is linked to.
You have configured Genie with all resources; it can now receive job requests.
Running an Apache Airflow workflow
It is out of the scope of this post to provide a detailed description of the workflow code and dataset. This section provides details of how Apache Airflow submits jobs to Genie via a GenieOperator that this post provides.
The GenieOperator for Apache Airflow
The GenieOperator allows the data engineer to define the combination of tags to identify the commands and the clusters in which the tasks should run.
In the following code example, the cluster tag is ‘emr.cluster.role:batch
‘ and the command tags are ‘type:spark-submit
‘ and ‘version:2.4.3
‘.
The property command_arguments
defines the arguments to the spark-submit
command, and dependencies
defines the location of the code for the Apache Spark Application (PySpark).
You can find the code for the GenieOperator in the following location: s3://Your_Bucket_Name/airflow/plugins/genie_plugin.py
.
One of the arguments to the DAG is the Genie connection ID (genie_conn_id
). This connection was created during the automated setup of the Apache Airflow Instance. To see this and other existing connections, complete the following steps:
- In the Apache Airflow Web UI, choose the Admin
- Choose Connections.
The following screenshot shows the connection details.
The Airflow variable s3_location_genie_demo
reference in the DAG was set during the installation process. To see all configured Apache Airflow variables, complete the following steps:
- In the Apache Airflow Web UI, choose the Admin
- Choose Variables.
The following screenshot shows the Variables page.
Triggering the workflow
You can now trigger the execution of the movie_lens_transfomer_to_parquet
DAG. Complete the following steps:
- In the Apache Airflow Web UI, choose the DAGs
- Next to your DAG, change Off to On.
The following screenshot shows the DAGs page.
For this example DAG, this post uses a small subset of the movielens dataset. This dataset is a popular open source dataset, which you can use in exploring data science algorithms. Each dataset file is a comma-separated values (CSV) file with a single header row. All files are available in your solution S3 bucket under s3://Your_Bucket_Name/airflow/demo/input/csv
.
movie_lens_transfomer_to_parquet
is a simple workflow that triggers a Spark job that converts input files from CSV to Parquet.
The following screenshot shows a graphical representation of the DAG.
In this example DAG, after transform_to_parquet_movies
concludes, you can potentially execute four tasks in parallel. Because the DAG concurrency is set to 3, as seen in the following code example, only three tasks can run at the same time.
Visiting the Genie job UI
The GenieOperator for Apache Airflow submitted the jobs to Genie. To see job details, in the Genie Web UI, choose the Jobs tab. You can see details such as the jobs submitted, their arguments, the cluster it is running, and the job status.
The following screenshot shows the Jobs page.
You can now experiment with this architecture by provisioning a new Amazon EMR cluster, registering it with a new value (for example, “production
”) for Genie tag “emr.cluster.role
”, linking the cluster to a command resource, and updating the tag combination in the GenieOperator used by some of the tasks in the DAG.
Cleaning up
To avoid incurring future charges, delete the resources and the S3 bucket created for this post.
Conclusion
This post showed how to deploy an AWS CloudFormation template that sets up a demo environment for Genie, Apache Airflow, and Amazon EMR. It also demonstrated how to configure Genie and use the GenieOperator for Apache Airflow.
About the Authors
Francisco Oliveira is a senior big data solutions architect with AWS. He focuses on building big data solutions with open source technology and AWS. In his free time, he likes to try new sports, travel and explore national parks.
Jelez Raditchkov leads the NoSQL AWS Professional Services Practice at AWS. He helps customers realize desired business outcomes by delivering focused guidance in the NoSQL, Graph and Search areas. Previously, he was a Principal Data Lake Architect with AWS Professional Services.
Prasad Alle is a Senior Big Data Consultant with AWS Professional Services. He spends his time leading and building scalable, reliable Big data, Machine learning, Artificial Intelligence and IoT solutions for AWS Enterprise and Strategic customers. His interests extend to various technologies such as Advanced Edge Computing, Machine learning at Edge. In his spare time, he enjoys spending time with his family.
Audit History
Last reviewed in December 2023 by Ragav Anumasa | Sr. Data Architect