AWS Compute Blog

Combining Amazon AppFlow with AWS Step Functions to maximize application integration benefits

This post is written by Ahmad Aboushady, Senior Technical Account Manager and Kamen Sharlandjiev, Senior Specialist Solution Architect, Integration.

In this blog post, you learn how to orchestrate AWS service integrations to reduce the manual steps in your workflow. The example uses AWS Step Functions SDK integration to integrate Amazon AppFlow and AWS Glue catalog without writing custom code. It automatically uses Amazon EventBridge to trigger Step Functions every time a new Amazon AppFlow flow finishes running.

Amazon AppFlow enables customers to transfer data securely between software as a service (SaaS) applications, like Salesforce, SAP, Zendesk, Slack, ServiceNow, and multiple AWS services.

An everyday use case of Amazon AppFlow is creating a customer-360 by integrating marketing, customer support, and sales data. For example, analyze the revenue impact of different marketing channels by synchronizing the revenue data from Salesforce with marketing data from Adobe Marketo.

This involves setting up flows to ingest data from different data sources and SaaS applications to AWS Data Lake based on Amazon S3. It uses AWS Glue to crawl and catalog this data. Customers use this catalog to access data quickly in several ways.

For example, they query the data using Amazon Athena or Amazon QuickSight for visualizations, business intelligence and anomaly detection. You can create those data flows quickly with no code required. However, to complete the next set of requirements, customers often go through multiple manual steps of provisioning and configuring different AWS resources. One such step requires creating AWS Glue crawler and running it with every Amazon AppFlow flow execution.

Step Functions can help us automate this process. This is a low-code workflow orchestration service that offers a visual workflow designer. You can quickly build workflows using the built-in drag-and-drop interface available in the AWS Management Console.

You can follow this blog and build your end-to-end state machine using the Step Functions Workflow Studio, or use the AWS Serverless Application Model (AWS SAM) template to deploy the example. The Step Functions state machine uses SDK integration with other AWS Services, so you don’t need to write any custom integration code.

Overview

The following diagram depicts the workflow with the different states in the state machine. You can group these into three phases: preparation, processing, and configuration.

  • The preparation phase captures all the configuration parameters and collects information about the metadata of the data, ingested by Amazon AppFlow.
  • The processing phase generates the AWS Glue table definition and sets the required parameters based on the destination file type. It iterates through the different columns and adds them as part of the table definition.
  • The last phase provides the Glue Catalog resources by creating or updating an existing AWS Glue table. With each Amazon AppFlow flow execution, the state machine determines if a new Glue table partition is required.

Workflow architecture

Preparation phase

The first state, “SetDatabaseAndContext”, is a pass state where you set the configuration parameters used in later states. Set the AWS Glue database and table name and capture the details of the data flow. You can do this by using the parameters filter to build a new JSON payload using parts of the state input similar to:

"Parameters": {
        "Config": {
          "Database": "<Glue-Database-Name>",
          "TableName.$": "$.detail['flow-name']",
          "detail.$": "$.detail"
        }
}

The following state, “DatabaseExist?” is an AWS SDK integration using a “GetDatabase” call to AWS Glue to ensure that the database exists. Here, the state uses error handling to intercept exception messages from the SDK call. This feature splits the workflow and adds an extra step if needed.

In this case, the SDK call returns an exception if the database does not exist, and the workflow invokes the “CreateDatabase” state. It moves to the “CleanUpError” state to clean up any errors and set the configuration parameters accordingly. Afterwards, with the database in place, the workflow continues to the next state: “DescribeFlow”. This returns the metadata of the Amazon AppFlow flow. Part of this metadata is the list of the object fields, which you must create in the Glue table and partitions.

Here is an error handling state that catches exceptions and routes the flow to execute an extra step:

"Catch": [
  {
    "ErrorEquals": [
      "States.ALL"
    ],
    "Comment": "Create Glue Database",
    "Next": "CreateDatabase",
    "ResultPath": "$.error"
  }
]

In the next state, “DescribeFlow”, you use the AWS SDK integration to get the Amazon AppFlow flow configuration. This uses the Amazon AppFlow “DescribeFlow API call. It moves to “S3AsDestination?”, which is a choice state to check if S3 is a destination for the flow. Amazon AppFlow allows you to bring data into different purpose-built data stores, such as S3, Amazon Redshift, or external SaaS or data warehouse applications. This automation can only continue if the configured destination is S3.

The choice definition is:

"Choices": [
  {
    "Variable": "$.FlowConfig.DestinationFlowConfigList[0].ConnectorType",
    "StringEquals": "S3",
    "Next": "GenerateTableDefinition"
  }
],
"Default": "S3NotDestination"

Processing phase

The following state generates the base AWS Glue table definition based on the destination file type. Then it uses a map state to iterate and transform the Amazon AppFlow schema output into what the AWS Glue Data Catalog expects as input.

Next, add the “GenerateTableDefinition” state and use the parameters filter to build a new JSON payload output. Finally, use the information from the “DescribeFlow” state similar to:

"Parameters": {
  "Config.$": "$.Config",
  "FlowConfig.$": "$.FlowConfig",
  "TableInput": {
    "Description": "Created by AmazonAppFlow",
    "Name.$": "$.Config.TableName",
    "PartitionKeys": [
      {
        "Name": "partition_0",
        "Type": "string"
      }
    ],
    "Retention": 0,
    "Parameters": {
      "compressionType": "none",
      "classification.$": "$.FlowConfig.DestinationFlowConfigList[0].DestinationConnectorProperties['S3'].S3OutputFormatConfig.FileType",
      "typeOfData": "file"
    },
    "StorageDescriptor": {
      "BucketColumns": [],
      "Columns.$": "$.FlowConfig.Tasks[?(@.TaskType == 'Map')]",
      "Compressed": false,
      "InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
      "Location.$": "States.Format('{}/{}/', $.Config.detail['destination-object'], $.FlowConfig.FlowName)",
      "NumberOfBuckets": -1,
      "OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
      "SortColumns": [],
      "StoredAsSubDirectories": false
    },
    "TableType": "EXTERNAL_TABLE"
  }
}

The following state, “DestinationFileFormatEvaluator”, is a choice state to change the JSON payload according to the destination file type. Amazon AppFlow supports different file type conversions when S3 is the destination of your data. These formats are CSV, Parquet, and JSON Lines. AWS Glue uses various serialization libraries according to the file type.

You iterate within a map state to transform the AWS Glue table schema and set the column type to a known AWS Glue format. If the file type is unrecognized or does not have an equivalent in Glue, default this field to string. The map state configuration is defined as:

"Iterator": {
        "StartAt": "KnownFIleFormat?",
        "States": {
          "KnownFIleFormat?": {
            "Type": "Choice",
            "Choices": [
              {
                "Or": [
                  {
                    "Variable": "$.TaskProperties.SOURCE_DATA_TYPE",
                    "StringEquals": "boolean"
                  },
                  {
                    "Variable": "$.TaskProperties.SOURCE_DATA_TYPE",
                    "StringEquals": "double"
                  },
                  .
                  .
                  .
                  .
                  {
                    "Variable": "$.TaskProperties.SOURCE_DATA_TYPE",
                    "StringEquals": "timestamp"
                  }
                ],
                "Next": "1:1 mapping"
              }
            ],
            "Default": "Cast to String"
          },
          "1:1 mapping": {
            "Type": "Pass",
            "End": true,
            "Parameters": {
              "Name.$": "$.DestinationField",
              "Type.$": "$.TaskProperties.SOURCE_DATA_TYPE"
            }
          },
          "Cast to String": {
            "Type": "Pass",
            "End": true,
            "Parameters": {
              "Name.$": "$.DestinationField",
              "Type": "string"
            }
          }
        }
      },
"ItemsPath": "$.TableInput.StorageDescriptor.Columns",
"ResultPath": "$.TableInput.StorageDescriptor.Columns",

Configuration phase

The next stage in the workflow is “TableExist?”, which checks if the Glue table exists. If the state machine detects any error because the table does not exist, it moves to the “CreateTable” state. Alternatively, it goes to the “UpdateTable” state.

Both states use the AWS SDK integration to create or update the AWS Glue table definition using the “TableInput” parameter. AWS Glue operates with partitions. Every time you have new data stored in a new S3 prefix, you must update the table and add a new partition showing where the data sits.

You need an extra step to check if Amazon AppFlow has stored the data into a new S3 prefix or an existing one. In the “AddPartition?” State, you must review and determine the next step of your workflow. For example, you must validate that the flow executed successfully and processed data.

A choice state helps with those checks:

"And": [
            {
              "Variable": "$.Config.detail['execution-id']",
              "IsPresent": true
            },
            {
              "Variable": "$.Config.detail['status']",
              "StringEquals": "Execution Successful"
            },
            {
              "Not": {
                "Variable": "$.Config.detail['num-of-records-processed']",
                "StringEquals": "0"
              }
            }
          ]

Amazon AppFlow supports different types of flow execution. With scheduled flows, you can regularly configure Amazon AppFlow to hydrate a data lake by bringing only new data since its last execution. Sometimes, after a successful flow execution, there is no new data to ingest. The workflow concludes and moves to the success state in such cases. However, if there is new data, the state machine continues to the next state, “SingleFileAggregation?”.

Amazon AppFlow supports different file aggregation strategies and allows you to aggregate all ingested records into a single or multiple files. Depending on your flow configuration, it may store your data in a different S3 prefix with each flow execution.

In this state, you check this configuration to decide if you need a new partition for your AWS Glue table.

"Variable": "$.FlowConfig.DestinationFlowConfigList[0].DestinationConnectorProperties.S3.S3OutputFormatConfig.AggregationConfig.AggregationType",
"StringEquals": "SingleFile"

If the data flow aggregates all records into a single file per flow execution, it stores all data into a single S3 prefix. In this case, there is a single partition in your AWS Glue table. You must create that single partition the first time this state machine executes for a specific flow.

Use the AWS SDK integration to get the table partition from the AWS Glue in the “IsPartitionExist?” state. Conclude the workflow and move to the “Success” state if the partition exists. Otherwise, create that single partition in another state, “CreateMainPartition”.

If the flow run does not aggregate files, every flow run generates multiple files into a new S3 prefix. In this case, you add a new partition to the AWS Glue table. A pass state, “ConfigureDestination”, configures the required parameters for the partition creation:

"Parameters": {
        "InputFormat.$": "$.TableInput.StorageDescriptor.InputFormat",
        "OutputFormat.$": "$.TableInput.StorageDescriptor.OutputFormat",
        "Columns.$": "$.TableInput.StorageDescriptor.Columns",
        "Compressed.$": "$.TableInput.StorageDescriptor.Compressed",
        "SerdeInfo.$": "$.TableInput.StorageDescriptor.SerdeInfo",
        "Location.$": "States.Format('{}{}', $.TableInput.StorageDescriptor.Location, $.Config.detail['execution-id'])"
      },
 "ResultPath": "$.TableInput.StorageDescriptor"

Next, move to the “CreateNewPartition” state to use the AWS SDK integration to create a new partition to the Glue table similar to:

"Parameters": {
        "DatabaseName.$": "$.Config.Database",
        "TableName.$": "$.Config.TableName",
        "PartitionInput": {
          "Values.$": "States.Array($.Config.detail['execution-id'])",
          "StorageDescriptor.$": "$.TableInput.StorageDescriptor"
        }
      },
"Resource": "arn:aws:states:::aws-sdk:glue:createPartition"

This concludes the workflow with a “Succeed” state after configuring the AWS Glue table in response to the new Amazon AppFlow flow run.

Conclusion

This blog post explores how to integrate Amazon AppFlow and AWS Glue using Step Functions to automate your business requirements. You can use AWS Lambda to simplify the configuration phase and reduce state transitions or create complex checks, filters, or even data cleansing and preparation.

This approach allows you to tailor the schema conversion to your business requirements. Use this AWS SAM template, to deploy this example. This provides the Step Functions workflow described in this post and the EventBridge rule to trigger the state machine after each Amazon AppFlow flow run. The template also includes all required IAM roles and permissions.

For more serverless learning resources, visit Serverless Land.