Amazon Managed Service for Apache Flink application lifecycle management with Terraform

In this post, you’ll learn how to use Terraform to automate and streamline your Apache Flink application lifecycle management on Amazon Managed Service for Apache Flink. We’ll walk you through the complete lifecycle including deployment, updates, scaling, and troubleshooting common issues.

Managing Apache Flink applications through their entire lifecycle from initial deployment to scaling or updating can be complex and error-prone when done manually. Teams often struggle with inconsistent deployments across environments, difficulty tracking configuration changes over time, and complex rollback procedures when issues arise.

Infrastructure as Code (IaC) addresses these challenges by treating infrastructure configuration as code that can be versioned, tested, and automated. While there are different IaC tools available including AWS CloudFormation or AWS Cloud Development Kit (AWS CDK), we focus on HashiCorp Terraform to automate the complete lifecycle management of Apache Flink applications on Amazon Managed Service for Apache Flink.

Managed Service for Apache Flink allows you to run Apache Flink jobs at scale without worrying about managing clusters and provisioning resources. You can focus on developing your Apache Flink using your Integrated Development Environment (IDE) of choice, building and packaging the application using standard build and CI/CD tools. Once your application is packaged and uploaded to Amazon S3, you can deploy and run it with a serverless experience.

While you can control your Managed Service for Apache Flink applications directly using the AWS Console, CLI, or SDKs, Terraform provides key advantages such as version control of your application configuration, consistency across environments, and seamless CI/CD integration. This post builds upon our two-part blog series “Deep dive into the Amazon Managed Service for Apache Flink application lifecycle – Part 1” and “Part 2” that discusses the general lifecycle concepts of Apache Flink applications.

We use the sample code published on the GitHub repository to demonstrate the lifecycle management. Note that this is not a production-ready solution.

Setting up your Terraform environment

Before you can manage your Apache Flink applications with Terraform, you need to set up your execution environment. In this section, we’ll cover how to configure Terraform state management and credential handling. The Terraform AWS provider supports Managed Service for Apache Flink through the aws_kinesis_analyticsv2_application resource (using the legacy name “Kinesis Analytics V2“).

Terraform state management

Terraform uses a state file to track the resources it manages. In Terraform, storing the state file in Amazon S3 is a best practice for teams working collaboratively because it provides a centralised, durable, and secure location for tracking infrastructure changes. However, since multiple engineers or CI/CD pipelines may run Terraform simultaneously, state locking is essential to prevent race conditions where concurrent executions could corrupt the state. S3 as backend is commonly used for state storage and locking, ensuring that only one Terraform process can modify the state at a time, thus maintaining infrastructure consistency and avoiding deployment conflicts.

Passing credentials

To run Terraform inside a Docker container while ensuring that it has access to the necessary AWS credentials and infrastructure code, we follow a structured approach. This process involves exporting AWS credentials, mounting required directories, and executing Terraform commands inside a Docker container. Let’s break this down step by step. Before running Terraform, we need to make sure that our Docker container has access to the required AWS credentials. Since we are using temporary credentials, we generate them using the AWS CLI with the following command:

aws configure export-credentials --profile $AWS_PROFILE --format env-no-export > .env.docker

This command does the following:

It exports AWS credentials from a specific AWS profile ($AWS_PROFILE).
The credentials are saved in .env.docker in a format suitable for Docker.
The --format env-no-export option displays credentials as non-exported shell variables

This file (.env.docker) will later be used to pass credentials into the Docker container

Running Terraform in Docker

Running Terraform inside a Docker container provides a consistent, portable, and isolated environment for managing infrastructure without requiring Terraform to be installed directly on the local machine. This approach ensures that Terraform runs in a controlled environment, reducing dependency conflicts and improving security. To execute Terraform within a Docker container, we use a docker run command that mounts the necessary directories and passes AWS credentials, allowing Terraform to apply infrastructure changes seamlessly.

The Terraform configuration files are stored in a local terraform folder, which is virtually attached to the container using the -v flag. This allows the containerised Terraform instance to access and modify infrastructure code as if it were running locally.

To run Terraform in Docker, the following command is executed:

docker run --env-file .env.docker --rm -it \
-v ./flink:/home/flink-project/flink \
-v ./terraform:/home/flink-project/terraform \
-v ./build.sh:/home/flink-project/build.sh \
msf-terraform bash build.sh apply

Breaking down this command step by step:

--env-file .env.docker provides the AWS credentials required for Terraform to authenticate.
--rm -it runs the container interactively and is removed after execution to prevent clutter.
-v ./terraform:/home/flink-project/terraform mounts the Terraform directory into the container, making the configuration files accessible.
-v ./build.sh:/home/flink-project/build.sh mounts the build.sh script, which contains the logic to build JAR file for flink and execute Terraform commands.
msf-terraform is the Docker image used, which has Terraform pre-installed.
bash build.sh apply runs the build.sh script inside the container, passing apply as an argument to trigger the Terraform apply process.

Inside the container, build.sh typically includes commands such as terraform init to initialise the Terraform working directory and terraform apply to apply infrastructure changes. Since the Terraform execution happens entirely within the container, there is no need to install Terraform locally, and the process remains consistent across different systems. This method is particularly beneficial for teams working in collaborative environments, as it standardises Terraform execution and allows for reproducibility across development, staging, and production environments.

Managing application lifecycle with Terraform

In this section, we walk through each phase of the Apache Flink application lifecycle and understand how you can implement these operations using Terraform. While these operations are usually fully automated as part of a CI/CD pipeline, you will execute the individual steps manually from the command line for demonstration purposes. There are many ways to run Terraform depending on your organization’s tooling and infrastructure setup, but for this demonstration, we run Terraform in a container alongside the application build to simplify dependency management. In real-world scenarios, you would typically have separate CI/CD stages for building your application and deploying with Terraform, with distinct configurations for each environment. Since every organization has different CI/CD tooling and approaches, we keep these implementation details out of scope and focus on the core Terraform operations.

For a comprehensive deep dive into Apache Flink application lifecycle operations, refer to our previous two-part blog series.

Create and start a new application

To get started you want to create your Apache Flink application running on Managed Service for Apache Flink. You should execute the following Docker command:

docker run --env-file .env.docker --rm -it \
-v ./flink:/home/flink-project/flink \
-v ./terraform:/home/flink-project/terraform \
-v ./build.sh:/home/flink-project/build.sh \
msf-terraform bash build.sh apply

This command will complete the following operations by executing the bash script build.sh:

Building the Java ARchive (JAR) file from your Apache Flink application
Uploading the JAR file to S3
Setting the config variables for your Apache Flink application in terraform/config.tfvars.json
Create and deploy the Apache Flink application to Managed Service for Apache Flink using terraform apply

Terraform fully covers this operation. You can check the running Apache Flink application using AWS CLI or inside the Managed Apache Flink Console after Terraform completes with Apply Complete! Terraform is expecting the Apache Flink artifact, i.e. the JAR file to be packaged and copied to S3. This operation is usually part of the CI/CD pipeline and executed before invoking the terraform apply. Here, the operation is specified in the build.sh script.

Deploy code change to an application

You have successfully created and started the Flink application. However, you realize that you have to make a change to the Flink application code. Let’s make a code change to the application code in flink/ and see how to build and deploy it. After making the necessary changes, you simply have to run the following Docker command again that builds the JAR file, uploads it to S3 and deploys the Apache Flink application using Terraform:

docker run --env-file .env.docker --rm -it \
-v ./flink:/home/flink-project/flink \
-v ./terraform:/home/flink-project/terraform \
-v ./build.sh:/home/flink-project/build.sh \
msf-terraform bash build.sh apply

This phase of the lifecycle is fully supported by Terraform as long as both applications are state compatible, meaning that the operators of the upgraded Apache Flink application are able to restore the state from the snapshot that is taken from the old application version, before Managed Service for Apache Flink stops and deploys the change. For example, removing a stateful operator without enabling the allowNonRestoredState flag or changing an operator’s UID could prevent the new application from restoring from the snapshot. For more information on state compatibility, refer to Upgrading Applications and Flink Versions. For an example of state incompatibility, and strategies for handling state incompatibility, refer to Introducing the new Amazon Kinesis source connector for Apache Flink.

When deploying a code change goes wrong – A problem prevents the application code from being deployed

You also need to be careful with deploying code changes that contain bugs preventing the Apache Flink job from starting. For more information, refer to failure mode (a) – a problem prevents the application code from being deployed under When starting or updating the application goes wrong. For instance, this can be simulated by setting the mainClass in flink/pom.xml mistakenly to com.amazonaws.services.msf.WrongJob. Similar to before you build the JAR, upload it and run the terraform apply by running the Docker command from above. However, Terraform now fails to correctly apply the changes and throws an error message as the Apache Flink application fails to correctly update. Finally, the application status moves to READY.

To remedy the issue, you have to change the value of mainClass back to the original one and deploy the changes to Managed Service for Apache Flink. The Apache Flink application remains in READY status and doesn’t start automatically, as this was its state before applying the fix. Note that Terraform does not try to start the application when you deploy a change. You will have to manually start the Flink application using the AWS CLI or through the Managed Apache Flink Console.

As detailed in Part 2 of the companion blog, there is a second failure scenario where the application starts successfully, but the job becomes stuck in a continuous fail-and-restart loop. A code change can also cause this failure mode. We will cover the second error scenario when we cover deploying configuration changes.

Manual rollback application code to previous application code

As part of the lifecycle management of your Apache Flink application, you may need to explicitly rollback to a previous running application version. This is particularly useful when a newly deployed application version with application code changes exhibits unexpected behaviour and you want to explicitly rollback the application. Currently, Terraform does not support explicit rollbacks of your Apache Flink application running in Managed Service for Apache Flink. You will have to resort to therollbackApplication API through the AWS CLI or the Managed Service for Apache Flink Console to revert the application to the previous running version.

When you perform the explicit rollback, Terraform will initially not be aware of the changes. More specifically, the S3 path to the JAR file in the Managed Service for Apache Flink service (see left part of the image below) is different to the S3 path denoted in the terraform.tfstate file stored in Amazon S3 (see the right part of the image below). Fortunately, Terraform will always perform refreshing actions that include reading the current settings from all managed remote objects and updating the Terraform state to match as part of creating a plan in both terraform plan and terraform apply commands.

In summary, while you can not perform a manual rollback using Terraform, Terraform will automatically refresh the state when deploying a change using terraform apply.

Deploy config change to application

You have already made changes to the application code of your Apache Flink application. What about making changes to the config of the application, e.g., changing runtime parameters? Imagine you want to change the application logging level of your running Apache Flink application. To change the logging level from ERROR to INFO, you have to change the value for flink_app_monitoring_metrics_level in the terraform/config.tfvars.json to INFO. To deploy the config changes, you need to run the docker run command again as done in the previous sections. This scenario works as expected and is fully covered by Terraform.

What happens when the Apache Flink application deploys successfully but fails and restarts during execution? For more information, please refer to failure mode (b) – the application is started, the job is stuck in a fail-and-restart loop under When starting or updating the application goes wrong. Note that this failure mode can happen when making code changes as well.

When deploying config change goes wrong – The application is started, the job is stuck in a fail-and-restart loop

In the following example, we apply a wrong configuration change preventing the Kinesis connector from initialising correctly, ultimately putting the job in a fail-and-restart loop. To simulate this failure scenario, you’ll need to modify the Kinesis stream configuration by changing the stream name to a non-existent one. This change is made in the terraform/config.tfvars.json file, specifically altering the stream.name value under flink_app_environment_variables. When you deploy with this invalid configuration, the initial deployment will appear successful, showing an Apply Complete! message. The Flink application status will also show as RUNNING. However, the actual behaviour reveals problems. If you check the Flink Dashboard, you’ll see the application is continuously failing and restarting. Also, you will see a warning message about the application requiring attention in the AWS Console.

As detailed in the section Monitoring Apache Flink application operations in the companion blog (part 2), you can monitor the FullRestarts metric to detect the fail-and-restart loop.

Reverting the changes made to the environment variable and deploying the changes will result in Terraform showing the following error message: Failed to take snapshot for the application flink-terraform-lifecycle at this moment. The application is currently experiencing downtime.

You have to force-stop without a snapshot and restart the application with a snapshot to get your Flink application back to a properly functioning state. You should constantly monitor the application state of your Apache Flink application to detect any issues.

Other common operations

Manually scaling the application

Another common operation in the lifecycle of your Apache Flink application is scaling the application up or down by adjusting the parallelism. This operation changes the number of Kinesis Processing Units (KPUs) allocated to your application. Let’s look at two different scaling scenarios and how they are handled by Terraform.

In the first scenario, you want to change the parallelism of your running Apache Flink application within the default parallelism quota. To do this, you need to modify the value for flink_app_parallelism in the terraform/config.tfvars.json file. After updating the parallelism value, you deploy the changes by running the Docker command as done in the previous sections:

docker run --env-file .env.docker --rm -it \
-v ./flink:/home/flink-project/flink \
-v ./terraform:/home/flink-project/terraform \
-v ./build.sh:/home/flink-project/build.sh \
msf-terraform bash build.sh apply

This scenario works as expected and is fully covered by Terraform. The application will be updated with the new parallelism setting, and Managed Service for Apache Flink will adjust the allocated KPUs accordingly. Note that there is a default quota of 64 KPUs for a single Managed Service for Apache Flink application, which must be raised proactively via a quota increase request if you need to scale your Managed Service for Apache Flink application beyond 64 KPUs. For more information, refer to Managed Service for Apache Flink quota.

Less common change deployments which require special handling In this section we analyze some less common change deployment scenarios which require some special handling.

Deploy code change that removes an operator

Removing an operator from your Apache Flink application requires special consideration, particularly regarding state management. When you remove an operator, the state from that operator still exists in the latest snapshot, but there’s no longer a corresponding operator to restore it. Let’s take a closer look at this scenario and understand how you can handle it properly. First, you need to make sure that the parameter AllowNonRestoredState is set to True. This parameter specifies whether the runtime is allowed to skip a state that cannot be mapped to the new program, when restoring from a snapshot. Allowing non-restored state is required to successfully update an Apache Flink application when you dropped an operator. To enable the AllowNonRestoredState, you need to set the configuration value for flink_app_allow_non_restored_state to true in terraform/config.tfvars.json. Then, you can go ahead and remove an operator: For example, you can directly have the sourceStream write to the sink connector in flink/src/main/java/com/amazonaws/services/msf/StreamingJob.java. Change code line 146 from windowedStream.sinkTo(sink).uid("kinesis-sink")to sourceStream.sinkTo(sink).uid("kinesis-sink"). Make sure that you have commented out the entire windowedStream code block (lines 103 to 140).

This change will remove the windowed computation and directly connect the source stream to the sink, effectively removing the stateful operation. After removing the operator from your Flink application code, you deploy the changes using the Docker command as previously done. However, the deployment fails with the following error message: Could not execute application. As a result, the Apache Flink application moves to the READY state. To recover from this situation, you need to restart the Apache Flink application using the latest snapshot for the application to successfully start and move to RUNNING status. Importantly, you need to make sure that AllowNonRestoredState is enabled. Otherwise, the application will fail to start as it cannot restore the state for the removed operator.

Deploy change that breaks state compatibility with system rollback enabled

During the lifecycle management of your Apache Flink application, you might encounter scenarios where code changes break state compatibility. This typically happens when you modify stateful operators in ways that prevent them from restoring their state from previous snapshots.

A common example of breaking state compatibility is changing the UID of a stateful operator (such as an aggregation or windowing operator) in your application code. To safeguard against such breaking changes, you can enable the automatic system rollback feature in Managed Service for Apache Flink as described in the subsection Rollback under Lifecycle of an application in Managed Service for Apache Flink previously. This feature is disabled by default and can be enabled using the AWS Management Console or invoking the UpdateApplication API operation. There is no way in Terraform to enable system rollback.

Next, let’s demonstrate this by breaking the state compatibility of your Apache Flink application by changing the UID of a stateful operator, e.g., the string windowed-avg-price in line 140 of flink/src/main/java/com/amazonaws/services/msf/StreamingJob.java to windowed-avg-price-v2 and deploy the changes as before. You will encounter the following error:

Error: waiting for Kinesis Analytics v2 Application (flink-terraform-lifecycle) operation (*) success: unexpected state ‘FAILED’, wanted target ‘SUCCESSFUL’. last error: org.apache.flink.runtime.rest.handler.RestHandlerException: Could not execute application.

At this point, Managed Service for Apache Flink automatically rolls back the application to the previous snapshot with the previous JAR file, maintaining your application’s availability as you have enabled system-rollback capability. Terraform will initially be not aware of the performed rollback. Fortunately, as we have already witnessed in subsection Manual rollback application code to previous application code, Terraform will automatically refresh the state when we change UID to the previous value and deploy the changes.

In-place upgrade of Apache Flink runtime version

Managed Service for Apache Flink supports in-place upgrade to new Flink runtime versions. See the documentation for more details. Updating the application dependencies and any required code changes is a responsibility of the user. Once you have updated the code artifact, the service is able to upgrade the runtime of your running application in-place, without data loss. Let’s examine how Terraform handles Flink version upgrades.

To upgrade your Apache Flink application from version 1.19.1 to 1.20, you need to:

Update the Flink dependencies in your flink/pom.xml to version 1.20.0 (flink.version to 1.20.1 and flink.connector.version to 5.0.0-1.20 in <properties>)
Update the flink_app_runtime_environment to FLINK-1_20 in terraform/config.tfvars.json
Build and deploy the changes using the familiar docker run command

Terraform successfully performs an in-place upgrade of your Flink application. You will receive the following message: Apply complete! Resources: 0 added, 1 changed, 0 destroyed.

Operations currently not supported by Terraform

Let’s take a closer look at operations that are currently not supported by Terraform.

Starting or stopping the application without any configuration change

Terraform provides the start_application parameter, indicating whether to start or stop the application. You can set this parameter using flink_app_start in config.tfvars.json to stop your running Apache Flink application. However, this will only work if the current configuration value is set to true. In other words, Terraform only responds to the change in the parameter value, not the absolute value itself. After Terraform applies this change, your Apache Flink application will stop and its application status will move to READY. Similarly, restarting the application requires changing the flink_app_start value back to true, but this will only take effect if the current configuration value is false. Terraform will then restart your application, moving it back to the RUNNING state.

In summary, you cannot start or stop your Apache Flink application without making any configuration change in Terraform. You have to use AWS CLI, AWS SDK or AWS Console to start or stop your application.

Restarting application from an older snapshot or no snapshot without any configuration change

Similar to the previous section, Terraform requires an actual configuration change of application_restore_type to trigger a restart with different snapshot settings. Simply reapplying the same configuration values won’t initiate a restart from a different snapshot or no snapshot. You have to use AWS CLI, AWS SDK or AWS Console to restart your application from an older snapshot.

Performing rollback triggered manually or by system-rollback feature

Terraform does not support performing a manual rollback nor automatic system rollback. In addition, Terraform will also not be aware when such a rollback is taking place. The state information will be outdated, e.g. S3 path information. However, Terraform automatically performs refreshing actions to read settings from all managed remote objects and updates the Terraform state to match. Consequently, you can have Terraform refresh the Terraform state by successfully running a terraform apply command.

Conclusion

In this post, we demonstrated how to use Terraform to automate the lifecycle management of your Apache Flink applications on Managed Service for Apache Flink. We walked through fundamental operations including creating, updating, and scaling applications, explored how Terraform handles various failure scenarios and examined advanced scenarios such as removing operators and performing in-place runtime upgrades. We also identified operations that are currently not supported by Terraform.

For more information, see Run a Managed Service for Apache Flink application and our two-part blog on Deep dive into the Amazon Managed Service for Apache Flink application lifecycle.

AWS Big Data Blog