Why does the checkpoint in my Amazon Managed Service for Apache Flink application fail?

4 minute read

The checkpoint or savepoint in my Amazon Managed Service for Apache Flink application keeps failing.

Short description

Checkpointing is the method that's used to implement fault tolerance in Amazon Managed Service for Apache Flink. If you don't optimize or properly provision your application, then you might experience checkpoint failures.

Some of the major causes for checkpoint failures are the following:

For Rocks DB, Apache Flink reads files from local storage and writes to the remote persistent storage that's Amazon Simple Storage Service (Amazon S3). The performance of the local disk and upload rate might affect checkpointing and result in checkpoint failures.
Savepoint and checkpoint states are stored in a service-owned Amazon S3 bucket that AWS fully manages. These states are accessed whenever an application fails over. Transient server errors or latency in the S3 bucket might lead to checkpoint failures.
A process function that you created where the function communicates with an external resource during checkpointing, such as Amazon DynamoDB, might result in checkpoint failures.
State serialization, such as serializer mismatch with the incoming data, can cause checkpoint failures.
The number of Kinesis Processing Units (KPUs) that are provisioned for the application might not be sufficient. To find the allocated KPUs, use the following calculation:
Allocated KPUs for the application = Parallelism / ParallelismPerKPU
Larger application state sizes might lead to an increase in checkpoint latency. The task manager takes more time to save the checkpoint and can cause an out-of-memory exception.
A skewed state distribution can result in one task manager that handles more data compared with other task managers. Even if sufficient KPUs (resources) are provisioned, overloaded task managers can cause an out-of-memory exception.
A high cardinality indicates that there's a large number of unique keys in the incoming data. If the job uses the KeyBy operator to partition incoming data and the key with the data has high cardinality, then slow checkpointing might occur. The slow checkpointing might eventually result in checkpoint failures.

Resolution

To troubleshoot your failing Amazon Managed Service for Apache Flink application, take the following actions:

Use the lastCheckPointDuration and lastCheckpointSize Amazon CloudWatch metrics to monitor your checkpoint size and duration. The size of your application state might increase rapidly and cause an increase in the checkpoint size and duration. For more information, see Application metrics.
Increase the parallelism of the operator that processes more data. You can use the setParallelism() method to define the parallelism for an individual operator, data source, or data sink. For more information, see Parallel execution on the Flink website.
Note: When you increase the operator parallelism, you might affect the total checkpoint time. You can also use unaligned checkpoints and optimize accordingly. For more information, see Checkpointing under backpressure on the Flink website.
Tune the Parallelism and ParallelismPerKPU values for optimum KPU utilization. Make sure that automatic scaling is turned on for your Amazon Managed Service for Apache Flink application. The value of the maxParallelism parameter allows you to scale the number of KPUs. For more information, see Application Scaling in Amazon Managed Service for Apache Flink.
Define TTL on the state to make sure that the state is periodically cleaned. For more information, see Class StateTtlConfig.Builder on the Flink website.
Optimize the code for better partitioning. Use rebalance partitioning to help evenly distribute the data. Rebalance partitioning uses a round robin method for distribution. For more information, see Rebalance on the Flink website.
Optimize the code to reduce the window size so that the cardinality of the number of keys in the window is reduced.

Related information

Checkpoints on the Flink website

Implementing Fault Tolerance in Managed Service for Apache Flink

Topics

Analytics Internet of Things (IoT)

Relevant content

Apache Flink 1.9.0 support in Amazon EMR
Accepted Answer
MODERATOR
AWS-User-3650908
asked 4 years ago
What is the best way to configure Availability Zones for a fault-tolerant ALB?
Quark Soup
asked 10 months ago
Pyflink on Managed Service for Apache Flink
Pieter
asked 7 months ago
How to configure multiple jar file in aws managed apache flink runtime properties.
Alex
asked a month ago
Export AWS Managed Apache Flink savepoints to S3
Luc Tran
asked 10 days ago
How do I make my Amazon OpenSearch Service domain more fault tolerant?
AWS OFFICIALUpdated 3 years ago
How do I implement disaster recovery or fault tolerance for my Amazon ElastiCache Redis cluster?
AWS OFFICIALUpdated 4 years ago
How do I use Auto Scaling to improve the fault tolerance of an application behind my load balancer?
AWS OFFICIALUpdated a year ago
Why does my Amazon Managed Service for Apache Flink application restart?
AWS OFFICIALUpdated 2 years ago
AWS re:Post Live - Diving Deep into Best Practices for Amazon Managed Service for Apache Flink (MSF)
EXPERT
AWS rePost Live
published 19 days ago