Ce contenu n'est pas disponible dans la langue sélectionnée

Nous nous efforçons constamment de rendre le contenu disponible dans la langue sélectionnée. Merci pour votre patience.

Why does the checkpoint in my Amazon Managed Service for Apache Flink application fail?

Lecture de 4 minute(s)

The checkpoint or savepoint in my Amazon Managed Service for Apache Flink application keeps failing.

Short description

Checkpointing is the method that's used to implement fault tolerance in Amazon Managed Service for Apache Flink. If you don't optimize or properly provision your application, then you might experience checkpoint failures.

Some of the major causes for checkpoint failures are the following:

For Rocks DB, Apache Flink reads files from local storage and writes to the remote persistent storage that's Amazon Simple Storage Service (Amazon S3). The performance of the local disk and upload rate might affect checkpointing and result in checkpoint failures.
Savepoint and checkpoint states are stored in a service-owned Amazon S3 bucket that AWS fully manages. These states are accessed whenever an application fails over. Transient server errors or latency in the S3 bucket might lead to checkpoint failures.
A process function that you created where the function communicates with an external resource during checkpointing, such as Amazon DynamoDB, might result in checkpoint failures.
State serialization, such as serializer mismatch with the incoming data, can cause checkpoint failures.
The number of Kinesis Processing Units (KPUs) that are provisioned for the application might not be sufficient. To find the allocated KPUs, use the following calculation:
Allocated KPUs for the application = Parallelism / ParallelismPerKPU
Larger application state sizes might lead to an increase in checkpoint latency. The task manager takes more time to save the checkpoint and can cause an out-of-memory exception.
A skewed state distribution can result in one task manager that handles more data compared with other task managers. Even if sufficient KPUs (resources) are provisioned, overloaded task managers can cause an out-of-memory exception.
A high cardinality indicates that there's a large number of unique keys in the incoming data. If the job uses the KeyBy operator to partition incoming data and the key with the data has high cardinality, then slow checkpointing might occur. The slow checkpointing might eventually result in checkpoint failures.

Resolution

To troubleshoot your failing Amazon Managed Service for Apache Flink application, take the following actions:

Use the lastCheckPointDuration and lastCheckpointSize Amazon CloudWatch metrics to monitor your checkpoint size and duration. The size of your application state might increase rapidly and cause an increase in the checkpoint size and duration. For more information, see Application metrics.
Increase the parallelism of the operator that processes more data. You can use the setParallelism() method to define the parallelism for an individual operator, data source, or data sink. For more information, see Parallel execution on the Flink website.
Note: When you increase the operator parallelism, you might affect the total checkpoint time. You can also use unaligned checkpoints and optimize accordingly. For more information, see Checkpointing under backpressure on the Flink website.
Tune the Parallelism and ParallelismPerKPU values for optimum KPU utilization. Make sure that automatic scaling is turned on for your Amazon Managed Service for Apache Flink application. The value of the maxParallelism parameter allows you to scale the number of KPUs. For more information, see Application Scaling in Amazon Managed Service for Apache Flink.
Define TTL on the state to make sure that the state is periodically cleaned. For more information, see Class StateTtlConfig.Builder on the Flink website.
Optimize the code for better partitioning. Use rebalance partitioning to help evenly distribute the data. Rebalance partitioning uses a round robin method for distribution. For more information, see Rebalance on the Flink website.
Optimize the code to reduce the window size so that the cardinality of the number of keys in the window is reduced.

Related information

Checkpoints on the Flink website

Implementing Fault Tolerance in Managed Service for Apache Flink

Sujets

Analytique Internet des objets (IoT)

Balises

Amazon Kinesis

Langue

English

AWS OFFICIELA mis à jour il y a 3 mois

Aucun commentaire

Contenus pertinents

error 403 / cloud front
nico-HPF
demandé il y a un mois
the logon attempt failed
rePost-User-7377810
demandé il y a un an
[ACTION REQUIRED] Update your TLS connections to 1.2 : quelles applications sont concernées ?
rePost-User-7550145
demandé il y a un an
LIghstail Instance in suspens
rePost-User-6587585
demandé il y a un an
My account is not recognized
juan ortega
demandé il y a un an
Comment résoudre l'exception « failed to obtain in-memory shard lock » dans Amazon OpenSearch Service ?
AWS OFFICIELA mis à jour il y a un an
Comment résoudre l'erreur « The final policy size is bigger than the limit » de Lambda ?
AWS OFFICIELA mis à jour il y a 2 ans
Pourquoi le message d’erreur « The snapshot is currently in use by an AMI » s’affiche-t-il lorsque je tente de supprimer mon instantané Amazon EBS ?
AWS OFFICIELA mis à jour il y a 3 mois
Comment résoudre l'erreur AWS STS « the security token included in the request is expired » (le jeton de sécurité inclus dans la demande a expiré) lorsque j'utilise l'AWS CLI pour assumer un rôle IAM ?
AWS OFFICIELA mis à jour il y a 2 ans