Ce contenu n'est pas disponible dans la langue sélectionnée

Nous nous efforçons constamment de rendre le contenu disponible dans la langue sélectionnée. Merci pour votre patience.

Why does my Amazon Managed Service for Apache Flink application restart?

Lecture de 4 minute(s)

My Amazon Managed Service for Apache Flink application is continues to restart.

Resolution

When a task fails, the Apache Flink application restarts the failed task and other affected tasks to bring the job to a normal state.

The following are some of the causes and troubleshooting steps for this issue.

Code errors

Code errors, such as NullPointerException and DataCast type, are generated at the task manager and end up at the job manager. The application is then restarted from the latest checkpoint. To detect application restarts because of unhandled exceptions in the application, check Amazon CloudWatch metrics such as, downtime. This metric displays a non-zero value during restart periods. To identify what causes this to happen, query your application logs for changes to your application's status from RUNNING to FAILED. For more information, see Analyze errors: Application task-related failures.

Out-of-memory exceptions

When you get out-of-memory exceptions, the task manager can't send healthy heartbeat signals to the job manager, and the application restarts. In this case, you might see errors in the application logs, such as TimeoutException, FlinkException, or RemoteTransportException.

Check if the application is overloaded because of CPU or memory resource pressure:

Be sure that the fullRestarts and downtime CloudWatch metrics have non-zero values.
Check the cpuUtilization and heapMemoryUtilization metrics for unusual spikes.
Check for unhandled exceptions in your application code.
Check for checkpoint and savepoint failures. Monitor the numOFFailedCheckpoints, lastCheckpointSize, and lastCheckpointDuration CloudWatch metrics for spikes and stead increases.

To resolve spikes and stead increases, complete the following tasks:

If you turned on debug logs for the application, then the application resource utilization might be high. To reduce the amount of logging, temporarily turn on the debug logs only when you investigate issues.
Analyze the TaskManager thread dump in the Apache Flink dashboard. For example, you can identify the CPU-intensive processes from the thread dump.
To review the flame graphs that are constructed, sample the stack traces several times. To check for blocked calls, use the off-CPU flame graphs. For information about flame graphs, see Flame graphs on the Apache Flink website.

Throttling errors

If your application is under-provisioning a source or sink, your application might experience throttling errors when it reads and writes to streaming services, such as Kinesis Data Streams. This issue might result in an application crash. To check the throughput for the source and sink, use CloudWatch metrics such as, WriteProvisionedThroughputExceeded and ReadProvisionedThroughputExceeded. To accommodate the data volume, increase the number of shards to scale up your data streams.

Timeout errors

The FlinkKinesisProducer uses the Kinesis Producer Library (KPL) to put data from a Flink stream into a Kinesis Data Stream. A timeout error can cause failures in the KPL that might cause the Flink application to the restart. In this case, you might see an increase in the buffering time and number of retries. You can modify the RecordMaxBufferedTime, RecordTtl, and RequestTimeout configurations for the KPL so that the record doesn't expire. For more information, see default_config.properties on the GitHub website. Also, monitor important KPL metrics, such as ErrorsByCode, RetriesPerRecord, and UserRecordsPending. When these metrics show that the application restarted, use the filters in CloudWatch Logs Insights to understand the failures that caused the application to restart.

Note that not all errors lead to an immediate restart of the application. For example, errors in the application code might result in the directed acyclic graph (DAG) workflow error. In this case, the DAG for your application doesn't get created. The application shuts down and doesn't immediately restart. Also, the application doesn't immediately restart when you get an Access denied error.

If the issue still persists, contact AWS Support and provide the following information:

Application ARN
Information about the source and sink of your application
CloudWatch logs for your application
Time of issue in UTC
Relevant thread dumps from the Apache Flink dashboard

Related information

Application is restarting

Sujets

Analytique Internet des objets (IoT)

Balises

Amazon Kinesis

Langue

English

AWS OFFICIELA mis à jour il y a 2 ans

Aucun commentaire

Contenus pertinents

the logon attempt failed
rePost-User-7377810
demandé il y a un an
account is currently blocked and not recognized as a valid account
Yves Boah
demandé il y a un an
[ACTION REQUIRED] Update your TLS connections to 1.2 : quelles applications sont concernées ?
rePost-User-7550145
demandé il y a un an
My account is not recognized
juan ortega
demandé il y a un an
Erreur "Amazon Rekognition experienced a service issue." quand je souhaite entrainer mon model
rePost-User-1814428
demandé il y a un an
Comment corriger l'erreur « the closest matching container-instance container-instance-id encountered error 'AGENT' » dans Amazon ECS ?
AWS OFFICIELA mis à jour il y a 5 ans
Comment résoudre l’erreur « [AWS service] was unable to place a task because no container instance met all of its requirements » rencontrée dans Amazon ECS ?
AWS OFFICIELA mis à jour il y a un an
Comment résoudre les erreurs d’expiration de délai lors du transfert de données de Flink vers Kinesis Data Streams ?
AWS OFFICIELA mis à jour il y a 4 ans
Comment résoudre l'exception « failed to obtain in-memory shard lock » dans Amazon OpenSearch Service ?
AWS OFFICIELA mis à jour il y a un an