How do I resolve the "Courier fetch: n of m shards failed" error in Kibana on Amazon Elasticsearch Service?

Last updated: 2019-08-23

When I try to load a dashboard in Kibana on my Amazon Elasticsearch Service (Amazon ES) domain, it returns an error like this: "Error: Courier Fetch: 5 of 60 shards failed."

Note: There are many possible causes of this error. This article covers some of the common root causes and solutions.

Short Description

When you load a dashboard in Kibana, Kibana sends a search request to the Amazon ES domain. The search request is routed to a cluster node that acts as the coordinating node for the request. The "Courier fetch: n of m shards failed" error happens when the coordinating node fails to complete the fetch phase of the search request. There are two types of issues that commonly cause this error:

  • Persistent issues: mapping conflicts or unassigned shards. If you have several indices in your index pattern, and some of those indices have fields with the same name but different mapping types, you might get a courier fetch error. If your cluster is in red status, it means that at least one shard is unassigned. Because Elasticsearch can't fetch documents from unassigned shards, a cluster in red status throws courier fetch errors. If you keep getting courier fetch errors and the value of "n" in the error message ("Courier fetch: n of m shards failed") is always the same, then a persistent issue is likely the cause. Retrying or provisioning more cluster resources won't resolve persistent issues. Check the application error logs for troubleshooting suggestions.
  • Transient issues: threadpool rejections, search timeouts, tripped field data circuit breakers, and so on. These problems happen when you don't have enough compute resources on the cluster. If the errors occur intermittently and the value of "n" in the error message is different each time the error occurs, then a transient issue is likely the cause. You can also monitor Amazon Cloudwatch metrics such as CPUUtilization, JVMMemoryPressure, and ThreadpoolSearchRejected to determine if a transient issue is causing the courier fetch error.

Resolution

Enable application error logs for the domain. The logs can help you identify the root cause and solution for both transient and persistent issues.

Persistent issues

The following is an example of a log entry for a courier fetch error that is caused by a persistent issue.

Note: The log entries don't always look like this—yours might be different.

[2019-07-01T12:54:02,791][DEBUG][o.e.a.s.TransportSearchAction] [ip-xx-xx-xx-xxx] [1909731] Failed to execute fetch phase
org.elasticsearch.transport.RemoteTransportException: [ip-xx-xx-xx-xx][xx.xx.xx.xx:9300][indices:data/read/search[phase/fetch/id]]
Caused by: java.lang.IllegalArgumentException: Fielddata is disabled on text fields by default. 
Set fielddata=true on [request_departure_date] in order to load fielddata in memory by uninverting the inverted index.
Note that this can however use significant memory. Alternatively use a keyword field instead.

In this example, the issue is caused by the request_departure_date field. The log entry shows that you can resolve this issue by setting fielddata=true in the index settings or by using a keyword field.

Transient issues

The following is an example of a log entry for a courier fetch error that's caused by a transient issue.

Note: The log entries don't always look like this—yours might be different.

Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.common.util.concurrent.TimedRunnable@26fdeb6f on QueueResizingEsThreadPoolExecutor
[name = __PATH__ queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 2.9ms, adjustment amount = 50,
org.elasticsearch.common.util.concurrent.QueueResizingEsThreadPoolExecutor@1968ac53[Running, pool size = 2, active threads = 2, queued tasks = 1015, completed tasks = 96587627]]

In this example, the issue is caused by search threadpool queue rejections. To resolve this problem, scale up your domain by choosing a larger instance type.

Most transient issues can be resolved with one of the following methods:

Provision more compute resources

Reduce the resource utilization for your queries

  • Confirm that you're following best practices for shard and cluster architecture. A poorly designed cluster can't use all available resources. Some nodes might get overloaded while other nodes sit idle. Elasticsearch can't fetch documents from overloaded nodes.
  • Reduce the scope of your query. For example, if you query on time frame, reduce the date range or filter the results by configuring the index pattern in Kibana.
  • Avoid executing select * queries on large indices. Instead, use filters to query a part of the index and search as few fields as possible.
  • Reindex and reduce the number of shards. The more shards you have in your Elasticsearch cluster, the more likely you are to get a courier fetch error. Because each shard has its own resource allocation and overheads, a large number of shards places excessive strain on the cluster. To reduce the number of shards in your cluster, see My Amazon Elasticsearch Service domain has been stuck in the Processing state for a long time.

Did this article help you?

Anything we could improve?


Need more help?