How can I resolve the Amazon SageMaker inference error "upstream timed out (110: Connection timed out) while reading response header from upstream"?

4 minute read
0

When I deploy an Amazon SageMaker endpoint or run a BatchTransform job, the connection times out with an error like this: "upstream timed out (110: Connection timed out) while reading response header from upstream, client: 169.xxx.xxx.xxx, server: , request: "POST /invocations HTTP/1.1", upstream: "http://unix:/tmp/gunicorn.sock/invocations" , host: "169.xxx.xxx.xxx:8080"

Short description

This error indicates a problem with the connection between NGINX and the web server. Both of these components run in the model container, regardless of whether you're using your own container or a pre-built container. These components aren't directly related to SageMaker hosting or batch transforms. However, when the connection between NGINX and the web server times out, then SageMaker can't obtain inference from the /invocations endpoint.

To resolve this issue:

  1. Reduce the latency of the algorithm container or increase the container's timeout limit.
  2. Increase the NGINX.conf timeout settings.

Resolution

Reduce the latency of the algorithm container or increase the timeout limit

  • If you're running inference code for hosting services: Your model containers must respond to requests within 60 seconds. The model itself can have a maximum processing time of 60 seconds. If you know that your model needs 50-60 seconds of processing time, then set the SDK socket timeout to 70 seconds. For more information, see How your container should respond to inference requests.
  • If you're running inference code for batch transform: Use ModelClientConfig to configure the InvocationsTimeoutInSeconds and InvocationsMaxRetries parameters.

Amazon SageMaker sets environment variables specified in CreateModel and CreateTransformJob on your container. Adjust the following API parameters to reduce the latency of the algorithm container. For example, if the input is splittable, limit the payload size of each request by setting the MaxPayloadInMB field when you create a transform job.

  • MaxPayloadInMB: The maximum size of the payload that is sent to the container. If the container can quickly process a batch transform, then increase this property. If the batch transform takes longer than expected, reduce this property.
  • MaxConcurrentTransforms: The default is 1. Increase this setting if you have more than one NGINX worker.
  • BatchStrategy: To fit as many records as possible in a mini-batch (up to the MaxPayloadInMB limit), set BatchStrategy to MultiRecord and SplitType to Line.

If you're using a SageMaker framework container that implements Gunicorn, then pass these properties to the Docker container as environment variables:

  • SAGEMAKER _MODEL_SERVER_TIMEOUT: The timeout for the Gunicorn server. To allow more time for the request to be processed before the connection is closed, increase this value.
  • SAGEMAKER _MODEL_SERVER_WORKERS: The number of workers per CPU.

Increase the NGINX.conf timeout settings

If you're using one of Amazon SageMaker's prebuilt Docker containers, then you can't modify the NGINX.conf file. You can modify NGINX.conf only if you're using your own Docker container.

NGINX timeouts can cause failures because Amazon SageMaker closes the connection after the timeout. If your container tries to read from or write to the closed connection, the request fails. Modify one or more of the following properties to accommodate for network overhead.

  • proxy_read_timeout: This is the amount of time that NGINX waits for a response from the model after a request.send call. Increase this value to allow more time for Amazon SageMaker to process the request before the closing the connection.
  • worker_processes: This is the number of threads for inbound connections. In most cases, the value should be the same or greater than the number of CPU cores. For example, for a two-core instance type such as ml.m5.large, set this property to a minimum of two.
  • worker_connections: This is the maximum number of simultaneous connections for each worker process. It's a best practice to set the starting value for this to 1024.

For more information about configuration settings, see Module ngx_http_proxy_module in the NGINX documentation.


AWS OFFICIAL
AWS OFFICIALUpdated 2 years ago