How can I resolve SageMaker Python SDK rate exceeded and throttling exceptions?

Last updated: 2021-04-19

How can I resolve throttling errors such as "botocore.exceptions.ClientError: An error occurred (ThrottlingException)" when using the Amazon SageMaker Python SDK?

Short description

API calls to any AWS service can't exceed the maximum allowed API request rate per account and per AWS Region. These API calls might be from an application, the AWS Command Line Interface (AWS CLI), or the AWS Management Console. If the API requests exceed the maximum rate, you receive the “Rate Exceeded” error, and the API calls get throttled.

You might get this error when calling the SageMaker APIs because of the default retry configuration in Boto3. You can override this configuration to increase the number of retry attempts and the timeouts for connecting and reading a response.

You can resolve this error by adding a SageMaker boto3 client with a custom retry configuration to the SageMaker Python SDK client.

Resolution

1.    Create a SageMaker boto3 client with a custom retry configuration. Example:

import boto3 
from botocore.config import Config
sm_boto = boto3.client('sagemaker', config=Config(connect_timeout=5, read_timeout=60, retries={'max_attempts': 20}))
print(sm_boto.meta.config.retries)

2.    Create a SageMaker Python SDK client using the boto3 client from the previous step. Example:

import sagemaker
sagemaker_session = sagemaker.Session(sagemaker_client = sm_boto)
region = sagemaker_session.boto_session.region_name
print(sagemaker_session.sagemaker_client.meta.config.retries)

3.    Test a SageMaker API with multiple requests from the SageMaker Python SDK. Example:

import multiprocessing
def worker(TrainingJobName):
    print(sagemaker_session.sagemaker_client
          .describe_training_job(TrainingJobName=TrainingJobName)
          ['TrainingJobName'])
    return

if __name__ == '__main__':
    jobs = []
    TrainingJobName = 'your-job-name'
    for i in range(10):
        p = multiprocessing.Process(target=worker, args=(TrainingJobName,))
        jobs.append(p)
        p.start()

4.    Create an instance of the sagemaker.estimator.Estimator class with the sagemaker_session parameter. Example:

estimator = sagemaker.estimator.Estimator(container,
                                             role, 
                                             train_instance_count=1, 
                                             train_instance_type='ml.c4.4xlarge',
                                             train_volume_size = 30,
                                             train_max_run = 360000,
                                             input_mode= 'File',
                                             output_path=s3_output_location,
                                             sagemaker_session=sagemaker_session )

5.    To confirm that the retry configuration resolves the throttling exceptions, launch a training job from the estimator that you created in the previous step:

estimator.fit()

Did this article help?


Do you need billing or technical support?