How can I resolve SageMaker Python SDK rate exceeded and throttling exceptions?
Last updated: 2020-10-12
How can I resolve throttling errors such as "botocore.exceptions.ClientError: An error occurred (ThrottlingException)" when using the Amazon SageMaker Python SDK?
Short description
Add a SageMaker boto3 client with a custom retry configuration to the SageMaker Python SDK client.
Resolution
1. Create a SageMaker boto3 client with a custom retry configuration. Example:
import boto3
from botocore.config import Config
sm_boto = boto3.client('sagemaker', config=Config(connect_timeout=5, read_timeout=60, retries={'max_attempts': 20}))
print(sm_boto.meta.config.retries)
2. Create a SageMaker Python SDK client using the boto3 client from the previous step. Example:
import sagemaker
sagemaker_session = sagemaker.Session(sagemaker_client = sm_boto)
region = sagemaker_session.boto_session.region_name
print(sagemaker_session.sagemaker_client.meta.config.retries)
3. Test a SageMaker API with multiple requests from the SageMaker Python SDK. Example:
import multiprocessing
def worker(TrainingJobName):
print(sagemaker_session.sagemaker_client
.describe_training_job(TrainingJobName=TrainingJobName)
['TrainingJobName'])
return
if __name__ == '__main__':
jobs = []
TrainingJobName = 'your-job-name'
for i in range(10):
p = multiprocessing.Process(target=worker, args=(TrainingJobName,))
jobs.append(p)
p.start()
4. Create an instance of the sagemaker.estimator.Estimator class with the sagemaker_session parameter. Example:
estimator = sagemaker.estimator.Estimator(container,
role,
train_instance_count=1,
train_instance_type='ml.c4.4xlarge',
train_volume_size = 30,
train_max_run = 360000,
input_mode= 'File',
output_path=s3_output_location,
sagemaker_session=sagemaker_session )
5. To confirm that the retry configuration resolves the throttling exceptions, launch a training job from the estimator that you created in the previous step:
estimator.fit()
Did this article help?
Do you need billing or technical support?