AWS Database Blog

Analyzing the impact of Python version on Amazon DynamoDB scan performance

Amazon DynamoDB is a NoSQL database that allows for a flexible schema. This means that items in the same table may differ from each other in terms of what attributes are present for each item.

In an earlier AWS Blog post, we looked at the performance impact of attribute counts per item. Recently, when helping a customer resolve some slow scans in their Python application, we examined different versions of the Python interpreter to see if upgrading the version of Python helps with performance.

This post analyzes data to demonstrate how the Python interpreter version can affect scan performance.

Methodology

We created one DynamoDB table with a simple structure: a primary key consisting of a partition key and a sort key (both are strings). We also created another DynamoDB table with 24 separate 6-character string attributes (6 x 24 = 144 characters total), each of which contain 7-character key names (“field01”, “field10”, and so on).

DynamoDB has a 1-MB limit on the amount of data it retrieves in a single request. After seeding the table with 10,000 items of random data, we run a single scan that retrieves as many items that fit within the 1-MB limit. We then measure how long it takes to retrieve and unmarshal those items on the client side.

Walkthrough

Here are some typical questions for comparing Python versions:

  • How does my performance vary if I use one version of Python versus another?
  • How does my performance vary if I have 20 attributes compared to if I merge some of them and have only five attributes?
  • How does my performance vary if I rename my attributes from long names like “thisIsAnEnterpriseAppWithLongAttributeNames” to “foo” or even just “x”?

Benchmarking tool

We made public a small benchmarking tool that we used for these tests. The tool allows you to do the following:

  • Reproduce these tests in your own environment.
  • Provide a custom definition for table structure.

You can specify the name of each attribute in your table, as well as the size of the data in that attribute. The results should allow you to answer questions like those listed earlier.

For more information, see the dynamodb-python-query-speed-test GitHub repo.

Empirical results

The following table summarizes the results.

Python Version Time to scan 1 MB Time to scan 50,000 items
2.7.16 1,112 ms 20 seconds
3.7.2 618 ms 12 seconds

With 24 attributes comprised of 6-character attribute names and 6-character string values, we were able to retrieve and unmarshal 2,819 items in the 1-MB scan response limit. With Python 2.7.16, scanning 1 MB of items across five rounds took an average of 1112 ms/scan page and scanning the full 50,000 items took approximately 20 seconds.

# pyenv local 2.7.16
# python run.py --table dynamodb-speed-test-blog --region us-east-2 --query 10000 --rounds 5 --rcu 100000 --wcu 100000 --schema schemas/24_attributes_7_char_names_6_byte_data.schema
...
GRAND TOTALS:
    Items queried: 50000
    Elapsed time: 19754.6 ms
    Avg. time per item: 0.395 ms

With Python 3.7.2, those same 2,819 items took an average of 618 ms to retrieve and unmarshal and scanning the full 50,000 items took approximately 12 seconds.

# pyenv local 3.7.2
# python run.py --table dynamodb-speed-test-blog --region us-east-2 --query 10000 --rounds 5 --rcu 100000 --wcu 100000 --schema 
schemas/24_attributes_7_char_names_6_byte_data.schema
...
GRAND TOTALS:
    Items queried: 50000
    Elapsed time: 11875.6 ms
    Avg. time per item: 0.238 ms

Conclusion

The results show that Python 3.7.2 is approximately 40% more efficient than Python 2.7.16 in several ways:

  • For handling the network traffic between the client and DynamoDB
  • For the string parsing required to unmarshal the DynamoDB API responses into Python dictionaries

Still consider shortening your attribute names. You also need the minimum number of attributes required to meet the indexing and query demands of your use case.

However, if you’re using an older version of Python, you should see significant performance gains simply by upgrading to the latest version of Python 3. While upgrading a large code base to Python 3 takes some work, to find tools and documents to help you transition your code to Python 3, see Porting Python 2 Code to Python 3.

 


About the Authors

 

Chad Tindel is a DynamoDB Specialist Solutions Architect based out of New York City. He works with large enterprises to evaluate, design, and deploy DynamoDB-based solutions. Prior to joining Amazon he held similar roles at Red Hat, Cloudera, MongoDB, and Elastic.

 

 

 

Mat Werber is a Solutions Architect on the AWS Community SA Team and is responsible for providing architectural guidance across the full AWS stack with a deep focus on Serverless, Redshift, DynamoDB, and RDS. He also has a background in IT and financial audit and public accounting.

 

 

 

Daniel Yoder is an LA-based senior NoSQL Specialist Solutions Architect at Amazon Web Services.