What object key naming pattern should I use for Amazon S3?

Last updated: 2020-03-30

I expect my Amazon Simple Storage Service (Amazon S3) bucket to get high request rates. What object key naming pattern should I use to get better performance?

Resolution

Your Amazon S3 bucket can support 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per partitioned prefix. To optimize your bucket for high request rates, consider using an object key naming pattern that distributes your objects across multiple prefixes. Each additional prefix enables your bucket to scale to support an additional 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second.

For example, the following objects in awsexamplebucket are all grouped into the prefix folderA:

awsexamplebucket/folderA/object-A1
awsexamplebucket/folderA/object-A2
awsexamplebucket/folderA/object-B1
awsexamplebucket/folderA/object-B2
awsexamplebucket/folderA/object-c1
awsexamplebucket/folderA/object-c2

To distribute these objects across multiple prefixes, you can introduce differentiating characters from the object name earlier as a prefix:

awsexamplebucket/Afolder/object-A1
awsexamplebucket/Afolder/object-A2
awsexamplebucket/Bfolder/object-B1
awsexamplebucket/Bfolder/object-B2
awsexamplebucket/Cfolder/object-c1
awsexamplebucket/Cfolder/object-c2

For more information and naming examples, see Amazon S3 Performance Tips & Tricks.

Data lake applications

Some applications run on data lakes that use date-based naming conventions, as recommended by engines such as Hive, Spark, or Presto. If your data lake application has a very high throughput, the date-based naming convention might need additional tuning.

Important: This level of throughput is uncommon and applies to use cases with petabytes of storage being analyzed with thousands of CPU cores. If you don't require this scale, then you don't need to implement additional tuning.

An object key name that follows the date-based naming convention often looks similar to the following:

awsexamplebucket/HadoopTableName/dt=yyyy-mm-dd/objectname

If a table mapped to a single Amazon S3 key name prefix exceeds the supported request rates per prefix, then the application can get 503 errors. To scale to higher rates of traffic, it's a best practice to split tables mapped to a key name prefix by using a natural key partition.

For example, if you have a "users" table that maps to an Amazon S3 prefix, then you could partition the table by the country of the user:

awsexamplebucket/users/US/dt=yyyy-mm-dd
awsexamplebucket/users/CA/dt=yyyy-mm-dd

You must review your application's requirements for how to add partitions to a table, as well as the application's supported naming conventions. As one example, see Partitioning Data for more information on table partitioning on Amazon Athena. Athena uses Hive for partitioning data.


Did this article help you?

Anything we could improve?


Need more help?