How can I back up a DynamoDB table to Amazon S3?
Last updated: 2021-08-14
I want to back up my Amazon DynamoDB table using Amazon Simple Storage Service (Amazon S3).
DynamoDB offers two built-in backup methods:
- On-demand: Create backups when you choose.
- Point-in-time recovery: Turn on automatic and continuous backups.
Both of these methods are suitable for backing up your tables for disaster recovery purposes. However, with these methods, you can't use the data for use cases involving data analysis or extract, transform, and load (ETL) jobs. The DynamoDB Export to S3 feature is the easiest way to create backups that you can download locally or use with another AWS service. To customize the process of creating backups, you can use use Amazon EMR, AWS Glue, or AWS Data Pipeline.
DynamoDB Export to S3 feature
Using this feature, you can export data from an Amazon DynamoDB table anytime within your point-in-time recovery window to an Amazon S3 bucket. For more information, see Exporting DynamoDB table data to Amazon S3.
For an example of how to use the Export to S3 feature, see Export Amazon DynamoDB table data to your data lake in Amazon S3, no code writing required.
Using the Export to S3 Feature allows you to use your data in other ways including the following:
- Perform ETL against the exported data on S3 and import the data back to DynamoDB
- Retain historical snapshots for auditing
- Integrate the data with other services/applications
- Build an S3 data lake from the DynamoDB data and analyze the data from various services, such as Amazon Athena, Amazon Redshift, and Amazon SageMaker.
- Run ad-hoc queries on your data from Athena or Amazon EMR without affecting your DynamoDB capacity
Keep in mind the following pros and cons when using this feature:
- Pros: This feature allows you to export data across AWS Regions and accounts without building custom applications or writing code. The exports don't affect the read capacity or the availability of your production tables.
- Cons: This feature exports the table data in DynamoDB JSON or Amazon Ion format only. The AWS Data Pipeline Import DynamoDB backup data from S3 feature can't be used to import data directly to DynamoDB, because this feature doesn't meet the data format requirements. You can't use the Data Pipeline templates to import the data back to a DynamoDB table. To re-import the data, you must either create a new template or use AWS Glue, Amazon EMR, or the AWS SDK.
Use Amazon EMR to export your data to an S3 bucket. You can do so using either of the following methods:
- Run Hive/Spark queries against DynamoDB tables using DynamoDBStorageHandler. For more information, see Exporting data from DynamoDB.
- Use the open-source emr-dynamodb-tool to export/import DynamoDB tables.
Keep in the mind the following pros and cons when using these methods:
- Pros: If you're an active Amazon EMR user and are comfortable with Hive or Spark, then these methods offer more control than the native Export to S3 function. You can also use existing clusters for this purpose.
- Cons: These methods require you to create and maintain an EMR Cluster. If you use DynamoDBStorageHandler, then you must be familiar with Hive or Spark.
Use AWS Glue to copy your table to Amazon S3. For more information, see Using exports with AWS Glue.
- Pros: As AWS Glue is a serverless service, there is no need to create and maintain resources. You have the ability to directly write back to DynamoDB. You can add custom ETL logic for use cases, such as filtering and converting, when exporting data. You can also choose your preferred format from CSV, JSON, Parquet, or ORC. For more information, see Format options for ETL inputs and outputs in AWS Glue.
- Cons: If you choose this option, you must have knowledge about using Spark. You also must maintain a source code for your AWS Glue ETL job. For more information, see "connectionType": "dynamodb".
Use AWS Data Pipeline to export your table to an S3 bucket in the same account or a different account. For more information, see Import and export DynamoDB data using AWS Data Pipeline.
- Pros: Data Pipeline uses Amazon EMR to create the backup and the scripting is done for you. You don't have to learn Apache Hive or Apache Spark to accomplish this task. The cluster is created and maintained for you.
- Cons: If you use the templates provided, then creating the backups is not as customizable as AWS Glue or Amazon EMR. To create customizable backups to Amazon S3, choose one of the other methods, or create your own template for Data Pipeline.
If none of these options offer the flexibility that you need, then you can use the DynamoDB API to create your own solution.