Best practices for data lake protection with AWS Backup

Data lakes, powered by Amazon Simple Storage Service (Amazon S3), provide organizations with the availability, agility, and flexibility required for modern analytics approaches to gain deeper insights. Protecting sensitive or business-critical information stored in these S3 buckets is a high priority for organizations. AWS Backup for Amazon S3 makes it easier to centrally automate the backup and recovery of critical data in your Amazon S3 powered data lake.

In this post, we discuss best practices for using AWS Backup to protect data that resides in your Amazon S3 data lake. We start with a look at the data lake user personas and where AWS Backup for Amazon S3 aligns versus Amazon S3’s native capabilities. Next, we discuss how to use AWS Backup to establish the separation of duties and secure access to your protected data. Finally, we address the importance of auditability to make sure your critical data is being protected. The information presented in this post highlights people, cost, security, and governance considerations to help you in designing an AWS Backup based solution for protecting an Amazon S3 data lake.

Architecture

Figure 1 illustrates an AWS Backup architecture for protecting an Amazon S3 data lake. This architecture follows the patterns detailed in Data Protection Reference Architectures with AWS Backup for creating immutable backups with AWS Backup Vault Lock. The data lake specific configurations detailed in the sections that follow are applied to this architecture.

Figure 1 illustrates an AWS Backup architecture for protecting an Amazon S3 data lake.

Figure 1: AWS Backup architecture for Amazon S3 data lake

Data lake personas and AWS Backup

The typical data lake environment can involve multiple user personas, such as data lake administrators, data engineers, storage administrators, and backup administrators. Configuration and management of the S3 resources required by the data lake is a responsibility of the storage administrator role. In some organizations, the storage administrator may be a standalone role, while in others these responsibilities fall upon the application owners or other groups. Storage administrators can configure Amazon S3 features to help protect data residing in buckets such as:

S3 Versioning to preserve, retrieve, and restore every version of every object stored in an S3 bucket.
Same-Region Replication or Cross-Region Replication to automatically and asynchronously copy objects to an S3 bucket in the same AWS Region or an alternate Region.
S3 Object Lock to prevent objects from being deleted or overwritten for a fixed amount of time or indefinitely.

The sets of features required for your protection plan must be configured for each S3 bucket that contains critical data. Some features, such as S3 Object Lock, must be accounted for at bucket creation and cannot be applied after the fact.

The backup administrator role is responsible for backups, restores, and the related compliance of your applications across multiple AWS services. Organizations can address separation of duty requirements related to the protection of critical data by defining a backup administrator role separate from the core data lake team. Using AWS Backup, backup administrators can create backup plans that define the frequency and retention of backups for AWS resources in a single AWS account or across multiple accounts using the AWS Organizations integration. The data is stored and organized in backup vaults, which have their own access policies, encryption keys, and locking policies separate from what was defined at the object or bucket level in Amazon S3. Using AWS IAM Identity Center, you can map your workforce identities to properly scoped user permissions for the different data lake personas. The AWS Backup features above, combined with the IAM Identity Center roles work to prevent malicious or accidental changes or deletions of your critical data.

Identify the critical data for backup

Data lakes typically contain data from multiple sources and across various stages of development. For example, raw data from a source system may initially land in one bucket, undergo transformation that lands in a curated bucket, and become part of aggregated data resulting from analytics processing in yet another bucket. Although all of this data may bring value to the business, it may not be critical to protect all of this data in the long-term. As transformed and aggregated data can be reproduced from the raw data, backing up the buckets containing raw data can be the best option for backup, especially for categories of financial transactions that may be subject to long-term retention. However, for data such as raw data containing PII information, the preference may be to back up the de-identified data instead. The process to identify the critical data for the backup should account for the business impact of recovery times along with the costs to store the protected data.

Once identified, the S3 buckets housing the critical data can be tagged with a key/value pair schema that indicates the data protection level required. For example, a tag Key of “DataProtectionLevel” and Value of “Critical” would be applied to only the S3 buckets requiring backup. When assigning resources to a backup plan, administrators can focus on selecting only the required S3 buckets by using these tag values. The backup frequency and retention specified when creating a backup plan determines when the jobs run and for how long the data is stored. Moreover, these parameters should be driven by the compliance requirements of your organization.

When using AWS Backup for Amazon S3, backup administrators can define backup plans that create continuous and/or periodic backups. Continuous backups are useful for point-in-time restores (PITR). S3 buckets in a data lake can hold thousands of objects, complicating the task of recovering a bucket to a specific point in time using the native Amazon S3 versioning feature alone. However, using continuous backups, AWS Backup lets you restore your Amazon S3 data to a specific date and time within the maximum retention period of up to 35 days. If you require longer term retention to support internal or regulatory compliance, then periodic backups can be added to your AWS Backup plan rules. The periodic backup rule defines the frequency and retention required for your compliance requirements using the same destination vault as your PTIR rule. To further isolate and protect your data, you can use the copy to destination capabilities to copy snapshots to a destination vault that can reside in a different AWS account and/or Region.

Figure 2: Defining a cross-account copy destination

Figure 2: Defining a cross-account copy destination

Cost should be considered when defining your data lake protection strategy. As noted earlier, not all data in your data lake has the same level of criticality to your business. AWS Backup rules and resource assignments should be created with a level of specificity to minimize your data storage and transfer costs. Similarly, Amazon S3 resources should be configured with lifecycle expiration rules for the versioning-enabled buckets that are being backed up.

Secure access to protected data

Data protected by AWS Backup is stored and organized using backup vaults. AWS Backup vaults provide three key features to secure access to and prevent deletion of the recovery points generated by the service: access policies, encryption keys, and AWS Backup Vault Lock.

Setting access policies on backup vaults let you define who can access the vaulted data and who can make changes to the vault configuration. These policies further enforce the separation of duties between the backup administrator and other personas in the data lake account. A policy such as the following one allows only the backup administrator role to make updates or deletes to the backup vault.

{
   "Version": "2012-10-17",
   "Statement": [
      {
         "Effect": "Deny",
         "Principal": "*",
         "Action": [
            "backup:DeleteBackupVault",
            "backup:DeleteBackupVaultAccessPolicy",
            "backup:DeleteBackupVaultLockConfiguration",
            "backup:DeleteBackupVaultNotifications",
            "backup:DeleteRecoveryPoint",
            "backup:PutBackupVaultAccessPolicy",
            "backup:PutBackupVaultLockConfiguration",
            "backup:PutBackupVaultNotifications",
            "backup:UpdateRecoveryPointLifecycle"
         ],
        "Resource": "*",
        "Condition": {
           "ArnNotEquals": {
              "aws:PrincipalArn": [
              "arn:aws:iam::{AccountId}:role/{Backup_Administrator_Role}"  
              ]
          }
       }
    }
  ]
}

Some services, such as Amazon S3, support independent encryption of backups with AWS Backup. This means that Amazon S3 backups are encrypted using an AWS Key Management Service (AWS KMS) key associated with the backup vault. The AWS KMS key can either be a customer-managed CMK or an AWS-managed CMK associated with the AWS Backup service. This feature benefits the data lake protection use case in two ways. First, AWS Backup encrypts all backups even if the source S3 buckets are not encrypted. Second, access and management of this AWS KMS key can be locked down to only backup related IAM principles, providing additional protection from unintended deletion of the protected data.

Finally, AWS Backup Vault Lock can add an additional layer of defense that protects the recovery points in your backup vaults from inadvertent or malicious delete operations or updates that shorten their retention period. Setting this feature to Compliance mode lets you enforce a minimum and/or maximum retention period and prevents privileged users (even the AWS account root user) from performing early deletions of recovery points in the backup vault.

Enable auditability

Making sure of the ongoing operational integrity of your data protection solution is a key component of your strategy. AWS Backup Audit Manager lets you audit and report on the compliance of your data protection policies to help meet business and regulatory compliance requirements. To monitor the backup activity of the Amazon S3 resources you identified earlier, you can create a custom framework in AWS Backup Audit Manager with a targeted set of controls configured. A full list of available controls and their configurable parameters and scope can be found in the AWS Backup Audit Manager controls and remediation guide. For the purposes of a data lake only backup environment using periodic backup rules, an AWS Backup Audit Manager framework with the following set of controls are recommended for the AWS account and Region where your data resides:

Backup resources are protected by a backup plan
Backup plan minimum frequency and minimum retention
Vaults prevent manual deletion of recovery points
Recovery points are encrypted
Minimum retention established for recovery point
Backups are protected by AWS Backup Vault Lock
Last recovery point was created

Note that if your backup plans call for the use of PITR for your Amazon S3 resources, then you should not include Last recovery point was created in your AWS Backup Audit Manager framework. This control does not evaluate properly, since PITR backups utilize a single recovery point rather than creating a new one for each new backup job.

If additional copies are required per your organization’s compliance standards, then add one or both of the following controls:

Cross-account backup copy scheduled
Cross-Region backup copy scheduled

Furthermore, an AWS Backup Audit Manager framework should be created in the AWS account and Region with the destination vault for the copies resides. This framework should include the following controls:

Vaults prevent manual deletion of recovery points
Recovery points are encrypted
Minimum retention established for recovery point
Backups are protected by AWS Backup Vault Lock

As mentioned above, each control can have parameters and scope associated that define which resources to monitor and how they should be evaluated. Let’s look at a few examples of how to configure the controls above for the data lake use case.

Resources are protected by a backup plan

This control evaluates if a defined set of resources are protected by a backup plan. In the case of the data lake environment, you want to make sure that the S3 buckets identified as holding your critical resources are evaluated. Configure this control by selecting Tagged resources as the resources to evaluate, provide the tag Key and Value values for your protected S3 bucket(s), and specify S3 as the Resource type.

The console image displays the edit screen for an AWS Backup Audit Manager framework. The control named “Resources are protected by a backup plan” is the focus. “Tagged resources” is toggled under the “Choose resource to evaluate” item. A tag Key of “” and tag Value of “” are entered in the input fields. “S3” has been selected under the “Resource type” drop down menu.

Figure 3: Resource are protected by a backup plan control

Recovery points are encrypted

The control evaluates whether any recovery points are unencrypted. By default, the control would apply to all recovery points in the AWS account and region where the framework is configured. However, by tagging recovery points as part of the AWS Backup plan definition, you can narrow the scope of this control to only evaluate recovery points containing the tag Key and Value values that you specified.

The console image displays the edit screen for an AWS Backup Audit Manager framework. The control named “Recovery points are encrypted” is the focus. “Tagged resources” is toggled under the “Choose recovery points to evaluate” item. A tag Key of “” and tag Value of “” are entered in the input fields.

Figure 4: Recovery points are encrypted control

Last recovery point created

This control checks that at least one recovery point exists within a frequency that you specify. The control helps make sure that your backup plans are both configured and operating in accordance with your compliance requirements. This control provides a parameter for providing a backup frequency value expressed in hours or days. Once again, you scope this control to the S3 buckets tagged for backup.

Figure 5: Last recovery point created control

Cleaning up

To avoid accruing costs associated with evaluating this architecture, you must delete the AWS Backup and AWS Backup Audit Manager resources manually. The steps to clean up resources can be found in the AWS Backup Developer Guide. If a vault lock was applied, you must remove the vault lock prior before proceeding with cleanup of other resources.

Conclusion

In this post, I shared key considerations for designing a data protection strategy for Amazon S3 based data lakes using AWS Backup. Organizations can utilize the policy-based plans, fine-grained access controls, and audit capability of AWS Backup to centralize management and protection of critical data stored in Amazon S3. By following the best practices outlined in this post, you can design a secure, cost optimized, and operationally efficient solution for protecting an Amazon S3 data lake with AWS Backup.

To learn more about data protection, visit AWS Backup technical documentation. If you have feedback about this blog post, submit comments in the comments section. You can also post a new question on AWS Backup in re:Post to get answers from the community. Thank you for reading this post.

AWS Storage Blog