AWS Database Blog

Archival solutions for Oracle database workloads in AWS: Part 1

This is a two-part series. In this post, we explain three archival solutions that allow you to archive Oracle data into Amazon Simple Storage Service (Amazon S3). In Part 2 of this series, we explain three archival solutions using native Oracle products and utilities. All of these options allow you to join current Oracle data with archived data.

Year over year, data volumes are growing exponentially, and many customers are using Oracle databases as their system of record. Over time, these databases can store decades of data. As a result, customers need a solution to archive historical data.

Numerous customers require archiving solutions to meet regulatory compliance laws and regulations. These include, but are not limited to, Payment Card Industry Data Security Standard (PCI DSS), Health Insurance Portability and Accountability Act (HIPAA), Federal Information Security Management Act (FISMA), Sarbanes-Oxley Act (SOX), EU’s General Data Protection Regulation (GDPR), Financial Industry Regulatory Authority (FINRA), and The Freedom of Information Act (FOIA). Customers must have a solution that incorporates secure storage, fast retrieval, and full lifecycle management for important data.

Another goal for archiving data is the desire to improve application performance. This is possible when data is archived within the Oracle database or outside of the Oracle database. If data is archived outside of the Oracle database, another benefit is the potential to lower costs (compute, storage, backup, and Oracle software licensing).

All archival solutions fall into at least one category:

  • Retire – In this strategy, you sunset or decommission the Oracle database. Data is migrated to lower-cost storage. You can uninstall or remove the Oracle database after migrating the data.
  • Reduce – The archive data is migrated to lower-cost storage. The current data remains in the Oracle database. Users are able to aggregate data (current data and archived data) using an AWS service or standard SQL tool.
  • Reorganize – Through logical or physical approaches, data continues to be managed in the Oracle database. Users are able to aggregate data (current data and archived data) using a standard SQL tool.

The following table summarizes the solutions we discuss, which part in the series we discuss them in, and which category the solution falls under.

Solution Series Part Retire Reduce Reorganize
Simple Database Archival Solution (SDAS) One Yes No No
AWS Glue (workflows and / or blueprints) One Yes Yes No
AWS Database Migration Service (AWS DMS) One Yes Yes No
Oracle Data Pump Access Driver Two No Yes No
Oracle In-Database Archiving (IDA) Two No No Yes
Oracle Partitioning Two No No Yes

Archival solution options

You can implement an Oracle database archival solution in AWS. You can configure these archival solutions with Amazon Relational Database Service (Amazon RDS) for Oracle, Amazon RDS Custom for Oracle, or Oracle on Amazon Elastic Compute Cloud (Amazon EC2). All solutions in this post were tested using Amazon RDS for Oracle Enterprise Edition version 19c. With the exception of Oracle partitioning, you can implement these archival solutions with Oracle Database Standard Edition 2. The archival solutions will run on Oracle database versions with available patching and premier support. This includes database versions 19c and 21c.

You should select a solution based on your existing skill sets, number of database objects to archive, volume of data to archive, Oracle data types used, network configuration, application or front end requirements, and complexity to set up and maintain archiving.

You can archive data with Simple Database Archival Solution (SDAS) or AWS Glue using AWS Glue workflows and / or blueprints. Additionally, creating a solution using multiple native AWS services is possible. These native AWS services include, but are not limited to, AWS Database Migration Service (AWS DMS), Amazon S3, Amazon Elastic Block Store (Amazon EBS), Amazon Elastic File System (Amazon EFS), Amazon Athena and Amazon Kinesis.

Additionally, you can combine native AWS services and Oracle solutions to archive Oracle data. An example includes using Oracle GoldenGate for Big Data with Amazon S3. This solution permits real-time parquet ingestion into Amazon S3 from Oracle databases. Additionally, Oracle GoldenGate for Big Data permits real-time ingestion into Amazon Kinesis from Oracle databases using the Kinesis Streams Handler. Oracle GoldenGate is a logical replication offering. Oracle GoldenGate is not included in this post.

If the target for the archive data is Amazon S3, many of the archiving solutions use AWS DMS or AWS Glue. For use cases that require a database migration from on-premises sources to AWS or database replication between on-premises sources and sources in AWS, the recommendation is to use AWS DMS. After the data is in AWS, it is recommended that you use AWS Glue to move and transform data.

If the target for the archive data is Amazon S3, you should create a gateway endpoint for Amazon S3 to access Amazon S3 from your Amazon Virtual Private Cloud (Amazon VPC). You will not need to set up an internet gateway or NAT device for the Amazon VPC. However, gateway endpoints do not allow access from on-premises networks, from peered Amazon VPCs in other Regions, or through AWS Transit Gateway.

Each archival solution has an overview section that provides more information. The solution should be determined by your requirements. The following table outlines three archival solutions for Oracle databases. The column definitions are as follows:

  • Solution – Name of the solution (AWS service or Oracle product to implement Oracle data archiving).
  • Application interface – If an interface (web-based, Java, and more) is provided with the solution, it will be noted in this column.
  • Category – Denotes which approach (retire, reduce, or reorganize) is used.
  • Archive data location – This column specifically lists where the archive data resides.
  • Logical or physical – This column explains if the archive data is logically or physically separated from the source data.
  • Record format – This column identifies the archival record format.
  • Aggregation tool – This is the tool or service that allows you to join current data with archival data.
  • CDC allowed – Change data capture (CDC) is permissible. This column identifies if data can continuously flow from the Oracle database to the archival location in near-real time.
  • Comment – More information on the solution. This column may include limitations or restrictions.
Solution Application interface Category Archive data location Logical or physical Record format Aggregation tool CDC allowed Comment
SDAS Yes Retire Amazon S3 Physical Parquet Amazon Athena No Archives the entire Oracle schema
AWS Glue (workflows and/or blueprints) Yes Retire or reduce Amazon S3 Physical Parquet, ORC, JSON, XML, CSV, Avro Amazon Athena No Flexible to include specific Oracle tables and can use data filtering to restrict rows
AWS DMS Yes Retire or reduce Amazon S3 Physical CSV or Parquet Amazon Athena Yes Can include specific tables and provides data filtering to restrict rows. Data type transformations cannot be performed. Refer to valid parquet data types for Amazon S3 targets.

Archival solution sample use cases

The following tables provide an archival strategy based on multiple factors. Remember, many of the solutions are capable of handling multiple requirements. For example, AWS Glue offers retire and reduce functionality. However, the following AWS Glue use case only illustrates reduce functionality.

Requirement Description
Objective Migrate an entire database by schema to Amazon S3 for read-only access. This is a one-time load. A fast and simple approach is required.
Solution Simple Database Archival Solution (SDAS)
Category Retire
Source Amazon RDS for Oracle, Amazon RDS Custom for Oracle, Oracle database on Amazon EC2
Target Parquet format in Amazon S3
Source database future state Oracle database will be decommissioned
Reporting / analytics Must be able to query archived data in Amazon S3 via reports or dashboards
Comments Data is organized per Oracle schema; data cannot be modified after it’s written to Amazon S3
Requirement Description
Objective Migrate a subset of Oracle tables in a database schema to Amazon S3 for read-only access. This is a one-time load. Additionally, periodically capture changes from the previous load (via microbatches). Orchestration, scheduling, and scalability is important.
Solution AWS Glue
Category Reduce
Source Amazon RDS for Oracle, Amazon RDS Custom for Oracle, Oracle database on Amazon EC2
Target Parquet format in Amazon S3 (also supports other formats)
Source database future state Same as source
Reporting / analytics Must be able to query current data and archived data via reports or dashboards
Comments Data cannot be modified after it’s written to Amazon S3
Requirement Description
Objective Migrate a subset of Oracle tables in a database schema to Amazon S3 for read-only access. This is a one-time load. Additionally, perform CDC (near real-time updates) to keep the data in sync.
Solution AWS DMS
Category Reduce
Source Amazon RDS for Oracle, Amazon RDS Custom for Oracle, Oracle database on Amazon EC2
Target Parquet format in Amazon S3 (also supports CSV format)
Source database future state Same as source database
Reporting / analytics Must be able to query current data and archived data via reports or dashboards
Comments There are limitations on using Oracle as a source with AWS DMS; refer to Using an Oracle database as a source for AWS DMS for more details

Visualization and consumption

When you use Amazon S3 for archival data and use an Oracle database as the source data, a dashboard may be required for consolidated reporting.

You can aggregate data in Amazon S3 and an Oracle database by using Amazon Athena. Amazon Athena is a serverless, interactive analytics service built on open-source frameworks, supporting open-table and file formats. Amazon Athena provides a simplified, flexible way to analyze petabytes of data using standard SQL.

For these use cases, Amazon Athena Federated Query allows you to seamlessly access the data in the Oracle database without having to move the data into Amazon S3. This allows you to create views that present historical data alongside current data.

You can use data visualization tools such as Amazon QuickSight to connect to Amazon Athena and visualize the views. Amazon QuickSight is a cloud-native business intelligence (BI) service that permits visually analyzing data and sharing interactive dashboards with multiple users. The combination of Amazon QuickSight and Amazon Athena allows you to rapidly deploy dashboards to tens of thousands of users, while only paying for actual usage, and not needing to worry about server deployment or management.

The following diagram shows the solution architecture.

Athena Federated Query architecture

For more information on Amazon Athena Federated Query, refer to Query your Oracle database using Athena Federated Query and join with data in your Amazon S3 data lake.

If a retire approach is used and all of the data is stored in Amazon S3, you can use Amazon S3 Select or Amazon Redshift Spectrum to run SQL queries directly on the archived data.

Remove archival data from the source database

If the archival data is stored outside of the Oracle database, it is your responsibility to implement a process to remove the data from the database after the archival data has been validated. Although this post does not go in-depth on the ways to remove archival data from the source database or tables, here are a few high-level processes that you can perform: If a schema was retired, you can drop the schema. Methods to remove archive data from source tables include dropping or truncating tables and partitions. Additionally, you can submit a DELETE statement against an Oracle table or partition. Your use case should determine which approach should be used. These processes should be thoroughly tested prior to implementing in the production environment.

Data modeling

The fundamental objective of data modeling is to only expose data that holds value for the end-user. Denormalization is the process of adding precomputed redundant data to an otherwise normalized relational database to improve read performance of the database for analytics and reporting. Although data modeling itself is highly technical, it nonetheless relies on input from non-technical personnel to properly define business objectives.

All solutions in this post archive data using the current data structure. If denormalization or summarization of data is required prior to archiving, this task should be performed and then the appropriate archival solution should be implemented.

Governance and compliance

Because these archival solutions use Amazon S3 as a target, a high-level overview for governance and compliance is included in this post. By default, Amazon S3 buckets block public access. Additionally, by default, Amazon S3 buckets have encryption enabled with server-side encryption with Amazon S3 managed keys (SSE-S3). When an Amazon S3 bucket is created, you can configure it to store objects using a write-once-read-many (WORM) model to help you prevent objects from being deleted or overwritten for a fixed amount of time or indefinitely. This feature automatically enables Amazon S3 bucket versioning.

You may need more in-depth governance and compliance. Regulators also require that organizations secure sensitive data. Compliance involves creating and applying data access, protection, and compliance policies. For example, you can restrict access to personally identifiable information (PII) at the table or column level, encrypt the data, and keep audit logs of who is accessing the data.

Traditionally, you can secure data using access control lists on Amazon S3 buckets. You can create and maintain data access, protection, and compliance policies for each AWS analytics service requiring access to the data. For example, if you are running analysis against archival data in Amazon S3 using Amazon Redshift and Amazon Athena, you must set up access control rules for each of these services.

AWS Lake Formation provides a single place to manage access controls for data in the data lake. You can define security policies that restrict access to data at the database, table, column, row, and cell levels. These policies apply to AWS Identity and Access Management (IAM) users and roles, and to users and groups when federating through an external identity provider. The AWS Lake Formation permissions model is implemented as DBMS-style GRANT and REVOKE commands, which are more accessible than traditional IAM policies. You can use fine-grain controls to access data secured by AWS Lake Formation within Amazon Redshift Spectrum, Amazon Athena, AWS Glue ETL (extract, transform, and load), and Amazon EMR for Apache Spark. AWS Lake Formation also provides comprehensive audit logs with AWS CloudTrail to monitor access and show compliance with centrally defined policies. This lets you see which users or roles have attempted to access what data, with which services, and when. You can access audit logs the same way you access other CloudTrail logs using the CloudTrail APIs and AWS console.

If AWS DMS is used to archive data to Amazon S3, you can use AWS Secrets Manager, which allows you to manage, retrieve, and rotate AWS DMS endpoint credentials. For more information on this topic, refer to Manage your AWS DMS endpoint credentials with AWS Secrets Manager. AWS DMS also encrypts data at rest. With Oracle sources, you can also use SSL with AWS DMS. Third-party auditors assess the security and compliance of AWS DMS as part of multiple AWS compliance programs.

Data quality and data redaction

This post does not cover data quality and data redaction techniques in detail. There are many data quality and data redaction products in the industry. However, because these archival solutions store data in Amazon S3, information on native AWS services is listed in this post.

When you archive historical data in Amazon S3 data lakes, the data should adhere to standards. You must identify missing, stale, or bad data before it impacts the business. You need to be able to create, implement, and enforce data quality rules.

Setting up data quality checks can be manual, time consuming, and error prone. It can take days for data engineers to identify and implement data quality rules. Someone must write code to monitor data pipelines, visualize quality scores, and alert when anomalies occur. Some customers adopt commercial data quality solutions; however, these solutions may require time-consuming infrastructure management and can be expensive. You need a simple, cost-effective, and automatic way to manage data quality.

AWS Glue Data Quality reduces these manual quality efforts from days to hours. AWS Glue Data Quality automatically computes statistics, recommends quality rules, monitors, and alerts when it detects that quality has deteriorated. You can apply AWS Glue Data Quality rules to data at rest in datasets and data lakes, and to entire data pipelines where data is in motion. For data pipelines built on AWS Glue Studio, you can apply a transform to evaluate the quality for the entire pipeline. Rules can also be defined to stop the pipeline if quality deteriorates, mitigating bad data from landing in the Amazon S3 data lake. AWS Glue Data Quality uses Deequ, an open-source framework built by Amazon, to manage petabyte-scale datasets. Because it is built using open source, AWS Glue Data Quality provides flexibility and portability without lock-in. For more details, refer to the following five-part AWS blog series on AWS Glue Data Quality.

When a data lake is being designed to host archival data, one of the most important aspects to consider is data privacy. Without it, sensitive information could be accessed by an unauthorized individual, which can affect the reliability of the data service. However, identifying sensitive data inside a data lake could represent a challenge due to the diversity of the data and the volume of data.

In 2022, AWS Glue announced the sensitive data detection and processing feature to help identify and protect sensitive information in a straightforward way using AWS Glue Studio. Sensitive data detection in AWS Glue identifies a variety of PII and other sensitive data like credit card numbers. It helps you take action, such as tracking sensitive data for audit purposes or redacting the sensitive information before writing records into a data lake. AWS Glue Studio’s visual, no-code interface lets you include sensitive data detection as a step in a data integration job. It lets you choose the type of personal information to detect, as well as specify follow-up actions, including redaction and logging. You can also define your own custom detection patterns for your unique needs. For more details and hands-on practice on how to detect and process sensitive data using AWS Glue Studio, refer to Detect and process sensitive data using AWS Glue Studio.

Data quality features are also available in AWS Glue DataBrew, a visual data preparation tool that helps data analysts and data scientists prepare data with an interactive, point-and-click visual interface without writing code. With AWS Glue DataBrew, you can visualize, clean, and normalize terabytes, and even petabytes of data directly from the data lake, data warehouse, and databases, including Amazon S3, Amazon Redshift, Amazon Aurora, and Amazon RDS. For more details on how to build a completely automated, end-to-end event-driven pipeline for data quality validation using AWS Glue DataBrew, refer to Build event-driven data quality pipelines with AWS Glue DataBrew.

Sample dataset

We used an AWS sample dataset for this post. This dataset is provided with SDAS and contains approximately 10 GB of data. Table row counts range from a few records to more than 55 million records. Several indexes exist that are larger than 1.5 GB. The full code is available on the GitHub repo.

Simple Database Archival Solution (SDAS) Overview

SDAS allows you to archive relational databases using a serverless approach. SDAS archives data from an Oracle, MySQL, or SQL Server database into a searchable immutable format residing in Amazon S3.

All database tables will be captured for the provided schema. SDAS cannot filter records (all rows are captured). It only provides a one-time load (incremental updates not available). This solution has a simple interface. This solution uses AWS Glue to perform the migration. AWS Glue is serverless and uses a central data catalog that is integrated with other AWS services, such as Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum. AWS Glue is the recommended service to move and transform data within AWS.

SDAS will connect to the database, map the schema, perform validation, and transfer data to Amazon S3. This is accomplished by primarily using AWS Step Functions and AWS Glue. The main purpose of SDAS is to provide an out-of-the-box solution that quickly allows you to archive data in the cloud. The following screenshot shows the process for SDAS to connect to an Oracle database.

SDAS add archive screenshot

After you connect to the Oracle database and the data migration is complete, you can review the archive status in the SDAS UI (see the following screenshot). Each table and the associated number of columns for each table will be displayed in the SDAS UI.

SDAS archive screenshot

All tables within the selected Oracle schema will be migrated to Amazon S3. An Amazon S3 folder will be created for each Oracle table (see the following screenshot). Each folder will contain multiple parquet files to store the table data.

SDAS S3 bucket screenshot

SDAS automatically populates the AWS Glue Data Catalog. As a result, you can use Amazon Athena to access the data in Amazon S3. The following screenshot shows the row count for the three largest tables.

SDAS Athena query screenshot

The AWS services reside within your AWS account. These AWS services will incur individual service usage fees. The SDAS architecture is shown in the following diagram.

SDAS architecture diagram

The benefits of SDAS include the following:

  • Ingest and archive a relational database (retrieves all tables within a schema)
  • Perform integrity checks
  • Validate the data on the target after the archiving process has completed
  • Ability to configure WORM (write once, read many)
  • Ability to define a data retention period for the data
  • Detailed information about the status of the data
  • Ability to preview data archived in Amazon S3 (using an SDAS predefined Amazon S3 bucket)

SDAS is a simple and intuitive solution to archive an entire Oracle database schema to Amazon S3 in parquet format. After the data resides in Amazon S3, you should backup the Oracle database prior to deleting it. You can install SDAS here. This self-guided workshop walks through the installation and core features of SDAS.

AWS Glue overview

AWS Glue is a serverless data integration service that helps to discover, prepare, and combine data for analytics, machine learning (ML), and application development. AWS Glue provides the capabilities needed for data integration, so you can start analyzing your data and put it to use quickly. AWS Glue provides visual and code-based interfaces to help with data integration. You can find and access data using the AWS Glue Data Catalog. Data engineers and Extract, Transform, and Load (ETL) developers can visually create, run, and monitor ETL workflows in a few steps in AWS Glue Studio.

This solution can include specific Oracle tables and use data or row filtering. The architecture is shown below.

Glue architecture diagram

AWS Glue uses workflows to create and visualize complex ETL activities involving multiple crawlers, jobs, and triggers. As a workflow runs each component, it records execution progress and status. This provides you with an overview of the larger task and the details of each step. The AWS Glue console provides a visual representation of a workflow as a graph. You can create a workflow from an AWS Glue blueprint if you would like to templatize a migration workflow and reuse it for similar use cases. Or you can manually build a workflow one component at a time using the AWS Management Console or the AWS Glue API. For more information, see Overview of workflows in AWS Glue and Overview of blueprints in AWS Glue in the AWS Glue Developer Guide.

Because the goal is to archive data from an Oracle database, we built the workflow below which performs an one-time load of the desired schema into the Amazon S3 data lake.

Glue workflow screenshot

If you want your workflow to perform incremental loads from Oracle tables, you could enable job bookmarks. AWS Glue tracks data that has already been processed during a previous run of an ETL job by persisting state information from the job run. This persisted state information is called a job bookmark. Job bookmarks help AWS Glue maintain state information and prevent the reprocessing of old data. With job bookmarks, you can process new data when rerunning on a scheduled interval. For more information, see Tracking Processed Data Using Job Bookmarks in the AWS Glue Developer Guide.

In addition, the archival dataset is immediately available to query via Amazon Athena upon completion of the ingestion workflow. The following screenshot shows the result of a simple validation query for the three largest tables in the database after they have been ingested to the Amazon S3 data lake and catalogued in the AWS Glue Data Catalog.

Glue Athena screenshot

AWS DMS overview

AWS DMS is a logical replication solution. It supports one-time loads and incremental updates. It supports specifying schemas and tables. Additionally, AWS DMS provides data and row filtering. It has a simple interface and also provides a serverless option. AWS DMS connects to the Oracle database via the Oracle database listener. AWS DMS does not use a centralized data catalog to integrate with other AWS services. Also, AWS DMS has limited data type support. AWS DMS is the recommended service to move data from on-premises sources to AWS.

You can use AWS DMS to archive infrequently accessed data to Amazon S3. AWS DMS is a managed migration and replication service that facilitates the movement and replication of data from various source databases to widely used targets, including commercial databases, open-source databases, and Amazon S3. When using Amazon S3 as the target, AWS DMS enables the extraction of information from supported databases. You can write this information to Amazon S3 in commonly used formats, such as CSV and parquet, making it compatible with most applications to consume this data.

For instructions for setting up an Oracle database as a source and Amazon S3 as a target in AWS DMS, refer to Effectively migrating LOB data to Amazon S3 from Amazon RDS for Oracle with AWS DMS and High-level view of AWS DMS.

Defining an AWS DMS replication task is straightforward, and AWS DMS is flexible to include or exclude schemas and tables. After the AWS DMS replication instance is created and the AWS DMS source and target endpoints are defined, you can create an AWS DMS replication task.

The following screenshots demonstrate how you can migrate an entire Oracle database schema to Amazon S3 via an AWS DMS replication task.

DMS migration task screenshot

DMS migration screenshot for tables

When the AWS DMS replication task has completed, the status will change on the AWS DMS console (see the following screenshot). The task will also identify how many rows were loaded for each table.

DMS migration status screenshot

An AWS DMS replication task can also use conditions to migrate data that meets certain criteria. For example, you may only want to archive data that is older than 1 year. AWS DMS can use column filters to handle this scenario.­

The following screenshot shows the folders that were created in the specified Amazon S3 bucket. Each folder represents a table in the source database. One or more parquet files reside in each folder.

DMS S3 bucket screenshot

After the archive data moves into Amazon S3, you can use Amazon Athena to query the data. However, first an AWS Glue crawler must run to populate the AWS Glue Data Catalog. The following screenshot shows the result of a simple validation query for the three largest tables in the database after they have been ingested in Amazon S3 and catalogued in the Data Catalog.

DMS Athena screenshot

The AWS DMS process flow is shown in the following diagram.

DMS architecture diagram

The benefits of AWS DMS include the following:

  • Fully managed and straightforward to use – AWS DMS is simple to use. There is no need to install drivers or applications. You can start a database migration with a few clicks on the AWS DMS console.
  • No downtime required – AWS DMS connects to the Oracle database via the database listener.
  • Cost-effective – AWS DMS allows you to only pay for database migration capacity used.
  • Reliable – AWS DMS is highly resilient and self-healing. It continually monitors sources, targets, network connectivity, and the replication instance. In the case of an interruption, AWS DMS automatically restarts the process and continues the migration from where it stopped.

Clean up

In the Oracle source database, run the cleanup script provided with the SDAS sample data. Additionally, delete or remove all AWS services created as a result this post. This includes, but is not limited to, RDS for Oracle instances, AWS DMS replication tasks and replication instances, Amazon S3 buckets, and AWS Glue connections and crawlers.

Conclusion

In this post, we provided three solutions to implement Oracle archiving using Amazon S3 as the archive target. The next post in this series contains three native Oracle solutions to implement Oracle archiving. Archiving Oracle data can improve database performance, decrease database backup times, reduce the Recovery Time Objective (RTO), lower operational expenses, help align with regulatory requirements, and reduce Oracle software licensing requirements. Please provide feedback in the comments section.


About the Authors

Marvin Vinson is an Oracle Database Specialist Solutions Architect for AWS. He works in the Worldwide Public Sector (WWPS) organization. He is focused on migrating Oracle workloads to AWS.

Apostolos Grevenitis is an Analytics Specialist Solutions Architect helping public sector agencies succeed with their analytics workloads in AWS.