AWS Database Blog
Key considerations while migrating bulk operations from Oracle to PostgreSQL
AWS launched Amazon Database Migration Accelerator (Amazon DMA) to accelerate your journey to AWS databases and analytics services and achieve cloud adoption benefits such as cost savings and performance improvements. Amazon DMA has assisted thousands of customers globally to migrate their workloads (database and application) to AWS databases and analytics services. In this post, we share Amazon DMA’s key considerations while migrating bulk data operations from Oracle to Amazon Relational Database Service (Amazon RDS) for PostgreSQL or Amazon Aurora PostgreSQL-Compatible Edition.
Implement bulk data operations in Oracle
SQL is a declarative database query language and not a programming language. However, database procedural languages such as Oracle’s PL/SQL are programming languages, and embed SQL statements to extend their capabilities, which comes at a price. When the Oracle’s procedural language engine (PL/SQL) processes a block of code, it runs the native, procedural statements within its own engine, but when it encounters SQL statements, it passes the statements on to the SQL engine. The SQL engine runs the SQL statements and returns results to the procedural language engines. This back-and-forth transfer of control between the two engines is called a context switch, and it comes with some performance overhead and could cause drastic performance degradation when many context switches occur. To mitigate this concern, Oracle enhances its procedural language with features that pool together context switches. BULK COLLECT and FORALL statements in PL/SQL are examples of these enhancements. The BULK COLLECT SELECT statement retrieves multiple rows with a single fetch, improving the speed of batch data retrieval:
Similarly, PL/SQL offers a FORALL statement that uses collections to change rows in bulk, as opposed to applying the change to one row at a time:
In the rest of this post, we use a sample use case to explain how to conduct the bulk data operations. The use case fetches orders from an orders table containing over 10 million records, provides a discount of $1,000 if the order price is greater than $10,000, and inserts the discounted order to the discounted orders target table. Approximately 95% of the records in the orders table are priced greater than $10,000.
Orders (primary key ID) | |
Name | Type |
ID | NUMBER (8) |
Name | VARCHAR (100) |
Price | NUMBER (6,2) |
Discounted_Orders (primary key ID) | |
Name | Type |
ID | NUMBER (8) |
Name | VARCHAR (100) |
Price | NUMBER (6,2) |
We can implement this use case in Oracle’s PL/SQL using the BULK COLLECT, FETCH, and FORALL statements, as shown in the following code snippet:
Implement bulk data operations in PostgreSQL
PostgreSQL’s PL/pgSQL doesn’t have built-in features that are equivalent to Oracle’s PL/SQL features, such as BULK COLLECT, FETCH, or FORALL. However, you can choose from multiple approaches to achieve similar functionalities. In this section, we compare four such approaches. This includes using a combination of the procedural database platform-specific language and the declarative SQL statements to pure SQL language. The trade-offs between these approaches vary with the performance implications and the flexibility to manipulate the records.
1. LOOP through the results of a SELECT statement
In this approach, we select all the orders with a price greater than $10,000 into a cursor, fetch each row from the cursor in a loop and apply a discount of $1,000, and insert the record into the target table. Here, the context switch occurs as many times as there are orders with a price greater than $10,000. See the following code:
2. Use PostgreSQL functions (ARRAY_AGG and UNNEST)
In this approach, we use the PostgreSQL functions ARRAY_AGG and UNNEST. The ARRAY_AGG function is used to fetch the desired order records in bulk into an array, and then the UNNEST function is used to bulk insert the fetched records. In this approach, the context switch occurs when fetching the records to an array and inserting the updated records into the target table. See the following code:
Compared to the previous approach, this approach minimizes the number of the context switches drastically. The drawback of this approach is that the ARRAY_AGG and UNNEST operations may not provide optimal performance for very large datasets.
3. Use only SQL statements
In this approach, we write the entire code in pure SQL statements and bypass the procedural language engine altogether. This removes the need for context switching between the procedural language engine and SQL engine run, as shown in the following code snippet:
4. Use SQL statements with chunked large batches
In some cases, a pure SQL statement won’t work for large batches of records, for reasons such as resource constraints (for example, dealing with memory constraints or result set pagination). In these cases, we can use a LIMIT clause and calculate the offset required to chunk large batches of records in smaller increments, as shown in the following code snippet:
Comparing techniques
We can rank these bulk operation techniques in order from least context switches used to most:
- Use pure SQL statements (if feasible)
- Use SQL statements with chunked large batches
- Use AGG_ARRAY and UNNEST if the preceding approaches are not feasible
- Cache the results of the SQL statement in a cursor and process the results row by row
Although reducing the number of context switches can help improve query performance, you need to consider additional criteria while evaluating the overall performance of the chosen approach, such as resource constraints, parallelization, and system configurations. To demonstrate this, Amazon DMA conducted a performance benchmark analysis on the above-mentioned techniques using an orders table that contained 10 million records and observed that the fourth option took the least amount time to run and was more than 300% faster than the first option, which was the slowest. The following table has the details.
Options | Relative runtime to Option 4 (smaller is faster) | |
1 | LOOP through the results of a SELECT statement | 3.24X |
2 | Use PostgreSQL functions (ARRAY_AGG and UNNEST) | 1.4X |
3 | Use only SQL statements | 1.12X |
4 | Use SQL statements with chunked large batches | 1X |
Based on the performance benchmark analysis, we ranked the approach in the following order:
- Use pure SQL statements (if feasible)
- Use SQL statements with chunked large batches (if we need to chunk a large batch into smaller increments)
- Use PostgreSQL functions (ARRAY_AGG and UNNEST, if pure SQL isn’t feasible)
- LOOP through the results of a SELECT statement (as a last resort, to cache the results of the SQL statement in a cursor and process the results row by row).
Conclusion
In this post, we shared Amazon DMA’s key considerations bulk operations to Amazon RDS for PostgreSQL or Amazon Aurora PostgreSQL-Compatible Edition.
Amazon DMA offers complementary advisory services to create migration strategy and implementation plans, and enables your in-house migration team (or Amazon Professional Services or APN Partner, if involved) to conduct migration implementation. If you are planning to migrate your workloads to AWS databases and analytics services, email DMA-sales@amazon.com to engage the Amazon DMA team.
In the coming weeks, we will publish additional posts on how to handle large objects in PostgreSQL, how to enforce uniqueness across PostgreSQL partitions, and more. Stay tuned!
About the Authors
Ezat Karimi is a Sr. Solutions Architect at Amazon Web Services (AWS). She is a member of the Database Migration Accelerator (DMA) team who help customers with their database application modernization journey.
Sharath Gopalappa is a Sr. Product Manager Technical at Amazon Web Services (AWS) with focus on helping organizations modernize their technology investments with AWS Databases & Analytics services.