AWS Database Blog

Key considerations while migrating bulk operations from Oracle to PostgreSQL

AWS launched Amazon Database Migration Accelerator (Amazon DMA) to accelerate your journey to AWS databases and analytics services and achieve cloud adoption benefits such as cost savings and performance improvements. Amazon DMA has assisted thousands of customers globally to migrate their workloads (database and application) to AWS databases and analytics services. In this post, we share Amazon DMA’s key considerations while migrating bulk data operations from Oracle to Amazon Relational Database Service (Amazon RDS) for PostgreSQL or Amazon Aurora PostgreSQL-Compatible Edition.

Implement bulk data operations in Oracle

SQL is a declarative database query language and not a programming language. However, database procedural languages such as Oracle’s PL/SQL are programming languages, and embed SQL statements to extend their capabilities, which comes at a price. When the Oracle’s procedural language engine (PL/SQL) processes a block of code, it runs the native, procedural statements within its own engine, but when it encounters SQL statements, it passes the statements on to the SQL engine. The SQL engine runs the SQL statements and returns results to the procedural language engines. This back-and-forth transfer of control between the two engines is called a context switch, and it comes with some performance overhead and could cause drastic performance degradation when many context switches occur. To mitigate this concern, Oracle enhances its procedural language with features that pool together context switches. BULK COLLECT and FORALL statements in PL/SQL are examples of these enhancements. The BULK COLLECT SELECT statement retrieves multiple rows with a single fetch, improving the speed of batch data retrieval:

SELECT <column_l> BULK COLLECT INTO bulk_varaible FROM <table name>;
FETCH <cursor_name> BULK COLLECT INTO <bulk_varaible>;

Similarly, PL/SQL offers a FORALL statement that uses collections to change rows in bulk, as opposed to applying the change to one row at a time:

FORALL <loop_variable>in<lower range> ... <higher range> 
<DML operations>;

In the rest of this post, we use a sample use case to explain how to conduct the bulk data operations. The use case fetches orders from an orders table containing over 10 million records, provides a discount of $1,000 if the order price is greater than $10,000, and inserts the discounted order to the discounted orders target table. Approximately 95% of the records in the orders table are priced greater than $10,000.

Orders (primary key ID)
Name Type
ID NUMBER (8)
Name VARCHAR (100)
Price NUMBER (6,2)
Discounted_Orders (primary key ID)
Name Type
ID NUMBER (8)
Name VARCHAR (100)
Price NUMBER (6,2)

We can implement this use case in Oracle’s PL/SQL using the BULK COLLECT, FETCH, and FORALL statements, as shown in the following code snippet:

DECLARE
CURSOR order_cursor IS SELECT * FROM orders;
TYPE order_tbl_type IS TABLE OF orders%ROWTYPE;
order_tbl order_tbl_type;

BEGIN
           OPEN order_cursor;
           FETCH order_cursor BULK COLLECT INTO order_tbl where price > 10000;
           
           FORALL i IN 1 .. order _tbl.COUNT
           INSERT INTO discounted_orders
           VALUES order_tbl(i).id, order_tbl(i).name, order.tbl(i).price - 1000); 
           CLOSE order_cursor;
           COMMIT;
END;

Implement bulk data operations in PostgreSQL

PostgreSQL’s PL/pgSQL doesn’t have built-in features that are equivalent to Oracle’s PL/SQL features, such as BULK COLLECT, FETCH, or FORALL. However, you can choose from multiple approaches to achieve similar functionalities. In this section, we compare four such approaches. This includes using a combination of the procedural database platform-specific language and the declarative SQL statements to pure SQL language. The trade-offs between these approaches vary with the performance implications and the flexibility to manipulate the records.

1. LOOP through the results of a SELECT statement

In this approach, we select all the orders with a price greater than $10,000 into a cursor, fetch each row from the cursor in a loop and apply a discount of $1,000, and insert the record into the target table. Here, the context switch occurs as many times as there are orders with a price greater than $10,000. See the following code:

DO
$$
DECLARE
order_rec record;
BEGIN
           -- Select orders into a cursor with a price > 10000
           FOR order_rec IN SELECT * FROM orders WHERE price > 10000
           -- Loop through records in the cursor to apply the discount, and insert the record into the target table
           LOOP
                    INSERT INTO discounted_orders VALUES (order_rec.id, order_rec.name, order_rec.price -1000);
           END LOOP;
END;
$$

2. Use PostgreSQL functions (ARRAY_AGG and UNNEST)

In this approach, we use the PostgreSQL functions ARRAY_AGG and UNNEST. The ARRAY_AGG function is used to fetch the desired order records in bulk into an array, and then the UNNEST function is used to bulk insert the fetched records. In this approach, the context switch occurs when fetching the records to an array and inserting the updated records into the target table. See the following code:

DO
$$
DECLARE
tmp_orders orders[]; — orders is both a table name and a type
BEGIN
            — Fetch the records for orders with a price > 10000 into an array.
            SELECT ARRAY_AGG(orders)
            INTO tmp_orders
            FROM orders
            WHERE price > 10000;

            -- Unnest the array and insert each record into the target table.
            INSERT INTO discounted_orders
            SELECT t.id, t.name, t.price - 1000
            FROM UNNEST(tmp_orders) t;
END;
$$

Compared to the previous approach, this approach minimizes the number of the context switches drastically. The drawback of this approach is that the ARRAY_AGG and UNNEST operations may not provide optimal performance for very large datasets.

3. Use only SQL statements

In this approach, we write the entire code in pure SQL statements and bypass the procedural language engine altogether. This removes the need for context switching between the procedural language engine and SQL engine run, as shown in the following code snippet:

-- Insert into the target table the order records whose price is > 10000
INSERT INTO discounted_orders
SELECT id, name, price - 1000
FROM orders
WHERE price > 10000;

4. Use SQL statements with chunked large batches

In some cases, a pure SQL statement won’t work for large batches of records, for reasons such as resource constraints (for example, dealing with memory constraints or result set pagination). In these cases, we can use a LIMIT clause and calculate the offset required to chunk large batches of records in smaller increments, as shown in the following code snippet:

DO
$$
DECLARE
_lmt INTEGER :=100000;
_maxID INTEGER := 0;
BEGIN
          LOOP
              INSERT INTO discounted_orders SELECT id, name, price - 1000 FROM orders WHERE
              id >= _maxID ORDER BY id LIMIT _lmt;
              EXIT WHEN NOT FOUND;
              SELECT MAX(id) + 1 INTO _maxID FROM discounted_orders;
          END LOOP;
END;
$$

Comparing techniques

We can rank these bulk operation techniques in order from least context switches used to most:

  • Use pure SQL statements (if feasible)
  • Use SQL statements with chunked large batches
  • Use AGG_ARRAY and UNNEST if the preceding approaches are not feasible
  • Cache the results of the SQL statement in a cursor and process the results row by row

Although reducing the number of context switches can help improve query performance, you need to consider additional criteria while evaluating the overall performance of the chosen approach, such as resource constraints, parallelization, and system configurations. To demonstrate this, Amazon DMA conducted a performance benchmark analysis on the above-mentioned techniques using an orders table that contained 10 million records and observed that the fourth option took the least amount time to run and was more than 300% faster than the first option, which was the slowest. The following table has the details.

Options Relative runtime to Option 4 (smaller is faster)
1 LOOP through the results of a SELECT statement 3.24X
2 Use PostgreSQL functions (ARRAY_AGG and UNNEST) 1.4X
3 Use only SQL statements 1.12X
4 Use SQL statements with chunked large batches 1X

Based on the performance benchmark analysis, we ranked the approach in the following order:

  • Use pure SQL statements (if feasible)
  • Use SQL statements with chunked large batches (if we need to chunk a large batch into smaller increments)
  • Use PostgreSQL functions (ARRAY_AGG and UNNEST, if pure SQL isn’t feasible)
  • LOOP through the results of a SELECT statement (as a last resort, to cache the results of the SQL statement in a cursor and process the results row by row).

Conclusion

In this post, we shared Amazon DMA’s key considerations bulk operations to Amazon RDS for PostgreSQL or Amazon Aurora PostgreSQL-Compatible Edition.

Amazon DMA offers complementary advisory services to create migration strategy and implementation plans, and enables your in-house migration team (or Amazon Professional Services or APN Partner, if involved) to conduct migration implementation. If you are planning to migrate your workloads to AWS databases and analytics services, email DMA-sales@amazon.com to engage the Amazon DMA team.

In the coming weeks, we will publish additional posts on how to handle large objects in PostgreSQL, how to enforce uniqueness across PostgreSQL partitions, and more. Stay tuned!


About the Authors

Ezat Karimi is a Sr. Solutions Architect at Amazon Web Services (AWS). She is a member of the Database Migration Accelerator (DMA) team who help customers with their database application modernization journey.

Sharath Gopalappa is a Sr. Product Manager Technical at Amazon Web Services (AWS) with focus on helping organizations modernize their technology investments with AWS Databases & Analytics services.