How to migrate an application from an on-premises Oracle database to Amazon RDS for PostgreSQL
For years, companies have had to set up their own local databases and maintain the hardware themselves. However, as the cloud infrastructure continues to improve, there’s far less need to own and manage your own hardware.
Here at Amazon, we own hundreds (if not thousands) of on-premises databases that over time we have migrated to cloud-based solutions, both relational like Amazon RDS and nonrelational like Amazon DynamoDB. For our migrated databases that end up in relational solutions, we also seek to reduce costs by looking at database technologies that are free to use, but have similar functionality to our Oracle databases: sequences, triggers, partitioning, and more. The closest, technology-wise, is PostgreSQL.
This blog post has two sections. First, we discuss the common Java application and changes on the layer of Hibernate (an object-relational mapping). After that, we discuss some best practices and strategies for managing the application tier discovered when we migrated our Java applications from Oracle to Amazon RDS for PostgreSQL. These suggestions and examples work with databases based both on RDS and Amazon Aurora with PostgreSQL compatibility.
Hibernate and application layer changes and best practices
First, start by analyzing your application. To complete a successful migration, we recommend a deep understanding of the application you’re trying to migrate. Here are some high-level questions to get you started:
- Does your application use an object-relational mapping (ORM), like Hibernate?
- If so, what Hibernate data types do you use in your application?
- Does your application still have some plain SQL that might need to be converted?
Let’s explore some of these questions and why they matter.
Does your application use an ORM?
If your application does use an ORM, then you’re in luck and the application code likely doesn’t have to change very much. Hibernate is one of the most popular ORMs, and following I explain the changes required to migrate an application using Hibernate.
Let’s walk through a standard Hibernate connection setup, shown following, and understand what the purpose of each of these lines is.
Line 2, I’m setting the driver class that Hibernate should try to use when establishing the connection.
Line 3, I set the connection URL. The PostgreSQL driver looks for
jdbc:postgresql or something of similar form to recognize that the PostgreSQL driver should use this URL. The drivers are statically initialized into Java’s DriverManager. When a new connection is requested, the DriverManager spools through each of the drivers to see if it handles that sort of connection URL by looking at prefixes of this sort.
Line 4 and 5 contain the user name and password. The only recommendation I have here is to find a way to securely store your password so that doesn’t appear in plain text in your configuration.
Line 6, I set the dialect for Hibernate to produce queries in.
Line 7 is interesting because PostgreSQL likes to return timestamps to clients in the time zone that each client is running from. If you want all your times in UTC, you need to specify that on the client side. You can do this either by setting up the JVM to run in UTC or explicitly setting the JDBC time zone as shown. Make sure to know whether you’re using timestamp with or without time-zone components in your schema. Oracle dates are timestamps, but PostgreSQL makes a distinction between dates and timestamps in cases where a date is just the day component.
Line 8 I left here as a best practice. It’s good when connecting to a database to create a connection pool so that a set of connections can be kept open and reused by your service. Doing this reduces the latency of opening a connection each time you want to talk to the database and helps your database by preventing it from having to continually establish new connections. The connection pool being used here is C3P0, and you can find additional details about setting it up and configuring it in the Hibernate documentation.
What data types does your application use?
The data types are a bit different between Oracle and PostgreSQL. The first type that we encountered issues with were Booleans. (We were using Hibernate version 3.6 when we encountered this problem.) In Oracle, you use
number(1) and Hibernate appropriately handles the conversion of a
Boolean Java type to a value of
1 in the database. However, when you switch Hibernate over to talk to the PostgreSQL database that was using the
smallint type, with the latter Hibernate throws errors when trying to convert from Numeric to Boolean. However, Hibernate has a type for doing this called
org.hibernate.type.NumericBooleanType that does these conversions for you.
Additionally, you need to be careful when migrating dates to ensure that your client has the correct time zone (discussed preceding) and carefully select what sort of date or timestamp object you are trying to store. On the storage side, Oracle’s
date keeps timestamp information at a resolution of seconds, and PostgreSQL’s
date keeps the date without a time. Oracle’s
timestamp types should both be converted to PostgreSQL
timestamp to preserve the same level of information. However, there are also time zone elements to bear in mind that I don’t go into further here. These differences can affect any raw SQL statements that might exist in your code, and you should analyze these thoroughly when converting an application.
Some additional information about Oracle to PostgreSQL schema conversions can be found in the post Challenges When Migrating from Oracle to PostgreSQL—and How to Overcome Them on the AWS Database Blog.
Does your application have plain SQL to be converted?
If your application doesn’t use an ORM, then your migration is likely to require a lot of modifications to raw SQL statements that aren’t compatible between the two databases. Even if you use an ORM, certain operations might be hard-coded into the application, either for optimization or specific DB functionality reasons.
There is an ANSI SQL standard that typically both databases accept, so writing in this standard might make future migrations easier. But to help with these conversions, you can use the AWS Schema Conversion Tool (AWS SCT), a free tool that’s downloadable from the AWS website. AWS SCT is a general tool that you can use to migrate your source database tables, indexes, stored procedures, sequences, and other objects to your target database.
But, more specific to this blog post, you can use the application conversion feature of AWS SCT to convert any raw SQL statements. AWS SCT will analyze your application code for any embedded SQL statements. Then AWS SCT will convert those SQL statements to the target syntax and save the application files with the converted code.
AWS SCT provides a direct translation between the source and target SQL dialects whenever possible. When no direct translation is possible, AWS SCT supplies additional functionality (called “extension packs”) that mimic the source behavior in the target environment. Otherwise, if AWS SCT cannot convert a given SQL statement, it will flag the offending code so you can perform a manual conversion.
If you need to perform a manual conversion, another great resource is the Oracle Database to Amazon Aurora with PostgreSQL Compatibility Migration Playbook, also available as a free download on the AWS website. The playbook includes many tips on converting from Oracle’s dialect to PostgreSQL, but I also list some common conversions following.
PostgreSQL doesn’t have a SYSDATE function, which we frequently use in our Oracle databases. You can convert this function to CURRENT_TIMESTAMP, which works in both databases.
However, there’s a couple things to consider. First, CURRENT_TIMESTAMP in Oracle returns in the time zone of the client session and SYSDATE returns in the server time zone. Also, in PostgreSQL, CURRENT_TIMESTAMP gives the same result for the whole transaction and SYSDATE follows the system clock and changes for long-running transactions. This difference matters for bulk-loading scenarios where the first row has a different value than the last. In PostgreSQL, CLOCK_TIMESTAMP is closer to SYSDATE. As a general rule of thumb, for most purposes you can do the following conversion.
Oracle and PostgreSQL
PostgreSQL doesn’t have a SYSDATE function, which we frequently use in our example Oracle database. You can convert this function to CURRENT_TIMESTAMP, which works in both databases.
Intervals in Oracle are a bit easier, because you can subtract values from dates to result in a relative date. However, a more generic way to do this for both is to use intervals.
Oracle and PostgreSQL
Sometimes we hard-code specific dates into our Oracle interactions using the to_date function, but it turns out there’s a simpler way that both databases understand.
Oracle & PostgreSQL
Warning: Make sure that you understand what time zone your client is operating in, because the preceding command mixed with a PostgreSQL timestamp (with and without time zone) can result in some funny behavior. PostgreSQL always stores timestamps in the time zone of the server regardless of whether the timestamp has a time zone or not. For Amazon RDS for PostgreSQL, this time zone is always UTC. If the client is logged into a time zone other than UTC, the preceding command can result in the time zone of the client being captured. This capture moves the time stored to something other than what was entered. To address this, the easiest thing to do is to ensure your client is running in UTC.
Substatements in the from clause
Statements inside the from clause need to be aliased inside PostgreSQL, though this was optional in Oracle.
Oracle and PostgreSQL
Even when you aren’t using a table in Oracle, you still must have a from clause, so Oracle introduced the concept of DUAL. In PostgreSQL, you can omit the from clause altogether.
So, for example, if you’re running a simple command to see that you connected to the database and can run a query, this code might be different between Oracle and PostgreSQL.
Null vs. empty strings
Some versions of Oracle treat empty strings the same as null values, so if you query
'IS NULL' on '' it returns true. PostgreSQL has a distinction between an empty string value and a null value, so
'IS NULL' on '' returns false.
In Oracle, a
DATE is a date and time object with resolution down to the second. On the other hand, PostgreSQL treats
DATE as a date with no time information.
Tips for managing the application tier migration
When we take a step back from the Hibernate interactions of our application and look at the application more holistically, we find some other questions to ask.
How do I want to migrate my application?
When looking at your application’s migration strategy, consider the use of the tables and data that your application uses. The simplest migration is using a tool like AWS Database Migration Service (AWS DMS) to replicate the data first in a bulk load and then using change data capture (CDC) to migrate live transactions. More information on the use of CDC can be seen in the post Migrate PostgreSQL Databases and Perform Ongoing Replication with the AWS Database Migration Service on the AWS Database Blog.
Between AWS DMS bulk load and CDC, all the data migration is handled for you. You’re responsible only for the service level migration, flipping the application from one database to the other. However, sometimes your data’s dependencies or uptime requirements might force you to do a dual-write migration. This kind of migration involves writing code in your application to read and write to both stores and manage the transactionality and failure logic yourself. Even with this solution, you might be able to use AWS DMS to do a bulk load of all the old data up to the timestamp where you began dual writing.
Some of the questions following play into whether to take this approach and which option works best for you.
Does my application have exclusive read/write access to the tables it uses?
The first and most blatant example of a data dependency is another service directly accessing your service’s tables to either read or write data. In most cases, I recommend breaking this direct data dependency and having the other service call into your service for those reads and writes. This approach means that one service can be the master of that data.
Otherwise, suppose that only one of the two services is migrated to the other database. In this case, the remaining service sees different data for whatever data replication lag exists between your databases. For some applications this might not be a problem. However, if you need near-immediate consistency you might have to look to a dual-write migration if you can’t break this dependency.
What level of write then read consistency does my application require?
In a similar vein, think about consistency within your application. If you have multiple services connecting to the same database and reading the same tables, do you need consistency across the entire service plane? Suppose that you’re doing a data migration with AWS DMS and flipping the services one by one from one database to the other. In this case, you need to understand what the impact is if one service writes to a table in the new database and any service connected to the old database reads stale data. Until you migrate the service to the new database, the service continues to read stale data unless you sync from your new database back to your old one.
For this reason, you need to order the way in which your applications migrate as to minimize this. You still have a short-term consistency problem if the writer is on the old database and reader on the new database, but at least it’s resolved when AWS DMS migrates those records. A similar problem exists even within a single application if it’s a distributed service, because during your deployment some hosts might point to the old database while some point to the new database. For this reason, you need to think carefully about how you deploy your database flips.
Do I have foreign key constraints that cross services’ tables?
Even if two applications are both reading and writing exclusively to their own tables, a problem can still exist if there are foreign key constraints between the tables of the different services. This problem can happen when a value is written on the old database but has not been replicated to the new database. In this case, a write to the new database referencing that value causes a foreign-key constraint violation. Typically the easiest fix is to drop the foreign key constraints during the migration. However, you might find a good order for your application migration so you don’t have to do this.
What sort of downtime and maintenance is acceptable for my application?
Your acceptable level of downtime and maintenance helps determine what sort of migration might work for your service. If you can afford no downtime and need immediate consistency across all your records, you might be left with the dual-write solution. However, even this solution has its downside, because you need to handle failed writes to one database when the other succeeds. If you can take a small bit of downtime, then ordering how you flip your applications (as discussed preceding) or a deployment of a new configuration might work best for you. Using this approach, you can leave the heavy lifting of the data migration to AWS DMS.
How do I test my application?
Regardless of which method of migration you implement, application compatibility and performance testing are crucial to making a migration like this work. Hibernate hides many incompatibilities for you if you are using it for everything, but there are still a few gotchas that we mention preceding. In our example, we mixed use of Hibernate and some raw SQL statements that resulted in incompatibilities found only during testing.
Besides testing for functionality, we recommend that you test the database performance difference that you can expect to see at run time. To start, you can run execution plans for your most common and resource-intensive queries to understand whether PostgreSQL is taking similar steps to get the results of your query.
Each of these renders an execution plan that you can use to compare the expected way each query runs. In doing so, you can determine if PostgreSQL and Oracle take relatively similar paths to run the query. A common thing to look for in these execution plans is whether the two use similar indices. Similarly, you can run the queries with timing enabled to see the amount of time taken to return the result, as shown following.
At a higher level, you can instrument your application by adding metrics around remote calls. You can observe the latency at your application layer by invoking the various code paths, similar to the functional testing discussed preceding. If you don’t already have an integration testing suite, this is a great opportunity to set one up. It can monitor both regressions in functionality and latency for your application. Setting up an integration testing suite will benefit you for years to come. It helps catch changes that have negative impacts on your application and moves your team closer to continuous deployment.
In this blog post, we discuss the migration of an application from an on-premises Oracle database to an Amazon RDS for PostgreSQL database. This migration not only reduces the infrastructure and operational maintenance by moving to a cloud-based solution, but also reduces the cost of licensing. This is a triple win.
Changing the data store for your application requires careful planning to understand what the right path forward for your application is. It also requires thorough testing to help protect against regressions in functionality or latency. Here at Amazon, we have used our own suite of AWS tools to migrate off of hundreds of on-premises databases to Amazon RDS for PostgreSQL instances, and continue migrating more each day.
About the Author
Ryan Zauber is a senior software development engineer at Amazon.