Getting more with PostgreSQL purpose-built data types
When designing many applications today, developers rightfully think of the end-user first and focus on what the experience will be. How the data is ultimately stored is an implementation detail that comes later. Combined with rapid release cycles, “schema-less” database designs fit well, allowing for flexibility as the application changes. PostgreSQL natively supports this type of design pattern, allowing data to be stored as JSON, XML, key-value, or trees through different data types, and performs well compared to purpose-built databases for that data.
The flexibility of a “schema-less” design does come at a cost of performance and code complexity. Storing the data as JSON requires the metadata to be stored as well, in the form of keys, in the document. This increases the size of the data, affecting performance. Handling a JSON document has additional challenges as the application code base evolves. There may be different modules or services using the same piece of data, and you must use it consistently agreeing on required keys or format of the values. This complexity leads to many developers having a more traditional database design with a schema defined with tables and columns.
When creating those tables, PostgreSQL has a robust set of data types that can help define the data stored in those columns. You have integer types, numeric types, Boolean types, timestamps with and without time zones, and the workhorse type of a VARCHAR, storing strings up to 1 GB in size. Because of its flexibility, VARCHAR is oftenly used akin to a “schema-less” column, where it holds a large string and the application code is responsible for handling and validating its contents. In this post, we dig into several purpose-built data types that handle types of data traditionally stored in base types like a VARCHAR. We demonstrate the advantages of using those specialized types.
A common type of data used in applications is a universally unique identifier (UUID). For distributed systems, these identifiers provide a method to control uniqueness. According to the standard [RFC 4122, ISO/IEC 9834-8:2005], a UUID is a sequence of lower-case hexadecimal digits, in groups separated by hyphens. Specifically, a group of 8 digits followed by three groups of 4 digits followed by a group of 12 digits, for a total of 32 digits representing the 128 bits. To accommodate this length, a 36-character VARCHAR is typically used.
Let’s consider a table that we use to collect leads from an outside source. We store the UUID from the source in the column
This definition protects us from a
source_id that is too long, but doesn’t protect from a
source_id that is too short. Without handling the format of
source_id in the application or creating a CONSTRAINT on the column checking the format, we can’t ensure that the value of
source_id is a valid UUID. It can be any string up to 36 characters. For example, see the following code with an invalid value:
PostgreSQL has a UUID data type specifically designed for such cases. Switching the data type of
source_id to a UUID type guarantees that all values of
source_id are valid UUIDs. However, before changing the type, we must delete the invalid
source_id that we inserted previously. The following steps delete the row with the invalid UUID and change the data type of the
source_id column to UUID:
In the following code block, the first two statements attempt to create rows with an invalid UUID. The final statement succeeds with a valid UUID:
Although PostgreSQL does offer functions to generate a UUID inside of the database, it’s best to generate them in the application and send them to the database. The process of generating a UUID is a CPU-intensive operation and scaling CPU resources on a database is difficult.
The UUID data type also standardizes the format of a UUID. The input of a UUID can be either capital, lower, or mixed case. It can include or exclude hyphens or braces. Regardless of the input format, the output format always follows the standard. The following statement shows a UUID in three different formats, but the results are identical:
This standardization is useful for display on a user interface, but its real benefit is in comparisons. With a VARCHAR type, if the two strings aren’t identical, they’re not considered equal. However, with a UUID type, the input values are converted to a standard internal format. This is essential when uniqueness must be guaranteed or when looking up records and the input format may be inconsistent. The following example shows the comparison of a UUID in two different formats. The first query compares them as a VARCHAR, and the second as a UUID:
Another advantage to using a UUID type instead of VARCHAR is the performance gains from the storage savings. When
source_id is VARCHAR, the size of the value is 37 bytes: 1 byte for the column header, needed for the variable length field, and 36 bytes to hold the text of the value. When
source_id is a UUID, the size is 16 bytes, a 57% reduction in the row size! For a table with millions of rows, the size difference is substantial. This is also reflected in the size of the index on
source_id, improving performance. The overall sizes of the tables are smaller so less work must be done scanning the table or index. Additionally, smaller rows also mean that more data can fit in cache.
Let’s look at the size of the
leads table and the index on
source_id. We then switch the data type back to a VARCHAR so we can compare the difference in sizes. The following query returns the size of the
leads table and its index in bytes:
With a little over 10 million rows in the
leads table, the table is 22% smaller using a UUID type and the index is 47% smaller.
Many applications track network information in tables. This might be the last known IP address of a user login or the MAC address of a sensor collecting temperature data. This network data can be an invaluable source of information about the users of your application. If the application needs to do more than just display an IP address, storing it as VARCHAR requires the application code to understand the finer details of a network address.
PostgreSQL has native network address data types, and there is also an extension (IP4R) that understands network addresses. Let’s consider a table used to track the history of logins. We store the IP address of the user in the column
Creating the column as
varchar(15) accommodates any IPv4 address, but it doesn’t guarantee that the value is a valid IP address. The following example inserts an invalid IP address into the
Switching the data type to INET adds validation to the column. The statements in the following code block remove the invalid row, modify the column
login_addr data type to INET, and attempt to add the invalid row again:
As INET, the value is more efficiently stored in the table. The following query shows the size of the
login_history table in bytes and its index with 2.56 million rows when
login_addr is a VARCHAR:
The following query shows the table and index size after converting the
login_addr to an INET. The size of the table is reduced by more than 11% and the size of the index is reduced by 21%. This space saving comes from the size of the column being reduced from 14 bytes as a VARCHAR to 7 bytes as an INET.
The INET data type is flexible; it can hold IPv4 and IPv6 values along with netmask information. That flexibility is traded for additional storage space. In our example, the
login_addr column only needs to store IPv4 addresses, so the IP4 data type provided by the ip4r extension is more efficient than INET. The following statements show creating the
ip4r extension, changing the data type to IP4, and checking the sizes of the table and index. Using an IP4 data type results in a 23% size reduction of the table over a VARCHAR and a 44% size reduction of the index:
On top of the validation and efficient storage size, the largest benefit to the network address data types is the additional functionality available by having context of the data. With the column being a VARCHAR, the value can be any arbitrary string, but as an IP4, there is specific meaning of the value. Fundamental things like operators can use that context to make things easier for the developer. For example, in the following query, we’re looking for all the users that have logged in from a specific subnet. When
login_addr is a VARCHAR, the meaning of BETWEEN doesn’t have the context that
login_addr is actually an IP address, so the results of the query return no rows:
However, if we run the query with
login_addr being an IP4, we find a row that is between the two given values. The following query returns a value with an address of 246.51.73.8, which is less than 246.51.73.255 when it has the context of being an IP address. When it’s defined as a VARCHAR, it’s greater than 246.51.73.255:
Many applications store and use location information. Some use it to help customers find the nearest store, whereas others use location information to connect people who are geographically close to one another. Many objects such as store locations can be considered a point. Those points are most often represented with a latitude and a longitude. Points can be stored as two numbers for easy access, but using them requires extracting the data and manipulating them in the application. PostgreSQL does have a POINT data type, but that is for two-dimensional coordinates on a plane, where latitude and longitude represent a point on the globe. The extension PostGIS solves this by introducing a few additional data types, including a GEOGRAPHY type. The GEOGRAPHY type assumes that points and shapes are on the globe and not an arbitrary plane or sphere. The functionality of PostGIS is vast and beyond the scope of this post, so we just show the GEOGRAPHY data type.
Let’s consider a table that contains a number of transportation locations like rail stops and airports. The actual location is stored in the
geog column, which is a GEOGRAPHY type:
PostGIS adds several functions and operators to use with the GEOGRAPHY column. The following query returns the five nearest bus stops from the point at 47.622358 latitude, -122.336522 longitude. The point is a location in Seattle, Washington:
Computing a nearest neighbor search requires knowledge of the shapes and how they relate to one another. Performing this type of analysis become inefficient outside of the database without the context of the full dataset. The GEOGRAPHY type adds that context. In addition to the GEOGRAPHY type, PostGIS adds a GEOMETRY type and a RASTER type, along with many functions specific for geospatial data.
With the COVID-19 pandemic, there has been a large increase in the need to rapidly analyze chemical data in the search for vaccines and treatments. One of the leading open-source tools for cheminformatics is RDKit, which adds a molecule data type, MOL, to PostgreSQL. This data type allows scientists to efficiently scan large amounts of highly specialized data.
To explore the MOL data type, and to keep it simple (to avoid stressing my memory of high school chemistry) let’s create a table containing a column for molecules and its corresponding common name:
The MOL data type follows the simplified molecular-input line-entry system (SMILES), allowing a sophisticated format to identify a molecule’s structure. The following statements create two rows for water and salt in the table:
Like other specialize types, the MOL type also enforces rules around valid molecules. For example, the next statement tries to create a row for a mythical element, which results in an error:
The RDKit extension adds functions to easily extract information about the molecules. An example of this is the
mol_numatoms function, which returns the number of atoms in the molecule:
With the MOL data type understanding the greater meaning of data it contains, more complex searches than string comparisons are possible. For example, the following query returns all the molecules that contain a carbon monoxide:
In this post, we explored a few of PostgreSQL’s purpose-built types. All the purpose-built data types we covered in this post are available in Amazon Relational Database Service (Amazon RDS) for PostgreSQL and Amazon Aurora PostgreSQL-Compatible Edition. Some, like the UUID type, make some tasks easier and more efficient. Others like a GEOGRAPHY type and a MOL type are essential for some applications. There are many more specialized types that cover domains like international product numbering standards and raster images. PostgreSQL’s extensible architecture will allow new types to be created in the future as PostgreSQL moves into more industries requiring specialized data handling.
About the Author
Jim Mlodgenski is a Principal Database Engineer at AWS. Prior to joining AWS, Jim was CTO at OpenSCG and CEO at StormDB, a Postgres-XC scalable public cloud solution. Jim was chief architect and part of the founding team of EnterpriseDB. Jim also very active in the PostgreSQL community, co-founding both the New York and Philadelphia PostgreSQL Meetup groups, co-organizing the largest PostgreSQL conferences, and serving on the Board of the PostgreSQL Foundation. Jim also speaks regularly at PostgreSQL and other open sources conferences.