AWS Database Blog

Getting more with PostgreSQL purpose-built data types

When designing many applications today, developers rightfully think of the end-user first and focus on what the experience will be. How the data is ultimately stored is an implementation detail that comes later. Combined with rapid release cycles, “schema-less” database designs fit well, allowing for flexibility as the application changes. PostgreSQL natively supports this type of design pattern, allowing data to be stored as JSON, XML, key-value, or trees through different data types, and performs well compared to purpose-built databases for that data.

The flexibility of a “schema-less” design does come at a cost of performance and code complexity. Storing the data as JSON requires the metadata to be stored as well, in the form of keys, in the document. This increases the size of the data, affecting performance. Handling a JSON document has additional challenges as the application code base evolves. There may be different modules or services using the same piece of data, and you must use it consistently agreeing on required keys or format of the values. This complexity leads to many developers having a more traditional database design with a schema defined with tables and columns.

When creating those tables, PostgreSQL has a robust set of data types that can help define the data stored in those columns. You have integer types, numeric types, Boolean types, timestamps with and without time zones, and the workhorse type of a VARCHAR, storing strings up to 1 GB in size. Because of its flexibility, VARCHAR is oftenly used akin to a “schema-less” column, where it holds a large string and the application code is responsible for handling and validating its contents. In this post, we dig into several purpose-built data types that handle types of data traditionally stored in base types like a VARCHAR. We demonstrate the advantages of using those specialized types.

UUID

A common type of data used in applications is a universally unique identifier (UUID). For distributed systems, these identifiers provide a method to control uniqueness. According to the standard [RFC 4122, ISO/IEC 9834-8:2005], a UUID is a sequence of lower-case hexadecimal digits, in groups separated by hyphens. Specifically, a group of 8 digits followed by three groups of 4 digits followed by a group of 12 digits, for a total of 32 digits representing the 128 bits. To accommodate this length, a 36-character VARCHAR is typically used.

Let’s consider a table that we use to collect leads from an outside source. We store the UUID from the source in the column source_id:

CREATE TABLE leads (
  id           bigint PRIMARY KEY,
  source_id    varchar(36),
  name         varchar(200),
  created_at   timestamptz
);

CREATE INDEX leads_source_id_idx ON leads (source_id);

This definition protects us from a source_id that is too long, but doesn’t protect from a source_id that is too short. Without handling the format of source_id in the application or creating a CONSTRAINT on the column checking the format, we can’t ensure that the value of source_id is a valid UUID. It can be any string up to 36 characters. For example, see the following code with an invalid value:

blog=> INSERT INTO leads VALUES (100, '12df2c4a-eee4-11ea-b376-060eac8eb668-0000', 'John Smith', now());
ERROR:  value too long for type character varying(36)

blog=> INSERT INTO leads VALUES (100, '12df2c4a-eee4-11ea-b376', 'John Smith', now());
INSERT 0 1

PostgreSQL has a UUID data type specifically designed for such cases. Switching the data type of source_id to a UUID type guarantees that all values of source_id are valid UUIDs. However, before changing the type, we must delete the invalid source_id that we inserted previously. The following steps delete the row with the invalid UUID and change the data type of the source_id column to UUID:

blog=> DELETE FROM leads WHERE id = 100;
DELETE 1
blog=> ALTER TABLE leads ALTER COLUMN source_id 
       TYPE uuid USING source_id::uuid;
ALTER TABLE

In the following code block, the first two statements attempt to create rows with an invalid UUID. The final statement succeeds with a valid UUID:

blog=> INSERT INTO leads VALUES 
      (100, '12df2c4a-eee4-11ea-b376-060eac8eb668-0000', 'John Smith', now());                                                                               
ERROR:  invalid input syntax for type uuid: "12df2c4a-eee4-11ea-b376-060eac8eb668-0000"
LINE 1: INSERT INTO leads VALUES (100, '12df2c4a-eee4-11ea-b376-060e...
                                       ^
blog=> INSERT INTO leads VALUES 
      (100, '12df2c4a-eee4-11ea-b376', 'John Smith', now());
ERROR:  invalid input syntax for type uuid: "12df2c4a-eee4-11ea-b376"
LINE 1: INSERT INTO leads VALUES (100, '12df2c4a-eee4-11ea-b376', 'J...
                                       ^
blog=> INSERT INTO leads VALUES 
      (100, '12df2c4a-eee4-11ea-b376-060eac8eb668', 'John Smith', now());                                                                               
INSERT 0 1

Although PostgreSQL does offer functions to generate a UUID inside of the database, it’s best to generate them in the application and send them to the database. The process of generating a UUID is a CPU-intensive operation and scaling CPU resources on a database is difficult.

The UUID data type also standardizes the format of a UUID. The input of a UUID can be either capital, lower, or mixed case. It can include or exclude hyphens or braces. Regardless of the input format, the output format always follows the standard. The following statement shows a UUID in three different formats, but the results are identical:

blog=> \x
Expanded display is on.
blog=> SELECT '12df2c4a-eee4-11ea-b376-060eac8eb668'::uuid, 
              '12DF2C4AEEE411EAB376060EAC8EB668'::uuid, 
              '{12df2c4a-eee411ea-b376-060eac8eb668}'::uuid;
-[ RECORD 1 ]------------------------------
uuid | 12df2c4a-eee4-11ea-b376-060eac8eb668
uuid | 12df2c4a-eee4-11ea-b376-060eac8eb668
uuid | 12df2c4a-eee4-11ea-b376-060eac8eb668

This standardization is useful for display on a user interface, but its real benefit is in comparisons. With a VARCHAR type, if the two strings aren’t identical, they’re not considered equal. However, with a UUID type, the input values are converted to a standard internal format. This is essential when uniqueness must be guaranteed or when looking up records and the input format may be inconsistent. The following example shows the comparison of a UUID in two different formats. The first query compares them as a VARCHAR, and the second as a UUID:

blog=> SELECT '12df2c4a-eee4-11ea-b376-060eac8eb668'::varchar = '12DF2C4AEEE411EAB376060EAC8EB668'::varchar;                                                                          
 ?column? 
----------
 f
(1 row)

blog=> SELECT '12df2c4a-eee4-11ea-b376-060eac8eb668'::uuid = '12DF2C4AEEE411EAB376060EAC8EB668'::uuid;                                                                                
 ?column? 
----------
 t
(1 row)

Another advantage to using a UUID type instead of VARCHAR is the performance gains from the storage savings. When source_id is VARCHAR, the size of the value is 37 bytes: 1 byte for the column header, needed for the variable length field, and 36 bytes to hold the text of the value. When source_id is a UUID, the size is 16 bytes, a 57% reduction in the row size! For a table with millions of rows, the size difference is substantial. This is also reflected in the size of the index on source_id, improving performance. The overall sizes of the tables are smaller so less work must be done scanning the table or index. Additionally, smaller rows also mean that more data can fit in cache.

Let’s look at the size of the leads table and the index on source_id. We then switch the data type back to a VARCHAR so we can compare the difference in sizes. The following query returns the size of the leads table and its index in bytes:

blog=> SELECT pg_relation_size('leads'), 
              pg_relation_size('leads_source_id_idx');                                                                                                  
 pg_relation_size | pg_relation_size 
------------------+------------------
        809918464 |        326049792
(1 row)

blog=> ALTER TABLE leads ALTER COLUMN source_id TYPE varchar(36);                                                                    
ALTER TABLE
blog=> SELECT pg_relation_size('leads'), 
              pg_relation_size('leads_source_id_idx');
 pg_relation_size | pg_relation_size 
------------------+------------------
       1037025280 |        610410496
(1 row)

With a little over 10 million rows in the leads table, the table is 22% smaller using a UUID type and the index is 47% smaller.

Network addresses

Many applications track network information in tables. This might be the last known IP address of a user login or the MAC address of a sensor collecting temperature data. This network data can be an invaluable source of information about the users of your application. If the application needs to do more than just display an IP address, storing it as VARCHAR requires the application code to understand the finer details of a network address.

PostgreSQL has native network address data types, and there is also an extension (IP4R) that understands network addresses. Let’s consider a table used to track the history of logins. We store the IP address of the user in the column login_addr:

CREATE TABLE login_history (
   user_id     bigint,
   status      int,
   login_addr  varchar(15), 
   login_at    timestamptz
 );

CREATE INDEX login_history_addr_idx ON login_history(login_addr, status);

Creating the column as varchar(15) accommodates any IPv4 address, but it doesn’t guarantee that the value is a valid IP address. The following example inserts an invalid IP address into the login_addr column:

blog=> INSERT INTO login_history (user_id, status, login_addr, login_at) 
       VALUES (1, 0, 'ip_address', now());                                                                          
INSERT 0 1

Switching the data type to INET adds validation to the column. The statements in the following code block remove the invalid row, modify the column login_addr data type to INET, and attempt to add the invalid row again:

blog=> DELETE FROM login_history WHERE login_addr = 'ip_address';
DELETE 1
blog=> ALTER TABLE login_history ALTER COLUMN login_addr 
       TYPE inet USING login_addr::inet;
ALTER TABLE
blog=> INSERT INTO login_history (user_id, status, login_addr, login_at) 
       VALUES (1, 0, 'ip_address', now());
ERROR:  invalid input syntax for type inet: "ip_address"
LINE 1: ...r_id, status, login_addr, login_at) VALUES (1, 0, 'ip_addres...

As INET, the value is more efficiently stored in the table. The following query shows the size of the login_history table in bytes and its index with 2.56 million rows when login_addr is a VARCHAR:

blog=> SELECT pg_relation_size('login_history'), 
              pg_relation_size('login_history_addr_idx');
 pg_relation_size | pg_relation_size 
------------------+------------------
        174063616 |        102670336
(1 row)

The following query shows the table and index size after converting the login_addr to an INET. The size of the table is reduced by more than 11% and the size of the index is reduced by 21%. This space saving comes from the size of the column being reduced from 14 bytes as a VARCHAR to 7 bytes as an INET.

blog=> SELECT pg_relation_size('login_history'), 
              pg_relation_size('login_history_addr_idx');
 pg_relation_size | pg_relation_size 
------------------+------------------
        154206208 |         80879616
(1 row)

The INET data type is flexible; it can hold IPv4 and IPv6 values along with netmask information. That flexibility is traded for additional storage space. In our example, the login_addr column only needs to store IPv4 addresses, so the IP4 data type provided by the ip4r extension is more efficient than INET. The following statements show creating the ip4r extension, changing the data type to IP4, and checking the sizes of the table and index. Using an IP4 data type results in a 23% size reduction of the table over a VARCHAR and a 44% size reduction of the index:

blog=> CREATE EXTENSION ip4r; 
CREATE EXTENSION
blog=> ALTER TABLE login_history ALTER COLUMN login_addr 
       TYPE ip4 USING login_addr::ip4;
ALTER TABLE
blog=> SELECT pg_relation_size('login_history'), 
              pg_relation_size('login_history_addr_idx');
 pg_relation_size | pg_relation_size 
------------------+------------------
        133578752 |         57606144
(1 row)

On top of the validation and efficient storage size, the largest benefit to the network address data types is the additional functionality available by having context of the data. With the column being a VARCHAR, the value can be any arbitrary string, but as an IP4, there is specific meaning of the value. Fundamental things like operators can use that context to make things easier for the developer. For example, in the following query, we’re looking for all the users that have logged in from a specific subnet. When login_addr is a VARCHAR, the meaning of BETWEEN doesn’t have the context that login_addr is actually an IP address, so the results of the query return no rows:

blog=> SELECT login_addr, user_id, max(login_at)
         FROM login_history
        WHERE login_addr BETWEEN '246.51.73.0' AND '246.51.73.255'
        GROUP BY 1, 2;

 login_addr | user_id | max 
------------+---------+-----
(0 rows)

However, if we run the query with login_addr being an IP4, we find a row that is between the two given values. The following query returns a value with an address of 246.51.73.8, which is less than 246.51.73.255 when it has the context of being an IP address. When it’s defined as a VARCHAR, it’s greater than 246.51.73.255:

blog=> SELECT login_addr, user_id, max(login_at)
         FROM login_history
        WHERE login_addr BETWEEN '246.51.73.0' AND '246.51.73.255'
        GROUP BY 1, 2;

 login_addr  | user_id |              max              
-------------+---------+-------------------------------
 246.51.73.8 |   28340 | 2020-09-08 21:02:47.678938+00
(1 row)

Geospatial

Many applications store and use location information. Some use it to help customers find the nearest store, whereas others use location information to connect people who are geographically close to one another. Many objects such as store locations can be considered a point. Those points are most often represented with a latitude and a longitude. Points can be stored as two numbers for easy access, but using them requires extracting the data and manipulating them in the application. PostgreSQL does have a POINT data type, but that is for two-dimensional coordinates on a plane, where latitude and longitude represent a point on the globe. The extension PostGIS solves this by introducing a few additional data types, including a GEOGRAPHY type. The GEOGRAPHY type assumes that points and shapes are on the globe and not an arbitrary plane or sphere. The functionality of PostGIS is vast and beyond the scope of this post, so we just show the GEOGRAPHY data type.

Let’s consider a table that contains a number of transportation locations like rail stops and airports. The actual location is stored in the geog column, which is a GEOGRAPHY type:

CREATE TABLE transport (
  id           bigint PRIMARY KEY,
  name         varchar(200),
  fclass       varchar(28),
  geog         geography
);

PostGIS adds several functions and operators to use with the GEOGRAPHY column. The following query returns the five nearest bus stops from the point at 47.622358 latitude, -122.336522 longitude. The point is a location in Seattle, Washington:

blog=> SELECT name, 
              ST_Distance(geog, ST_SetSRID(
                          ST_MakePoint(-122.336522, 47.622358), 4326)) 
              AS distance
         FROM transport
        WHERE fclass = 'bus_stop'
        ORDER BY 
              geog <-> ST_SetSRID(ST_MakePoint(-122.336522, 47.622358), 4326)
        LIMIT 5;
             name             |   distance    
------------------------------+---------------
 Westlake Ave N & Harrison St |  176.32415736
 Westlake Ave N & Mercer St   |  257.56034608
 Westlake Ave & 9th Ave       |  486.28977358
 Westlake Ave & 9th Ave       |  504.77718073
 Broadway & E Pine St         | 1430.97265263
(5 rows)

Computing a nearest neighbor search requires knowledge of the shapes and how they relate to one another. Performing this type of analysis become inefficient outside of the database without the context of the full dataset. The GEOGRAPHY type adds that context. In addition to the GEOGRAPHY type, PostGIS adds a GEOMETRY type and a RASTER type, along with many functions specific for geospatial data.

Chemistry

With the COVID-19 pandemic, there has been a large increase in the need to rapidly analyze chemical data in the search for vaccines and treatments. One of the leading open-source tools for cheminformatics is RDKit, which adds a molecule data type, MOL, to PostgreSQL. This data type allows scientists to efficiently scan large amounts of highly specialized data.

To explore the MOL data type, and to keep it simple (to avoid stressing my memory of high school chemistry) let’s create a table containing a column for molecules and its corresponding common name:

CREATE EXTENSION rdkit;

CREATE TABLE molecules (
    id             serial PRIMARY KEY,
    common_name    varchar,
    molecule       mol
);

The MOL data type follows the simplified molecular-input line-entry system (SMILES), allowing a sophisticated format to identify a molecule’s structure. The following statements create two rows for water and salt in the table:

blog=> INSERT INTO molecules (common_name, molecule) 
       VALUES ('water', '[2H]O'::mol);
INSERT 0 1
blog=> INSERT INTO molecules (common_name, molecule) 
       VALUES ('salt', '[Na+].[Cl-]'::mol);
INSERT 0 1

Like other specialize types, the MOL type also enforces rules around valid molecules. For example, the next statement tries to create a row for a mythical element, which results in an error:

blog=> INSERT INTO molecules (common_name, molecule) 
       VALUES ('adamantium', '[X]'::mol);
ERROR:  could not create molecule from SMILES '[X]'
LINE 1: ...les (common_name, molecule) VALUES ('adamantium', '[X]'::mol...

The RDKit extension adds functions to easily extract information about the molecules. An example of this is the mol_numatoms function, which returns the number of atoms in the molecule:

blog=> SELECT mol_numatoms(molecule) 
       FROM molecules WHERE common_name = 'water';
 mol_numatoms 
--------------
            3
(1 row)

With the MOL data type understanding the greater meaning of data it contains, more complex searches than string comparisons are possible. For example, the following query returns all the molecules that contain a carbon monoxide:

blog=> SELECT common_name, mol_formula(molecule) 
       FROM molecules WHERE molecule @> 'CO';
 common_name | mol_formula 
-------------+-------------
 sugar       | C6H12O6
 penicillin  | C16H18N2O4S
 aspirin     | C9H8O4
(3 rows)

Summary

In this post, we explored a few of PostgreSQL’s purpose-built types. All the purpose-built data types we covered in this post are available in Amazon RDS for PostgreSQL and Amazon Aurora PostgreSQL. Some, like the UUID type, make some tasks easier and more efficient. Others like a GEOGRAPHY type and a MOL type are essential for some applications. There are many more specialized types that cover domains like international product numbering standards and raster images. PostgreSQL’s extensible architecture will allow new types to be created in the future as PostgreSQL moves into more industries requiring specialized data handling.


About the Author

Jim Mlodgenski is a Principal Database Engineer at AWS. Prior to joining AWS, Jim was CTO at OpenSCG and CEO at StormDB, a Postgres-XC scalable public cloud solution. Jim was chief architect and part of the founding team of EnterpriseDB. Jim also very active in the PostgreSQL community, co-founding both the New York and Philadelphia PostgreSQL Meetup groups, co-organizing the largest PostgreSQL conferences, and serving on the Board of the PostgreSQL Foundation. Jim also speaks regularly at PostgreSQL and other open sources conferences.