A PartiQL deep dive: Understand the language and bring SQL queries to AWS non-relational database services

It’s data, data everywhere! To turn data into information, and information into insight, we need to understand our data to its full extent and make use of the statistics derived from it. Applying the obtained results successfully while making business decisions remains a challenge for most organizations, even today. This is partly due to the way the data is stored and the way it’s retrieved. It can be highly structured or unstructured, which makes it difficult for organizations to read or understand the data. Therefore, it’s essential to have a language that can read relational or non-relational data.

PartiQL is a SQL-compatible query language that makes it easy to efficiently query data, regardless of where or in what format it’s stored.

As long as the query engine supports PartiQL, it can process structured data from relational databases (both transactional and analytical), semi-structured and nested data in open data formats, and even schema-less data in NoSQL or document databases that allow different attributes for different rows.

PartiQL is easy and familiar for SQL users because it’s compatible with SQL-92. A developer or architect should understand the capabilities of PartiQL when developing applications or interacting with databases that use PartiQL as the query language. PartiQL separates the syntax and semantics of a query from the underlying data source and data format. It enables you to interact with data with or without a regular schema.

In this post, we explain some of the important PartiQL features and basics so you can effectively use the language to explore databases that require PartiQL knowledge. Unlike traditional SQL, the PartiQL query language also meets the needs of NoSQL and non-relational databases.

PartiQL adoption is growing at Amazon

PartiQL is widely used within Amazon and is now available as part of many AWS services, including Amazon S3 Select, Amazon Glacier Select, Amazon Redshift Spectrum, Amazon DynamoDB, and Amazon internal systems. Also, Amazon EMR pushes down PartiQL queries to S3 Select.

To understand the features of PartiQL, let’s look at some of the AWS services you can use with it.

Amazon DynamoDB

DynamoDB is a key-value and document database that delivers single-digit millisecond performance at any scale. It’s a fully managed, multi-Region, multi-active, durable database with built-in security, backup and restore, and in-memory caching for internet-scale applications.

You can use PartiQL to select, insert, update, and delete items from DynamoDB. PartiQL for DynamoDB can be run from the DynamoDB console, the AWS Command Line Interface (AWS CLI), and DynamoDB APIs.

PartiQL data model

PartiQL is based on a logical type system: the PartiQL data model. Each PartiQL implementation maps data formats, like JSON, Parquet, and Ion, into a PartiQL dataset that follows the PartiQL data model. PartiQL queries work on the PartiQL dataset abstraction.

The following diagram shows the PartiQL data model basic structure.

A PartiQL database generally contains one or more PartiQL named values. A name is an identifier, such as a table name, that is associated with a PartiQL value.

The type of value can be absent, scalar, tuple, or collection. Further subtyping applies to scalars, tuples, and collections. We often use the term complex value to refer to any non-scalar and non-absent value. Complex values include tuples and collections. A tuple is a set of attribute name-value pairs, in which each name is a string (as in SQL). A tuple in the data model is unordered. A conventional SQL tuple is an ordered tuple because the schema dictates the order of the attributes.

PartiQL has two kinds of absent values: NULL and MISSING. Unlike SQL, in which a query that refers to a non-existent attribute name is expected to fail during compilation, in semi-structured data, you expect a query to operate even if some of the tuples don’t define some of the attributes that the query’s paths mention. Therefore, PartiQL contains the special value MISSING, which is the path result in cases where navigation fails to bind to any information. The distinction between MISSING and NULL enables you to retain the original distinction between a missing attribute and a null-valued attribute. The value MISSING may not appear as an attribute value.

PartiQL’s data model extends SQL to Ion’s type system to cover schema-less and nested data. Unlike SQL, PartiQL allows the possibility of duplicate attribute names, in the interest of compatibility with non-strict JSON or Ion datasets. However, PartiQL doesn’t encourage duplicate attribute names; navigation into tuples with the conventional dot notation is tuned to the assumption that the attribute names are unique. A collection value is either ordered or unordered. Both arrays and bags may contain duplicate elements. An array is ordered (similar to a JSON array or Ion list) and each element is accessible by its ordinal position. Arrays are delimited with brackets ([ ]). Arrays have size, which is not explicitly denoted, but is implied by the number of elements in the array. The first element of an array corresponds to index 0; the last element corresponds to index size minus one. In contrast, a bag is unordered (similar to a SQL table) and its elements can’t be accessed by ordinal position. Bags are denoted with << and >>.

PartiQL datasets look very much like JSON. It adopts the tuple/object and array notation of JSON. However, the PartiQL string literals are denoted by single quotes. Importantly, the scalar types of PartiQL are those of SQL, not just strings, numbers, and Booleans, as in JSON.

The following diagram shows the basics of PartiQL syntax.

PartiQL queries are backward-compatible with SQL-92.

A PartiQL query is either an SFW query (such as SELECT-FROM-WHERE-…,) or an expression query (also called a simple expression) such as a path expression or a function invocation.

Unlike SQL expressions, which are restricted to outputting scalar and null values, PartiQL expressions output arbitrary PartiQL values, and are fully composable within larger SFW queries and expressions. Indeed, PartiQL allows the top-level query to also be an expression query, not just an SFW query as in SQL.

PartiQL may be seen as a functional programming language with composable semantics.

In the following sections, we discuss the SELECT, FROM, and WHERE clauses in detail.

SELECT clause

SQL’s well-known SELECT clause can be used as a mere syntactic sugar over SELECT VALUE, when we consider the top-level query.

Core PartiQL SFW queries have a SELECT VALUE clause (in lieu of SQL’s SELECT clause) that can create outputs that are collections of anything (such as tuples, scalars, arrays, mixed type elements, and more).

SELECT is the special case in which SELECT VALUE produces collections of tuples. Furthermore, when SELECT is used as a subquery, it’s coerced into a scalar or a tuple, in the ways that SQL coerces the results of subqueries.

FROM clause

The formal semantics of a FROM clause describe the collection of binding tuples that is output by the FROM clause. The semantics specify three cases and essentially extend the tuple calculus that underlies the SQL semantics.

In a binding tuple, each PartiQL (sub-)query and PartiQL (sub-)expression q is evaluated within the database environment ρ0, created by the database names, and the variables environment ρ, created by the defined query variables. The pair of these environments (ρ0, ρ) is collectively called the bindings environment. An environment is a binding tuple (x1 : v1, . . . , xn : vn) where each xi is a bind name that is unique and binds to the PartiQL value vi.

The semantics specify the core semantics of a FROM clause with a single FROM item. The term “semantics of the FROM item f” is synonymous to the term “semantics of a FROM clause with the single item f.” In either case, we refer to the specification of the collection of binding tuples B^outFROM that results from the evaluation of “FROM f.”

Then the semantics specify how multiple FROM items combine, according to the core semantics, using the join and outer join operations.

Finally, the semantics specify the syntactic sugar structures that are overlaid over the core semantics. Their primary purpose is SQL compatibility.

WHERE clause

The WHERE clause inputs the bindings that have been produced from the FROM clause and outputs the ones that satisfy its condition. The Boolean predicates follow SQL’s three-valued logic. As explained earlier, PartiQL has two kinds of absent values: NULL and MISSING. As far as the Boolean connectives and IS NULL are concerned, a NULL input and a MISSING input behave identically. For example, MISSING AND TRUE is equivalent to NULL AND TRUE: they both result into NULL.

Similar to SQL, when the expression of the WHERE clause expression evaluates to an absent value or a value that isn’t a Boolean, PartiQL eliminates the corresponding binding.

In the following sections, we discuss more features of PartiQL in detail.

Elaborating the SQL-compatible features like querying the nested data and grouping

A key feature of many modern formats is nested data—attributes whose values may themselves be tables (collections of tuples), arrays of scalars, arrays of arrays, or many other combinations. You can query the nested data and group it using PartiQL.

The following code is the PartiQL ORDER BY syntax:

Similar to SQL, the PartiQL ORDER BY clause syntax is:

ORDER BY (  e1   [ASC|DESC]? [NULLS FIRST|NULLS LAST],
          .
          .
          .
          em[ASC|DESC]? [NULLS FIRST|NULLS LAST]?
        )|PRESERVE

In this code, e1…em is a list of ordering expressions. In PartiQL, an SFW query with ORDER BY outputs an array, whereas an SFW query without ORDER BY outputs a bag.

Similar to SQL’s ORDER BY clause, the NULLS FIRST and NULLS LAST keywords indicate whether NULL and MISSING values are ordered before or after all other values. In PartiQL, NULLS FIRST and NULLS LAST refer to both NULL and MISSING.

PartiQL allows the CURRENT variable to be omitted from ordering expressions. When the CURRENT variable binds tuples, the ordering expressions can refer directly to the attributes of those tuples.

The following conditions must be satisfied for scoping:

A PartiQL path expression ordering expression as it appears in the ORDER BY of a UNION … ORDER BY query, where a is an identifier and s is the potentially empty suffix of the path
The expression as is evaluated in the database environment ρ0 and variables’ environment ρ, which defines variables v1,…,vn and none of them are named a
None of the variables v1,…,vn may bind to a tuple that has an attribute a, then the path expression as resolves to CURRENT.as

The most common and useful way to satisfy the third condition is when UNION … ORDER BY is a top-level query and, therefore, the variables environment ρ is empty.

Use JOINS to unnest the data

Unlike a conventional SQL join that requires an ON condition, the Dataset1 and Dataset2 join condition is implicit in the nesting of the Dataset1 data into the Dataset2 data.

The FROM clause expressions have the same semantics:

l , r  ⇔
l CROSS JOIN r ⇔
l JOIN r ON TRUE ⇔

They combine the bag of bindings produced from the FROM item l with the bag of binding tuples produced by the FROM item r, whereas the expression may use variables defined by l. Again, the term “the semantics of l CROSS JOIN r” is equivalent to the term “the semantics of FROM l CROSS JOIN r.” In both cases, the semantics specify a bag of binding tuples.

In the following sections, we discuss aggregating the data and different ways to group the data.

PartiQL GROUP BY core: Grouping into a group variable

The following GROUP BY clause creates a group:

GROUP BY e1  AS x1,.  .  .,em AS xm GROUP AS g

Each ei is a grouping expression, each xi is a grouping variable, and g is the group variable.

Group attributes and direct use of grouping expressions

The group-by and aggregation of PartiQL is backward-compatible to SQL.

For SQL compatibility, PartiQL allows GROUP BY…,e,…, meaning a grouping expression e that is not associated with a grouping variable x (in core PartiQL, you write e AS x). For SQL compatibility, PartiQL supports using the grouping expression e in HAVING, ORDER BY, and SELECT clauses. The SQL form is syntactic sugar for the core PartiQL.

The following table compares SQL and PartiQL clauses.

SQL

PartiQL

FROM. . .

GROUP BY e,…

HAVING f(e,…)

ORDER BY f′(e),…

SELECT f′′(e),…

FROM. . .

GROUP BY e AS x,…

HAVING f(x,…)

ORDER BY f′(x),…

SELECT f′′(x),…

An expression e that appears in the GROUP BY clause and an expression e′ that appears in SELECT, HAVING, or ORDER BY are considered the same expression if they’re syntactically identical after performing the schema-based rewritings.

SQL’s implicit use of the group variable in SQL aggregate functions

SQL doesn’t have explicit group variables. For SQL compatibility, PartiQL allows the SQL aggregation functions to be fed by expressions that don’t explicitly say that there is iteration over the group variable.

PartiQL doesn’t allow you to write queries that lack a GROUP AS or GROUP ALL clause and have nested aggregate SQL functions.

Equivalence function used by grouping, and grouping of NULL and MISSING

The equivalence function eqg extends SQL’s respective function. In particular, it behaves as follows:

eqg(NULL,NULL) is true, despite NULL=NULL not being true.
For any two non-null values x and y, eqg(x,y) returns the same with x=y. As is the case generally for =, while SQL’s = will error when given incompatible types, the PartiQL= will return false.

PartiQL groups together the NULL and the MISSING grouping expressions, because any grouping expression resulting to MISSING has been coerced into NULL before eqg does comparisons for grouping.

Pivot and unpivot the data

Pivoting turns a collection into a tuple. The PIVOT clause inputs a bag of binding tuples or an array of binding tuples. Semantically, it’s similar to SELECT VALUE, but the latter creates a collection of values, whereas PIVOT constructs a tuple in which each input binding is evaluated to an attribute value pair in the tuple.

The clause PIVOT v AT a inputs a bag or an array of binding tuples and outputs a single tuple, in which each evaluation of v and a generate an attribute in the tuple. The PIVOT query doesn’t return a collection of tuples—it literally returns a tuple value.

The UNPIVOT clause enables ranging over the attribute-value pairs of a tuple. The FROM clause FROM UNPIVOT t AS v AT a normally expects t to be a tuple, with attribute-value pairs a1 :v1,…,an : vn. It doesn’t matter whether the tuple is ordered or unordered. The FROM clause outputs the collection of binding tuples

UNPIVOT turns every instance of the tuple t into a collection. The WHERE filters the collections.

Many queries need to range over and collect the attribute name-value pairs of tuples or the key-value pairs of maps. Unpivoting tuples enables the use of attribute names as if they were data.

Structural types and type-related query syntax and semantics

SQL allows us to avoid writing aliases (variables) when the schema of the tables allows correct de-referencing. PartiQL does the same. A schema isn’t necessary for a PartiQL dataset.

The input data generally conforms to a structural type, also often called a schema. The SQL semantics make extensive use of the structural types in order to assign meaning to queries, which don’t have a meaning in the absence of such structural types. In the interest of SQL compatibility and user convenience, PartiQL also allows structural types to assign meaning to queries that don’t have a meaning otherwise. As of this writing, development is in progress to specify the precise rules that provide SQL compatibility, while keeping the schema optional and the query results stable with respect to schema addition.

Conclusion

In this post, we explained a few PartiQL basic features so developers can use the language to explore databases that require PartiQL knowledge. We also discussed how PartiQL separates the syntax and semantics of a query from the underlying data source and data format.

PartiQL is fully open sourced under the Apache2.0 license. The PartiQL open-source implementation provides an interactive shell (or Read Evaluate Print Loop (REPL)) that allows you to write and evaluate PartiQL queries.

Queries, comments, and suggestions are welcome. For more information, see the following resources:

About the Author

Swapna Chawhan is an Associate Consultant with AWS ProServe LLP. As a database migration specialist, she helps customers migrate their workloads from on premises to the AWS Cloud. She is passionate about learning various languages, such as PostgreSQL, MySQL, and PartiQL, and technologies that can ease the database migration process.

AWS Database Blog