Query 201: Tips and Tricks for Amazon SimpleDB Query
After passing Query 101, you're ready to graduate on to learning about building efficient queries. The syllabus for this tutorial includes querying for dates, using negation, and tuning your query performance.
Submitted By: Bruce@AWS
AWS Products Used: Amazon SimpleDB
Created On: February 07, 2008
By Aleksey Martynov, Amazon Web Services
This tutorial provides a number of tips and tricks for building complex and efficient queries for Amazon SimpleDB:
- Lexicographical comparison definition
- Querying for numerical data
- Querying for dates
- Single vs. multi-predicate queries
- Tuning Your Queries Using Composite Attributes
- Partitioning your dataset in order to improve query performance
Lexicographical Comparison Definition
A string s
precedes a string t
in lexicographic order if:
-
s
is a prefix oft
, or - if
c
andd
are respectively the first character ofs
andt
in whichs
andt
differ, thenc
precedesd
in character order.
Note: For the characters that are alphabetical letters, the character order coincides with the alphabetical order. Digits precede letters, and uppercase letters precede lowercase ones.
Example:
-
composer
precedescomputer
-
house
precedeshousehold
-
Household
precedeshouse
-
H2O
precedesHOTEL
All attribute values in Amazon SimpleDB are stored as UTF-8 strings. All attribute comparisons are done lexicographically. This requires the application designers to slightly massage their numerical data before submitting it to Amazon SimpleDB in order to preserve the expected order of comparison.
An easy way to handle these conversions is by implementing them in the communication layer between the application and Amazon SimpleDB. Sample libraries available in the Resource Center include utility classes that aid in conversion of the numerical data types, however, the application designers must choose certain parameters that we will describe below that depend on their individual dataset qualities.
There are two main techniques used with the numerical data in order to properly represent it in the string form:
- Negative number offsetting
- Zero-padding
Negative Number Offsets
The first step for representing the number ranges is to ensure that every number in the dataset is positive. This can be easily achieved by choosing an offset, such that it is larger than the module of the smallest expected negative number in your dataset. For example, if the smallest expected number in your dataset is -12,000, choosing offset = 100,000 may be safe.
After a proper offset is chosen, the conversion is extremely straight-forward:
- For PutAttributes, DeleteAttributes and Query calls, the offset is added to ALL values for that particular numeric attribute
- After executing GetAttributes call, the offset is subtracted to produce the original values
Example:
- Original dataset:
{654, -12000, 3610, 0, -23}
- Negative number offset:
100,000
- Dataset with offset applied:
{100654, 88000, 103610, 100000, 99977}
Zero-Padding
Once all the numbers in the dataset are positive, there is another step necessary to ensure that values are properly represented for lexicographical comparisons. For example, number 2 and 10, if converted directly into strings "2" and "10" will not compare properly, as "10" comes before "2" in lexicographical order. However, if we zero-pad number 2 to be represented as "02", the comparison will execute as expected. But what happens if we decide to add number 200 to the dataset at a later point? Zero-padding with a single "0" will not work anymore. Therefore, application designers may follow the next steps for zero-padding:
- Determine the largest possible number in your dataset. Remember, that if you are using offsetting to take care of the negative number conversions, you have to take that into account as well by ensuring that your largest possible number is determined after you have added the offset to all the values.
- Based on the largest possible number, determine the maximum number of digits before the decimal point that you may have in your dataset.
- Convert all the numbers in your dataset to the proper string representation by appending as many "0" as necessary in front of each number to ensure that the total number of characters, representing the portion of the number before the decimal point, matches the maximum number of digits determined in Step 2 above.
Example:
- Original dataset:
{14.58, -12536.791, 20071109, 655378.34, -23}
- Negative number offset:
100,000
- Dataset with offset applied:
{100014.58, 87463.209, 20171109, 755378.34, 99977}
- Zero-padded dataset representation:
{00100014.58, 00087463.209, 20171109, 00755378.34, 00099977}
- Original query expression:
attribute > '500'
- Converted query expression:
attribute > '00100500'
Dates can be properly converted to strings for use within Amazon SimpleDB by following ISO 8601 format. This format defines a representation for dates such that they can be properly compared in lexicographical order.
Below are the formats for representing date-time values with different degree of granularity. Exactly the components shown here must be present, with exactly this punctuation. Note that the "T" appears literally in the string, to indicate the beginning of the time element, as specified in ISO 8601.
Year: YYYY Year and month: YYYY-MM Complete date: YYYY-MM-DD Complete date plus hours and minutes: YYYY-MM-DDThh:mmTZD Complete date plus hours, minutes and seconds: YYYY-MM-DDThh:mm:ssTZD Complete date plus hours, minutes, seconds and a decimal fraction of a second YYYY-MM-DDThh:mm:ss.sTZD
Date | ISO 8601 format |
---|---|
Year 2007 | 2007 |
January 2008 | 2008-01 |
January 24 th 2008 | 2008-01-24 |
1:15 pm January 24 th 2008 | 2008-01-24T13:15:00+01:00 |
1:15:30.45 pm January 24 th 2008 | 2008-01-24T13:15:30.45+01:00 |
Example:
`Update Date` > '2008-01-24T22:48:15+01:00'
To understand the semantic difference between single predicate and multiple predicate queries, the following are two queries:
select * from mydomain where Keyword = 'Book' intersection Keyword = 'Hardcover'
select * from mydomain where Keyword = 'Book' and Keyword = 'Hardcover'
Using the same data set in Query 101, the first query returns item '1579124585' and the second query returns nothing.
The reason is that each value is evaluated individually against each predicate. In the second query, since neither of the values satisfies both comparisons defined in the single predicate simultaneously, the item is not selected.
Since the "intersection" is a set operation, "Keyword = 'Book'" and "Keyword = 'Hardcover'" are treated as two independent predicates in the first query. The first predicate selects items 0385333498 and 1579124585, and the second predicate selects items 1579124585. Therefore, the intersection operator returns 1579124585 that appears in both queries.
Note: For single-value-attribute domains, the result of the two queries are the same. However, the second query is likely to perform better than the first one.
Careful implementation of attributes can increase the efficiency of query operations in terms of duration and complexity. SimpleDB indexes attributes individually. In some cases, a query contains predicates on more than one attribute, and the combined selectivity of the predicates is significantly higher than the selectivity of each individual predicate. When this happens, the query retrieves a lot of data, and then removes most of the data to generate the result, which can degrade performance. If you find your queries using this pattern, you can implement composite attributes to improve your queries' performance.
The following example retrieves many books and many book prices before returning the requested result of books priced under nine dollars.
select * from myDomain where Type = 'Book' and Price < '9'
A composite attribute provides a more efficient way to handle this query. Assuming the attribute Type
is a fixed four character string, a new composite attribute of TypePrice
allows you to write a single predicate query.
select * from myDomain where TypePrice > 'Book' and TypePrice < 'Book9'
Performance for a multi-predicate query can also degrade if it uses an order by
clause and the sorted attribute is constrained by a non-selective predicate. A typical example uses not null
. For example, a table contains user names, billing timestamps, and a variety of other attributes. You want to get the latest 100 billing times for a user. A typical approach for this query leverages the index on the user_id
attribute, retrieving all the records with the user's ID value, filtering the ones with correct values for the billing time, and then sorting the records and filtering out the top 100. The following example retrieves the latest 100 billing times for a user.
select * from myDomain where user_id = '1234' and bill_time is not null order by bill_time limit 100
However, if the predicate on user_id
is not selective (i.e. many items exist in the domain for the user_id
value 1234
), then the SimpleDB query processor could avoid dynamically sorting a very large number of records and scan the index on bill_time
, instead. For this execution strategy, SimpleDB discards all the records not belonging to user_id
value 1234
.
A composite attribute provides a more efficient way to handle this query, too. You can combine the user_id
and bill_time
values into a composite value, and then query for items with that value. For example, if the user_id
attribute is a fixed four character string, the following query would efficiently seek the billing times for a user by querying only the composite attribute where the billing time is part of the value:
select * from myDomain where user_id_bill_time like '1234%' order by user_id_bill_time limit 100
Note: If user_id
is a variable length field, consider using a separator when combining it with bill_time
in the user_id_bill_time
composite attribute. For example, the following attribute assignment uses the vertical bar separator character (|
) for a user_id
that is six characters long: user_id_bill_time = 123456|1305914378
. The following select example only gets the attributes with user_id =1234
in the composite attribute, and does not get the attributes for the six character user_id
.
select * from myDomain where user_id_bill_time like '1234|%' order by user_id_bill_time limit 100
The composite attribute technique is described further in the "Query performance optimization" section at Building for Performance and Reliability with Amazon SimpleDB.
Amazon SimpleDB is designed from a ground up to support highly parallel applications. One of the features that aid in that goal is the ability for each subscriber to create multiple domains under their account. Application designers can use this feature to their advantage by partitioning their dataset among multiple domains in order to parallelize their queries and have them operate on smaller individual datasets. Single query may only be executed against a single domain, but developers can perform aggregation of the result sets in their application layer. Below are a few example scenarios where partitioning may be applied.
- Dataset naturally partitions along some dimension. For example, a product catalog may be partitioned in the "Book", "CD" and "DVD" domains. It is absolutely possible to store all the product data in a single domain, but partitioning may result in better overall performance. The drawback of partitioning is the additional complexity in the application layer.
- Application requires throughput higher than a single domain can provide. In this case, partitioning data among several domains will increase the overall throughput.
- Dataset is large and queries that application needs to execute are so complicated that they hit the query timeout limit. In this case, partitioning will reduce the size of each domain against which the queries are executed and may improve performance of individual queries.
There are several options for partitioning your dataset. You may be able to find a particular dimension in your data that is suited well for partitioning. For example, "Country" may be a good partitioning dimension, if you are storing data about different NATO countries. In this case, application will create different domains for each country ("USA", "UK", "Germany" and so on) and execute queries against each domain individually, combining the results in the application layer, if necessary.
There are cases where datasets do not naturally present themselves well for partitioning, for example when storing logs, events or web-crawler data. In this case, various hashing algorithms could be used to create a uniform distribution of items among multiple domains. Below are sample steps for partitioning such a dataset into four different domains:
- Determine the hash of an item name using a well-behaved hash function, such as MD5
- Consider last 2 bits of the resulting hash value and place each item in the specified domain:
- Last two bits equal to 00 – place item in Domain0
- Last two bits equal to 01 – place item in Domain1
- Last two bits equal to 10 – place item in Domain2
- Last two bits equal to 11 – place item in Domain3
The outlined algorithm provides a distribution of items among domains, uniformity of which is directly controlled by the hash function used. The additional advantage of this scheme is the ease with which it can be adjusted to partition your data among larger number of domains by considering more and more bits of the hash value (3 bits will distribute to 8 domains, 4 bits to 16 domains and so on).