AWS Compute Blog

Implementing geohashing at scale in serverless web applications

Many web and mobile applications use geospatial data, often with map overlays. This results in dataset queries based upon proximity, for questions such as “How far is the nearest business?” or “How many users are nearby?”. Applications with significant traffic need an efficient way to handle geolocation queries. This blog post explores a simple geohashing solution for serverless applications, and how this can work at scale.

Geohashing is a popular public domain geocode system that converts geographic information into an alphanumeric hash. A geohash is used to identify a rectangular area around a fixed point. The length of the hash determines the precision of the area identified. This allows you to use a hierarchical search where the length of the geohash corresponds to the size of a search area.

This blog post references the Ask Around Me example application, which helps users ask and answer questions in their local geographic area. This post explores a solution suited for the expected volumes in this web application. See part 1 of the blog application series to learn more.

Choosing geohashing precision for your application

One of the most important considerations in using geohashing is understanding the expected distribution of the locations and the size of the search area. Selecting the correct geohash length is important for the efficiency of the search. There is a balance between the number of cells searched and the number of items within each cell.

A geohash corresponds to a grid cell but a typical search corresponds to multiple overlapping cells. In the following diagram, the search radius overlaps nine distinct geohash cells. The query returns all location pins within the cells but only the green pins are relevant to the radial search. The gray pins are outside of the overlapping geohash cells so are immediately discarded in the query:

Discarded pins from a query

If the geohash is too precise, meaning the cell is small, the search radius may include many more cells:

Search radius with too many cells

If the geohash is too coarse, meaning the cell is too large, the query may return too many results for a single cell for the search to be efficient. You may also find a large number of results in those cells are not within the search radius:

Too many results in a single cell

The geohashing algorithm creates a hash that makes it easy to find the neighboring eight cells of any target cell. For optimal performance, you should ideally select a hashing resolution where most queries are resolved by searching only the target cell and its immediate neighbors.

In the canonical implementation of geohash, there are some areas on the globe where physically neighboring cells are not logically close. This can cause errors in these edge cases where there are “seams”. The S2 geometry library solves this problem by using a spherical reference, meaning you can use this approach anywhere in the world. The library has been ported to TypeScript and is available as an npm package called nodes2ts.

Using Amazon DynamoDB for geohashing queries

Amazon DynamoDB is a serverless NoSQL database that offers single-digit millisecond performance at any scale. For an application with moderate load, you can set read and write capacity to allocate a dedicated amount of throughout, or you can set the provisioning to On-Demand. DynamoDB is well suited to key-based queries needing fast, consistent performance.

For web application developers using Node.js or JavaScript, there is an npm package called dynamodb-geo that ports the Java Geo Library for DynamoDB. Both packages are based on the S2 geometry library. This provides a simple interface to use DynamoDB for geospatial data. The library is a wrapper for DynamoDB and maintains an underlying table. After configuring the table and the library, you interact with the data using the library’s API.

For example, to add a new geospatial item:

    // Add a new location to the database
const result = await myGeoTableManager.putPoint({
        RangeKeyValue: { S: 'location-id-1234' }, // unique ID
        GeoPoint: { 
            latitude: 40.6892534,
            longitude: -74.0466891
        },
        PutItemInput: { 
            Item: { 
                country: { S: 'USA' },
                state: { S: 'New York' },
                pointOfInterest: { S: 'Statue of Liberty' }
            }
        }
    }).promise()

The library also provides methods for updating and deleting data points, and supports DynamoDB batch operations. Querying the data via the API can only be done via geo-point requests. You can retrieve a dataset of items based around a rectangular or radial area:

    // Querying 10km (~6.2 miles) around Boston, Massachusetts.

    const result = await myGeoTableManager.queryRadius({
        RadiusInMeter: 10000,
        CenterPoint: {
            latitude: 42.3145186,
            longitude: -71.1103666
        }
    }).promise()

    // Outputs results as an array of DynamoDB.AttributeMaps
    console.log(result)

Moving locations and moving consumers

In Ask Around Me, questions have a fixed latitude and longitude – once a question is asked, the location never changes. The dynamodb-geo library uses the hashKey as a primary key in the underlying DynamoDB table. To update the primary key of a DynamoDB item, you must delete and recreate the entire item.

Many geolocation implementations use static data, such as a list of retail store locations. But if you need to track moving locations, such as the location of mobile users, this is not the best approach. As a result, this library works well for static data (or data where the location rarely changes) but is not suitable where locations change frequently.

In the Ask Around Me app, users receive alerts when new questions appear around their current location. Real-time messaging is implemented using AWS IoT Core, where publish-subscribe topics connect the frontend and backend.

The application retrieves the current latitude and longitude via the browser and uses the S2 geometry library to convert this to a geohash key. It then subscribes to this geohash topic. On the backend, when new questions are saved, these are published to a topic using the geohash key. As a result, users in the same geohash area receive notifications when new questions are asked nearby.

This broadcast pattern allows the application to fan out a single new question in the DynamoDB table to thousands of live users in frontend application. As users move their location, the front-end application can detect if they moved from one geohash cell to another. If so, the frontend unsubscribes from the outdated geohash identifier, and subscribes to the new. This ensures that moving consumers are always listening to questions near their current location. For more information on the broadcast pattern, see the AWS whitepaper, Designing MQTT Topics for AWS IoT Core.

Sorting, aging, and expiring location data

For an application with many writes to the underlying geolocation table, the user may expect to see newer data first. Also, you may need a strategy for removing stale data from the table. Even with a highly performant database like DynamoDB, you must ensure the first page of results return the most relevant items.

The dynamodb-geo library uses the geohash identifier as a primary key, but allows the developer to choose an identifying range key. You need this key to uniquely identify items that have the same geohash. However, you can also use this key to for sorting and paging data that’s returned by the library’s search APIs.

The Ask Around Me app uses a concatenated userId-timestamp pattern as a range key, for example jbeswick-1589202456. This helps in implementing two data access patterns:

  • Finding by user: using the begins_with operator, you can identify questions asked by a specific user. For example, begins_with(‘jbeswick’) returns all the questions for this user.
  • Sorting results by time added: as query results are always sorted by the sort key value, the Unix timestamp ensures that these are returned from oldest to newest. You can reverse this order to return newest first by setting ScanIndexForward to false in the DynamoDB query.

In this application, you might decide that questions more than a year old should be expired from the table. You can have DynamoDB expire items automatically using the time to live (TTL) feature.

To use this, create a custom numeric attribute in the table that contains the expiration time of the item, set as a Unix timestamp. For example, an item expiring on a midnight on January 1, 2023 uses the timestamp 1672531200.

When you enable TTL on a DynamoDB table, you specify which attribute contains the expiration value. A background job checks the TTL values against the current time. For any items found where the TTL timestamp is older than the current time, it expires those items within a 48-hour time window.

Conclusion

This blog post explores how you can solve geolocation queries using geohashing. I discuss how you should decide on the resolution of a geohash for your specific workload.

I explain why DynamoDB is a good fit for many geohashing applications. I cover how you can use the dynamodb-geo library to easily implement location queries in your web applications. Using AWS IoT Core, you can also use MQTT topics to fan out updates to moving subscribers.

Finally, I show how to use the DynamoDB table’s range key to help with sorting data by age and supporting addition access patterns. For application with many writes, you can also automatically expire items using DynamoDB’s TTL feature.

To learn more about how the Ask Around Me application implements geolocation queries, see the blog series.