Querying the Daylight OpenStreetMap Distribution with Amazon Athena

This is guest post written by Jennings Anderson, research scientist at Meta, with Mike Jeffe, a technical business development manager on the Open Data Team at Amazon Web Services (AWS).

In 2020, Meta introduced the Daylight Map Distribution, which combines OpenStreetMap (OSM) data with quality and consistency checks from Daylight mapping partners to create a no-cost, stable, and simple-to-use global map.

This blog post provides a brief overview of OSM and Daylight followed by a step-by-step tutorial using five real-world examples. We combine the powerful query capabilities of Amazon Athena from Amazon Web Services (AWS) with the feature-rich Daylight OSM data to demonstrate a typical OSM data analysis workflow. This walkthrough builds on the foundation for cloud-native OSM datasets on AWS in the blog post, “Querying OpenStreetMap with Amazon Athena.”

This walkthrough is targeted at both new and experienced Athena enthusiasts who are looking for cloud-native approaches that simplify the process of working with OSM at large mapping scales. It is highly relevant for organizations such as the Humanitarian OpenStreetMap Team (HOT), which depends on timely and updated OSM data to support their global humanitarian and disaster response mapping charters, as well as organizations that provide value-added services and maps built from OSM data.

OpenStreetMap data and the Daylight Map Distribution

OpenStreetMap (OSM) is a no-cost, editable map of the world, created and maintained by a global community of contributors. The complete OSM database includes more than one billion geospatial objects, including 500 million buildings, 77 million kilometers of roads and paths, tens of millions points of interest, and more. Collectively, these features are the product of more than 1.8 million individual contributors. OSM data is available on the Registry of Open Data on AWS and the AWS Data Exchange as part of the AWS Open Data Sponsorship Program, which makes high-value, cloud-optimized datasets publicly available on AWS.

OSM data differs from other sources of mapping data because it is a form of volunteered geographic information (VGI), curated by individual volunteers and editors. When users contribute to OSM, their map changes are contained in a changeset consisting of either additions (new feature), modifications (change existing tags, feature location, or feature shape), or deletions.

How the Daylight Map supports OSM queries

Every day, OSM receives millions of contributions from the community. However, not every contribution is compatible with the specific use cases of some companies, which is why the maps team at Meta developed Daylight Map. Updated with the latest changes from OSM each month, Daylight provides the same data used by Facebook maps in an analysis-ready format that includes Parquet files optimized for loading data into Amazon Athena. This allows users to more simply load OSM data into Athena and perform SQL queries against a billion features in the browser, without the need to download or access the files directly.

Working with OpenStreetMap data in geospatial analysis

Users most commonly interact with the OSM data by editing existing features or adding new features to the map. This is commonly performed on openstreetmap.org using the web-based ID editor; more advanced mappers may use desktop editing tools such as Java OpenStreetMap Editor (JOSM).

OSM data is also heavily used in different types of geospatial analysis. In order to perform any type of analysis with OSM data, it must first be read, parsed, and transformed into an analysis-ready format. To support these workstreams, there can be an added effort to extract, transform, and load (ETL) in order to reconstruct downloaded OSM data into more common spatial primitive features like points, lines, and polygons.

A plethora of community-maintained tools exist to work with OSM data at every stage from database exports to processing and ingesting into any number of geographic information system (GIS) tools. The typical OSM data-analysis workflow looks like this:

Figure 1. A typical workflow for performing data analysis of OpenStreetMap data.

Figure 1. A typical workflow for performing data analysis of OpenStreetMap data.

Working with the global OSM dataset at scale requires some combination of cloud infrastructure or extensive amounts of on-premises computing resources. Many users forgo this by downloading smaller subsets or regional extracts and then filter them locally using custom tools that recognize OSM tags and metadata. Then, to transform this data into something that is considered “analysis-ready,” data must be converted into a format that can be imported into a common analytical environment, such as Shapefile, GeoJSON, or WKT; or loaded into a database such as PostgreSQL. Once the data has become spatially enabled, it can be further processed using standard geospatial applications and tools such as QGIS, ArcMap, and Jupyter Notebooks. Today, a variety of plugins and libraries exist to abstract these processes and load OSM files directly into QGIS, ArcMap, R, and Python—simplifying the steps required—but the local processing burden remains the same.

Analysis-ready Daylight Distribution

The analysis-ready Daylight OSM Distribution simplifies this entire pipeline by combining the first three steps (obtaining data, preprocessing, converting to analysis-ready formats) into a single SQL query.

Figure 2. The simplified workflow for performing OSM data analysis when using the analysis-ready Daylight Map Distribution.

Figure 2. The simplified workflow for performing OSM data analysis when using the analysis-ready Daylight Map Distribution.

We use the term analysis-ready here to denote the following:

All OSM geometries are converted to more common geometric data types: points, lines, and (multi)polygons. Therefore, all OSM way and relation elements contain their own sets of coordinates instead of references to other nodes. Moreover, all of Daylight has been combined into a single table so users need not worry about the differences between node, way, or relation data types.
Additional attributes such as bounding box, length, area, and location have also been pre-computed per feature.
The data is ready to be loaded directly into common analytical tools without the need for conversion, especially tools that can work with the data in-situ in the cloud.

The Analysis-Ready Daylight OSM Distribution is stored on Amazon Simple Storage Service (Amazon S3) as Parquet files, which are optimized for use with Athena, both in computation and cost-effectiveness. Users do not incur charges for accessing the data directly. Costs to the user are determined only by the number of bytes that Athena scans while running the query. The columnar storage format of Parquet files provides several optimizations allowing fast query and retrieval of data in a highly cost-efficient manner. As an example, querying all 212M roads and paths in Daylight release v1.23 and summing up their collective length of 80M kilometers scans just 14GB of data (only the tags and linear_meters columns).

The analysis-ready Daylight Distribution schema is as follows:

Column	Type	Description
id	bigint	The OSM id of the object.
version	int	The current version of the feature in OSM. If the version is greater than 1, the feature has been edited since it was first created.
changeset	bigint	The changeset in which the feature was last edited.
created_at	timestamp	When the feature was last updated.
tags	map<string,string>	Key/Value pairs of attributes that describe the OSM feature, such as ‘building=house’ or ‘natural=tree’.
type	string	The type of OSM element, a node, way, or relation.
wkt	string	Well-known-text representation of the feature’s geometry.
min_lon	double
max_lon	double
min_lat	double
max_lat	double
quadkey	string	Zoom level 15 quadkey (Bing Tile) that contains this feature.
linear_meters	double	The length (in meters) of the feature if it is a LineString.
square_meters	double	The area (in square meters) of the feature if it is a polygon.

release	string	The daylight release. Analysis Ready Daylight was first released with Daylight v1.9. Version 1.23 was released in February 2023.

The release column corresponds to the Daylight Release version. Because the data is partitioned by release on Amazon S3, it is imperative to specify a release version (such as WHERE release = 'v1.23') to include only one version of each OSM feature.

We can begin to dive deeper into common workflows once we create the Daylight OSM Features table in Athena.

Getting started with querying OSM data on Amazon Athena

Prerequisites

To run the example queries in this post, you need:

1. An AWS account

2. Create the daylight_osm_features table by running the following query in the Athena query editor in the Athena console:

CREATE EXTERNAL TABLE `daylight_osm_features`(
  `id` bigint, 
  `version` int, 
  `changeset` bigint, 
  `created_at` timestamp, 
  `tags` map<string,string>, 
  `wkt` string, 
  `min_lon` double, 
  `max_lon` double, 
  `min_lat` double, 
  `max_lat` double, 
  `quadkey` string, 
  `linear_meters` double,
  `square_meters` double)
PARTITIONED BY ( 
  `release` string, 
  `type` string)
STORED AS Parquet
LOCATION
  's3://daylight-openstreetmap/parquet/osm_features'

3. Next, load the partitions by running this query in the Athena query editor:

MSCK REPAIR TABLE `daylight_osm_features`

With this, you have access to nearly one billion OSM features complete with geometries, tags, and metadata.

Example query 1: Density of building features in OSM

For our first query example, we want to discover how many buildings are in the daylight features table.

Enter the following query in the Athena query editor:

SELECT COUNT(*)
FROM daylight_osm_features
WHERE tags [ 'building' ] <> 'no'
  AND release = 'v1.23'

This query counts the total number of features in Daylight version 1.23 that have a value for the building key other than ‘no’, which turns out to be 543M buildings (542,868,282). Next, to find where these buildings are, we can use the quadkey instead of the actual building geometries for simplicity. To aggregate at zoom-level 7, use `SUBSTR(quadkey, 1, 7)`. Athena can also convert the quadkey to a polygon representing the boundaries of that tile:

SELECT BING_TILE_POLYGON(BING_TILE(SUBSTR(quadkey,1,7))) as geometry,
	count(*) as building_count
FROM daylight_osm_features
WHERE release = 'v1.23'
	AND tags [ 'building' ] <> 'no'
GROUP BY SUBSTR(quadkey,1,7)

This query returns 3,695 rows with each tile’s geometry as a WKT string (Well-known text) and the number of buildings in that tile in the building_count column.

Loading this file directly into QGIS as a delimited text file, we create this visualization showing that there is (significantly) more buildings mapped in Western Europe than the rest of the world (Figure 3).

Figure 3. Building densities in OpenStreetMap. Note the higher densities in Nepal, Indonesia, and Eastern Africa as well — these regions have been the focus of numerous humanitarian mapping campaigns.

Example query 2: Searching for features with a bounding box

Example query 1 used quadkeys representing map tiles as a convenient unit for spatial aggregation without actually performing a geospatial calculation. We can also use a geographic bounding box in Athena to be more precise and perform actual geospatial operations. In this example, we’ve drawn a polygon around Puerto Rico. The WKT string is: POLYGON((-67.51 17.65,-67.44 18.78,-65.16 18.63,-65.06 17.99,-66.40 17.57,-67.51 17.65)). To search for features within this polygon, Athena can interpret the WKT string with the ST_GeometryFromText function:

SELECT tags [ 'highway' ] AS highway_tag,
	SUM(linear_meters) / 1000 as total_km
FROM daylight_osm_features
WHERE release = 'v1.23'
	AND tags [ 'highway' ] IS NOT NULL
	AND linear_meters > 0
	AND ST_CONTAINS(
			ST_GEOMETRYFROMTEXT(
				'POLYGON((-67.51 17.65,-67.44 18.78,-65.16 18.63,
    -65.06 17.99,-66.40 17.57,-67.51 17.65))'
),
	ST_GEOMETRYFROMTEXT(wkt)
	    )
GROUP BY tags [ 'highway' ]
ORDER BY total_km DESC

In this example query, we group by the highway tag to get the total length of different highway= features in Puerto Rico:

OSM Highway Tag	Total Kilometers
residential	21,932.20
track	5,739.49
tertiary, tertiary_link	5,675.30, 28.71
service	3,595.43
secondary, secondary_link	1777.52, 31.78
unclassified	1539.47
motorway, motorway_link	919.62, 320.22
path	607.40
primary, primary_link	515.05, 33.42

Figure 4. Total lengths of different highway tags in Puerto Rico.

Example query 3: Extracting features in a given bounding box

In our previous queries, we aggregated query results by map tile or feature type. Athena can also be used for data extraction. This next query also uses our Puerto Rico bounding box, but instead of adding up the length of each highway, we receive the highway tag, length, and geometry of each highway feature in Puerto Rico.

SELECT id, tags[ 'highway' ] as highway_tag, linear_meters, wkt AS geometry
FROM daylight_osm_features
WHERE release = 'v1.23'
	AND tags [ 'highway' ] IS NOT NULL
	AND linear_meters > 0
	AND ST_CONTAINS(
			ST_GEOMETRYFROMTEXT(
				'POLYGON((-67.51 17.65,-67.44 18.78,-65.16 18.63,
    -65.06 17.99,-66.40 17.57,-67.51 17.65))'
),
	ST_GEOMETRYFROMTEXT(wkt)

This query returns a CSV with 166K rows. Loading this file into QGIS allows us to produce the following map of all the highways in Puerto Rico:

Example query 4: Incorporating OSM attributes (tags)

One of the most powerful aspects of OSM data is the complex attribute tagging system of unlimited key/value pairs. In this next example, we look specifically at the amenity tag, which can be used to denote points of interest or types of businesses. Restaurants and churches are two of the top five most prevalent amenities mapped in OSM. As an example, we can count the total number of OSM features tagged as either churches (amenity=place_of_worship) or restaurants/cafes (amenity=restaurant and amenity=cafe) in each zoom-level 9 map tile:

SELECT BING_TILE_POLYGON(BING_TILE(substr(quadkey, 1, 9))),
	COUNT_IF(tags ['amenity'] =  'place_of_worship') as churches,
	COUNT_IF(tags ['amenity'] IN ('restaurant', 'café')) as restaurants
FROM daylight_osm_features
WHERE release = 'v1.23'
GROUP BY substr(quadkey, 1,9)

Figure 6. Comparison of the number of restaurants and places of worship per zoom-level 9 map tile. Map data © OpenStreetMap Contributors.

Figure 6. Comparison of the number of restaurants and places of worship per zoom-level 9 map tile. Map data © OpenStreetMap Contributors.

Example query 5: Joining on OSM changesets table

Each time a mapper saves their edits to the map, they create a new changeset that includes metadata with the time, mapper information, and comments describing the nature of the edits. Each week, the latest changesets from OSM are made available as part of the OpenStreetMap open dataset on the Registry of Open Data on AWS (RODA). To create the OSM changesets table from the latest weekly export, we can run this query (discussed in more detail in this blog post) in the Athena query editor:

CREATE EXTERNAL TABLE changesets (
    id BIGINT,
    tags MAP<STRING,STRING>,
    created_at TIMESTAMP,
    open BOOLEAN,
    closed_at TIMESTAMP,
    comments_count BIGINT,
    min_lat DECIMAL(9,7),
    max_lat DECIMAL(9,7),
    min_lon DECIMAL(10,7),
    max_lon DECIMAL(10,7),
    num_changes BIGINT,
    uid BIGINT,
    user STRING
)
STORED AS ORCFILE
LOCATION 's3://osm-pds/changesets/'

Each analysis-ready OSM feature includes the changeset which corresponds to the id column in the changesets table. This next query quantifies both buildings and total kilometers of roads in Daylight that were mapped as part of a Humanitarian OSM Team (HOT) mapping task, determined by the existence of the ‘hotosm’ phrase in the comments, which is automatically added by the HOT tasking manager.

SELECT DATE(changesets.created_at) as _day,
SUM(IF(d.tags [ 'highway' ] IS NOT NULL,
d.linear_meters, 0)) / 1000 AS highway_km,
	SUM(IF(d.tags [ 'building' ] <> 'no', 1, 0)) AS buildings
FROM daylight_osm_features d
	INNER JOIN changesets ON changesets.id = d.changeset
WHERE changesets.tags [ 'comment' ] LIKE '%hotosm%'
	AND changesets.created_at > DATE '2015-01-01'
	AND release = 'v1.23'
GROUP BY date(changesets.created_at)

The query results look like this:

_day	highway_km	buildings
2015-01-01	421.02	2,730
2015-01-02	301.42	5,558
2015-01-03	146.77	3,446

Plotting these values by time yields the following figure:

Figure 7. Buildings and kilometers of roads mapped each day as part of HOT tasks.

Figure 7. Buildings and kilometers of roads mapped each day as part of HOT tasks.

It should be noted that Figure 7 represents the lower bounds of these values. Since our dataset includes only the latest version of the map, there may be features that were first added to the map years ago in a changeset including a HOT-related hashtag, but have been updated in subsequent non-HOT changesets and therefore go uncounted here. However, the overall temporal patterns and relative editing magnitudes can still tell us a lot about humanitarian mapping activities in OSM.

Conclusion

These use cases merely scratch the surface of the types of analysis that can be performed using the parquet Daylight data in Athena. For organizations that depend on timely access of this data, these examples demonstrate how one can reduce the time from analysis to insight. When combined with powerful tools like Amazon Athena, Daylight can be highly cost effective and scalable, allowing users to focus on their analysis rather than their infrastructure and data. Stay tuned for future Daylight Map Distribution releases and be sure to subscribe to any relevant Amazon Simple Notification Service (Amazon SNS) topics listed on the Registry of Open Data on AWS.

Read related stories on the AWS Public Sector Blog:

Subscribe to the AWS Public Sector Blog newsletter to get the latest in AWS tools, solutions, and innovations from the public sector delivered to your inbox, or contact us.

Please take a few minutes to share insights regarding your experience with the AWS Public Sector Blog in this survey, and we’ll use feedback from the survey to create more content aligned with the preferences of our readers.