Visualize data lake address datasets on a map with Amazon Athena and Amazon Location Service geocoding
At Amazon Web Services (AWS), many public sector customers in government, healthcare, and life sciences have data lakes that contain addresses (e.g., 123 Main Street). These customers frequently ask how they can quickly visualize these addresses on a geographic map to get a more intuitive understanding of how these addresses are distributed. Amazon Location Service provides geocoding functionality, which can convert addresses to geographic coordinates (latitude and longitude points). But the challenges are:
- What is the simplest way to use Amazon Location Service’s geocoding functionality in an ad hoc fashion on addresses in a data lake without having to geocode the entire dataset?
- Once the addresses are geocoded, what is the simplest way to quickly visualize the results on a map without setting up a full geographic information system (GIS)?
In this post, learn how to use Amazon Athena and Amazon Location Service to perform ad hoc geocoding on an example dataset* of registered business addresses provided by the State of Oregon, and visualize these geocoded addresses on an Amazon QuickSight map.
Solution overview: How to visualize data lake address datasets on a map with Amazon Athena and Amazon Location Service geocoding
Figure 1. Architecture diagram showing the solution described in this blog post. The main components are an Amazon S3 bucket, Amazon Athena, AWS Glue, AWS Lambda, Amazon Location Service, and Amazon QuickSight.
Figure 1 features the solution architecture, which works as follows:
- A .csv file is obtained from the example dataset from the State of Oregon that contains addresses for registered businesses. This dataset is placed in an Amazon Simple Storage Service (Amazon S3) bucket.
- An AWS Glue database and table are created that define the data format of the .csv file in Amazon S3 to allow it to be queried from Amazon Athena.
- Amazon Athena is used to execute a SQL query against the data in Amazon S3. An Amazon Athena User Defined Function (UDF) is used to execute an AWS Lambda function, which in turn calls Amazon Location Service to geocode address data into geographic points (latitude and longitude pairs). The result of the SQL query is written back to Amazon S3 and an associated AWS Glue table is created.
- The new geocoded address dataset is loaded into Amazon QuickSight and visualized on a map for end users.
To complete this walkthrough, you will need access to an AWS account with sufficient permissions to create these AWS resources, and access to the AWS Management Console.
You will also need to have Amazon QuickSight enabled within your AWS account. If you do not already have QuickSight enabled, you will need to sign up for Amazon QuickSight.
In this walkthrough, you will create AWS resources using AWS CloudFormation, and additional AWS resources using the AWS Management Console.
Request a service quota increase for Amazon Location Service
To enable Amazon Athena to call the Amazon Location Service geocoding API many times in rapid succession as needed for this walkthrough, you need to request a service quota increase on the Amazon Location Service SearchPlaceIndexForText API endpoint. To do this, navigate to the request page in the AWS Management Console. Make sure you are in the desired AWS Region before making the request.
Select the Request quota increase button (Figure 2), which will open a request screen (Figure 3). From this screen, set the Change quota value to 100 and select the Request button. Your request should be granted automatically after a short period of time. Refresh the page to see the status of your service quota increase request.
Figure 2. Service quota request page for the rate of SearchPlaceIndexForText API requests.
Figure 3. How to set the quota value for the Rate of SearchPlaceIndexForText service quota request.
Deploy the Amazon Athena User Defined Function
Amazon Athena UDFs can be used to call AWS Lambda functions when performing SQL queries from Amazon Athena. From an AWS Lambda function, you can call Amazon Location Service to perform geocoding operations. In the aws-samples GitHub repository, there is a sample project that uses the AWS Serverless Application Repository to deploy this AWS Lambda function, along with the required Amazon Location Service place index that enables you to perform geocoding operations.
To deploy this in your own AWS account, first sign in to your AWS Management Console. Make sure you are in your desired AWS Region, then click here to open an AWS Management Console page where you can edit application settings (Figure 4).
Figure 4. Application settings that can be adjusted before deploying the Athena UDF. The figure highlights the required settings change to ReservedConcurrentExecutions.
As highlighted in Figure 4, change the value of ReservedConcurrentExecutions to “4.” Check the box that acknowledges the creation of IAM roles, and select the Deploy button. Setting ReservedConcurrentExecutions to 4 prevents too many concurrent Lambda function executions from exceeding the rate limit that you set up earlier for the Amazon Location Service SearchPlaceIndexForText API.
Deploy the solution CloudFormation template
An AWS CloudFormation template is provided as part of this blog post solution that deploys a number of AWS resources, including:
- An Amazon S3 bucket, which is populated with the example dataset of addresses from the State of Oregon
- An AWS Glue database and table that have the example dataset’s structure predefined, which in turn allows you to start querying it from Amazon Athena with no additional configuration
- An example Amazon Athena SQL query that you will use to perform geocoding operations
To deploy this template into your AWS account, click this link which opens the AWS Management Console and initiates the solution’s CloudFormation template deployment process. Check the I acknowledge that AWS CloudFormation might create IAM resources box (Figure 5); then select the Create stack button.
Figure 5. Shows how to deploy the CloudFormation template to your AWS account that creates the required resources for this solution.
Geocode addresses using Amazon Athena
With the solution deployed, you can now perform some geocoding. Navigate to the Amazon Athena section of the AWS Management Console and open the Query Editor. Change your workgroup to LocationServiceExampleWorkgroup (Figure 6), and choose the Acknowledge button in the resulting screen (Figure 7).
Figure 6. Shows how to change your Amazon Athena workgroup in the AWS Management Console.
Figure 7. Shows how to acknowledge the changes to your Amazon Athena workgroup.
In the Query Editor page, navigate to the Saved queries tab and choose the saved query named OregonGeocodingExample (Figure 8).
Figure 8. Shows how to open an Amazon Athena saved query to perform the subsequent geocoding operation.
Select the Run button (Figure 9) to execute the saved SQL query. If you look at the contents of this SQL query, you will see that it passes address values into a function called “search_place_index_for_text,” which in turn uses AWS Lambda to call Amazon Location Service to geocode these addresses. In this example calculation, you are selecting only the addresses for the 97201 zip code, which corresponds to downtown Portland, Oregon.
Figure 9. Shows how to execute the Amazon Athena SQL query that performs the geocoding operation.
After a few minutes, the query will complete. You should see a new AWS Glue table called “oregon_businesses_geocoded” that has “lat” and “lon” values (Figure 10).
Figure 10. Shows the results of the Amazon Athena query, which is a new AWS Glue table that includes latitude and longitude values from the geocoding operation.
Visualize addresses on a map in Amazon QuickSight
Now that you have performed the required geocoding and have produced a new AWS Glue table, you can visualize the addresses on a map in Amazon QuickSight.
Start by logging into Amazon QuickSight through the AWS Management Console. Once you are logged in, navigate to the “QuickSight access to AWS services” settings page. Under Allow access and autodiscovery for these resources, make sure that both Amazon S3 and Amazon Athena are checked (Figure 11). Then choose the Select S3 buckets link and check the oregon-businesses-* Amazon S3 bucket that was created earlier. Select Save.
Figure 11. Shows the Amazon QuickSight permissions page that allows you to grant permissions for Amazon QuickSight to access data in Amazon S3 via Amazon Athena.
Next you will add a dataset in QuickSight so that we can import the data via Amazon Athena.
Navigate to the Amazon QuickSight Datasets section and select the New dataset button. As seen in Figure 12, select the Athena option, then enter “oregon_businesses” in the Data source name field before selecting the Create data source button.
Figure 12. Shows the Amazon Athena add data source workflow.
On the next screen, Choose your table (Figure 13), select the oregon_businesses database and the oregon_businesses_geocoded table before choosing the Select button.
Figure 13. Shows the Amazon Athena table selection workflow.
On the next screen, Finish dataset creation (Figure 14), leave the defaults and select the Visualize button.
Figure 14. Shows the final step in setting up the Amazon QuickSight dataset creation process.
After a short time, the data will be imported into Amazon QuickSight SPICE, which is an in-memory engine that increases performance. As seen in Figure 15, select the Visual type of Points on map, then drag and drop lat and lon into the Geospatial field well, and experiment with dragging and dropping different values into the Size and Color field wells, such as entity_type or associated_name_type to distinguish between business types by color.
Figure 15. Shows an Amazon QuickSight map visualization with the Oregon business data loaded.
You now have a fully interactive map of registered businesses in downtown Portland, Oregon, rendered on top of a base map that is also provided by Amazon Location Service. You can use this as-is, export it, or publish it as an Amazon QuickSight Dashboard.
If you would like to remove all of the resources created in this walkthrough, perform the following steps:
- Delete the Amazon QuickSight analysis and dataset from the QuickSight user interface.
- If you were not already using Amazon QuickSight and want to terminate your trial subscription, follow these instructions.
- From the Amazon Athena section of the AWS Management Console, delete the “oregon_businesses_geocoded” table that was created by our SQL query as seen in Figure 16.
- Delete all objects from the Amazon S3 bucket “Oregon-businessses-*” that was created by CloudFormation using the instructions in the Amazon QuickSight User Guide.
- Delete the CloudFormation stacks called “serverlessrepo-AmazonLocationUDFs” and “location-service-athena-example.”
Figure 16. Shows how to delete the “oregon_businesses_geocoded” table from the Athena user interface.
Note that Amazon QuickSight can show up to 10,000 points on a map at once. If you experiment with this solution with your own data and have more than 10,000 points to show, Amazon QuickSight will render the first 10,000 points.
If you are attempting to do this walkthrough on a new AWS account, your service quota request for Amazon Location Service may not be automatically accepted, and you may have to work with AWS support to get the service quota adjusted.
Similarly, if you are using a new AWS account, you may have an AWS Lambda “Concurrent Executions” service quota limit that is set to 10. If this is the case, then deploying the Athena UDF solution will fail. If this happens, you may request a service quota increase for AWS Lambda Concurrent Executions with a value of at least 20.
In this blog, we visualized business addresses from the State of Oregon on a map after geocoding them to provide an interactive, intuitive interface to better understand this dataset. We used serverless services on AWS that are simple to set up, are pay-as-you-go, and do not require ongoing maintenance. We used a .csv file of addresses in Amazon S3 as the data source to show how this solution can be applied to the types of data lakes that we see from many of our public sector customers.
If you have your addresses stored in a relational database instead of a data lake, read about a similar solution in this blog post: “Access Amazon Location Service from Amazon Aurora.”
*Data Citation: State of Oregon Active Businesses dataset was accessed on August 1, 2022.
Read more about AWS for research:
- How to partition your geospatial data lake for analysis with Amazon Redshift
- How to deliver performant GIS desktop applications with Amazon AppStream 2.0
- How to set up Galaxy for research on AWS using Amazon Lightsail
- Analyze terabyte-scale geospatial datasets with Dask and Jupyter on AWS
- Accelerating and democratizing research with the AWS Cloud
- Creating access control mechanisms for highly distributed datasets
Subscribe to the AWS Public Sector Blog newsletter to get the latest in AWS tools, solutions, and innovations from the public sector delivered to your inbox, or contact us.
Please take a few minutes to share insights regarding your experience with the AWS Public Sector Blog in this survey, and we’ll use feedback from the survey to create more content aligned with the preferences of our readers.