AWS Blog

New – USASpending.gov on an Amazon RDS Snapshot

My colleague Jed Sundwall runs the AWS Public Datasets program. He wrote the guest post below to tell you about an important new dataset that is available as an Amazon RDS Snapshot. In the post, Jed introduces the dataset and shows you how to create an Amazon RDS DB Instance from the snapshot.

Jeff;


I am very excited to announce that, starting today, the entire public USAspending.gov database is available for anyone to copy via Amazon Relational Database Service (RDS). USAspending.gov data includes data on all spending by the federal government, including contracts, grants, loans, employee salaries, and more. The data is available via a PostgreSQL snapshot, which provides bulk access to the entire USAspending.gov database, and is updated nightly. At this time, the database includes all USAspending.gov for the second quarter of fiscal year 2017, and data going back to the year 2000 will be added over the summer. You can learn more about the database and how to access it on its AWS Public Dataset landing page.

Through the AWS Public Datasets program, we work with AWS customers to experiment with ways that the cloud can make data more accessible to more people. Most of our AWS Public Datasets are made available through Amazon S3 because of its tremendous flexibility and ability to scale to serve any volume of any kind of data files. What’s exciting about the USAspending.gov database is that it provides a great example of how Amazon RDS can be used to share an entire relational database quickly and easily. Typically, sharing a relational database requires extract, transfer, and load (ETL) processes that require redundant storage capacity, time for data transfer, and often scripts to migrate your database schema from one database engine to another. ETL processes can be so intimidating and cumbersome that they’re effectively impossible for many people to carry out.

By making their data available as a public Amazon RDS snapshot, the team at USASPending.gov has made it easy for anyone to get a copy of their entire production database for their own use within minutes. This will be useful for researchers and businesses who want to work with real data about all US Government spending and quickly combine it with their own data or other data resources.

Deploying the USASpending.gov Database Using the AWS Management Console
Let’s go through the steps involved in deploying the database in your AWS account using the AWS Management Console.

  1. Sign in to the AWS Management Console and select the US East (N. Virginia) region in the menu bar.
  2. Open the Amazon RDS Console and choose Snapshots in the navigation pane.
  3. In the filter for the search bar, select All Public Snapshots and search for 515495268755:
  4. Select the snapshot named arn:aws:rds:us-east-1:515495268755:snapshot:usaspending-db.
  5. Select Snapshot Actions -> Restore Snapshot. Select an instance size, and enter the other details, then click on Restore DB Instance.
  6. You will see that a DB Instance is being created from the snapshot, within your AWS account.
  7. After a few minutes, the status of the instance will change to Available.
  8. You can see the endpoint for your database on the main page along with other useful info:

Deploying the USASpending.gov Database Using the AWS CLI
You can also install the AWS Command Line Interface (CLI) and use it to create a DB Instance from the snapshot. Here’s a sample command:

$ aws rds restore-db-instance-from-db-snapshot --db-instance-identifier my-test-db-cli \
  --db-snapshot-identifier arn:aws:rds:us-east-1:515495268755:snapshot:usaspending-db \
  --region us-east-1

This will give you an ARN (Amazon Resource Name) that you can use to reference the DB Instance. For example:

$ aws rds describe-db-instances \
  --db-instance-identifier arn:aws:rds:us-east-1:917192695859:db:my-test-db-cli

This command will display the Endpoint.Address that you use to connect to the database.

Connecting to the DB Instance
After following the AWS Management Console or AWS CLI instructions above, you will have access to the full USAspending.gov database within this Amazon RDS DB instance, and you can connect to it using any PostgreSQL client using the following credentials:

  • Username: root
  • Password: password
  • Database: data_store_api

If you use psql, you can access the database using this command:

$ psql -h my-endpoint.rds.amazonaws.com -U root -d data_store_api

You should change the database password after you log in:

ALTER USER "root" WITH ENCRYPTED PASSWORD '{new password}';

If you can’t connect to your instance but think you should be able to, you may need to check your VPC Security Groups and make sure inbound and outbound traffic on the port (usually 5432) is allowed from your IP address.

Exploring the Data
The USAspending.gov data is very rich, so it will be hard to do it justice in this blog post, but hopefully these queries will give you an idea of what’s possible. To learn about the contents of the database, please review the USAspending.gov Data Dictionary.

The following query will return the total amount of money the government is obligated to pay for contracts awarded by NASA that include “Mars” or “Martian” in the description of the award:

select sum(total_obligation) from awards, subtier_agency 
  where (awards.description like '% MARTIAN %' OR awards.description like '% MARS %') 
  AND subtier_agency.name = 'National Aeronautics and Space Administration';

As I write this, the result I get for this query is $55,411,025.42. Note that the database is updated nightly and will include more historical data in the coming months, so you may get a different result if you run this query.

Now, here’s the same query, but looking for awards with “Jupiter” or “Jovian” in the description:

select sum(total_obligation) from awards, subtier_agency
  where (awards.description like '%JUPITER%' OR awards.description like '%JOVIAN%') 
  AND subtier_agency.name = 'National Aeronautics and Space Administration';

The result I get is $14,766,392.96.

Questions & Comments
I’m looking forward to seeing what people can do with this data. If you have any questions about the data, please create an issue on the USAspending.gov API’s issue tracker on GitHub.

— Jed