AWS Public Sector Blog

Data.world Census Data Now Available as AWS Public Dataset

The American Community Survey (ACS) Public Use Microdata Sample (PUMS) is now available as an AWS Public Dataset. AWS and data.world collaborated to make the data available for analysis in the cloud. Now, anyone can access and analyze PUMS data in the cloud without needing to download and store their own copy.

The ACS is the largest and most up-to-date annual survey performed by the US Census Bureau detailing information about the American people and housing units. It affects $400 billion in annual spending and impacts local officials, community leaders, and businesses who rely on the data to understand the changes taking place in their communities.

By hosting this important data where it can be quickly and easily processed with elastic computing resources, AWS hopes to enable more innovation, more quickly. Learn how to access the data at the ACS PUMS on AWS Public Dataset landing page.

We spoke with Jonathan Ortiz, data scientist and knowledge engineer at data.world, and discussed how people can use and work with the ACS data on AWS to help foster a more informed population.

Q. What are the biggest challenges associated with ACS data usage?

The biggest challenge with ACS data usage is its steep learning curve. The learning curve is a byproduct of a very positive aspect of the ACS: it’s so big. It covers an exhaustive number of attributes about people and housing units by practically any geography you could think of, which is fantastic but robust datasets tend to have steep learning curves.

Q. How was ACS data previously made available for use?

There are two ACS data releases: pre-tabulated Summary Files, which are aggregated population estimates by geography; and microdata, which is the non-aggregated, individual record-by-record view of the population. There are countless ways to access the Summary Files, including the Census API and data.world, but there aren’t many options for the microdata, which is what we focused on for this AWS public dataset.

Until now, ACS microdata has been made available via FTP as raw .csv files, and its metadata has been stored in human-readable data dictionaries separate from the data, which required users to constantly refer back and forth between the data and the data dictionary.

Q. Your National Science Foundation (NSF)-funded work transforms the ACS into a graph database – what are the benefits of this approach?

Making the ACS data available as a graph database helps improve usability. Big, raw .csv files can be unwieldy, so putting everything into a secure, queryable database means you don’t have to work with raw files or store anything. Just query what you need!

By using graph data, you can store the metadata along with the data itself. Data.world has removed the data from its silo, so now others can use this as a foundation to link their data to the Census. For example, there are many ACS-powered projects, apps, and analyses using data.world, such as Data For Democracy’s election transparency analysis, which aims to uncover trends and anomalies in US elections.

Q. What do you see as the advantage of making this data available on AWS?

AWS cloud computing serves many, so this makes it easier for those folks to get ACS microdata and use it. It extends the audience of the work we’ve done by orders of magnitude. Developers will now be able to make websites and software applications using the ACS microdata, putting information in front of the American people, which can help foster a more informed public.

Learn how to access the data at the ACS PUMS on AWS Public Dataset landing page.