AWS Public Sector Blog
How the Nonprofit Open Data Collective Came Together to Work on IRS 990 Data in the Cloud
Form 990 is used by the United States Internal Revenue Service (IRS) to gather financial information about nonprofit organizations. In July 2016, the IRS started making electronic IRS 990 filing data available via the AWS Public Datasets Program. By making electronic 990 filing data available in this way, the IRS made it possible for anyone to programmatically access and analyze information about individual nonprofits or the entire nonprofit sector in the United States. Since the data became available, a consortium of nonprofits and researchers has come together as the Nonprofit Open Data Collective to analyze the data and identify ways to make it more useful.
We spoke with Lindsey Struck, Director of Program Development at Charity Navigator, to learn more about the Nonprofit Open Data Collective and what they’ve accomplished.
What makes 990 data so valuable?
The nonprofit tax filing conveys a detailed narrative about business structure, operations, expenses, and personnel. Taken in aggregate and over time, these data allow us to build a clearer picture of the factors that drive success and prosperity in the nonprofit sector. They can also act as a launching point for deeper dives. For example, the 990 asks for a website URL. We can use that to perform text analysis on websites across the sector, potentially allowing for the creation of a taxonomy far more sensitive and scalable than those the sector has used in the past.
Who is in the Nonprofit Open Data Collective?
Members currently include: Aspen Institute, Urban Institute, Charity Navigator, GuideStar, Arizona State University (Jesse Lecy), Citizen Audit, Chronicle of Philanthropy, BoardSource, Public Sector Credit (Marc Joffee), Datalake (Jon Durnford), Classy (Ben Cipollini), Syracuse University (Francisco Santam), Carleton University (Nathan Grasse), and Indiana University Lilly Family School of Philanthropy (Kirsten Gronbjerg).
How did this consortium get started?
There is widespread recognition that open data represents both an opportunity and a challenge. It’s here and it’s not going anywhere, so there is a shared sense that we should seize the opportunity and innovate around it. The original consortium grew out of a conference session involving BoardSource, Aspen Institute, Urban Institute, GuideStar, and Jesse Lecy. Charity Navigator was a relative latecomer. When David Borenstein, our Director of Data Science, heard about it, we shared some of the progress already made, and that work became the basis for the “datathon” that took place in Washington, DC.
When the 990 dataset was first made available, it needed a lot of cleaning to be useful. Originally, a consensus emerged that it would be in the sector’s mutual interest to distribute the burden of that work, and make it a common resource for individual projects. There emerged a sense that we would all benefit from more collaborative research. That has been the driving force behind our continued efforts.
How did having 990 data available in the cloud help the consortium collaborate?
Digitizing the 990 data and making it openly available in the cloud has been critical at multiple stages in this journey to open up nonprofit data. At the outset, and for Charity Navigator’s day-to-day business, having the most up-to-date eFile data available on demand allowed us to build labor-intensive production processes around the AWS-hosted data. As we moved into the consortium stage, cloud hosting allowed us to distribute iterations of the cleaned data and to build open-source software that automatically pulls the latest version. The low cost and easy provisioning of AWS lets us maximize our scarce human capital by intensifying our use of compute resources.
What has the consortium accomplished so far?
The IRS released over 25 different electronic formats describing the IRS Form 990, and one of the biggest challenges was to reconcile them. We extracted all of these mappings and we created a provisional concordance (or “Rosetta Stone”) to allow us to translate between them. We are now in the process of using this concordance to generate an easier-to-use dataset for public consumption.
What are the biggest challenges you foresee?
Once the group finishes and posts its work, we will need to continue to provide updates as new Form 990s become available from the IRS. We will need to determine a way to sustain our momentum. Another challenge is that the IRS releases only electronically-filed 990s, which comprise about 60% of all Form 990s. The remaining 40% of tax forms are paper-filed and are not released as open data. The Aspen Institute’s Nonprofit Data Project has been leading an effort to require all nonprofits to file tax forms electronically, coupled with the release of the data by the IRS. Legislation has been proposed by leading members of the Senate Finance and House Ways and Means committees (The CHARITY Act).
What’s next for the consortium?
We have a “Validatathon” scheduled for November 1-2, during which we will check the validity of our mappings between the original XML and the standardized structure. We currently have all of the mappings in a columnar format. We plan to use Amazon Athena to make this table queryable, and then have each participant access it through an Amazon Elastic Compute Cloud (Amazon EC2) instance running RStudio. The standardization will allow us to work quickly and avoid any hardware or consistency issues.
Beyond those events, several member groups are beginning to create public websites and other tools for dissemination of 990 data. Our group has grown a great deal, and the contributions of each participating organization are deeply interconnected. We are developing a roadmap for public dissemination of work that continues to incentivize a spirit of collaboration.
Learn how to access IRS 990 Filings on AWS.