Q&A: Open990 and How to Access Financial Indicators from Nonprofits
Last year, we shared a post about the Nonprofit Open Data Collective and how a consortium of nonprofits and researchers came together to make the data from the IRS’s 990 filings easier to work with. As a follow-up, we spoke with David Borenstein, one of the driving forces behind the Nonprofit Open Data Collective, about his new project: Open990.
Read on for his Q&A:
What is Open990?
Open990 is a free platform that provides unlimited access to financial indicators from over a half million nonprofits. We use “fuzzy matching” records linkage to track the compensation of particular employees and the expenses of particular programs from 2010 to the present. We provide data from over 250 indicators, and users do not need to register for access. We also provide free downloads of data extracts for over 7,000 unique data fields.
The e-file dataset has a reputation for being difficult to work with. How have you managed to tame it?
We started with techniques developed during the 990 “Datathon.” We join descriptions from the IRS schema files with counts and examples from the actual dataset, and then we manually create mappings. On top of this, we created an internal database to track situations in which field names, paper locations, or XML paths changed over time, and a highly efficient ETL pipeline that lets us regenerate the entire dataset in about 30 minutes.
What have been the biggest challenges in working with the IRS 990 dataset?
The biggest issue is that the native file format is not amenable to analysis. There are three major format issues: the files are too small and too numerous, the format changes over time with little documentation, and the document structure is complex and varied.
You can deal with the speed issue by turning the millions of XML files into JSON records in a NoSQL database. Keep in mind that the largest filings exceed the MongoDB record size limit.
The format changes are harder to deal with. The filing contains a version attribute in its outermost element. The IRS previously provided schemas for some versions, but they have since been removed from the IRS website. As for dealing with complexity, there’s no way but the old-fashioned way – careful and considered manual curation.
You are trying to start a business. Why give away the results of your analysis?
We use Open990 to showcase the ways in which natural language processing can enrich 990 data. We use semantic and statistical models to identify people, projects, and similar organizations. We share the people and projects data on Open990, and we use the similar-organizations data in our paid products. These include benchmarks, grantor/grantee leads, and marketing leads based on organizations that have similar missions, programs, and financials.
How has Open990 benefited from being on AWS?
Building semantic models of text is an iterative process, so is figuring out how to map thousands of distinct XML fields—and many are open-text—to hundreds of well-defined data points. Thanks to AWS Cloud Credits for Research, we are able to experiment with some powerful machinery and explore what have traditionally been relatively expensive strategies, such as using semantic relationships between words to find common themes between organizations. Because we don’t have to buy any hardware to do this, we can experiment with these techniques without having to worry about what to do with the computers when we’re done.
What’s on the horizon for Open990?
We are working on an ontology for the nonprofit sector. When looking for similar organizations in the nonprofit sector, the most common resource is a taxonomy called the National Taxonomy of Exempt Entities (NTEE). The problem with NTEE is that organizations have to fit into a single bucket. But organizations often do work that touches many different issues and themes. The ontology will become a part of the “peer organizations” model that we use for our paid products, and will appear on the free Open990 site in the form of keyword tags.
Do you have any advice for those looking to work with the IRS dataset?
If you can, try to avoid reinventing the wheel. IRSx can extract specific fields from 990 filings (tax year 2013+). If you need a larger dataset than IRSx can extract efficiently, or if you need older filings, see if you can find the fields you need from the e-file master concordance. The Nonprofit Open Data Collective provides a great collection of links to helpful resources on analyzing 990s.
If you’re interested in going deeper to learn more about David’s work on the 990 data, don’t miss this blog post: The IRS 990 e-file dataset: getting to the chocolatey center of data deliciousness