How to Share Data (Hint: “Thoughtfully”)
This blog post is part of a blog series on “Open Data for Public Good,” a collaboration between the AWS Institute and AWS Open Data aimed at identifying emerging issues around open data and offering best practices for data practitioners. Read the first post here.
Sharing data requires more than just making it available for download or creating an API to access it. In many ways, sharing data is similar to shipping a software product. Just like software; data is made up of digital information; it requires documentation; it will be used by groups of users who may require support; and it may become vital to those users’ work. Another common characteristic of software is that it often gets updated over time as software developers learn from their users and adapt to new technologies.
So how should you share data? Thoughtfully. There are technical and non-technical considerations. Below, we explore the non-technical considerations, which require a focus on data governance and community engagement.
For data to be useful, it must come from a reliable source that users trust. If users do not believe that data has been produced or documented with sufficient rigor, they will be less likely to rely on it. The USGS Landsat program provides an important lesson in trust.
Launched in 1972, the Landsat program has been a reliable source of data and imagery of the Earth for decades. This continues to be the case despite an incident in 2003 when one of the instruments on Landsat 7 failed, which caused gaps in data produced by the sensor from that point on. The Landsat team worked to document how users could still use data from Landsat 7 despite the limitations imposed by the instrument failure. This transparency in operations helped maintain the trust of users in data generated by the program. If users understand the value and limitations of data, they will be more likely to use it, even if it’s not perfect. This requires open and clear communication with users.
Jupiter: Hubble’s decades of observations of the planets in the outer Solar System allow astronomers to study their seasonal variations and provide support for NASA’s dedicated suite of spacecraft that visit these celestial bodies. Credit: NASA, ESA, and A. Simon (NASA/GSFC).
If data is not documented, the audience will be limited. There are times when users will do the detective work required to interpret poorly documented data, but a lack of documentation will usually frustrate users to the point that they will not trust the data. Users should be able to understand when data was created, the methodology used to create it, how to interpret the values contained in it, and if there are any licenses that may limit how the data can be used. Documentation should also include a method to contact someone who can answer questions about the data. Ideally, documentation should include tutorials that users can follow to get hands-on experience with data.
Developers will not put in the effort to create tools or applications based on data if they have no assurance that the data will be available in the future. Assuring that data will be available on an ongoing basis is important when sharing large volumes of data.
The cloud provides infrastructure for storage, low-latency access, and transfer of data. This becomes more important as data volumes grow. When data is shared in the cloud, users can access data quickly and directly from the source, which assures them that they can reliably access a trustworthy copy of the data without the need to duplicate storage in their own account.
In an era where governments and organizations are opening their data up to the public, making sure that data is shared in a deliberate and transparent way is key to establishing trust with users and increasing the utility of data. For information on technical considerations behind sharing data, visit opendata.aws.
A post by Jed Sundwall, Manager, Open Data Program, AWS