Best Practices in Ethical Data-Sharing: An Interview with Natalie Evans Harris

This blog post is part of a blog series on “Open Data for Public Good,” a collaboration between the AWS Institute and AWS Open Data aimed at identifying emerging issues around open data and offering best practices for data practitioners. Read the first post here and the second post here.

The AWS Institute interviewed Natalie Evans Harris, co-founder and CEO of BrightHive and former senior policy advisor to the US Chief Technology Officer in the Obama administration. Harris founded the Data Cabinet, a federal data science community of practice with over 200 active members across more than 40 federal agencies, co-led a cohort of federal, nonprofit, and for-profit organizations to develop data-driven tools through the Opportunity Project, and established the Open Skills Community through the Workforce Data Initiative. She also led an analytics development center for the National Security Agency (NSA) that served as the foundation for NSA’s Enterprise Data Science Development program and became a model for other intelligence community agencies.

Harris shared best practices on data stewardship and responsible and effective use of data.

Why is it important for data to be shared?

Over the last decade, we have seen the power and impact of data in drawing insights, understanding the world around us, and making decisions to improve communities and human lives. When used correctly, data has the power to improve human life: medicine, healthcare, economic opportunity, and more.

Data is behind the decisions we make. Data influences how resources are allocated and how success is measured. We’ve seen limits on our ability to drive outcomes when data is in silos. It becomes difficult to answer questions like: who really utilizes the healthcare system? How do we predict the skills that will be needed to fill future roles?

Data sharing becomes critical because the only way to have a holistic view of what’s happening is by bringing together data that’s currently sitting in separate places and governed under a privacy framework that discourages cross-agency collaboration.

Can you share examples of how open data has addressed real-life problems?

When the Workforce Improvement and Opportunity Act (WIOA) was passed, it required state and local governments to better measure return on investment in training programs by collecting data about training providers that receive grants.

The challenge with data-driven policies such as this one is that measuring outcomes requires bringing together administrative data that sits in separate state agencies. For WIOA, states must bring together education data from their education agency and wage and employment data from their labor agency. Then they have to figure out to how to bring in data that doesn’t even sit in government but with training providers – some public, some private, some community colleges, some have gotten grants, and others who haven’t gotten grants. Lack of standards definitions and rules governing data sharing means manual, bureaucratic processes to collect, analyze, and share this data.

To get ahead of these challenges, the Open Skills Community led by the University of Chicago, worked with experts across government, nonprofits, and the private sector to develop an open data standard for easing the collection, linking, and analysis of these data sources. Tools can be developed on top of this open data standard to automate much of this process and allow individuals to focus on drawing insights from the data. It’s one thing to collect data and a whole other issue to draw insights from it. Having open data standards makes data accessible and useable. Because it’s an open standard, it can be leveraged by anyone, and also evolve in a transparent fashion.

How does cloud technology change how we share data?

I read in Forbes that over 2.5 quintillion bytes of data are collected daily. For that collection to lead to meaningful insights, a cloud technology solution is often the best way to collect, store, and protect that data. Where we seem to fall short is with the tools built on top of the technology for accessibility and usability of the data. When organizations invest heavily in the cloud infrastructure, but fail to invest in the processes, policies, and training, which govern the use of the data, moving to the cloud, can become a frustrating experience.

What are some best practices you can share so data is shared responsibly and ethically?

The most important best practice I can recommend is to focus on how ethics and responsible data use is a part of your organizational culture. There are some things that can be done through technology, but ethics is driven by people not tools. Make sure that you have a code of ethics, a set of principles that guides your teams in the decisions made and can be used to build trust within organizations and with the communities you serve.

As government becomes more digitized, consent based on empowering data owners will be critical. A set of principles that encourages questions around how you’re making sure data is protected, that datasets are large and diverse enough, that questions are framed adequately ensures that ethics is at the top of people’s mind throughout the data lifecycle. At BrightHive, we instituted an ethical design checklist. Beyond that, it is also important to make sure that the ethics manifesto filters into your processes.

So what are the ethical implications of sharing data?

Tools help us draw insights and answer questions, but we as individuals have the responsibility to assess the quality of the answers. Tools help link and aggregate data and merge datasets, but biased algorithms can change the result. Tools can’t tell us whether it was a large enough dataset, or a diverse enough dataset to help us get the answers we’re looking for. What we’re starting to realize is that we rely too heavily on the technology. Machine learning and artificial intelligence can’t do the thinking for you. You still need people with subject matter expertise to analyze what the technology has done. Someone should be able to analyze the results and deduce what is happening with the data.

The interaction between technology and people is at the crux of the ethical issue – we need to ask the right questions and understand what conclusions we are drawing from the data and the implications from it.

Who is responsible when data isn’t shared responsibly?

When I hear about a breach, the first thing I want to know is where things went wrong in the process. Not for blame, but for gaps that caused the leak. A sound process with the right checks along the way minimizes your risk of exposing data inappropriately. At the collection process, is there a check? When you get to the analysis, was there a check in place? I look at where in the process there was a failure.

What is a data steward, what are their responsibilities?

The fun part of the data steward role is in creating, optimizing, and examining the end-to-end life cycle. You’re responsible for caring for data from the minute data enters your environment until it leaves, and working with your organization to make sure all the steps are in place. It’s a process that includes documenting and understanding the organizations’ operating model, as well as educating staff on what it means.

A data steward is the heart of the data. That person can be called a Chief Data Officer (CDO) and can delegate responsibilities to other stewards, but there must be one person who owns the data journey within your organization.

Do you foresee this role changing in the future?

Yes, data roles will evolve in organizations to meet new regulations’ needs. They will be responsible for making sure it is not treated as a checkbox but something that’s a part of the culture of the organization.

Data is rarely a technology problem, rather it’s a cultural problem. Data stewards have to work with multiple parts of an organization to lead the charge on strategic data use for that organization. That’s not solely a legal or technological issue– it’s someone who can bring together all of these different aspects of using data.

How do you measure whether data was used effectively?

The first thing to look at is why data was collected. At the beginning of the project, you want to define the measures of success – not on money and mission but the real, tangible, community impact. If you know and can visualize impact and you know the questions you need to answer to get to that, knowing whether you collected and used data effectively becomes easy.

So many organizations today have troves of data sitting around, not archived or used to answer mission-related questions. Why are you collecting the data? If you’re collecting and not using it, that’s a misuse of data, it’s irresponsible and increases your security risk.

MORE INFORMATION

For more on the role of the chief data officer, listen to our podcast with Jed Sundwall, AWS Open Data Global Lead, here. You can learn more about the AWS Open Data program at https://opendata.aws.

A post by Maysam Ali, Content Manager, AWS Institute, AWS

AWS Public Sector Blog