AWS Startups Blog

Data lakes: How to Weather a Data Super Storm

Guest post by Stephen Campbell, Director, Technology Partnerships at Collibra

Weathering the data lake stormHurricanes are a force of nature, flooding cities and towns with deluges of water that can be impossible to control. When I lived in Texas and Mississippi, every year, my neighbors and I waited with bated breath to see if “the big one” was going to hit and upend our lives. While I was lucky enough to avoid disaster, I knew I had to be prepared to face not only the impact of pounding wind and rain, but almost more critically, the aftermath of any huge natural disaster. In a way, it’s kind of like what businesses are facing with data.

By now, we’ve all seen the statistics. The data universe is doubling in size every two years, which translates into a 50-fold growth from 2010 to 2020. It’s a superstorm of data that isn’t letting up anytime soon. Businesses everywhere are desperately trying to manage the flood of information pouring in from every direction. For many, it’s truly a “water, water everywhere, nor any drop to drink” scenario.

To manage this extreme data growth, organizations are looking to data lakes, such as those provided by AWS, to help them control and consolidate the storm of data. It seems like a reasonable strategy: to catch and contain your data in a single spot. But what many organizations are finding is that simply pouring the data into a data lake and hoping that the business will use it is no longer enough.

As I talk with customers, I find that when it comes to data lakes, there are two scenarios: 1) they’re building a new data lake from scratch, possibly in preparation for the storm or 2) they’re trying, often desperately, to clean up the data lake (swamp?) in the aftermath of the flood.

If you’re fortunate enough to be building a new lake, then I believe you’re one of the lucky ones. See, you have a blank slate. You can embrace all the right data governance policies and practices right from the start, ensuring that data in your lake is identifiable and has context and meaning. You know its authoritative sources.

You can also identify where the data is coming from. And you can find out if—and how—it is being used by the business. You can prioritize the data coming into the lake based on usage. And you can use a data catalog to ingest data that is easily understood and easily trusted. Your data lake will collect the right data elements and will remain unsullied by poorly described data. You’ll provide the data catalog to your business users, and they will easily find the data sets they need, understand their lineage, meaning, and use, and trust that the data they are using is right. And that increases the odds that your business users will actually use the lake that you’re building. Sounds great, right?

But for those of you facing scenario two, you’re dealing with the aftermath of the storm. You know what I’m talking about. What started out as a calm, peaceful rainy day quickly turned into the dirty, sloppy raging waters left behind by intense wind and downpours. Eventually, the raging waters die down, but it takes a great deal of effort to clean up the mess they leave behind. Far too often, I talk with customers who have a data lake that resembles the mucky aftermath of a storm. Why? Because they failed to put in place governance policies and practices before they dumped data into the lake, and now they are left mopping up the mess.

But there’s good news for those of you facing the dirty data lake scenario as well. You, too, can benefit from a governed data catalog. See, a data catalog will help you understand—and document—what you have in your lake. It will create an inventory of data, including attributes that indicate its quality, its definition, its lineage, and its recommended use. A catalog will also promote collaboration and will help you crowdsource information about the data from the people who are using it. So if Ram in marketing used a data set and found it to be incomplete, he makes a note in the catalog for others to see. And if Ann in accounting found a particular data set to be fit for purpose and highly trustworthy, she can document that as well. Further, by linking your data catalog to your business glossary, you can make it really easy for the business users searching the lake for data to know exactly what the data means and whether or not it’s the right data for the job at hand.

Facing a data superstorm is becoming a way of life for modern business, just like hurricanes in the south. But with a bit of planning and a governed data catalog, I’m confident that you can weather the storm.

Michelle Kung

Michelle Kung

Michelle Kung currently works in startup content at AWS and was previously the head of content at Index Ventures. Prior to joining the corporate world, Michelle was a reporter and editor at The Wall Street Journal, the founding Business Editor at the Huffington Post, a correspondent for The Boston Globe, a columnist for Publisher’s Weekly and a writer at Entertainment Weekly.