Data Lakes Are Not Evil — They Just Need To Be Properly Utilized
- Written by Michael Hiskey, Semarchy
- Published in Demanding Views
Imagine a place where all business data could be stored in its source form, waiting to be analyzed on call. That place is the data lake, and when properly utilized, it can be one of a business’ most valuable assets, saving money and increasing ROI.
So, it’s been surprising to see data lakes labeled “evil.” The anxiety about data lakes turning into data swamps cedes a level of discourse that suggests businesses don’t really know what to do with their data — and haven’t realized that smart solutions have kept up with the changing reality of data management.
To give “big data” a backbone, businesses relied on new open-source software like Hadoop. Efforts like NoSQL tried to make sense of “unstructured” information. The thinking was: If it’s possible to retain 100% of the data in our midst, why shouldn’t we?
To find value, just hire a data scientist to analyze it. And if that didn’t work out, no problem: the business hadn’t blown the bank. They were no longer paying hefty fees per terabyte, like with the data warehouse model, where, in 2011, a single terabyte could cost $25,000 — an unsustainable price in the digital world.
Against these realities, the easiest action is to wave the white flag. Some organizations simply chose to opt out of the EU market the day GDPR went live. But painting GDPR as a data bogeyman misses the point completely, and can prevent businesses from seeing it as an opportunity.
In effect, GDPR stipulates that all data is important because it’s human data. This includes seemingly innocuous data, like weblogs. Because we use data for everything, our internet usage reveals our daily wants and needs. A stream of it can reveal who we are as people. GDPR asks organizations to respect their customer’s data no differently than they would respect them in a real-life interaction.
Rather than say, “let’s stop filling our data lakes, the whole idea was dubious from the beginning,” businesses should learn to use data lakes the right way. Doing so means committing to concepts that are less sexy than “big data” but critical to extracting value from the data lake, including master data management (MDM), data governance (DG) and data workflows.
When combined into a single-view-of-customer solution known as the “data hub,” MDM, DG and data workflows allow organizations to keep all their data in a lake and combine it with data that exists beyond their organization, all while knowing whom it belongs to. For example, the American Association of Insurance Services (AAIS), in the highly regulated insurance industry, needs to be able to comb through their data lakes and make the data available in an organized format.
This level of “traceability” makes it easy to “self-audit” data and address when necessary — which is all-too-important in a world where GDPR gives EU customers the global right to data “erasure,” “portability” and more.
Given this need, it’s dismaying to see arguments to toss out everything we learned in the “big data revolution” of 2012. Instead, we ought to think carefully about whose data we have and add sensible tracking on top, in the form of a data hub. As Gartner says, “date lake” may be a bit of a misnomer, but the concept is here to stay.
Michael Hiskey is the Head of Strategy at Semarchy and a long-time data industry executive. An accomplished writer, speaker and blogger, he spends much of his time thinking about innovations that impact the chief data officer and related business functions in organizations around the world. For 20 years, he has spearheaded projects in marketing, sales, software development and customer success at companies like IBM, MicroStrategy, Kognitio and Trifacta.