THE DATA LAKEHOUSE (updated)

Updated: Feb 7


Databricks a few days ago published an interesting blog post titled "What is a Data Lakehouse." Only when we thought things were stabilizing...


Here is a link to the blog post.

Image courtesy of Databricks (from the blog post)


Let's stay in the realm of structured internal data for a moment. The classic Data Warehouse integrated data sources with ETL and created a highly readable data model that people could consume. In many ways, we still need this today. Of course, this was highly expensive to do and would inevitably end up being incomplete for mainly ROI reasons. Then came the idea of the Data Lake, a dumping ground of all data in raw format in a lower-cost storage environment. Store it, and they will come. Of course, you had to organize it a little, but it helped in many cases by rendering development more flexible and delaying placing a structure on the data only when and if you were going to consume it. Since not all data gets used, we saw ROI of data-gathering initiatives go up, due to delaying some of that expensive work that, in the end, was never going to be needed in the first place. Cool.


We can talk of architecture all day, but extremes never quite do the trick it seems. What was great about Data Warehouses was that the data was connected, harmonized, and clean. While we may not need a ready-to-consume data model all the time, we do need the data to fit some purpose before we can effectively consume it, even raw. We need it harmonized with standard definitions, clean or at least its quality identified, and some of the critical entities connected. If you have ever tried to integrate and link data sets together, you know this is some of the most challenging aspects of data pipelines. 


So, in essence, Data Warehousing is too much upfront work, and the Data Lake is not enough. The argument being made is that we need a middle-ground. 


Of course, the way Databricks describes the features of a Data Lakehouse is congruent with what their platform enables. And it all makes sense. But still, isn't it just combining a Data Lake and one or more Data Warehouses on a single data platform framework? That, we have been doing, and the blog mentions that it is what they see "emerging independently across many customers and use cases." So the question still stands, why a new term?


A Technical View


Simon Whiteley, of Advanced Analytics, wrote a blog piece and offered a technical perspective behind the Databricks post and gives a good rundown on whether technically we are ready for a unified platform that does it all. Simon is a Microsoft Data Platform MVP so he goes into a lot of details with respect to Azure. I can support him in saying that neither AWS or GCP are ready either.


With the world of Data Warehouse of old, with independent ETL and reporting solution companies, eventually, there was consolidation. Companies bought other companies and expanded their offerings. Databricks is great at what it does, but it cannot alone provide all the pieces of the technical Data Lakehouse yet. Check out Domo, which has been going in that all-in-one direction. Yet, it can't do what Databricks does. Get it? We still have q way to go - I agree with Simon.


Master Thesis: a Data Architecture View


I found out that Pedro Javier Gonzales Alonso at Barcelonatech wrote his master thesis and mentioned the term Data Lakehouse in 2016! I am not sure if he is the one that coined the term, but here are some points he makes.

Image from the Thesis of Pedro Javier Gonzales Alonso


Between no schema and a static relational schema, we need a flexible schema. And to achieve this, Pedro talks of ontologies, levels of abstraction and a "global schema." What is interesting with new concepts is that not everyone agrees on the definitions. Databricks includes all data, unstructured too. But Pedro does not:


Image from the Thesis of Pedro Javier Gonzales Alonso


Both Databricks and Pedro wish to simplify and "make some sense" of the data before it gets committed in the Data Lakehouse, but without going too far, by performing some common work that will benefit all use cases, and connect things. That is a great objective and one that always struck me as a considerable improvement to the "dumping ground" of the Data Lake. Data Catalogs do help here somewhat by providing some context, but the fact remains that in a Data Lake, someone has to do the work of cleaning, harmonizing, and linking. 

Image from Dzone.com


If we look back at the Data Lake, and there are multiple versions of this diagram, we already had the "trusted" or "refined" zones, or even the "silver" and "gold" zone of Databricks' Delta Lake. What we put in there is a function of how people and pipelines will need to consume the data; after all, we are simply trying to assist consumption. For example, if it means that everyone needs products to be linked to clients reliably, why should everyone have to do that same work? If we refer to the diagram above, one merely has to clean and harmonize the fields used to link and develop the algorithm to connect them, publish and automate in the refined zone. Voilà.


Because we want more and more people to have access to data, and to provide a single source of truth, some data architecture work can definitely be done to create organization-wide benefits; 20% of work that gets consumers 80% of the way there. More importantly, because organizations are such models of sharing and communication, this will prevent the 20% of work being repeated again and again. I know, that never happens.


There is a lot more to this, but my experience has been that if you understand the organization's data and the usage patterns, you can create this enhanced or trusted data layer/area. And it doesn't have to be done all at once.


Do you think the term Data Lakehouse will stick?

©2020 by Modern Data Analytics