First, there was a data warehouse – an information storage architecture that allowed structured data to be archived for specific business intelligence purposes and reporting. The concept of the data warehouse dates back to the 1980s and has served businesses well for several decades – until the dawn of the Big Data era.
This was when businesses began to unlock the value of working with unstructured data – messy, raw information that might come in the form of pictures, videos, or sound recordings. This type of data typically makes up 80 to 90% of the information available to organizations and often holds a phenomenal amount of value – think of the insights contained in years' worth of customer email communications or hours of production line video footage. Unfortunately, it doesn't fit well with the structured and ordered way information is stored in the data warehouse model.
This led to the development of a different type of architecture known as the data lake – where unstructured information is stored in its raw format, ready for whatever uses we may be able to find for it, now or in the future.
The data lake is undoubtedly a hugely powerful and flexible architecture. However, it does have some issues. For a start, as you can imagine, it can get very messy – in fact, I've heard it said that if they aren’t careful, businesses can end up with something that more closely resembles a data swamp!
This can create governance and privacy issues, as well as technical complexities involved with creating systems that are able to ingest data in a myriad of schema and formats.
So today, businesses and other organizations that work with datasets that could be considered Big Data have yet another option when it comes to storage architecture. Just as we are with cloud platforms in general, with data storage, we are increasingly hearing about a hybrid architecture which is being called the "data lakehouse” approach.
There are no prizes for guessing that the fundamental idea behind this approach is to take the best concepts from both the data warehouse and data lake models and put them together while trying to eliminate the worst concepts of both models!
Just like a data lake, a data lakehouse is built to house both structured and unstructured data. This means that businesses that can benefit from working with unstructured data (which is pretty much any business) only need one data repository rather than requiring both warehouse and lake infrastructure.
Where organizations do use both, then generally data in the warehouse feeds BI analytics, while data in the lake is used for data science – which could include artificial intelligence (AI) such as machine learning -and storage for future, as-of-yet undefined use cases.
Data lakehouses enable structure and schema like those used in a data warehouse to be applied to the unstructured data of the type that would typically be stored in a data lake. This means that data users can access the information more quickly and start putting it to work. And those data users might be data scientists or, increasingly, workers in any number of other roles that are increasingly seeing the benefits of augmenting themselves with analytics capabilities.
These data lakehouses might make use of intelligent metadata layers – that act as a sort of "middle man" between the unstructured data and the data user in order to categorize and classify the data. By identifying and extracting features from the data, it can effectively be structured, allowing it to be cataloged and indexed just as if it was nice, tidy structured data.
For example, part of this metadata extraction might include using computer vision or natural language processing algorithms to understand the content of picture, text, or voice files that are dumped as raw, unlabelled data into the lakehouse.
Lakehouse enables smart analytics
So who is the data lakehouse architecture for? One key group of users is very likely to be organizations that are looking to take the next step in their analytics journey by graduating from BI to AI. Increasingly, businesses are looking to unstructured data to inform their data-driven operations and decision-making simply because of the richness of the insights that can be extracted from it. Here’s a very simple example – if you count the number of customers coming into your shop each day and store that data as a simple number, those data points are only ever going to tell you one thing.
If you record them coming in on video, however, then as well as the basic number of customers coming in, you can find out all sorts of other information – are your customers male or female? What’s their age range, how do they like to dress? In the future, you might even be able to fit facial analytics technology and tell what mood they are in when they walk through your door!
Yes, you could dump all of that information into a data lake. However, there would be important issues of data governance to address – such as the fact you’re dealing with personal information. A lakehouse architecture would address this by automating compliance procedures – perhaps even anonymizing data where it was needed.
Unlike data warehouses, data lakehouses are inexpensive to scale because integrating new data sources is automated – they don’t have to be made to manually fit with the organization's data formats and schema. They are also "open," meaning that the data can be queried from anywhere using any tool, rather than limited to being accessed through applications that can only handle structured data (such as SQL).
The data lakehouse approach is one that's likely to become increasingly popular as more organizations begin to understand the value of using unstructured data together with AI and machine learning. In the analytics journey, it’s a step up in maturity from combined data lake and data warehouse model that until recently has been seen as the only option for organizations that want to continue with legacy BI and analytics workflows while also migrating towards smart, automated data initiatives. With more mainstream data infrastructure vendors (e.g. AWS and Databricks) offering this architecture, and open-source tools like Delta Lake growing in popularity, it’s a term we will hear more and more of in years to come.