Aug 9, 2024 5 min read

Lakes, warehouses, and lakehouses

This article is, of course, about data storage and utilisation in a business. Last week, I wrote about the overall data architecture in an organisation, but this week, I'd like to explore the data layer itself, the different possible architectures, and the advantages and disadvantages of each.

Data warehouses

Data warehouses are the most intuitive form of data storage for many people, so let's start with them.

A data warehouse stores structured data; think of a database or even an Excel spreadsheet to get some idea. Each data item can be considered a 'row', and the attributes or properties associated with that item are the columns. For example, we could have a People table, where a row might represent a person, with columns such as Name, Age, Height, Hometown, etc. Each of these items can be related to other data in other tables; for example, the Hometown column could have the value London, and there would then be a row in a separate Cities table for London with other attributes, such as Population or Country.

This data is straightforward to use, particularly if you're writing code. We know where to look for data about people or cities and what we can expect to find for each. Adding a Business Intelligence (BI) or other reporting layer to these structures to gather business insights is typically straightforward.

The problem is that this data is more complex to write because we have to decide how the data will be stored and organised at the point where we add it to the table. If we have a use case in mind for the data, that's probably doable, but for data stored for potential future use, that's a lot of upfront work without any guarantee of delivering value. This is known as 'schema on write', i.e. we have to decide the schema for the data when it's written into the warehouse. Data warehouses are also often limited in the data types they can store, with numbers and text working more naturally than images or audio.

To avoid doing this upfront schema work, there is an alternative solution, which is in the form of data lakes.

Data lakes

Data lakes take a 'schema on read' approach. All data is stored in its native form without organisation or structure. Think of it like a folder on your computer where you save all your files (we'll discuss data lake structure later). Save everything into that folder, and you know when you need it in the future, it'll be there and ready to go. Once you need that data, you'll know what for and how much value you'll be able to deliver, which means you can start thinking about how to structure the data. Often, this means transforming the data and putting it into a structured form in a data warehouse.

The clear advantage here is that you have all your data readily available when you need to use it, but you haven't wasted much time getting it into a particularly tidy state. However, there are drawbacks, particularly around ensuring the data quality is appropriately maintained and policing access to the data appropriately. A data lake without appropriate governance ends up being a useless, disorganised, and outdated mess of data and files that happen to be in one place, which people call a Data Swamp.

Data lakehouse

So we come to the lakehouse, an irritating portmanteau of lake and warehouse that aims to offer the best of both worlds.

Data lakehouses typically use the same low-cost object data storage used for data lakes, meaning it's still possible to store everything in the lakehouse and then figure out a use for it later. However, a lakehouse also has metadata layers that offer more structured schemas for accessing the data and governance features. This means you can access the data directly from the lake but with additional functionality, making it feel like a well-organised data warehouse.

I've redrawn the image below, but it's copied directly from the Databricks website. Note that 'ETL' stands for Extract, Transform, Load and is essentially the process of structuring and organising data I discussed above. The image summarises what we've already covered regarding the data types in each architecture and how that data is used.

Why bother with lakes?

The obvious question at this point is, why bother with a data lake? If you need to do something with the data, you can process it and put it into a data warehouse. If you don't, why not leave it in the source system until you do?

There are at least two good reasons to put all your data into a data lake:

Data science/ ML exploration: This is shown in the image above. While many users will consume more structured data, data scientists can take a more exploratory approach to data to uncover insights. This is only possible if they can easily access all the data.
Historical reprocessing: source systems don't necessarily store long histories you can call back on when a new use case emerges. It can be frustrating to build an insightful dashboard and then wait six months to see the trends emerge as the data flows in when this could have been pulled immediately from a data lake with historical data.

Structuring the data lake

Finally, I want to discuss how to structure the data within a data lake. The typical approach is to organise this into three layers:

Raw: a direct copy of the source data in the same structure as the source data. Typically organised within subfolders identifying the time the raw data was taken (e.g. hourly, daily, etc.)
Curated: this layer achieves two things:
- Data is cleaned: for example, dates are put into a consistent format, and missing values are handled appropriately.
- Data is combined and enriched: the data in this layer conforms to a more consistent data model that integrates information across systems. For example, HR data about a person might be combined with their salary and cost centre from Finance.
Business: this is the layer where the data is used to address a business problem or answer a question. For example, this might include calculated fields such as employee attrition, which are not directly available from the People data. This layer should expose all the data used in the technology layer so that only straightforward calculations are done in the technology layer itself.

Implicit in this structure is that as we get higher up the stack, the data becomes higher quality and more trustworthy. The data pipelines that move data between these different layers should work to improve and police the quality of data passing through them.

Conclusion

This gives a much clearer idea of how data is structured within the data layer, which was introduced in my last article. I want to add two more articles to this series, one breaking down the technology layer in more detail and the other focusing on the operating model that makes this possible (i.e., who owns what).