Choosing a Data Lake Format: What to Actually Look For

ODSC - Open Data Science
5 min readAug 15, 2023

Recently we’ve seen lots of posts about a variety of different file formats for data lakes. There’s Delta Lake, Hudi, Iceberg, and QBeast, to name a few.

It can be tough to keep track of all these data lake formats — let alone figure out why (or if!) we really need this wide of a selection, and more importantly, which data lake is best for a given use case.

The short answer: All those special data lake formats are geared toward trying to make your data directly queryable.

That’s a fine thing to do, but it shouldn’t be the primary purpose of your data lake.

Let’s discuss all this a bit more: how to choose the best data lake format, and at the same time, why you shouldn’t worry about format all that much. There’s something else that we — the engineers at Estuary — think is more important.

And I’m curious to see if you’ll agree.

Direct Querying and Your Data Lake Options

There are lots of excellent tools out there for performing various types of queries.

You’ve got things like Elasticsearch for full-text search, TimescaleDB for time-series data, Pinecone + ChatGPT for asking conversational questions about your data, PostGIS for geospatial data, and many, many more.

There’s a massive number of different systems, strategies, and algorithms out there for indexing and querying data. And there’s a perfectly good reason for that! The world of data is huge. Even within a single small to medium business, it’s common to see a huge variety in both the types of data that you have, and in the ways in which you want to leverage it.

So, while tools for directly querying your data lake are impressive and sometimes quite useful, they are at best a nice bonus feature.

No matter how awesome your data lake format is, it’s not able to beat PostGIS for geospatial queries, or Elasticsearch for full-text search, or …you get the picture. Even in cases where direct queries against the data lake can work, it’s rarely the best tool for the job.

A More Important Data Lake Feature

So, if we’re not worried about direct querying, then how do you choose — or design, for that matter — a data lake?

At a high level, my team and I think that a data lake should prioritize integrations over query capabilities.

Rather than try to build your entire infrastructure around an all-inclusive data storage system that claims to do it all, it’s far more important that your data lake can make it easy to leverage the broader ecosystem of analytical tools.

You can use these tools, in turn, to do the thing they’re great at: asking questions about your data.

How We Came to Believe This…

The reason we at Estuary feel so strongly about data lake features is (you guessed it) that we built a data lake.

For those new around here: our platform, Flow, is in effect a real-time ETL tool, but it’s also a real-time data lake with transactional support. When we built Flow, we didn’t use any of the aforementioned data lake formats.

Instead, we just used newline-delimited JSON. We’ve previously written about why JSON is a good choice, but I wanted to expand on this particular aspect: the prioritization of integration over direct querying. In a nutshell, that’s what makes Flow’s approach different — both in the world of ETL and data lakes.

We know that no matter how hard we try, we can’t provide query capabilities that suffice for all, or even most, use cases.

Instead, we lean hard into integrations. When you use Flow as your data lake, you can easily materialize data from your lake into a rapidly growing variety of other systems, which are kept up-to-date automatically in real time.

This makes it easy to query your data using whichever tools are best for your scenario.

Actually Choosing the Best Data Lake For You

Before you start lighting your torches, I want to clarify that querying your data lake is not bad. Nor can I sit here and say for certain that the integrations-first approach we use in Flow is right for your needs. That’d be kind of presumptuous, not to mention impossible to determine without knowing your situation.

There are lots of reasons why direct querying could be the best approach for your specific scenario. If that’s you, you already know who you are. You already know the types of queries you need to run and your desired outcomes. In your case, selecting from the variety of data lake formats on the market is simply a matter of comparing capabilities and testing your queries.

But if you’re looking to get more value out of your data in general across multiple business domains, improving the performance of direct queries against the data lake is probably not going to buy you a whole lot.

Making it easier to move data into other systems, on the other hand, does make a big difference. It means that you can be free to use the best tool for each scenario. Perhaps as importantly, it gives you the freedom to try out different systems to figure out what the best tool even is.

For you, my advice for your data lake search is: don’t get too hung up on the data format or query capabilities. Instead, take a closer look at integrations and how you move data into and out of the lake. You’ll end up with better query capabilities, happier users, and a whole lot more flexibility.

Have thoughts on this discussion on choosing a data lake? We’d love to hear them.

Though we almost always have comments turned off on the blog (even a team of mostly engineers can get plagued by comment bots, go figure) our doors are always open on Slack.

Article by Phil Fried, engineer at Estuary

Originally posted on OpenDataScience.com

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform. Subscribe to our fast-growing Medium Publication too, the ODSC Journal, and inquire about becoming a writer.

--

--

ODSC - Open Data Science

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.