Skip to main content

Command Palette

Search for a command to run...

đŸ€ Merging Parquet with chsql + duckdb

Merging & Sorting Parquet with chsql + duckdb

Updated
đŸ€ Merging Parquet with chsql + duckdb

If you’re familiar with our blog, you already know about our DuckDB Community Extension chsql providing a growing number of ClickHouse SQL macros and functions for DuckDB users. You can install chsql right from DuckDB SQL:

INSTALL chsql FROM community;
LOAD chsql;

Merging Parquet files with chsql mergetree

Today we’re introducing a new original function: read_parquet_mergetree

If you're a fan of ClickHouse's MergeTree engine and you're looking to supercharge your data workflow within DuckDB, the new read_parquet_mergetree function from our chsql extension is going to be your new best friend. It’s a memory-efficient, easy-to-use feature that lets you merge multiple Parquet files with sorting capabilities, emulating the best parts of ClickHouse's powerful data merging strategy.

What Does read_parquet_mergetree do?

This new function does exactly what it says on the tin: it reads and merges multiple Parquet files based on a user-specified primary sort key. Think of it as ClickHouse’s MergeTree engine for DuckDB, but tailored to handle massive datasets without hogging your system's memory.

The result? You get compact, sorted Parquet files that are ready for lightning-fast range queries.

Why should You care? 🚀

Here’s the TL;DR:

  • Efficient Merging: Combine data from multiple Parquet files just like how ClickHouse MergeTree tables consolidate data.

  • Sort and Compact: Set a primary sort key to organize your data, optimizing it for fast queries and analysis.

  • Memory Savvy: Perfect for large datasets where memory constraints matter.

  • Wildcard Support: Supports glob patterns and wildcards as read_parquet

How to Use It

Here's the beauty of read_parquet_mergetree—it fits seamlessly into your DuckDB workflow. The syntax is intuitive for ClickHouse users but also simple for anyone new:

COPY (SELECT * FROM read_parquet_mergetree(['/folder/*.parquet'], 'some_key')) 
TO 'sorted.parquet';

This command:

  • Reads all Parquet files in the folder (thanks to wildcard support),

  • Merges them based on the selected primary sort key (some_key),

  • Outputs the sorted and compacted result into a new Parquet file.

Real-World Benchmark: Memory Efficiency

To illustrate how much memory read_parquet_mergetree can save you, consider this benchmark:

COPY (SELECT * FROM read_parquet(['/folder/*.parquet']) ORDER BY some_key) 
TO 'sorted.parquet';

The read_parquet function uses all of our system’s RAM (64GB) to run the task.

Now, let’s compare that to read_parquet_mergetree:

COPY (SELECT * FROM read_parquet_mergetree(['/folder/*.parquet'], 'some_key')) 
TO 'sorted.parquet';

On the same system, the read_parquet_mergetree function query uses only ~800MB of RAM usage (~80x optimization), making it perfect for those working with large datasets on resource-constrained systems.

Used in combination with HTTPFS it can be used to merge remote files, too!

Final Thoughts

If you're working with large-scale data and need a tool that can merge and sort Parquet files efficiently, the read_parquet_mergetree function in the chsql extension is a game changer. Whether you're a ClickHouse user or a DuckDB enthusiast, this feature allows you to manage your data with unparalleled efficiency.

Try it out if you need to merge fast, compact, and sorted Parquet files! đŸ€

Join our Community

Got ideas for the chsql extension? Join our team, let’s make this happen!