đ€ Merging Parquet with chsql + duckdb
Merging & Sorting Parquet with chsql + duckdb

If youâre familiar with our blog, you already know about our DuckDB Community Extension chsql providing a growing number of ClickHouse SQL macros and functions for DuckDB users. You can install chsql right from DuckDB SQL:
INSTALL chsql FROM community;
LOAD chsql;
Merging Parquet files with chsql mergetree
Today weâre introducing a new original function: read_parquet_mergetree
If you're a fan of ClickHouse's MergeTree engine and you're looking to supercharge your data workflow within DuckDB, the new read_parquet_mergetree function from our chsql extension is going to be your new best friend. Itâs a memory-efficient, easy-to-use feature that lets you merge multiple Parquet files with sorting capabilities, emulating the best parts of ClickHouse's powerful data merging strategy.
What Does read_parquet_mergetree do?
This new function does exactly what it says on the tin: it reads and merges multiple Parquet files based on a user-specified primary sort key. Think of it as ClickHouseâs MergeTree engine for DuckDB, but tailored to handle massive datasets without hogging your system's memory.
The result? You get compact, sorted Parquet files that are ready for lightning-fast range queries.
Why should You care? đ
Hereâs the TL;DR:
Efficient Merging: Combine data from multiple Parquet files just like how ClickHouse MergeTree tables consolidate data.
Sort and Compact: Set a primary sort key to organize your data, optimizing it for fast queries and analysis.
Memory Savvy: Perfect for large datasets where memory constraints matter.
Wildcard Support: Supports glob patterns and wildcards as read_parquet
How to Use It
Here's the beauty of read_parquet_mergetreeâit fits seamlessly into your DuckDB workflow. The syntax is intuitive for ClickHouse users but also simple for anyone new:
COPY (SELECT * FROM read_parquet_mergetree(['/folder/*.parquet'], 'some_key'))
TO 'sorted.parquet';
This command:
Reads all Parquet files in the folder (thanks to wildcard support),
Merges them based on the selected primary sort key (
some_key),Outputs the sorted and compacted result into a new Parquet file.
Real-World Benchmark: Memory Efficiency
To illustrate how much memory read_parquet_mergetree can save you, consider this benchmark:
COPY (SELECT * FROM read_parquet(['/folder/*.parquet']) ORDER BY some_key)
TO 'sorted.parquet';
The read_parquet function uses all of our systemâs RAM (64GB) to run the task.
Now, letâs compare that to read_parquet_mergetree:
COPY (SELECT * FROM read_parquet_mergetree(['/folder/*.parquet'], 'some_key'))
TO 'sorted.parquet';
On the same system, the read_parquet_mergetree function query uses only ~800MB of RAM usage (~80x optimization), making it perfect for those working with large datasets on resource-constrained systems.
Used in combination with HTTPFS it can be used to merge remote files, too!
Final Thoughts
If you're working with large-scale data and need a tool that can merge and sort Parquet files efficiently, the read_parquet_mergetree function in the chsql extension is a game changer. Whether you're a ClickHouse user or a DuckDB enthusiast, this feature allows you to manage your data with unparalleled efficiency.
Try it out if you need to merge fast, compact, and sorted Parquet files! đ€
Join our Community
Got ideas for the chsql extension? Join our team, letâs make this happen!






