duckplyr
A drop-in replacement for dplyr, powered by DuckDB for speed
duckplyr is a drop-in replacement for dplyr that uses DuckDB as its execution engine to run data manipulation operations faster. It executes existing dplyr code with identical results while automatically leveraging DuckDB’s performance optimizations.
The package handles larger-than-memory datasets by working directly with files on disk or remote URLs without loading everything into memory. It automatically falls back to standard dplyr when DuckDB doesn’t support a specific operation, providing transparent acceleration without requiring code changes. The package can query Parquet, CSV, and JSON files efficiently, including remote files over HTTP, making it practical for analyzing datasets that exceed available RAM.
Contributors#
Resources featuring duckplyr#
duckplyr: Analyze large data with full dplyr compatibility (Kirill Müller, cynkra)
duckplyr: Analyze large data with full dplyr compatibility
Speaker(s): Kirill Müller
Abstract:
The duckplyr package is now stable, version 1.0.0 has been published on CRAN. Learn how to use this package to speed up your existing dplyr codes with little to no changes, and how to work with larger-than-memory data using a syntax that not only feels like dplyr for data frames, but behaves exactly like that.
Materials - https://github.com/cynkra/posit-conf-2025 posit::conf(2025) Subscribe to posit::conf updates: https://posit.co/about/subscription-management/
duckplyr: Tight Integration of duckdb with R and the tidyverse - posit::conf(2023)
Presented by Kirill Müller
The duckplyr R package combines the convenience of dplyr with the performance of DuckDB. Better than dbplyr: Data frame in, data frame out, fully compatible with dplyr.
duckdb is the new high-performance analytical database system that works great with R, Python, and other host systems. dplyr is the grammar of data manipulation in the tidyverse, tightly integrated with R, but it works best for small or medium-sized data. The former has been designed with large or big data in mind, but currently, you need to formulate your queries in SQL.
The new duckplyr package offers the best of both worlds. It transforms a dplyr pipe into a query object that duckdb can execute, using an optimized query plan. It is better than dbplyr because the interface is “data frames in, data frames out”, and no intermediate SQL code is generated.
The talk first presents our results, a bit of the mechanics, and an outlook for this ambitious project.
Materials: https://github.com/duckdblabs/duckplyr/
Presented at Posit Conference, between Sept 19-20 2023, Learn more at posit.co/conference.#
Talk Track: Databases for data science with duckdb and dbt. Session Code: TALK-1100