dbplyr

Database (DBI) backend for dplyr

tidyverse/dbplyr

dbplyr.tidyverse.org

506 stars

186 forks

Tidyverse

dbplyr is the database backend for dplyr that lets you work with remote database tables using dplyr syntax. It automatically translates your R code into SQL, eliminating the need to write SQL queries directly.

The package provides lazy evaluation, meaning queries are only executed when you explicitly request results, which improves performance when working with large databases. It integrates seamlessly with the DBI package ecosystem and supports standard dplyr operations like filtering, grouping, and summarizing on database tables. You can preview generated SQL queries before execution and work with databases as if they were local data frames.

Contributors#

Resources featuring dbplyr#

Coding vs. thinking programmatically | Samia Baig | Data Science Hangout

ADD THE DATA SCIENCE HANGOUT TO YOUR CALENDAR HERE: https://pos.it/dsh - All are welcome! We’d love to see you!

This week’s guest was Samia Baig, Senior Data Scientist/Data Engineer at Johnson & Johnson Innovative Medicine!

Some topics covered in this week’s Hangout were transitioning from a background in pharmacy and public health to a data career in pharma, distinguishing the responsibilities of data scientists versus analytics engineers, strategies for making data pipelines more robust (and convincing your team that you NEED robust pipelines in the first place), and the value of joining open-source communities like Tidy Tuesday.

Resources mentioned in the video and chat: Posit Data Science Lab → https://pos.it/dslab Tidy Tuesday GitHub Repository → https://github.com/rfordatascience/tidytuesday {dbplyr} → https://dbplyr.tidyverse.org/ The Missing Semester of Your CS Education → https://missing.csail.mit.edu/

► Subscribe to Our Channel Here: https://bit.ly/2TzgcOu

Follow Us Here: Website: https://www.posit.co Hangout: https://pos.it/dsh The Lab: https://pos.it/dslab LinkedIn: https://www.linkedin.com/company/posit-software Bluesky: https://bsky.app/profile/posit.co

Thanks for hanging out with us!

Timestamps 00:00 Introduction 03:22 “Do you feel like analytics engineer is a good descriptor for what you do?” 04:44 “How did you get into data from being on a pharmacist’s job path?” 11:55 “What was it like in the move that you made from public health to pharma?” 16:57 “What do you say are distinguishing factors between data science and engineering?” 20:16 “What are the most popular tools that you and your team use, in your job at J&J?” 24:00 “What do you use SQL in?” 27:40 “How would you go about convincing a team of the need for a more robust pipeline?” 31:10 “Can you define robust?” 33:31 “Do you happen to have any specific resources or strategies or examples that might help students or others with that mindset of thinking programmatically?” 37:06 “Are there any non data science skills that are very helpful in your either current or former job?” 40:23 “Is there any kind of community among data scientists across the whole company?” 45:44 “What are your biggest data challenges that you have?” 46:12 “If you had a magic wand, what problem would you solve in that area?” 49:52 “What is a piece of career advice that maybe you wish you could go back in time and give yourself?”

Supporting 100 Data Scientists with a Small Team | Mike Thomson | Data Science Hangout

ADD THE DATA SCIENCE HANGOUT TO YOUR CALENDAR HERE: https://pos.it/dsh - All are welcome! We’d love to see you!

We were recently joined by Mike Thomson, Data Science Manager at Flatiron Health, to chat about managing open source tools and maintaining R packages, creating reproducible reports for Word and Excel using Quarto, the “hub and spoke” support model for data scientists, and applying R and Posit tools in Real World Evidence (RWE) oncology space.

In this Hangout, we explore creating reproducible outputs using Quarto for formats like Word and Excel. Flatiron Health uses Quarto because it allows the reproducible publication of analyses to multiple formats simultaneously (like HTML and a downloadable Word document) from the same source code. A specific challenge discussed was outputting formatted analytic tables to Excel, as this is not natively supported by Quarto. Erica Yim, from Mike’s team, detailed how they built an internal R function that uses the flexlsx package along with flextable to easily output pre-existing formatted tables from a Quarto document into an Excel template.

Resources mentioned in the video and zoom chat: flexlsx R package GitHub repository → https://github.com/pteridin/flexlsx DBPlier PR for Snowflake Translations (contributed to by Flat Iron Health) → https://github.com/tidyverse/dbplyr/pull/860

If you didn’t join live, one great discussion you missed from the zoom chat was about the pain points of exporting data from Quarto to Word or Excel, particularly concerning table formatting and styles. Attendees in the chat strongly highlighted the difficulty of managing table formatting, including issues with table cross-references, headers, and footers. They noted that dealing with styles often requires workarounds, such as creating flextables that match desired Word styles instead of relying on default table styles. Let us know below if you’d like to hear more about this topic!

► Subscribe to Our Channel Here: https://bit.ly/2TzgcOu

Follow Us Here: Website: https://www.posit.co Hangout: https://pos.it/dsh LinkedIn: https://www.linkedin.com/company/posit-software Bluesky: https://bsky.app/profile/posit.co

Thanks for hanging out with us!

Timestamps: 00:00 Introduction 02:23 “Can you talk about what Flatiron does and what your teams do?” 03:29 “Could you give us a few examples of the data types or collections that you might be working with?” 05:00 “Do you have longitudinal data?” 07:46 “Are you aware of any computer vision applications in the health care industry from your perspective?” 09:38 “Do you use mixed models or Bayesian MCMC?” 10:56 “How does your team use Quarto?” 16:59 “How do you convince stakeholders of the value of going open source (and handle security concerns)?” 22:56 “Do you allow people to have a certain amount of time to contribute back to open source?” 26:03 “I just want to understand a little bit about your support model for that group.” 29:57 “Do you have any tips for asynchronous working?” 31:02 “Are you like a Jira team or an Asana team for assigning tasks or tickets?” 32:10 “How many people on your platform team support Posit teams?” 34:24 “What does your team use for unstructured document analysis?” 36:24 “How important is domain knowledge in your recruitment?” 40:02 “Where do you store all of this stuff (data storage and databases)?” 42:04 “What is the approximate timeline from the time you do analysis to final deployment of results in the real world?” 44:31 “Is there a process for people getting things approved to use in your environment?” 47:39 “How do you handle the challenge of going back from Word to Quarto source code (after changes are tracked)?” 50:22 “What does a typical Workday look like for you?” 51:47 “Is there a piece of career advice that has either really helped you, that you’ve really liked, that you try to give to other people?”

Larry Fenn | Journalism with RStudio, R, and the tidyverse | RStudio (2020)

The Associated Press data team primarily uses R and the Tidyverse as the main tool for doing data processing and analysis. In this talk, some of the technology behind the published stories will be showcased: - Using dbplyr to work off a hosted database containing 380 million opioid records to identify “pill mills”. - Using open-sourced AP style templates for R Markdown and ggplot to quickly produce graphics and reports off breaking news. - Using R Markdown and htmlwidgets to give reporters and editors interactive reports to identify reporting leads