dplyr

Workflow Demo Live Q&A - December 18th!

This is the Live Q&A session for our Workflow Demo on December 18th to show how to build a Shiny app that reads from and writes to a database - using DuckDB.

The demo will be here: https://youtu.be/6AGroJb4zPM

Sara covers how to:

Set up a connection to DuckDB from a Shiny app
Query the database using R or Python
Use editable tables to enable users to write back to the database
Securely deploy your app to Posit Connect

Resources mentioned in this Q&A: Connect Authentication documentation: https://docs.posit.co/connect/user/oauth-integrations/ Git backed deployment in Posit Connect: https://posit.co/blog/git-backed-deployment-in-posit-connect/ shinylive: https://posit-dev.github.io/r-shinylive/ Mastering Shiny Best Practices: https://mastering-shiny.org/best-practices.html#best-practices

Blogs: https://blog.stephenturner.us/p/duckdb-vs-dplyr-vs-base-r https://outsiderdata.netlify.app/posts/2024-04-10-the-truth-about-tidy-wrappers/benchmark_wrappers.html Using Posit tools with data in DuckDB, Databricks, and Snowflake: https://posit.co/blog/databases-with-posit/ Creating a Shiny app that interacts with a database: https://posit.co/blog/shiny-with-databases/ Appsilon has a lot of resources on the topic: https://www.appsilon.com/post/r-shiny-duckd

Please note, while the monthly Posit Team Workflow Series is usually the last Wednesday of the month - this will be a week earlier. Cheers to a festive holiday season and a fantastic year ahead for you and yours!

To add future Monthly Workflow Demo events to your calendar → https://www.addevent.com/event/Eg16505674

You can view all the previous workflow demos here → https://www.youtube.com/playlist?list=PL9HYL-VRX0oRsUB5AgNMQuKuHPpNDLBVt

Hadley Wickham - R in Production

R in Production by Hadley Wickham

Visit https://rstats.ai for information on upcoming conferences.

Abstract: In this talk, we delve into the strategic deployment of R in production environments, guided by three core principles to elevate your work from individual exploration to scalable, collaborative data science. The essence of putting R into production lies not just in executing code but in crafting solutions that are robust, repeatable, and collaborative, guided by three key principles:

Not just once: Successful data science projects are not a one-off, but will be run repeatedly for months or years. I’ll discuss some of the challenges for creating R scripts and applications that run repeatedly, handle new data seamlessly, and adapt to evolving analytical requirements without constant manual intervention. This principle ensures your analyses are enduring assets not throw away toys.
Not just my computer: the transition from development on your laptop (usually windows or mac) to a production environment (usually linux) introduces a number of challenges. Here, I’ll discuss some strategies for making R code portable, how you can minimise pain when something inevitably goes wrong, and few unresolved auth challenges that we’re currently working on.
Not just me: R is not just a tool for individual analysts but a platform for collaboration. I’ll cover some of the best practices for writing readable, understandable code, and how you might go about sharing that code with your colleagues. This principle underscores the importance of building R projects that are accessible, editable, and usable by others, fostering a culture of collaboration and knowledge sharing.

By adhering to these principles, we pave the way for R to be a powerful tool not just for individual analyses but as a cornerstone of enterprise-level data science solutions. Join me to explore how to harness the full potential of R in production, creating workflows that are robust, portable, and collaborative.

Bio: Hadley is Chief Scientist at Posit PBC, winner of the 2019 COPSS award, and a member of the R Foundation. He builds tools (both computational and cognitive) to make data science easier, faster, and more fun. His work includes packages for data science (like the tidyverse, which includes ggplot2, dplyr, and tidyr)and principled software development (e.g. roxygen2, testthat, and pkgdown). He is also a writer, educator, and speaker promoting the use of R for data science. Learn more on his website, http://hadley.nz .

Mastodon: https://fosstodon.org/@hadleywickham

Presented at the 2024 New York R Conference (May 17, 2024) Hosted by Lander Analytics (https://landeranalytics.com )

Hadley Wickham

posit::conf(2023) Workshop: Big Data with Arrow

Register now: http://pos.it/conf Instructors: Nic Crane and Stephanie Hazlitt Workshop Duration: 1-Day Workshop

This course is for you if you: • want to learn how to work with tabular data that is too large to fit in memory using existing R and tidyverse syntax implemented in Arrow • want to learn about Parquet and other file formats that are powerful alternatives to CSV files • want to learn how to engineer your tabular data storage for more performant access and analysis with Apache Arrow

Data analysis pipelines with larger-than-memory data are becoming more and more commonplace. In this workshop you will learn how to use Apache Arrow, a multi-language toolbox for working with larger-than-memory tabular data, to create seamless “big” data analysis pipelines with R.

The workshop will focus on using the the arrow R package—a mature R interface to Apache Arrow— to process larger-than-memory files and multi-file data sets with arrow using familiar dplyr syntax. You’ll learn to create and use interoperable data file formats like Parquet for efficient data storage and access, with data stored both on disk and in the cloud, and also how to exercise fine control over data types to avoid common large data pipeline problems. This workshop will provide a foundation for using Arrow, giving you access to a powerful suite of tools for performant analysis of larger-than-memory data in R

dplyr tidyverse Rstudio Data Science Machine Learning Python Stats Tidyverse Data Visualization Data Viz Ggplot Technology Coding Connect Server Pro Shiny RMarkdown Package Manager CRAN Interoperability Serious Data Science Dplyr Forcats Ggplot2 Tibble Readr Stringr Tidyr Purrr Github Data Wrangling Tidy Data Odbc Rayshader Plumber Blogdown Gt Lazy Evaluation Tidymodels Statistics Debugging Programming Education Rstats Open Source OSS Reticulate

posit::conf(2023) Workshop: Introduction to Data Science with R and Tidyverse

Register now: http://pos.it/conf Instructors: Posit Academy Instructors Workshop Duration: 2-Day Workshop

This course is ideal for: • those new to R or the Tidyverse • anyone who has dabbled in R, but now wants a rigorous foundation in up-to-date data science best practices • SAS and Excel users looking to switch their workflows to R

This is not a standard workshop, but a six-week online apprenticeship that culminates in two in-person days at posit::conf(2023). Begins August 7th, 2023. No knowledge of R required. Visit posit.co/academy to learn more about this uniquely effective learning format.

Here, you will learn the foundations of R and the Tidyverse under the guidance of a Posit Academy mentor and in the company of a close group of fellow learners. You will be expected to complete a weekly curriculum of interactive tutorials, and to attend a weekly presentation meeting with your mentor and fellow students. Topics will include the basics of R, importing data, visualizing data with ggplot2, wrangling data with dplyr and tidyr, working with strings, factors, and date-times, modelling data with base R, and reporting reproducibly with quarto

dplyr ggplot2 Quarto tidyr tidyverse Rstudio Data Science Machine Learning Python Stats Tidyverse Data Visualization Data Viz Ggplot Technology Coding Connect Server Pro Shiny RMarkdown Package Manager CRAN Interoperability Serious Data Science Dplyr Forcats Ggplot2 Tibble Readr Stringr Tidyr Purrr Github Data Wrangling Tidy Data Odbc Rayshader Plumber Blogdown Gt Lazy Evaluation Tidymodels Statistics Debugging Programming Education Rstats Open Source OSS Reticulate

posit::conf(2023) Workshop: Tidy time series and forecasting in R

Register now: http://pos.it/conf Instructor: Rob J Hyndman Workshop Duration: 2-Day Workshop

This course is for you if you: • already use the tidyverse packages in R such as dplyr, tidyr, tibble and ggplot2 • need to analyze large collections of related time series • would like to learn how to use some tidy tools for time series analysis including visualization, decomposition and forecasting

It is common for organizations to collect huge amounts of data over time, and existing time series analysis tools are not always suitable to handle the scale, frequency and structure of the data collected. In this workshop, we will look at some packages and methods that have been developed to handle the analysis of large collections of time series.

On day 1, we will look at the tsibble data structure for flexibly managing collections of related time series. We will look at how to do data wrangling, data visualizations and exploratory data analysis. We will explore feature-based methods to explore time series data in high dimensions. A similar feature-based approach can be used to identify anomalous time series within a collection of time series, or to cluster or classify time series. Primary packages for day 1 will be tsibble, lubridate and feasts (along with the tidyverse of course).

Day 2 will be about forecasting. We will look at some classical time series models and how they are automated in the fable package, and we will explore the creation of ensemble forecasts and hybrid forecasts. Best practices for evaluating forecast accuracy will also be covered. Finally, we will look at forecast reconciliation, allowing millions of time series to be forecast in a relatively short time while accounting for constraints on how the series are related

dplyr ggplot2 lubridate tibble tidyr tidyverse Rstudio Data Science Machine Learning Python Stats Tidyverse Data Visualization Data Viz Ggplot Technology Coding Connect Server Pro Shiny RMarkdown Package Manager CRAN Interoperability Serious Data Science Dplyr Forcats Ggplot2 Tibble Readr Stringr Tidyr Purrr Github Data Wrangling Tidy Data Odbc Rayshader Plumber Blogdown Gt Lazy Evaluation Tidymodels Statistics Debugging Programming Education Rstats Open Source OSS Reticulate

ZJ | Easy larger-than-RAM data manipulation with {disk.frame} | RStudio

Learn how to handle 100GBs of data with ease using {disk.frame} - the larger-than-RAM-data manipulation package.

R loads data in its entirety into RAM. However, RAM is a precious resource and often do run out. That’s why most R user would have run into the “cannot allocate vector of size xxB.” error at some point.

However, the need to handle larger-than-RAM data doesn’t go away just because RAM isn’t large enough. So many useRs turn to big data tools like Spark for the task. In this talk, I will make the case that {disk.frame} is sufficient and often preferable for manipulating larger-than-RAM data that fit on disk. I will show how you can apply familiar {dplyr}-verbs to manipulate larger-than-RAM data with {disk.frame}.

About ZJ: ZJ is a machine learning developer based in Melbourne, Australia. He regularly contributes to open source projects. He has more than 10 years of experience in banking before joining the tech sector. In his free time, he enjoys playing Go/Baduk/Weiqi

dplyr rstudio Rstudio Data Science Machine Learning Python Stats Tidyverse Data Visualization Data Viz Ggplot Technology Coding Connect Server Pro Shiny RMarkdown Package Manager CRAN Interoperability Serious Data Science Dplyr Ggplot2 Tibble Readr Stringr Tidyr Purrr Github Data Wrangling Tidy Data Odbc Rayshader Plumber Blogdown Gt Lazy Evaluation Tidymodels Statistics Debugging Programming Education Forcats Rstats Open Source OSS Reticulate ZJ Disk.frame

Michael Chow | Bringing the Tidyverse to Python with Siuba | RStudio

Last January I left my job to spend a year developing siuba, a python port of dplyr. At its core, this decision was driven by a decade of watching python and R users produce similar analyses, but in very different ways.

In this talk, I’ll discuss 3 ways siuba enables R users to transfer their hard-earned programming knowledge to python: (1) leveraging the power of dplyr syntax, (2) options to generate SQL code, and (3) working with the plotnine plotting library.

Looking back, I’ll consider two critical pieces that have helped me develop siuba: using it to livecode TidyTuesday analyses, and building an interactive tutorial for absolute beginners.

About Michael: Michael Chow is a data scientist and learning researcher. He serves as a co-director at Code for Philly. In past lives, he worked on adaptive assessment tools in ed tech, and received a PhD in cognitive psychology from Princeton University

Michael Chow

dplyr plotnine rstudio tidyverse Rstudio Data Science Machine Learning Python Stats Tidyverse Data Visualization Data Viz Ggplot Technology Coding Connect Server Pro Shiny RMarkdown Package Manager CRAN Interoperability Serious Data Science Dplyr Ggplot2 Tibble Readr Stringr Tidyr Purrr Github Data Wrangling Tidy Data Odbc Rayshader Plumber Blogdown Gt Lazy Evaluation Tidymodels Statistics Debugging Programming Education Forcats Rstats Open Source OSS Reticulate Siuba Michael Chow SQL

Hadley Wickham | Maintaining the house the tidyverse built | RStudio

Hadley will talk about how the tidyverse has evolved since its creation (just five years ago!). You’ll learn about our greatest successes, learn from our biggest failures, and get some hints of what’s coming down the pipeline for the future.

About Hadley: Hadley Wickham is the Chief Scientist at RStudio, a member of the R Foundation, and Adjunct Professor at Stanford University and the University of Auckland. He builds tools (both computational and cognitive) to make data science easier, faster, and more fun. You may be familiar with his packages for data science (the tidyverse: including ggplot2, dplyr, tidyr, purrr, and readr) and principled software development (roxygen2, testthat, devtools, pkgdown). Much of the material for the course is drawn from two of his existing books, Advanced R and R Packages, but the course also includes a lot of new material that will eventually become a book called “Tidy tools”

Hadley Wickham

devtools dplyr ggplot2 pkgdown purrr readr roxygen2 rstudio testthat tidyr tidyverse Rstudio Data Science Machine Learning Python Stats Tidyverse Data Visualization Data Viz Ggplot Technology Coding Connect Server Pro Shiny RMarkdown Package Manager CRAN Interoperability Serious Data Science Dplyr Ggplot2 Tibble Readr Stringr Tidyr Purrr Github Data Wrangling Tidy Data Odbc Rayshader Plumber Blogdown Gt Lazy Evaluation Tidymodels Statistics Debugging Programming Education Forcats Rstats Open Source OSS Reticulate Hadley Wickham

Tyson Barrett | List-columns in data.table | RStudio (2020)

The use of list-columns in data frames and tibbles is well documented (e.g. Bryan, 2018), providing a cognitively efficient way to organize results of complex data (e.g. several statistical models, groupings of text, data summaries, or even graphics) with corresponding data. For example, one can store student information within classrooms, player information within teams, or analyses within groups. This allows the data to be of variable sizes without overly complicating or adding redundancies to the structure of the data. In turn, this can improve the reliability to appropriately analyze the data. Because of its efficiency and speed, being able to use data.table to work with list-columns would be beneficial in many data contexts (e.g. to reduce memory usage in large data sets). Herein, I demonstrate how one can create list-columns in a data table using the by argument in data.table and purrr::map(). I compare the behavior of the data.table approaches to the dplyr::group_nest() function and tidyr::unnest(), two of the several powerful Tidyverse nesting and unnesting functions. Results using bench::mark() show the speed and efficiency of using data.table to work with list-columns

dplyr purrr rstudio tidyr tidyverse Rstudio::conf(2020) Tyson Barrett Rstudio Data Science Machine Learning Python Stats Tidyverse Data Visualization Data Viz Ggplot Technology Coding Connect Server Pro Shiny RMarkdown Package Manager CRAN Interoperability Serious Data Science Dplyr Forcats Ggplot2 Tibble Readr Stringr Tidyr Purrr Github Data Wrangling Tidy Data Odbc Rayshader Plumber Blogdown Gt Lazy Evaluation Tidymodels Statistics Debugging Programming Education Rstats Open Source OSS Reticulate

Lionel Henry | Interactivity and Programming in the Tidyverse | RStudio (2020)

In Tidyverse grammars such as dplyr you can refer to the columns in your data frames as if they were objects in the workspace. This syntax is optimised for interactivity and is a great fit for data analysis, but it makes it harder to write functions and reuse code. In this talk we present some advances in the tidy eval framework that make it easier to program around Tidyverse pipelines without having to learn a lot of theory

Lionel Henry

dplyr rstudio tidyverse Rstudio::conf(2020) Lionel Henry Rstudio Data Science Machine Learning Python Stats Tidyverse Data Visualization Data Viz Ggplot Technology Coding Connect Server Pro Shiny RMarkdown Package Manager CRAN Interoperability Serious Data Science Dplyr Forcats Ggplot2 Tibble Readr Stringr Tidyr Purrr Github Data Wrangling Tidy Data Odbc Rayshader Plumber Blogdown Gt Lazy Evaluation Tidymodels Statistics Debugging Programming Education Rstats Open Source OSS Reticulate

Ian Cook | Bridging the Gap between SQL and R | RStudio (2020)

Ian Cook | January 31, 2020 Like it or not, SQL is the closest thing we have to a universal language for working with structured data. Celebrating its 50th birthday in 2020, SQL today integrates with thousands of applications and has millions of users worldwide. Data analysts using SQL represent a large audience of potential R users motivated to expand their data science skills. But learning R can be frustrating for SQL users. One major frustration is the inability to directly query R data frames with SQL SELECT statements. Eager to use R for tasks that are not possible with SQL (like data visualization and machine learning), these users are dismayed to find that they must first learn an unfamiliar syntax for data manipulation. The popularity of the sqldf package (which automatically exports an R data frame into an embedded database, then runs a SQL query on it) demonstrates this frustration. But now there is a way to directly query an R data frame without moving the data out of R. In this talk, I introduce tidyquery, a new R package that runs SQL queries directly on R data frames. tidyquery is powered by dplyr and by queryparser, a new pure-R, no-dependency SQL query parser

dplyr rstudio Ian Cook Rstudio::conf(2020) SQL Rstudio Data Science Machine Learning Python Stats Tidyverse Data Visualization Data Viz Ggplot Technology Coding Connect Server Pro Shiny RMarkdown Package Manager CRAN Interoperability Serious Data Science Dplyr Forcats Ggplot2 Tibble Readr Stringr Tidyr Purrr Github Data Wrangling Tidy Data Odbc Rayshader Plumber Blogdown Gt Lazy Evaluation Tidymodels Statistics Debugging Programming Education Rstats Open Source OSS Reticulate

Data Manipulation Tools: dplyr – Pt 3 Intro to the Grammar of Data Manipulation with R

Data wrangling is too often the most time-consuming part of data science and applied statistics. Two tidyverse packages, tidyr and dplyr, help make data manipulation tasks easier. Keep your code clean and clear and reduce the cognitive load required for common but often complex data science tasks.

dplyr docs: dplyr.tidyverse.org/reference/

Pt. 1: What is data wrangling? Intro, Motivation, Outline, Setup https://youtu.be/jOd65mR1zfw

/01:44 Intro and what’s covered Ground Rules
/02:40 What’s a tibble
/04:50 Use View
/05:25 The Pipe operator:
/07:20 What do I mean by data wrangling?

Pt. 2: Tidy Data and tidyr https://youtu.be/1ELALQlO-yM

/00:48 Goal 1 Making your data suitable for R
/01:40 tidyr “Tidy” Data introduced and motivated
/08:10 tidyr::gather
/12:30 tidyr::spread
/15:23 tidyr::unite
/15:23 tidyr::separate

Pt. 3: Data manipulation tools: dplyr https://youtu.be/Zc_ufg4uW4U

00.40 setup
02:00 dplyr::select
03:40 dplyr::filter
05:05 dplyr::mutate
07:05 dplyr::summarise
08:30 dplyr::arrange
09:55 Combining these tools with the pipe (Setup for the Grammar of Data Manipulation)
11:45 dplyr::group_by

Pt. 4: Working with Two Datasets: Binds, Set Operations, and Joins https://youtu.be/AuBgYDCg1Cg Combining two datasets together

/00.42 dplyr::bind_cols
/01:27 dplyr::bind_rows
/01:42 Set operations dplyr::union, dplyr::intersect, dplyr::set_diff
/02:15 joining data dplyr::left_join, dplyr::inner_join, dplyr::right_join, dplyr::full_join,

Cheatsheets: https://www.rstudio.com/resources/cheatsheets/

Documentation: tidyr docs: tidyr.tidyverse.org/reference/

tidyr vignette: https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html dplyr docs: http://dplyr.tidyverse.org/reference/
dplyr one-table vignette: https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html
dplyr two-table (join operations) vignette: https://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html

Tidy Data and tidyr – Pt 2 Intro to Data Wrangling with R and the Tidyverse

Data wrangling is too often the most time-consuming part of data science and applied statistics. Two tidyverse packages, tidyr and dplyr, help make data manipulation tasks easier. Keep your code clean and clear and reduce the cognitive load required for common but often complex data science tasks.

http://tidyr.tidyverse.org/reference/

Pt. 1: What is data wrangling? Intro, Motivation, Outline, Setup https://youtu.be/jOd65mR1zfw

/01:44 Intro and what’s covered Ground Rules
/02:40 What’s a tibble
/04:50 Use View
/05:25 The Pipe operator:
/07:20 What do I mean by data wrangling?

Pt. 2: Tidy Data and tidyr https://youtu.be/1ELALQlO-yM

00:48 Goal 1 Making your data suitable for R
01:40 tidyr “Tidy” Data introduced and motivated
08:10 tidyr::gather
12:30 tidyr::spread
15:23 tidyr::unite
15:23 tidyr::separate

Pt. 3: Data manipulation tools: dplyr https://youtu.be/Zc_ufg4uW4U

00.40 setup
/02:00 dplyr::select
/03:40 dplyr::filter
/05:05 dplyr::mutate
/07:05 dplyr::summarise
/08:30 dplyr::arrange
/09:55 Combining these tools with the pipe (Setup for the Grammar of Data Manipulation)
/11:45 dplyr::group_by
/15:00 dplyr::group_by

Pt. 4: Working with Two Datasets: Binds, Set Operations, and Joins https://youtu.be/AuBgYDCg1Cg Combining two datasets together

/00.42 dplyr::bind_cols
/01:27 dplyr::bind_rows
/01:42 Set operations dplyr::union, dplyr::intersect, dplyr::set_diff
/02:15 joining data dplyr::left_join, dplyr::inner_join, dplyr::right_join, dplyr::full_join,

Cheatsheets: https://www.rstudio.com/resources/cheatsheets/

Documentation: tidyr docs: tidyr.tidyverse.org/reference/

tidyr vignette: https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html dplyr docs: http://dplyr.tidyverse.org/reference/
dplyr one-table vignette: https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html
dplyr two-table (join operations) vignette: https://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html

What is data wrangling? Intro, Motivation, Outline, Setup – Pt. 1 Data Wrangling Introduction

Data wrangling is too often the most time-consuming part of data science and applied statistics. Two tidyverse packages, tidyr and dplyr, help make data manipulation tasks easier. These videos introduce you to these tools. Keep your R code clean and clear and reduce the cognitive load required for common but often complex data science tasks.

Pt. 1: What is data wrangling? Intro, Motivation, Outline, Setup https://youtu.be/jOd65mR1zfw

01:44 Intro and what’s covered Ground Rules
02:40 What’s a tibble
04:50 Use View
05:25 The Pipe operator:
07:20 What do I mean by data wrangling?

Pt. 2: Tidy Data and tidyr https://youtu.be/1ELALQlO-yM

/00:48 Goal 1 Making your data suitable for R
/01:40 tidyr “Tidy” Data introduced and motivated
/08:15 tidyr::gather
/12:38 tidyr::spread
/15:30 tidyr::unite
/15:30 tidyr::separate

Pt. 3: Data manipulation tools: dplyr https://youtu.be/Zc_ufg4uW4U

00.40 setup
/02:00 dplyr::select
/03:40 dplyr::filter
/05:05 dplyr::mutate
/07:05 dplyr::summarise
/08:30 dplyr::arrange
/09:55 Combining these tools with the pipe (Setup for the Grammar of Data Manipulation)
/11:45 dplyr::group_by
/15:00 dplyr::group_by

Pt. 4: Working with Two Datasets: Binds, Set Operations, and Joins https://youtu.be/AuBgYDCg1Cg Combining two datasets together

/00.42 dplyr::bind_cols
/01:27 dplyr::bind_rows
/01:42 Set operations dplyr::union, dplyr::intersect, dplyr::set_diff
/02:15 joining data dplyr::left_join, dplyr::inner_join, dplyr::right_join, dplyr::full_join,

Cheatsheets: https://www.rstudio.com/resources/cheatsheets/

Documentation: tidyr docs: tidyr.tidyverse.org/reference/

tidyr vignette: https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html dplyr docs: http://dplyr.tidyverse.org/reference/
dplyr one-table vignette: https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html
dplyr two-table (join operations) vignette: https://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html

New York Times “For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights”, By STEVE LOHRAUG. 17, 2014 https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html

Working with Two Datasets: Binds, Set Operations, and Joins – Pt 4 Intro to Data Manipulation

Data wrangling is too often the most time-consuming part of data science and applied statistics. Two tidyverse packages, tidyr and dplyr, help make data manipulation tasks easier. Keep your R code clean and clear and reduce the cognitive load required for common but often complex data science tasks.

dplyr docs: dplyr.tidyverse.org/reference/

Pt. 1: What is data wrangling? Intro, Motivation, Outline, Setup https://youtu.be/jOd65mR1zfw

/01:44 Intro and what’s covered Ground Rules:
/02:40 What’s a tibble
/04:50 Use View
/05:25 The Pipe operator:
/07:20 What do I mean by data wrangling?

Pt. 2: Tidy Data and tidyr https://youtu.be/1ELALQlO-yM

/00:48 Goal 1 Making your data suitable for R
/01:40 tidyr “Tidy” Data introduced and motivated
/08:10 tidyr::gather
/12:30 tidyr::spread
/15:23 tidyr::unite
/15:23 tidyr::separate

Pt. 3: Data manipulation tools: dplyr https://youtu.be/Zc_ufg4uW4U

/00.40 setup
/02:00 dplyr::select
/03:40 dplyr::filter
/05:05 dplyr::mutate
/07:05 dplyr::summarise
/08:30 dplyr::arrange
/09:55 Combining these tools with the pipe (Setup for the Grammar of Data Manipulation)
/11:45 dplyr::group_by

Pt. 4: Working with Two Datasets: Binds, Set Operations, and Joins https://youtu.be/AuBgYDCg1Cg Combining two datasets together

00.42 dplyr::bind_cols
01:27 dplyr::bind_rows
01:42 Set operations dplyr::union, dplyr::intersect, dplyr::set_diff
02:15 joining data - dplyr::left_join, dplyr::inner_join, - dplyr::right_join, dplyr::full_join,

Cheatsheets: https://www.rstudio.com/resources/cheatsheets/

Documentation: tidyr docs: tidyr.tidyverse.org/reference/

tidyr vignette: https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html dplyr docs: http://dplyr.tidyverse.org/reference/
dplyr one-table vignette: https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html
dplyr two-table (join operations) vignette: https://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html

Interacting with Databases by Bárbara Borges from Shiny from useR! Brussels 2017

Connecting to an external database from R can be challenging. This is made worse when you need to interact with a database from a live Shiny application. To demystify this process, I’ll do two things. First, I’ll talk about best practices when connecting to a database from Shiny. There are three important packages that help you with this and I’ll weave them into this part of the talk. The DBI package does a great job of standardizing how to establish a connection, execute safe queries using SQL (goodbye SQL injections!) and close the connection. The dplyr package builds on top of this to make even easier to connect to databases and extract data, since it allows users to query the database using regular dplyr syntax in R (no SQL knowledge necessary). Yet a third package, pool, exists to help you when using databases in Shiny applications, by taking care of connection management, and often resulting in better performance. Second, I’ll demo these concepts in practice by showing how we can connect to a database from Shiny to create a CRUD application. I will show the application running and point out specific parts of the code (which will be publicly available)