tidyr

Hadley Wickham - R in Production

R in Production by Hadley Wickham

Visit https://rstats.ai for information on upcoming conferences.

Abstract: In this talk, we delve into the strategic deployment of R in production environments, guided by three core principles to elevate your work from individual exploration to scalable, collaborative data science. The essence of putting R into production lies not just in executing code but in crafting solutions that are robust, repeatable, and collaborative, guided by three key principles:

Not just once: Successful data science projects are not a one-off, but will be run repeatedly for months or years. I’ll discuss some of the challenges for creating R scripts and applications that run repeatedly, handle new data seamlessly, and adapt to evolving analytical requirements without constant manual intervention. This principle ensures your analyses are enduring assets not throw away toys.
Not just my computer: the transition from development on your laptop (usually windows or mac) to a production environment (usually linux) introduces a number of challenges. Here, I’ll discuss some strategies for making R code portable, how you can minimise pain when something inevitably goes wrong, and few unresolved auth challenges that we’re currently working on.
Not just me: R is not just a tool for individual analysts but a platform for collaboration. I’ll cover some of the best practices for writing readable, understandable code, and how you might go about sharing that code with your colleagues. This principle underscores the importance of building R projects that are accessible, editable, and usable by others, fostering a culture of collaboration and knowledge sharing.

By adhering to these principles, we pave the way for R to be a powerful tool not just for individual analyses but as a cornerstone of enterprise-level data science solutions. Join me to explore how to harness the full potential of R in production, creating workflows that are robust, portable, and collaborative.

Bio: Hadley is Chief Scientist at Posit PBC, winner of the 2019 COPSS award, and a member of the R Foundation. He builds tools (both computational and cognitive) to make data science easier, faster, and more fun. His work includes packages for data science (like the tidyverse, which includes ggplot2, dplyr, and tidyr)and principled software development (e.g. roxygen2, testthat, and pkgdown). He is also a writer, educator, and speaker promoting the use of R for data science. Learn more on his website, http://hadley.nz .

Mastodon: https://fosstodon.org/@hadleywickham

Presented at the 2024 New York R Conference (May 17, 2024) Hosted by Lander Analytics (https://landeranalytics.com )

Hadley Wickham

posit::conf(2023) Workshop: Introduction to Data Science with R and Tidyverse

Register now: http://pos.it/conf Instructors: Posit Academy Instructors Workshop Duration: 2-Day Workshop

This course is ideal for: • those new to R or the Tidyverse • anyone who has dabbled in R, but now wants a rigorous foundation in up-to-date data science best practices • SAS and Excel users looking to switch their workflows to R

This is not a standard workshop, but a six-week online apprenticeship that culminates in two in-person days at posit::conf(2023). Begins August 7th, 2023. No knowledge of R required. Visit posit.co/academy to learn more about this uniquely effective learning format.

Here, you will learn the foundations of R and the Tidyverse under the guidance of a Posit Academy mentor and in the company of a close group of fellow learners. You will be expected to complete a weekly curriculum of interactive tutorials, and to attend a weekly presentation meeting with your mentor and fellow students. Topics will include the basics of R, importing data, visualizing data with ggplot2, wrangling data with dplyr and tidyr, working with strings, factors, and date-times, modelling data with base R, and reporting reproducibly with quarto

dplyr ggplot2 Quarto tidyr tidyverse Rstudio Data Science Machine Learning Python Stats Tidyverse Data Visualization Data Viz Ggplot Technology Coding Connect Server Pro Shiny RMarkdown Package Manager CRAN Interoperability Serious Data Science Dplyr Forcats Ggplot2 Tibble Readr Stringr Tidyr Purrr Github Data Wrangling Tidy Data Odbc Rayshader Plumber Blogdown Gt Lazy Evaluation Tidymodels Statistics Debugging Programming Education Rstats Open Source OSS Reticulate

posit::conf(2023) Workshop: Tidy time series and forecasting in R

Register now: http://pos.it/conf Instructor: Rob J Hyndman Workshop Duration: 2-Day Workshop

This course is for you if you: • already use the tidyverse packages in R such as dplyr, tidyr, tibble and ggplot2 • need to analyze large collections of related time series • would like to learn how to use some tidy tools for time series analysis including visualization, decomposition and forecasting

It is common for organizations to collect huge amounts of data over time, and existing time series analysis tools are not always suitable to handle the scale, frequency and structure of the data collected. In this workshop, we will look at some packages and methods that have been developed to handle the analysis of large collections of time series.

On day 1, we will look at the tsibble data structure for flexibly managing collections of related time series. We will look at how to do data wrangling, data visualizations and exploratory data analysis. We will explore feature-based methods to explore time series data in high dimensions. A similar feature-based approach can be used to identify anomalous time series within a collection of time series, or to cluster or classify time series. Primary packages for day 1 will be tsibble, lubridate and feasts (along with the tidyverse of course).

Day 2 will be about forecasting. We will look at some classical time series models and how they are automated in the fable package, and we will explore the creation of ensemble forecasts and hybrid forecasts. Best practices for evaluating forecast accuracy will also be covered. Finally, we will look at forecast reconciliation, allowing millions of time series to be forecast in a relatively short time while accounting for constraints on how the series are related

dplyr ggplot2 lubridate tibble tidyr tidyverse Rstudio Data Science Machine Learning Python Stats Tidyverse Data Visualization Data Viz Ggplot Technology Coding Connect Server Pro Shiny RMarkdown Package Manager CRAN Interoperability Serious Data Science Dplyr Forcats Ggplot2 Tibble Readr Stringr Tidyr Purrr Github Data Wrangling Tidy Data Odbc Rayshader Plumber Blogdown Gt Lazy Evaluation Tidymodels Statistics Debugging Programming Education Rstats Open Source OSS Reticulate

Hadley Wickham | Maintaining the house the tidyverse built | RStudio

Hadley will talk about how the tidyverse has evolved since its creation (just five years ago!). You’ll learn about our greatest successes, learn from our biggest failures, and get some hints of what’s coming down the pipeline for the future.

About Hadley: Hadley Wickham is the Chief Scientist at RStudio, a member of the R Foundation, and Adjunct Professor at Stanford University and the University of Auckland. He builds tools (both computational and cognitive) to make data science easier, faster, and more fun. You may be familiar with his packages for data science (the tidyverse: including ggplot2, dplyr, tidyr, purrr, and readr) and principled software development (roxygen2, testthat, devtools, pkgdown). Much of the material for the course is drawn from two of his existing books, Advanced R and R Packages, but the course also includes a lot of new material that will eventually become a book called “Tidy tools”

Hadley Wickham

devtools dplyr ggplot2 pkgdown purrr readr roxygen2 rstudio testthat tidyr tidyverse Rstudio Data Science Machine Learning Python Stats Tidyverse Data Visualization Data Viz Ggplot Technology Coding Connect Server Pro Shiny RMarkdown Package Manager CRAN Interoperability Serious Data Science Dplyr Ggplot2 Tibble Readr Stringr Tidyr Purrr Github Data Wrangling Tidy Data Odbc Rayshader Plumber Blogdown Gt Lazy Evaluation Tidymodels Statistics Debugging Programming Education Forcats Rstats Open Source OSS Reticulate Hadley Wickham

Tyson Barrett | List-columns in data.table | RStudio (2020)

The use of list-columns in data frames and tibbles is well documented (e.g. Bryan, 2018), providing a cognitively efficient way to organize results of complex data (e.g. several statistical models, groupings of text, data summaries, or even graphics) with corresponding data. For example, one can store student information within classrooms, player information within teams, or analyses within groups. This allows the data to be of variable sizes without overly complicating or adding redundancies to the structure of the data. In turn, this can improve the reliability to appropriately analyze the data. Because of its efficiency and speed, being able to use data.table to work with list-columns would be beneficial in many data contexts (e.g. to reduce memory usage in large data sets). Herein, I demonstrate how one can create list-columns in a data table using the by argument in data.table and purrr::map(). I compare the behavior of the data.table approaches to the dplyr::group_nest() function and tidyr::unnest(), two of the several powerful Tidyverse nesting and unnesting functions. Results using bench::mark() show the speed and efficiency of using data.table to work with list-columns

dplyr purrr rstudio tidyr tidyverse Rstudio::conf(2020) Tyson Barrett Rstudio Data Science Machine Learning Python Stats Tidyverse Data Visualization Data Viz Ggplot Technology Coding Connect Server Pro Shiny RMarkdown Package Manager CRAN Interoperability Serious Data Science Dplyr Forcats Ggplot2 Tibble Readr Stringr Tidyr Purrr Github Data Wrangling Tidy Data Odbc Rayshader Plumber Blogdown Gt Lazy Evaluation Tidymodels Statistics Debugging Programming Education Rstats Open Source OSS Reticulate

Data Manipulation Tools: dplyr – Pt 3 Intro to the Grammar of Data Manipulation with R

Data wrangling is too often the most time-consuming part of data science and applied statistics. Two tidyverse packages, tidyr and dplyr, help make data manipulation tasks easier. Keep your code clean and clear and reduce the cognitive load required for common but often complex data science tasks.

dplyr docs: dplyr.tidyverse.org/reference/

Pt. 1: What is data wrangling? Intro, Motivation, Outline, Setup https://youtu.be/jOd65mR1zfw

/01:44 Intro and what’s covered Ground Rules
/02:40 What’s a tibble
/04:50 Use View
/05:25 The Pipe operator:
/07:20 What do I mean by data wrangling?

Pt. 2: Tidy Data and tidyr https://youtu.be/1ELALQlO-yM

/00:48 Goal 1 Making your data suitable for R
/01:40 tidyr “Tidy” Data introduced and motivated
/08:10 tidyr::gather
/12:30 tidyr::spread
/15:23 tidyr::unite
/15:23 tidyr::separate

Pt. 3: Data manipulation tools: dplyr https://youtu.be/Zc_ufg4uW4U

00.40 setup
02:00 dplyr::select
03:40 dplyr::filter
05:05 dplyr::mutate
07:05 dplyr::summarise
08:30 dplyr::arrange
09:55 Combining these tools with the pipe (Setup for the Grammar of Data Manipulation)
11:45 dplyr::group_by

Pt. 4: Working with Two Datasets: Binds, Set Operations, and Joins https://youtu.be/AuBgYDCg1Cg Combining two datasets together

/00.42 dplyr::bind_cols
/01:27 dplyr::bind_rows
/01:42 Set operations dplyr::union, dplyr::intersect, dplyr::set_diff
/02:15 joining data dplyr::left_join, dplyr::inner_join, dplyr::right_join, dplyr::full_join,

Cheatsheets: https://www.rstudio.com/resources/cheatsheets/

Documentation: tidyr docs: tidyr.tidyverse.org/reference/

tidyr vignette: https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html dplyr docs: http://dplyr.tidyverse.org/reference/
dplyr one-table vignette: https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html
dplyr two-table (join operations) vignette: https://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html

Tidy Data and tidyr – Pt 2 Intro to Data Wrangling with R and the Tidyverse

Data wrangling is too often the most time-consuming part of data science and applied statistics. Two tidyverse packages, tidyr and dplyr, help make data manipulation tasks easier. Keep your code clean and clear and reduce the cognitive load required for common but often complex data science tasks.

http://tidyr.tidyverse.org/reference/

Pt. 1: What is data wrangling? Intro, Motivation, Outline, Setup https://youtu.be/jOd65mR1zfw

/01:44 Intro and what’s covered Ground Rules
/02:40 What’s a tibble
/04:50 Use View
/05:25 The Pipe operator:
/07:20 What do I mean by data wrangling?

Pt. 2: Tidy Data and tidyr https://youtu.be/1ELALQlO-yM

00:48 Goal 1 Making your data suitable for R
01:40 tidyr “Tidy” Data introduced and motivated
08:10 tidyr::gather
12:30 tidyr::spread
15:23 tidyr::unite
15:23 tidyr::separate

Pt. 3: Data manipulation tools: dplyr https://youtu.be/Zc_ufg4uW4U

00.40 setup
/02:00 dplyr::select
/03:40 dplyr::filter
/05:05 dplyr::mutate
/07:05 dplyr::summarise
/08:30 dplyr::arrange
/09:55 Combining these tools with the pipe (Setup for the Grammar of Data Manipulation)
/11:45 dplyr::group_by
/15:00 dplyr::group_by

Pt. 4: Working with Two Datasets: Binds, Set Operations, and Joins https://youtu.be/AuBgYDCg1Cg Combining two datasets together

/00.42 dplyr::bind_cols
/01:27 dplyr::bind_rows
/01:42 Set operations dplyr::union, dplyr::intersect, dplyr::set_diff
/02:15 joining data dplyr::left_join, dplyr::inner_join, dplyr::right_join, dplyr::full_join,

Cheatsheets: https://www.rstudio.com/resources/cheatsheets/

Documentation: tidyr docs: tidyr.tidyverse.org/reference/

tidyr vignette: https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html dplyr docs: http://dplyr.tidyverse.org/reference/
dplyr one-table vignette: https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html
dplyr two-table (join operations) vignette: https://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html

What is data wrangling? Intro, Motivation, Outline, Setup – Pt. 1 Data Wrangling Introduction

Data wrangling is too often the most time-consuming part of data science and applied statistics. Two tidyverse packages, tidyr and dplyr, help make data manipulation tasks easier. These videos introduce you to these tools. Keep your R code clean and clear and reduce the cognitive load required for common but often complex data science tasks.

Pt. 1: What is data wrangling? Intro, Motivation, Outline, Setup https://youtu.be/jOd65mR1zfw

01:44 Intro and what’s covered Ground Rules
02:40 What’s a tibble
04:50 Use View
05:25 The Pipe operator:
07:20 What do I mean by data wrangling?

Pt. 2: Tidy Data and tidyr https://youtu.be/1ELALQlO-yM

/00:48 Goal 1 Making your data suitable for R
/01:40 tidyr “Tidy” Data introduced and motivated
/08:15 tidyr::gather
/12:38 tidyr::spread
/15:30 tidyr::unite
/15:30 tidyr::separate

Pt. 3: Data manipulation tools: dplyr https://youtu.be/Zc_ufg4uW4U

00.40 setup
/02:00 dplyr::select
/03:40 dplyr::filter
/05:05 dplyr::mutate
/07:05 dplyr::summarise
/08:30 dplyr::arrange
/09:55 Combining these tools with the pipe (Setup for the Grammar of Data Manipulation)
/11:45 dplyr::group_by
/15:00 dplyr::group_by

Pt. 4: Working with Two Datasets: Binds, Set Operations, and Joins https://youtu.be/AuBgYDCg1Cg Combining two datasets together

/00.42 dplyr::bind_cols
/01:27 dplyr::bind_rows
/01:42 Set operations dplyr::union, dplyr::intersect, dplyr::set_diff
/02:15 joining data dplyr::left_join, dplyr::inner_join, dplyr::right_join, dplyr::full_join,

Cheatsheets: https://www.rstudio.com/resources/cheatsheets/

Documentation: tidyr docs: tidyr.tidyverse.org/reference/

tidyr vignette: https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html dplyr docs: http://dplyr.tidyverse.org/reference/
dplyr one-table vignette: https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html
dplyr two-table (join operations) vignette: https://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html

New York Times “For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights”, By STEVE LOHRAUG. 17, 2014 https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html

Working with Two Datasets: Binds, Set Operations, and Joins – Pt 4 Intro to Data Manipulation

Data wrangling is too often the most time-consuming part of data science and applied statistics. Two tidyverse packages, tidyr and dplyr, help make data manipulation tasks easier. Keep your R code clean and clear and reduce the cognitive load required for common but often complex data science tasks.

dplyr docs: dplyr.tidyverse.org/reference/

Pt. 1: What is data wrangling? Intro, Motivation, Outline, Setup https://youtu.be/jOd65mR1zfw

/01:44 Intro and what’s covered Ground Rules:
/02:40 What’s a tibble
/04:50 Use View
/05:25 The Pipe operator:
/07:20 What do I mean by data wrangling?

Pt. 2: Tidy Data and tidyr https://youtu.be/1ELALQlO-yM

/00:48 Goal 1 Making your data suitable for R
/01:40 tidyr “Tidy” Data introduced and motivated
/08:10 tidyr::gather
/12:30 tidyr::spread
/15:23 tidyr::unite
/15:23 tidyr::separate

Pt. 3: Data manipulation tools: dplyr https://youtu.be/Zc_ufg4uW4U

/00.40 setup
/02:00 dplyr::select
/03:40 dplyr::filter
/05:05 dplyr::mutate
/07:05 dplyr::summarise
/08:30 dplyr::arrange
/09:55 Combining these tools with the pipe (Setup for the Grammar of Data Manipulation)
/11:45 dplyr::group_by

Pt. 4: Working with Two Datasets: Binds, Set Operations, and Joins https://youtu.be/AuBgYDCg1Cg Combining two datasets together

00.42 dplyr::bind_cols
01:27 dplyr::bind_rows
01:42 Set operations dplyr::union, dplyr::intersect, dplyr::set_diff
02:15 joining data - dplyr::left_join, dplyr::inner_join, - dplyr::right_join, dplyr::full_join,