MLOps and Admin on Posit Open Source

News from the sparkly-verse

Edgar Ruiz — Mon, 22 Apr 2024 00:00:00 +0000

Highlights

sparklyr and friends have been getting some important updates in the past few months, here are some highlights:

spark_apply() now works on Databricks Connect v2
sparkxgb is coming back to life
Support for Spark 2.3 and below has ended

pysparklyr 0.1.4

spark_apply() now works on Databricks Connect v2. The latest pysparklyr release uses the rpy2 Python library as the backbone of the integration.

Databricks Connect v2, is based on Spark Connect. At this time, it supports Python user-defined functions (UDFs), but not R user-defined functions. Using rpy2 circumvents this limitation. As shown in the diagram, sparklyr sends the the R code to the locally installed rpy2, which in turn sends it to Spark. Then the rpy2 installed in the remote Databricks cluster will run the R code.

A big advantage of this approach, is that rpy2 supports Arrow. In fact it is the recommended Python library to use when integrating Spark, Arrow and R . This means that the data exchange between the three environments will be much faster!

As in its original implementation, schema inferring works, and as with the original implementation, it has a performance cost. But unlike the original, this implementation will return a ‘columns’ specification that you can use for the next time you run the call.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


spark_apply(
  tbl_mtcars,
  nrow,
  group_by = "am"
)

#> To increase performance, use the following schema:
#> columns = "am double, x long"

#> # Source:   table<`sparklyr_tmp_table_b84460ea_b1d3_471b_9cef_b13f339819b6`> [2 x 2]
#> # Database: spark_connection
#>      am     x
#>    
#> 1     0    19
#> 2     1    13

A full article about this new capability is available here: Run R inside Databricks Connect

sparkxgb

The sparkxgb is an extension of sparklyr. It enables integration with XGBoost . The current CRAN release does not support the latest versions of XGBoost. This limitation has recently prompted a full refresh of sparkxgb. Here is a summary of the improvements, which are currently in the development version of the package :

The xgboost_classifier() and xgboost_regressor() functions no longer pass values of two arguments. These were deprecated by XGBoost and cause an error if used. In the R function, the arguments will remain for backwards compatibility, but will generate an informative error if not left NULL:
- sketch_eps - As of XGBoost version 1.6.0 sketch_eps was replaced by max_bins
- timeout_request_workers - Removed in XGBoost version 1.7.0 because it was no longer needed when XGBoost added barrier support
Updates the JVM version used during the Spark session. It now uses xgboost4j-spark version 2.0.3 , instead of 0.8.1. This gives us access to XGboost’s most recent Spark code.
Updates code that used deprecated functions from upstream R dependencies. It also stops using an un-maintained package as a dependency (forge). This eliminated all of the warnings that were happening when fitting a model.
Major improvements to package testing. Unit tests were updated and expanded, the way sparkxgb automatically starts and stops the Spark session for testing was modernized, and the continuous integration tests were restored. This will ensure the package’s health going forward.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28


remotes::install_github("rstudio/sparkxgb")

library(sparkxgb)
library(sparklyr)

sc <- spark_connect(master = "local")
iris_tbl <- copy_to(sc, iris)

xgb_model <- xgboost_classifier(
  iris_tbl,
  Species ~ .,
  num_class = 3,
  num_round = 50,
  max_depth = 4
)

xgb_model %>% 
  ml_predict(iris_tbl) %>% 
  select(Species, predicted_label, starts_with("probability_")) %>% 
  dplyr::glimpse()
#> Rows: ??
#> Columns: 5
#> Database: spark_connection
#> $ Species                 "setosa", "setosa", "setosa", "setosa", "setosa…
#> $ predicted_label         "setosa", "setosa", "setosa", "setosa", "setosa…
#> $ probability_setosa      0.9971547, 0.9948581, 0.9968392, 0.9968392, 0.9…
#> $ probability_versicolor  0.002097376, 0.003301427, 0.002284616, 0.002284…
#> $ probability_virginica   0.0007479066, 0.0018403779, 0.0008762418, 0.000…

sparklyr 1.8.5

The new version of sparklyr does not have user facing improvements. But internally, it has crossed an important milestone. Support for Spark version 2.3 and below has effectively ended. The Scala code needed to do so is no longer part of the package. As per Spark’s versioning policy, found here , Spark 2.3 was ’end-of-life’ in 2018.

This is part of a larger, and ongoing effort to make the immense code-base of sparklyr a little easier to maintain, and hence reduce the risk of failures. As part of the same effort, the number of upstream packages that sparklyr depends on have been reduced. This has been happening across multiple CRAN releases, and in this latest release tibble, and rappdirs are no longer imported by sparklyr.

Announcing bundle

Julia Silge — Fri, 16 Sep 2022 00:00:00 +0000

We’re thrilled to announce the first release of bundle . The bundle package provides a consistent interface to capture all information needed to serialize a model, situate that information within a portable object, and restore it for use in new settings.

You can install it from CRAN with:

1

install.packages("bundle")

Let’s walk through what bundle does, and when you might need to use it.

Saving things is hard

We often think of a trained model as a self-contained R object. The model exists in memory in R and if we have some new data, the model object can generate predictions on its own:

In reality, model objects sometimes also make use of references to generate predictions. A reference is a piece of information that a model object refers to that isn’t part of the object itself; this could be something like a connection to a server, a file on disk, or an internal function in the package used to train the model. When we call predict(), model objects know where to look to retrieve that information:

Saving model objects can sometimes disrupt those references. Thus, if we want to train a model, save it, re-load it in a production setting, and generate predictions with it, we may run into issues:

We need some way to preserve access to those references. This new package provides a consistent interface for bundling model objects with their references so that they can be safely saved and re-loaded in production:

When to bundle your model

Let’s walk through building a couple of models using data on cell body segmentation .

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


library(tidymodels)
data(cells, package = "modeldata")

set.seed(123)
cell_split <- cells %>% 
  select(-case) %>%
  initial_split(strata = class)

cell_train <- training(cell_split)
cell_test  <- testing(cell_split)

First, let’s train a logistic regression model:

1

glm_fit <- glm(class ~ ., family = "binomial", data = cell_train)

If we’re satisfied with this model and think it is ready for production, we might want to deploy it somewhere, maybe as a REST API or as a Shiny app. A typical approach would be to:

save our model object
start up a new R session
load the model object into the new session
predict on new data with the loaded model object

The callr package is helpful for demonstrating this kind of situation; it allows us to start up a fresh R session and pass a few objects in.

We’ll just make use of two of the arguments to the function r():

func: A function that, given a model object and some new data, will generate predictions, and
args: A named list, giving the arguments to the above function.

Let’s save our model object to a temporary file and pass it to a fresh R session for prediction, like if we had deployed the model somewhere.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


library(callr)

temp_file <- tempfile()
saveRDS(glm_fit, file = temp_file)

r(
  function(temp_file, new_data) {
    model_object <- readRDS(file = temp_file)
    predict(model_object, new_data)
  },
  args = list(
    temp_file = temp_file,
    new_data = head(cell_test)
  )
)

##          1          2          3          4          5          6 
## -4.8706401 -1.8143956  2.3386470 -1.2735249 -0.3586448  2.7865270

Nice! 😀

What if instead we wanted to train a neural network using tidymodels, with keras as the modeling engine?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


cell_rec <- 
  recipe(class ~ ., data = cell_train) %>%
  step_YeoJohnson(all_numeric_predictors()) %>%
  step_normalize(all_numeric_predictors())

keras_spec <- 
  mlp(penalty = 0, epochs = 10) %>% 
  set_mode("classification") %>% 
  set_engine("keras", verbose = 0) 

keras_fit <- 
  workflow(cell_rec, keras_spec) %>%
  fit(data = cell_train)

Let’s try to save this to disk and then reload it in a new session.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


temp_file <- tempfile()
saveRDS(keras_fit, file = temp_file)

r(
  function(temp_file, new_data) {
    library(workflows)
    model_object <- readRDS(file = temp_file)
    predict(model_object, new_data)
  },
  args = list(
    temp_file = temp_file,
    new_data = head(cell_test)
  )
)

## Error: ! error in callr subprocess
## Caused by error in `do.call(object$predict, args)`:
## ! 'what' must be a function or character string

Oh no! 😱

It turns out that keras models need to be saved in a special way . This is true of a handful of models, like XGBoost, and even some preprocessing steps, like UMAP. These special ways to save objects, like the ones that keras provide, are often referred to as native serialization. Methods for native serialization know which references need to be brought along in order for an object to effectively do its thing in a new environment, but they are different for each model.

The bundle package provides a consistent way to deal with all these kinds of special serialization. The package provides two functions, bundle() and unbundle(), that take care of all of the minutae of preparing to save and load R objects effectively. You bundle() your model before you save it:

1
2
3
4


library(bundle)
temp_file <- tempfile()
keras_bundle <- bundle(keras_fit)
saveRDS(keras_bundle, file = temp_file)

And then you unbundle() after you read the object in a new session:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


r(
  function(temp_file, new_data) {
    library(bundle)
    library(workflows)
    model_bundle <- readRDS(file = temp_file)
    model_object <- unbundle(model_bundle)
    predict(model_object, new_data)
  },
  args = list(
    temp_file = temp_file,
    new_data = head(cell_test)
  )
)

## # A tibble: 6 × 1
##   .pred_class
##         
## 1 PS         
## 2 PS         
## 3 WS         
## 4 PS         
## 5 PS         
## 6 WS

Hooray! 🎉

We have support in bundle for a wide variety of models that require (or sometimes require) special handling for serialization, from H2O to torch luz models . Soon bundle will be integrated into vetiver , for better and more robust deployment options. If you use a model that needs special serialization and is not yet supported, let us know in an issue.

Acknowledgements

Thank you so much to everyone who contributed to this first release: @dfalbel , @juliasilge , @qiushiyan , and @simonpcouch . I would especially like to highlight Simon’s contributions, which have been central to bundle getting off the ground!

Announcing vetiver for MLOps in R and Python

Julia Silge — Thu, 09 Jun 2022 00:00:00 +0000

We are thrilled to announce the release of vetiver , a framework for MLOps tasks in R and Python! The goal of vetiver is to provide fluent tooling to version, share, deploy, and monitor a trained model. If you like perfume or candles, you may recognize this name; vetiver, also known as the “oil of tranquility”, is used as a stabilizing ingredient in perfumery to preserve more volatile fragrances.

You can install the released version of vetiver for R from CRAN :

1

install.packages("vetiver")

You can install the released version of vetiver for Python from PyPI :

1

pip install vetiver

We are sharing more about what vetiver is and how it works over on the RStudio blog so check that out, but we want to share here as well!

Train a model

For this example, let’s work with data on everyone’s favorite dataset on fuel efficiency for cars to predict miles per gallon. In R, we can train a decision tree model to predict miles per gallon using a tidymodels workflow:

1
2
3
4
5


library(tidymodels)

car_mod <-
    workflow(mpg ~ ., decision_tree(mode = "regression")) %>%
    fit(mtcars)

In Python, we can train the same kind of model using scikit-learn :

1
2
3


from vetiver.data import mtcars
from sklearn import tree
car_mod = tree.DecisionTreeRegressor().fit(mtcars, mtcars["mpg"])

For both R and Python, the car_mod object is a fitted model, with parameters estimated using our training data mtcars.

Create a vetiver model

We can create a vetiver_model() in R or VetiverModel() in Python from the trained model; a vetiver model object collects the information needed to store, version, and deploy a trained model.

1
2
3
4
5
6


library(vetiver)
v <- vetiver_model(car_mod, "cars_mpg")
v
#> 
#> ── cars_mpg ─  model for deployment 
#> A rpart regression modeling workflow using 10 features

1
2
3
4
5


from vetiver import VetiverModel
v = VetiverModel(car_mod, model_name = "cars_mpg", 
                 save_ptype = True, ptype_data = mtcars)
v.description
#> "Scikit-learn  model"

See our documentation for how to use these deployable model objects and:

Be sure to also read more on the RStudio blog .

Acknowledgements

We’d like to extend our thanks to all of the contributors who helped make these initial releases of vetiver for R and Python possible!

R package: @cderv , @ggpinto , @isabelizimm , @juliasilge , and @mfansler
Python package: @has2k1 , and @isabelizimm

Integrating Dynamic R and Python Models in Tableau Using plumbertableau

Isabella Velásquez — Mon, 20 Dec 2021 00:00:00 +0000

RStudio believes that you can attain greater business intelligence with interoperable tools that take the full advantage of open-source data science. Your organization may rely on Tableau for reporting purposes, but how can you ensure that you’re using the full power of your data science team’s R and Python models in your dashboards?

With the plumbertableau package (and its corresponding Python package, fastapitableau), you can use functions or models created in R or Python from Tableau through an API. These packages allow you to showcase cutting-edge data science results in your organization’s preferred dashboard tool.

While this post mentions R, anything possible with R and plumbertableau is also doable with Python and fastapitableau.

Foster Data Analytics Capabilities With plumbertableau

With plumbertableau, you can fully develop your model with code-first data science. The package uses plumber to create an API directly from your code. Since your model is fully developed in your data science editor, it can use all the packages and complex calculations it needs.

You can extract the best data science results using R’s capabilities as your model will not be constrained by Tableau’s environment.

Improve Data Quality With APIs for Continuous Use

Seamless integration between analytic platforms prevents issues like using outdated, inaccurate, or incomplete data. Rather than depending on a manual process, data scientists can depend on their data pipelines to ensure data integrity.

With plumbertableau, your tools are integrated through an API. The Tableau dashboard displays results without any intermediate manipulation like copy-and-pasting code or uploading datasets. You can work in confidence knowing your results are synchronized, accurate, and reproducible.

Increase Deliverability by Streamlining Data Pipelines

If your model has many dependencies or versioning requirements, it can be difficult to handle them outside of the development environment. Debugging is even more time-consuming when you need to work in separate environments to figure out what went wrong.

With RStudio Connect, you can publish directly plumbertableau extensions directly from the RStudio IDE. RStudio Connect automatically manages your API’s dependent packages and files to recreate an environment closely mimicking your local development environment. And since all your R code remains in R, you can use your usual data science techniques to efficiently resolve issues.

Read more on the Hosting page of the plumber package.

How to Use plumbertableau: XGBoost with Dynamic Model Output Example

In this walkthrough, we will be using data from the Seattle Open Data Portal to predict the paid parking occupancy percentage in various areas around the city. We will run an XGBoost model in RStudio, create a plumbertableau extension to embed into Tableau, and visualize and interact with the model in a Tableau dashboard. The code is here for reproducibility purposes; however, it will require an RStudio Connect account to complete.

The plumbertableau and fastapi packages have wonderful documentation. Be sure to read them for more information on:

The anatomy of the extensions
Details on setting up RStudio Connect and Tableau
Other examples to try out in your Tableau dashboards

1. Build the model

First, we need to build a model. This walkthrough won’t be covering how to create, tune, or validate a model. If you’d like to learn more on models and machine learning, check out the tidymodels website and Julia Silge’s fantastic screencasts and tutorials.

Load Libraries

library(tidyverse)
library(RSocrata)
library(lubridate)
library(usemodels)
library(tidymodels)

Download and Clean Data

The Seattle Open Data Portal uses Socrata, a data management tool, for its APIs. We can use the RSocrata package to download the data.

parking_data <-
  RSocrata::read.socrata(
    "https://data.seattle.gov/resource/rke9-rsvs.json?$where=sourceelementkey <= 1020"
  )

parking_id <-
  parking_data %>%
  group_by(blockfacename, location.coordinates) %>%
  mutate(id = cur_group_id()) %>%
  ungroup()

parking_clean <-
  parking_id %>%
  mutate(across(c(parkingspacecount, paidoccupancy), as.numeric),
         occupancy_pct = paidoccupancy / parkingspacecount) %>%
  group_by(
    id = id,
    hour = as.numeric(hour(occupancydatetime)),
    month = as.numeric(month(occupancydatetime)),
    dow = as.numeric(wday(occupancydatetime)),
    date = date(occupancydatetime)
  ) %>%
  summarize(occupancy_pct = mean(occupancy_pct, na.rm = TRUE)) %>%
  drop_na() %>%
  ungroup()

We will also need information on the city blocks, so let’s create that dataset.

parking_information <-
  parking_id %>%
  mutate(loc = location.coordinates) %>%
  select(id, blockfacename, loc) %>%
  distinct(id, blockfacename, loc) %>%
  unnest_wider(loc, c('loc1', 'loc2'))

Create Training Data

Now, let’s create the training set from our original data.

parking_split <-
  parking_clean %>%
  arrange(date) %>%
  select(-date) %>%
  initial_time_split(prop = 0.75)

Train and Tune the Model

Here, we train and tune the model. We select the model with the best RSME to use in our dashboard.

xgboost_recipe <-
  recipe(formula = occupancy_pct ~ ., data = parking_clean) %>%
  step_zv(all_predictors())  %>%
  prep()

xgboost_folds <-
  recipes::bake(xgboost_recipe,
                new_data = training(parking_split)) %>%
  rsample::vfold_cv(v = 5)

xgboost_model <-
  boost_tree(
    mode = "regression",
    trees = 1000,
    min_n = tune(),
    tree_depth = tune(),
    learn_rate = tune(),
    loss_reduction = tune()
  ) %>%
  set_engine("xgboost", objective = "reg:squarederror")

xgboost_params <-
  parameters(min_n(),
             tree_depth(),
             learn_rate(),
             loss_reduction())

xgboost_grid <-
  grid_max_entropy(xgboost_params,
                   size = 5)

xgboost_wf <-
  workflows::workflow() %>%
  add_model(xgboost_model) %>%
  add_formula(occupancy_pct ~ .)

xgboost_tuned <- tune::tune_grid(
  object = xgboost_wf,
  resamples = xgboost_folds,
  grid = xgboost_grid,
  metrics = yardstick::metric_set(rmse, rsq, mae),
  control = tune::control_grid(verbose = TRUE)
)

xgboost_best <-
  xgboost_tuned %>%
  tune::select_best("rmse")

xgboost_final <-
  xgboost_model %>%
  finalize_model(xgboost_best)

We bundle the recipe and fitted model in an object so we can use it later.

train_processed <-
  bake(xgboost_recipe, new_data = training(parking_split))

prediction_fit <-
  xgboost_final %>%
  fit(formula = occupancy_pct ~ .,
      data    = train_processed)

model_details <- list(model = xgboost_final,
                      recipe = xgboost_recipe,
                      prediction_fit = prediction_fit)

Save Objects for the plumbertableau Extension

We’ll want to save our data and our model so that we can use them in the extension. If you have an RStudio Connect account, the pins package is a great choice for saving these objects.

rsc <-
  pins::board_rsconnect(server = Sys.getenv("CONNECT_SERVER"),
                        key = Sys.getenv("CONNECT_API_KEY"))

pins::pin_write(
  board = rsc,
  x = model_details,
  name = "seattle_parking_model",
  description = "Seattle Occupancy Percentage XGBoost Model",
  type = "rds"
)

pins::pin_write(
  board = rsc,
  x = parking_information,
  name = "seattle_parking_info",
  description = "Seattle Parking Information",
  type = "rds"
)

2. Create a plumbertableau Extension

Next, we will use our model to create a plumbertableau extension. As noted previously, the plumbertableau extension is a Plumber API with some special annotations.

Create an R script called plumber.R. At the top, we list the libraries we’ll need.

library(plumber)
library(pins)
library(tibble)
library(xgboost)
library(lubridate)
library(dplyr)
library(tidyr)
library(tidymodels)
library(plumbertableau)

We want to bring in our model details and our data. If you pinned your data, you’ll change the name of the pin below.

rsc <-
  pins::board_rsconnect(
    server = Sys.getenv("CONNECT_SERVER"),
    key = Sys.getenv("CONNECT_API_KEY")
  )

xgboost_model <-
  pins::pin_read("isabella.velasquez/seattle_parking_model", board = rsc)

Now, we add our annotations. Note that we use plumbertableau annotations, which are slightly different than the ones from plumber.

We use tableauArg rather than params.
We specify what is returned to Tableau with tableauReturn.
We must use post for what is being returned.

#* @apiTitle Seattle Parking Occupancy Percentage Prediction API
#* @apiDescription Return the predicted occupancy percentage at various Seattle locations

#* @tableauArg block_id:integer numeric block ID
#* @tableauArg ndays:integer number of days in the future for the prediction

#* @tableauReturn [numeric] Predicted occupancy rate
#* @post /pred

Now, we create our function with the arguments station_id and ndays. These will have corresponding arguments in Tableau. The function will output our predicted occupancy percentage, which will be what we visualize and interact with in the dashboard.

This function takes the city block and number of days in the future to give us the predicted occupancy percentage at that time.

function(block_id, ndays) {
  times <- Sys.time() + lubridate::ddays(ndays)
  
  current_time <-
    tibble::tibble(times = times,
                   id = block_id)
  
  current_prediction  <-
    current_time %>%
    transmute(
      id = id,
      hour = hour(times),
      month = month(times),
      dow = wday(times),
      occupancy_pct = NA
    ) %>%
    bake(xgboost_model$recipe, .)
  
  parking_prediction <-
    xgboost_model$prediction_fit %>%
    predict(new_data = current_prediction)
  
  predictions <-
    parking_prediction$.pred
  
  predictions[[1]]
  
}

Finally, we finish off our script with the extension footer needed for plumbertableau extensions.

#* @plumber
tableau_extension

Here is the full plumber.R script:

library(plumber)
library(pins)
library(tibble)
library(xgboost)
library(lubridate)
library(dplyr)
library(tidyr)
library(tidymodels)
library(plumbertableau)

rsc <-
  pins::board_rsconnect(server = Sys.getenv("CONNECT_SERVER"),
                        key = Sys.getenv("CONNECT_API_KEY"))

xgboost_model <-
  pins::pin_read("isabella.velasquez/seattle_parking_model", board = rsc)

#* @apiTitle Seattle Parking Occupancy Percentage Prediction API
#* @apiDescription Return the predicted occupancy percentage at various Seattle locations

#* @tableauArg block_id:integer numeric block ID
#* @tableauArg ndays:integer number of days in the future for the prediction

#* @tableauReturn [numeric] Predicted occupancy rate
#* @post /pred

function(block_id, ndays) {
  times <- Sys.time() + lubridate::ddays(ndays)
  
  current_time <-
    tibble::tibble(times = times,
                   id = block_id)
  
  current_prediction  <-
    current_time %>%
    transmute(
      id = id,
      hour = hour(times),
      month = month(times),
      dow = wday(times),
      occupancy_pct = NA
    ) %>%
    bake(xgboost_model$recipe, .)
  
  parking_prediction <-
    xgboost_model$prediction_fit %>%
    predict(new_data = current_prediction)
  
  predictions <-
    parking_prediction$.pred
  
  predictions[[1]]
  
}

#* @plumber
tableau_extension

3. Host your API

We have to host our API so that it can be accessed in Tableau. In our case, we publish it to RStudio Connect.

Once hosted, plumbertableau automatically generates a documentation page. Notice that the SCRIPT_* value is not R code. This is a Tableau command that we will use to connect our extension and Tableau.

Automatically generated plumbertableau documentation page

4. Create a calculated field in Tableau

There are a few steps you need to take so that Tableau can use your plumbertableau extension. If you are using RStudio Connect, read the documentation on how to configure RStudio Connect as an analytic extension.

Create a new workbook and upload the station_information file. Under Analysis, turn off Aggregate Measures. Drop Lat into Rows and Lon into Columns, which will create a map. Save the workbook.

Make sure your workbook knows to connect to RStudio Connect by going to Analysis > Manage Analytic Extensions Connection > Choose a Connection. Then, select your Connect account.

Drag Id into the “Detail” mark. Create a parameter called “Days in the Future”. We’re using our model to predict parking occupancy percentage for that date. Show the parameter on the worksheet.

Create a calculated field using the SCRIPT from the plumbertableau documentation page:

SCRIPT_REAL("/plumbertableau-xgboost-example/pred", block_id, ndays)

For each tableauArg we have listed in the extension, we will replace it with its corresponding Tableau value. If you’re following along, this means block_id will become ATTR([Id]) and ndays will become ATTR([Days in the Future]).

SCRIPT_REAL("/plumbertableau-xgboost-example/pred", ATTR([Id]), ATTR([Days in the Future]))

5. Run model and visualize results in Tableau

That’s it! Once you embed your extension in Tableau’s calculated fields, you can use your model’s results in your Tableau dashboard like any other measure or dimension.

We can change the ndays argument to get new predictions from our XGBoost model and display them on our Tableau dashboard.

You can style your Tableau dashboard and then provide your users something that is not only aesthetically pleasing, but is dynamically calculating predictions based on a model you have created in R.

Conclusion

With plumbertableau, you can showcase sophisticated model results that are easy to integrate, debug, and reproduce. Your work will be at the forefront of data science while being visualized in Tableau’s easy, point-and-click interface.

Learn More

Watch James Blair showcase plumbertableau in Leveraging R & Python in Tableau with RStudio Connect:

More on how RStudio supports interoperability across tools can be found on our BI and Data Science Overview Page.

Sharing Data With the pins Package

Katie Masiello — Wed, 15 Dec 2021 00:00:00 +0000

Photo by Universal Eye on Unsplash

Teams often need access to key data to do their work, but have you ever opened your coworker’s script to see:

dat <-  
 read_csv("C://Users/someone_else/data/dataset.csv")

more_dat <- 
 read_csv("S://Path_to_mapped_drive_that_you_dont_have/dataset.csv")

Yikes! How will you get these files? Let’s hope you can reach your coworker before they’ve logged off for the day.

How can your code be reproducible if you have to manually change the file paths? Shudder.

What if you need to make edits to the data, will you have to keep copying CSVs and emailing files forever? Double shudder.

What if your coworker accidentally forwards your email to someone who is not supposed to have access? Oh no.

We can struggle to share data assets easily and safely, relying on emailed files to keep our analyses up to date. This makes it difficult to keep current or know what version of the data we’re using. If you’ve ever experienced any of the scenarios above, consider pins as a solution that can help you share your data assets.

What is a pin, anyway?

Pins, from the R package of the same name, are a versatile way to publish R objects on a virtual corkboard so you can share them across projects and people.

Good pins are data or assets that are a few hundred megabytes or smaller. You can pin just about any object: data, models, JSON files, feather files from the Arrow package, and more. One of the most frequent use cases is pinning small data sets — often ephemeral data or reference tables that don’t quite merit being in a database, but seemingly don’t have a good home elsewhere (until now).

Pins get published to a board, which can be an RStudio Connect server, an AWS S3 bucket or Azure Blob Storage, a shared drive like Dropbox or Sharepoint, or a variety of other options. Try it out for yourself — read in this data set we’ve pinned for you on RStudio Connect!

# Install the latest pins from CRAN
install.packages("pins")

library(pins)

# Identify the board
board <-
  board_url(c("penguins" = "https://colorado.rstudio.com/rsc/example_pin/"))

# Read the shared data
board %>%
  pin_read("penguins")

In short, if you’ve ever wondered where to put an R object that you or your colleague will need to use again, you might just want to pin it.

One of the greatest strengths of pins is how your pin becomes accessible directly from your R scripts and the R scripts of anyone else to whom you’ve given access. Different projects can include code that reads the same pin without creating more copies of the data:

It’s easier (and safer) to share a pin across multiple projects or people than to email files around. Pins respect the access controls of the board. Say you’ve pinned to RStudio Connect: you can control who gets to use the pin, just like any other piece of content.

Pins for Updating and Versioning

You may be wondering why use pins if you already have a shared drive with your teammates. But what happens if you need to replace the dataset with a new one? Do you email everybody to let them know? Is it dataFINALv2.csv? Or dataFINALfinal.csv?

The pins package retrieves the newest version of the pin by default. That means pin users never have to worry about getting a stale version of the pin. If you need to update your pin regularly, a scheduled R Markdown on RStudio Connect can handle this task for you, so your pin stays fresh.

But you’re not locked into losing old versions of a pin. You can version pins so that writing to an existing pin adds a new copy rather than replacing the existing data.

Here’s what versioning looks like using a temporary board:

library(pins)


board2 <- board_temp(versioned = TRUE)

board2 %>% pin_write(1:5, name = "x", type = "rds")
#> Creating new version '20210304T050607Z-ab444'
#> Writing to pin 'x'

board2 %>% pin_write(2:6, name = "x", type = "rds")
#> Creating new version '20210304T050607Z-a077a'
#> Writing to pin 'x'

board2 %>% pin_write(3:7, name = "x", type = "rds")
#> Creating new version '20210304T050607Z-0a284'
#> Writing to pin 'x'

# see all versions
board2 %>% pin_versions("x")
#> # A tibble: 3 × 3
#>   version                created             hash 
#>                                   
#> 1 20210304T050607Z-0a284 2021-03-04 05:06:00 0a284
#> 2 20210304T050607Z-a077a 2021-03-04 05:06:00 a077a
#> 3 20210304T050607Z-ab444 2021-03-04 05:06:00 ab444

Learn More

With pins, you and your teammates can know where your important data assets are, how to access them, and whether they are the correct version. You can work with confidence knowing you’re using the right asset, your work is reproducible, and you’re following good practices for data management.

There’s more to explore with pins. We’re excited to share how you can adopt them into your workflow.

Learn more about how and when to use pins:

See pins in action:

Pins can pull intensive ETL processes out of your apps, improve performance, and save you the hassle of redeploying whenever the underlying data changes.
- Watch: Deploying End-To-End Data Science with Shiny, Plumber, and Pins
Pins can play a key role in MLOps, publishing versioned models, and monitoring model metrics.
- Read: Model Monitoring with R Markdown, pins, and RStudio Connect

pins 1.0.0

Hadley Wickham — Mon, 04 Oct 2021 00:00:00 +0000

^{Photo by Kelsey Knight on Unsplash}

I’m delighted to announce that pins 1.0.0 is now available on CRAN. The pins package publishes data, models, and other R objects, making it easy to share them across projects and with your colleagues. You can pin objects to a variety of pin boards, including folders (to share on a networked drive or with services like DropBox), RStudio Connect, Amazon S3, and Azure blob storage. Pins can be versioned, making it straightforward to track changes, re-run analyses on historical data, and undo mistakes. Our users have found numerous ways to use this ability to fluently share and version data and other objects, such as automating ETL for a Shiny app.

You can install pins with:

install.packages("pins")

pins 1.0.0 includes a major overhaul of the API. The legacy API (pin(), pin_get(), board_register(), and friends) will continue to work, but new features will only be implemented with the new API, so we encourage you to switch to the modern API as quickly as possible. If you’re an existing pins user, you can learn more about the changes and how to update you code in vignette("pins-update").

Basics

To use the pins package, you must first create a pin board. A good place to start is board_folder(), which stores pins in a directory you specify. Here I’ll use a special version of board_folder() called board_temp() which creates a temporary board that’s automatically deleted when your R session ends. This is great for examples, but obviously you shouldn’t use it for real work!

library(pins)

board <- board_temp()
board
#> Pin board 
#> Path: '/tmp/RtmpLu2Bkx/pins-114af466104ab'
#> Cache size: 0

You can “pin” (save) data to a board with pin_write(). It takes three arguments: the board to pin to, an object, and a name:

board %>% pin_write(head(mtcars), "mtcars")
#> Guessing `type = 'rds'`
#> Creating new version '20211004T155644Z-f8797'
#> Writing to pin 'mtcars'

As you can see, the data saved as an .rds by default, but depending on what you’re saving and who else you want to read it, you might use the type argument to instead save it as a csv, json, arrow, or qs file.

You can later retrieve the pinned data with pin_read():

board %>% pin_read("mtcars")
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Sharing pins

A board on your computer is good place to start, but the real power of pins comes when you use a board that’s shared with multiple people. To get started, you can use board_folder() with a directory on a shared drive or using Dropbox, or if you use RStudio Connect you can use board_rsconnect():

board <- board_rsconnect()
#> Connecting to RSC 1.9.0.1 at 
board %>% pin_write(tidy_sales_data, "sales-summary", type = "rds")
#> Writing to pin 'hadley/sales-summary'

Then, someone else (or an automated Rmd report) can read and use your pin:

board <- board_rsconnect()
board %>% pin_read("hadley/sales-summary")

You can easily control who gets to access the data using the RStudio Connection permissions pane.

Other boards

As well as board_folder() and board_rsconnect(), pins 1.0.0 provides:

board_azure(), which uses Azure’s blob storage.
board_s3(), which uses Amazon’s S3 storage platform.
board_ms365(), which uses Microsoft’s OneDrive or SharePoint. (Thanks to contribution from Hong Ooi)

Future versions of the pins package are likely to include other backends as we learn from our users what would be most useful.

sparklyr.sedona: A sparklyr extension for analyzing geospatial data

Yitao Li — Wed, 07 Jul 2021 00:00:00 +0000

sparklyr.sedona is now available as the sparklyr-based R interface for Apache Sedona .

To install sparklyr.sedona from GitHub using the remotes package ¹, run

1

remotes::install_github(repo = "apache/incubator-sedona", subdir = "R/sparklyr.sedona")

In this blog post, we will provide a quick introduction to sparklyr.sedona, outlining the motivation behind this sparklyr extension, and presenting some example sparklyr.sedona use cases involving Spark spatial RDDs, Spark dataframes, and visualizations.

Motivation for `sparklyr.sedona`

A suggestion from the mlverse survey results earlier this year mentioned the need for up-to-date R interfaces for Spark-based GIS frameworks. While looking into this suggestion, we learned about Apache Sedona , a geospatial data system powered by Spark that is modern, efficient, and easy to use. We also realized that while our friends from the Spark open-source community had developed a sparklyr extension for GeoSpark, the predecessor of Apache Sedona, there was no similar extension making more recent Sedona functionalities easily accessible from R yet. We therefore decided to work on sparklyr.sedona, which aims to bridge the gap between Sedona and R.

The lay of the land²

We hope you are ready for a quick tour through some of the RDD-based and Spark-dataframe-based functionalities in sparklyr.sedona, and also, some bedazzling visualizations derived from geospatial data in Spark.

In Apache Sedona, Spatial Resilient Distributed Datasets (SRDDs) are basic building blocks of distributed spatial data encapsulating “vanilla” RDD s of geometrical objects and indexes. SRDDs support low-level operations such as Coordinate Reference System (CRS) transformations, spatial partitioning, and spatial indexing. For example, with sparklyr.sedona, SRDD-based operations we can perform include the following:

Importing some external data source into a SRDD:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


library(sparklyr)
library(sparklyr.sedona)

sedona_git_repo <- normalizePath("~/incubator-sedona")
data_dir <- file.path(sedona_git_repo, "core", "src", "test", "resources")

sc <- spark_connect(master = "local")

pt_rdd <- sedona_read_dsv_to_typed_rdd(
  sc,
  location = file.path(data_dir, "arealm.csv"),
  type = "point"
)

Applying spatial partitioning to all data points:

1

sedona_apply_spatial_partitioner(pt_rdd, partitioner = "kdbtree")

Building spatial index on each partition:

1

sedona_build_index(pt_rdd, type = "quadtree")

Joining one spatial data set with another using “contain” or “overlap” as the join predicate:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


polygon_rdd <- sedona_read_dsv_to_typed_rdd(
  sc,
  location = file.path(data_dir, "primaryroads-polygon.csv"),
  type = "polygon"
)

pts_per_region_rdd <- sedona_spatial_join_count_by_key(
  pt_rdd,
  polygon_rdd,
  join_type = "contain",
  partitioner = "kdbtree"
)

It is worth mentioning that sedona_spatial_join() will perform spatial partitioning and indexing on the inputs using the partitioner and index_type only if the inputs are not partitioned or indexed as specified already.

From the examples above, one can see that SRDDs are great for spatial operations requiring fine-grained control, e.g., for ensuring a spatial join query is executed as efficiently as possible with the right types of spatial partitioning and indexing.

Finally, we can try visualizing the join result above, using a choropleth map:

1
2
3
4
5
6
7
8


sedona_render_choropleth_map(
  pts_per_region_rdd,
  resolution_x = 1000,
  resolution_y = 600,
  output_location = tempfile("choropleth-map-"),
  boundary = c(-126.790180, -64.630926, 24.863836, 50.000),
  base_color = c(63, 127, 255)
)

which gives us the following:

Wait, but something seems amiss. To make the visualization above look nicer, we can overlay it with the contour of each polygonal region:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


contours <- sedona_render_scatter_plot(
  polygon_rdd,
  resolution_x = 1000,
  resolution_y = 600,
  output_location = tempfile("scatter-plot-"),
  boundary = c(-126.790180, -64.630926, 24.863836, 50.000),
  base_color = c(255, 0, 0),
  browse = FALSE
)

sedona_render_choropleth_map(
  pts_per_region_rdd,
  resolution_x = 1000,
  resolution_y = 600,
  output_location = tempfile("choropleth-map-"),
  boundary = c(-126.790180, -64.630926, 24.863836, 50.000),
  base_color = c(63, 127, 255),
  overlay = contours
)

which gives us the following:

With some low-level spatial operations taken care of using the SRDD API and the right spatial partitioning and indexing data structures, we can then import the results from SRDDs to Spark dataframes. When working with spatial objects within Spark dataframes, we can write high-level, declarative queries on these objects using dplyr verbs in conjunction with Sedona spatial UDFs , e.g. ³ , the following query tells us whether each of the 8 nearest polygons to the query point contains that point, and also, the convex hull of each polygon.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


tbl <- DBI::dbGetQuery(
  sc, "SELECT ST_GeomFromText(\"POINT(-66.3 18)\") AS `pt`"
)
pt <- tbl$pt[[1]]
knn_rdd <- sedona_knn_query(
  polygon_rdd, x = pt, k = 8, index_type = "rtree"
)

knn_sdf <- knn_rdd %>%
  sdf_register() %>%
  dplyr::mutate(
    contains_pt = ST_contains(geometry, ST_Point(-66.3, 18)),
    convex_hull = ST_ConvexHull(geometry)
  )

knn_sdf %>% print()

# Source: spark [?? x 3]
  geometry                         contains_pt convex_hull
                                    
1


Acknowledgements

The author of this blog post would like to thank Jia Yu
,
the creator of Apache Sedona, and Lorenz Walthert
 for
their suggestion to contribute sparklyr.sedona to the upstream
incubator-sedona
 repository. Jia has provided
extensive code-review feedback to ensure sparklyr.sedona complies with coding standards
and best practices of the Apache Sedona project, and has also been very helpful in the
instrumentation of CI workflows verifying sparklyr.sedona works as expected with snapshot
versions of Sedona libraries from development branches.
The author is also grateful for his colleague Sigrid Keydana

for valuable editorial suggestions on this blog post.
That’s all. Thank you for reading!
Photo by NASA
 on Unsplash





sparklyr.sedona was not released to CRAN yet at the time of writing. ↩︎


Yes, pun intended ↩︎


This demo requires sparklyr 1.7 or above to generate the required Spark SQL type casts for ST_Point() automatically. ↩︎



sparklyr 1.7: New data sources and spark_apply() capabilities, better interfaces for sparklyr extensions, and more!
Yitao Li — Tue, 06 Jul 2021 00:00:00 +0000
Sparklyr
 1.7 is now available on CRAN
!
To install sparklyr 1.7 from CRAN, run


1


install.packages("sparklyr")


In this blog post, we wish to present the following highlights from the sparklyr 1.7 release:

Image and binary data sources

New spark_apply() capabilities

Better integration with sparklyr extensions

Other exciting news


Image and binary data sources

As a unified analytics engine for large-scale data processing, Apache Spark

is well-known for its ability to tackle challenges associated with the volume, velocity, and last but
not least, the variety of big data. Therefore it is hardly surprising to see that – in response to recent
advances in deep learning frameworks – Apache Spark has introduced built-in support for
image data sources

and binary data sources
 (in releases 2.4 and 3.0, respectively).
The corresponding R interfaces for both data sources, namely,
spark_read_image()
 and
spark_read_binary()
, were shipped
recently as part of sparklyr 1.7.
The usefulness of data source functionalities such as spark_read_image() is perhaps best illustrated
by a quick demo below, where spark_read_image(), through the standard Apache Spark
ImageSchema
,
helps connecting raw image inputs to a sophisticated feature extractor and a classifier, forming a powerful
Spark application for image classifications.
The demo


    
  

Photo by Daniel Tuttle
 on
Unsplash

In this demo, we shall construct a scalable Spark ML pipeline capable of classifying images of cats and dogs
accurately and efficiently, using spark_read_image() and a pre-trained convolutional neural network
code-named Inception (Szegedy et al. (2015)).
The first step to building such a demo with maximum portability and repeatability is to create a
sparklyr extension
 that accomplishes the following:

Specifying the required MVN dependencies of this demo (namely, the
Spark Deep Learning library

(Databricks, Inc. (2019)), which contains an Inception-V3-based image feature extractor accessible through
the Spark ML Transformer interface
)
Bundling with itself two randomly selected

¹ and disjoint subsets of the
dogs-vs-cats dataset (Elson et al. (2007)) as train and test data, which are stored in the extdata/{train,test} sub
directories of the package)

A reference implementation of such a sparklyr extension can be found in
here
.
The second step, of course, is to make use of the above-mentioned sparklyr extension to perform some feature
engineering. We will see very high-level features being extracted intelligently from each cat/dog image based
on what the pre-built Inception-V3 convolutional neural network has already learned from classifying a much
broader collection of images:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39


library(sparklyr)
library(sparklyr.deeperer)

# NOTE: the correct spark_home path to use depends on the configuration of the
# Spark cluster you are working with.
spark_home <- "/usr/lib/spark"
sc <- spark_connect(master = "yarn", spark_home = spark_home)

data_dir <- copy_images_to_hdfs()

# extract features from train- and test-data
image_data <- list()
for (x in c("train", "test")) {
  # import
  image_data[[x]] <- c("dogs", "cats") %>%
    lapply(
      function(label) {
        numeric_label <- ifelse(identical(label, "dogs"), 1L, 0L)
        spark_read_image(
          sc, dir = file.path(data_dir, x, label, fsep = "/")
        ) %>%
          dplyr::mutate(label = numeric_label)
      }
    ) %>%
      do.call(sdf_bind_rows, .)

  dl_featurizer <- invoke_new(
    sc,
    "com.databricks.sparkdl.DeepImageFeaturizer",
    random_string("dl_featurizer") # uid
  ) %>%
    invoke("setModelName", "InceptionV3") %>%
    invoke("setInputCol", "image") %>%
    invoke("setOutputCol", "features")
  image_data[[x]] <-
    dl_featurizer %>%
    invoke("transform", spark_dataframe(image_data[[x]])) %>%
    sdf_register()
}


Third step: equipped with features that summarize the content of each image well, we can
build a Spark ML pipeline that recognizes cats and dogs using only logistic regression
²


1
2
3
4
5
6
7
8
9


label_col <- "label"
prediction_col <- "prediction"
pipeline <- ml_pipeline(sc) %>%
  ml_logistic_regression(
    features_col = "features",
    label_col = label_col,
    prediction_col = prediction_col
  )
model <- pipeline %>% ml_fit(image_data$train)


Finally, we can evaluate the accuracy of this model on the test images:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


predictions <- model %>%
  ml_transform(image_data$test) %>%
  dplyr::compute()

cat("Predictions vs. labels:\n")
predictions %>%
  dplyr::select(!!label_col, !!prediction_col) %>%
  print(n = sdf_nrow(predictions))

cat("\nAccuracy of predictions:\n")
predictions %>%
  ml_multiclass_classification_evaluator(
    label_col = label_col,
    prediction_col = prediction_col,
    metric_name = "accuracy"
  ) %>%
    print()


## Predictions vs. labels:
## # Source: spark [?? x 2]
##    label prediction
##          
##  1     1          1
##  2     1          1
##  3     1          1
##  4     1          1
##  5     1          1
##  6     1          1
##  7     1          1
##  8     1          1
##  9     1          1
## 10     1          1
## 11     0          0
## 12     0          0
## 13     0          0
## 14     0          0
## 15     0          0
## 16     0          0
## 17     0          0
## 18     0          0
## 19     0          0
## 20     0          0
##
## Accuracy of predictions:
## [1] 1

New spark_apply() capabilities

Optimizations & custom serializers

Many sparklyr users who have tried to run
spark_apply()
 or
doSpark
 to
parallelize R computations among Spark workers have probably encountered some
challenges arising from the serialization of R closures.
In some scenarios, the
serialized size of the R closure can become too large, often due to the size
of the enclosing R environment required by the closure. In other
scenarios, the serialization itself may take too much time, partially offsetting
the performance gain from parallelization. Recently, multiple optimizations went
into sparklyr to address those challenges. One of the optimizations was to
make good use of the
broadcast variable

construct in Apache Spark to reduce the overhead of distributing shared and
immutable task states across all Spark workers. In sparklyr 1.7, there is
also support for custom spark_apply() serializers, which offers more fine-grained
control over the trade-off between speed and compression level of serialization
algorithms. For example, one can specify


1


options(sparklyr.spark_apply.serializer = "qs")


,
which will apply the default options of qs::qserialize() to achieve a high
compression level, or


1
2


options(sparklyr.spark_apply.serializer = function(x) qs::qserialize(x, preset = "fast"))
options(sparklyr.spark_apply.deserializer = function(x) qs::qdeserialize(x))


,
which will aim for faster serialization speed with less compression.
Inferring dependencies automatically

In sparklyr 1.7, spark_apply() also provides the experimental
auto_deps = TRUE option. With auto_deps enabled, spark_apply() will
examine the R closure being applied, infer the list of required R packages,
and only copy the required R packages and their transitive dependencies
to Spark workers. In many scenarios, the auto_deps = TRUE option will be a
significantly better alternative compared to the default packages = TRUE
behavior, which is to ship everything within .libPaths() to Spark worker
nodes, or the advanced packages =  option, which requires
users to supply the list of required R packages or manually create a
spark_apply() bundle.
Better integration with sparklyr extensions

Substantial effort went into sparklyr 1.7 to make lives easier for sparklyr
extension authors. Experience suggests two areas where any sparklyr extension
can go through a frictional and non-straightforward path integrating with
sparklyr are the following:

The dbplyr SQL translation environment

Invocation of Java/Scala functions from R


We will elaborate on recent progress in both areas in the sub-sections below.
Customizing the dbplyr SQL translation environment

sparklyr extensions can now customize sparklyr’s dbplyr SQL translations
through the
spark_dependency()

specification returned from spark_dependencies() callbacks.
This type of flexibility becomes useful, for instance, in scenarios where a
sparklyr extension needs to insert type casts for inputs to custom Spark
UDFs. We can find a concrete example of this in
sparklyr.sedona
,
a sparklyr extension to facilitate geo-spatial analyses using
Apache Sedona
. Geo-spatial UDFs supported by Apache
Sedona such as ST_Point() and ST_PolygonFromEnvelope() require all inputs to be
DECIMAL(24, 20) quantities rather than DOUBLEs. Without any customization to
sparklyr’s dbplyr SQL variant, the only way for a dplyr
query involving ST_Point() to actually work in sparklyr would be to explicitly
implement any type cast needed by the query using dplyr::sql(), e.g.,


1
2
3
4
5
6


my_geospatial_sdf <- my_geospatial_sdf %>%
  dplyr::mutate(
    x = dplyr::sql("CAST(`x` AS DECIMAL(24, 20))"),
    y = dplyr::sql("CAST(`y` AS DECIMAL(24, 20))")
  ) %>%
  dplyr::mutate(pt = ST_Point(x, y))


.
This would, to some extent, be antithetical to dplyr’s goal of freeing R users from
laboriously spelling out SQL queries. Whereas by customizing sparklyr’s dplyr SQL
translations (as implemented in
here

and
here

), sparklyr.sedona allows users to simply write


1


my_geospatial_sdf <- my_geospatial_sdf %>% dplyr::mutate(pt = ST_Point(x, y))


instead, and the required Spark SQL type casts are generated automatically.
Improved interface for invoking Java/Scala functions

In sparklyr 1.7, the R interface for Java/Scala invocations saw a number of
improvements.
With previous versions of sparklyr, many sparklyr extension authors would
run into trouble when attempting to invoke Java/Scala functions accepting an
Array[T] as one of their parameters, where T is any type bound more specific
than java.lang.Object / AnyRef. This was because any array of objects passed
through sparklyr’s Java/Scala invocation interface will be interpreted as simply
an array of java.lang.Objects in absence of additional type information.
For this reason, a helper function
jarray()
 was implemented as
part of sparklyr 1.7 as a way to overcome the aforementioned problem.
For example, executing


1
2
3
4
5
6
7


sc <- spark_connect(...)

arr <- jarray(
  sc,
  seq(5) %>% lapply(function(x) invoke_new(sc, "MyClass", x)),
  element_type = "MyClass"
)


will assign to arr a reference to an Array[MyClass] of length 5, rather
than an Array[AnyRef]. Subsequently, arr becomes suitable to be passed as a
parameter to functions accepting only Array[MyClass]s as inputs. Previously,
some possible workarounds of this sparklyr limitation included changing
function signatures to accept Array[AnyRef]s instead of Array[MyClass]s, or
implementing a “wrapped” version of each function accepting Array[AnyRef]
inputs and converting them to Array[MyClass] before the actual invocation.
None of such workarounds was an ideal solution to the problem.
Another similar hurdle that was addressed in sparklyr 1.7 as well involves
function parameters that must be single-precision floating point numbers or
arrays of single-precision floating point numbers.
For those scenarios,
jfloat()
 and
jfloat_array()

are the helper functions that allow numeric quantities in R to be passed to
sparklyr’s Java/Scala invocation interface as parameters with desired types.
In addition, while previous verisons of sparklyr failed to serialize
parameters with NaN values correctly, sparklyr 1.7 preserves NaNs as
expected in its Java/Scala invocation interface.
Other exciting news

There are numerous other new features, enhancements, and bug fixes made to
sparklyr 1.7, all listed in the
NEWS.md

file of the sparklyr repo and documented in sparklyr’s
HTML reference
 pages.
In the interest of brevity, we will not describe all of them in great detail
within this blog post.
Acknowledgement

In chronological order, we would like to thank the following individuals who
have authored or co-authored pull requests that were part of the sparklyr 1.7
release:

@yitao-li

@mzorko

@jozefhajnala

@lresende


We’re also extremely grateful to everyone who has submitted
feature requests or bug reports, many of which have been tremendously helpful in
shaping sparklyr into what it is today.
Furthermore, the author of this blog post is indebted to
@skeydan
 for her awesome editorial suggestions.
Without her insights about good writing and story-telling, expositions like this
one would have been less readable.
If you wish to learn more about sparklyr, we recommend visiting
sparklyr.ai
, spark.rstudio.com
,
and also reading some previous sparklyr release posts such as
sparklyr 1.6

and
sparklyr 1.5
.
That is all. Thanks for reading!
Databricks, Inc. 2019. Deep Learning Pipelines for Apache Spark. V. 1.5.0. Released January 25. https://spark-packages.org/package/databricks/spark-deep-learning
.
Elson, Jeremy, John (JD) Douceur, Jon Howell, and Jared Saul. 2007. “Asirra: A CAPTCHA That Exploits Interest-Aligned Manual Image Categorization.” Proceedings of 14th ACM Conference on Computer and Communications Security (CCS), Proceedings of 14th ACM Conference on Computer and Communications Security (CCS) Editions. https://www.microsoft.com/en-us/research/publication/asirra-a-captcha-that-exploits-interest-aligned-manual-image-categorization/
.
Szegedy, Christian, Wei Liu, Yangqing Jia, et al. 2015. “Going Deeper with Convolutions.” Computer Vision and Pattern Recognition (CVPR). http://arxiv.org/abs/1409.4842
.




Fun exercise for our readers: why not experiment with different subsets of cats-vs-dogs images for training
and testing, or even better, replace train and test images with your own images of cats and dogs, and see what
happens? ↩︎


Another way to see why it works: in fact the pre-built Inception-based feature
extractor simply applies all transformations Inception would have applied to its input,
except for the last logistic-regression-esque affine transformation plus non-linearity
producing the final categorical output, and Inception is a highly successful
convolutional neural network trained to recognize 1000 categories of animals and objects,
including multiple types of cats and dogs. ↩︎






sparklyr 1.6: weighted quantile summaries, power iteration clustering, spark_write_rds(), and more
Yitao Li — Thu, 25 Mar 2021 00:00:00 +0000
Sparklyr
 1.6 is now available on CRAN
!
To install sparklyr 1.6 from CRAN, run


1


install.packages("sparklyr")


In this blog post, we shall highlight the following features and enhancements
from sparklyr 1.6:

Weighted quantile summaries

Power iteration clustering

spark_write_rds() + collect_from_rds()

Dplyr-related improvements


Weighted quantile summaries

Apache Spark
 is well-known for supporting
approximate algorithms that trade off marginal amounts of accuracy for greater
speed and parallelism.
Such algorithms are particularly beneficial for performing preliminary data
explorations at scale, as they enable users to quickly query certain estimated
statistics within a predefined error margin, while avoiding the high cost of
exact computations.
One example is the Greenwald-Khanna algorithm for on-line computation of quantile
summaries, as described in Greenwald and Khanna (2001).
This algorithm was originally designed for efficient $\epsilon$-
approximation of quantiles within a large dataset without the notion of data
points carrying different weights, and the unweighted version of it has been
implemented as
approxQuantile()

since Spark 2.0.
However, the same algorithm can be generalized to handle weighted
inputs, and as sparklyr user @Zhuk66
 mentioned
in this issue
, a
weighted version

of this algorithm makes for a useful sparklyr feature.
To properly explain what weighted-quantile means, we must clarify what the
weight of each data point signifies. For example, if we have a sequence of
observations $(1, 1, 1, 1, 0, 2, -1, -1)$, and would like to approximate the
median of all data points, then we have the following two options:


Either run the unweighted version of approxQuantile() in Spark to scan
through all 8 data points


Or alternatively, “compress” the data into 4 tuples of (value, weight):
$(1, 0.5), (0, 0.125), (2, 0.125), (-1, 0.25)$, where the second component of
each tuple represents how often a value occurs relative to the rest of the
observed values, and then find the median by scanning through the 4 tuples
using the weighted version of the Greenwald-Khanna algorithm


We can also run through a contrived example involving the standard normal
distribution to illustrate the power of weighted quantile estimation in
sparklyr 1.6. Suppose we cannot simply run qnorm() in R to evaluate the
quantile function

of the standard normal distribution at $p = 0.25$ and $p = 0.75$, how can
we get some vague idea about the 1st and 3rd quantiles of this distribution?
One way is to sample a large number of data points from this distribution, and
then apply the Greenwald-Khanna algorithm to our unweighted samples, as shown
below:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


library(sparklyr)

sc <- spark_connect(master = "local")

num_samples <- 1e6
samples <- data.frame(x = rnorm(num_samples))

samples_sdf <- copy_to(sc, samples, name = random_string())

samples_sdf %>%
  sdf_quantile(
    column = "x",
    probabilities = c(0.25, 0.75),
    relative.error = 0.01
  ) %>%
  print()


##        25%        75%
## -0.6629242  0.6874939

Notice that because we are working with an approximate algorithm, and have specified
relative.error = 0.01, the estimated value of $-0.6629242$ from above
could be anywhere between the 24th and the 26th percentile of all samples.
In fact, it falls in the $25.36896$-th percentile:


1


pnorm(-0.6629242)


## [1] 0.2536896

Now how can we make use of weighted quantile estimation from sparklyr 1.6 to
obtain similar results? Simple! We can sample a large number of $x$ values
uniformly randomly from $(-\infty, \infty)$ (or alternatively, just select a
large number of values evenly spaced between $(-M, M)$ where $M$ is
approximately $\infty$), and assign each $x$ value a weight of
$\displaystyle \frac{1}{\sqrt{2 \pi}}e^{-\frac{x^2}{2}}$, the standard normal
distribution’s probability density at $x$. Finally, we run the weighted version
of sdf_quantile() from sparklyr 1.6, as shown below:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


library(sparklyr)

sc <- spark_connect(master = "local")

num_samples <- 1e6
M <- 1000
samples <- tibble::tibble(
  x = M * seq(-num_samples / 2 + 1, num_samples / 2) / num_samples,
  weight = dnorm(x)
)

samples_sdf <- copy_to(sc, samples, name = random_string())

samples_sdf %>%
  sdf_quantile(
    column = "x",
    weight.column = "weight",
    probabilities = c(0.25, 0.75),
    relative.error = 0.01
  ) %>%
  print()


##    25%    75%
## -0.696  0.662

Voilà! The estimates are not too far off from the 25th and 75th percentiles (in
relation to our abovementioned maximum permissible error of $0.01$):


1


pnorm(-0.696)


## [1] 0.2432144



1


pnorm(0.662)


## [1] 0.7460144

Power iteration clustering

Power iteration clustering (PIC), a simple and scalable graph clustering method
presented in Lin and Cohen (2010), first finds a low-dimensional embedding of a dataset, using
truncated power iteration on a normalized pairwise-similarity matrix of all data
points, and then uses this embedding as the “cluster indicator”, an intermediate
representation of the dataset that leads to fast convergence when used as input
to k-means clustering. This process is very well illustrated in figure 1
of Lin and Cohen (2010) (reproduced below)

in which the leftmost image is the visualization of a dataset consisting of 3
circles, with points colored in red, green, and blue indicating clustering
results, and the subsequent images show the power iteration process gradually
transforming the original set of points into what appears to be three disjoint line
segments, an intermediate representation that can be rapidly separated into 3
clusters using k-means clustering with $k = 3$.
In sparklyr 1.6, ml_power_iteration() was implemented to make the
PIC functionality

in Spark accessible from R. It expects as input a 3-column Spark dataframe that
represents a pairwise-similarity matrix of all data points. Two of
the columns in this dataframe should contain 0-based row and column indices, and
the third column should hold the corresponding similarity measure.
In the example below, we will see a dataset consisting of two circles being
easily separated into two clusters by ml_power_iteration(), with the Gaussian
kernel being used as the similarity measure between any 2 points:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43


gen_similarity_matrix <- function() {
  # Guassian similarity measure
  guassian_similarity <- function(pt1, pt2) {
    exp(-sum((pt2 - pt1) ^ 2) / 2)
  }
  # generate evenly distributed points on a circle centered at the origin
  gen_circle <- function(radius, num_pts) {
    seq(0, num_pts - 1) %>%
      purrr::map_dfr(
        function(idx) {
          theta <- 2 * pi * idx / num_pts
          radius * c(x = cos(theta), y = sin(theta))
        })
  }
  # generate points on both circles
  pts <- rbind(
    gen_circle(radius = 1, num_pts = 80),
    gen_circle(radius = 4, num_pts = 80)
  )
  # populate the pairwise similarity matrix (stored as a 3-column dataframe)
  similarity_matrix <- data.frame()
  for (i in seq(2, nrow(pts)))
    similarity_matrix <- similarity_matrix %>%
      rbind(seq(i - 1L) %>%
        purrr::map_dfr(~ list(
          src = i - 1L, dst = .x - 1L,
          similarity = guassian_similarity(pts[i,], pts[.x,])
        ))
      )

  similarity_matrix
}

library(sparklyr)

sc <- spark_connect(master = "local")
sdf <- copy_to(sc, gen_similarity_matrix())
clusters <- ml_power_iteration(
  sdf, k = 2, max_iter = 10, init_mode = "degree",
  src_col = "src", dst_col = "dst", weight_col = "similarity"
)

clusters %>% print(n = 160)


## # A tibble: 160 x 2
##        id cluster
##        
##   1     0       1
##   2     1       1
##   3     2       1
##   4     3       1
##   5     4       1
##   ...
##   157   156       0
##   158   157       0
##   159   158       0
##   160   159       0

The output shows points from the two circles being assigned to separate clusters,
as expected, after only a small number of PIC iterations.
spark_write_rds() + collect_from_rds()

spark_write_rds() and collect_from_rds() are implemented as a less memory-
consuming alternative to collect(). Unlike collect(), which retrieves all
elements of a Spark dataframe through the Spark driver node, hence potentially
causing slowness or out-of-memory failures when collecting large amounts of data,
spark_write_rds(), when used in conjunction with collect_from_rds(), can
retrieve all partitions of a Spark dataframe directly from Spark workers,
rather than through the Spark driver node.
First, spark_write_rds() will
distribute the tasks of serializing Spark dataframe partitions in RDS version
2 format among Spark workers. Spark workers can then process multiple partitions
in parallel, each handling one partition at a time and persisting the RDS output
directly to disk, rather than sending dataframe partitions to the Spark driver
node. Finally, the RDS outputs can be re-assembled to R dataframes using
collect_from_rds().
Shown below is an example of spark_write_rds() + collect_from_rds() usage,
where RDS outputs are first saved to HDFS, then downloaded to the local
filesystem with hadoop fs -get, and finally, post-processed with
collect_from_rds():


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27


library(sparklyr)
library(nycflights13)

num_partitions <- 10L
sc <- spark_connect(master = "yarn", spark_home = "/usr/lib/spark")
flights_sdf <- copy_to(sc, flights, repartition = num_partitions)

# Spark workers serialize all partition in RDS format in parallel and write RDS
# outputs to HDFS
spark_write_rds(
  flights_sdf,
  dest_uri = "hdfs://:8020/flights-part-{partitionId}.rds"
)

# Run `hadoop fs -get` to download RDS files from HDFS to local file system
for (partition in seq(num_partitions) - 1)
  system2(
    "hadoop",
    c("fs", "-get", sprintf("hdfs://:8020/flights-part-%d.rds", partition))
  )

# Post-process RDS outputs
partitions <- seq(num_partitions) - 1 %>%
  lapply(function(partition) collect_from_rds(sprintf("flights-part-%d.rds", partition)))

# Optionally, call `rbind()` to combine data from all partitions into a single R dataframe
flights_df <- do.call(rbind, partitions)


Dplyr-related improvements

Similar to other recent sparklyr releases, sparklyr 1.6 comes with a
number of dplyr-related improvements, such as

Support for where() predicate within select() and summarize(across(...))
operations on Spark dataframes
Addition of if_all() and if_any() functions
Full compatibility with dbplyr 2.0 backend API

select(where(...)) and summarize(across(where(...)))

The dplyr where(...) construct is useful for applying a selection or
aggregation function to multiple columns that satisfy some boolean predicate.
For example,


1
2
3


library(dplyr)

iris %>% select(where(is.numeric))


returns all numeric columns from the iris dataset, and


1
2
3


library(dplyr)

iris %>% summarize(across(where(is.numeric), mean))


computes the average of each numeric column.
In sparklyr 1.6, both types of operations can be applied to Spark dataframes, e.g.,


1
2
3
4
5
6
7
8
9


library(dplyr)
library(sparklyr)

sc <- spark_connect(master = "local")
iris_sdf <- copy_to(sc, iris, name = random_string())

iris_sdf %>% select(where(is.numeric))

iris %>% summarize(across(where(is.numeric), mean))


if_all() and if_any()

if_all() and if_any() are two convenience functions from dplyr 1.0.4 (see
here
 for more details)
that effectively ¹
combine the results of applying a boolean predicate to a tidy selection of columns
using the logical and/or operators.
Starting from sparklyr 1.6, if_all() and if_any() can also be applied to
Spark dataframes, .e.g.,


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


library(dplyr)
library(sparklyr)

sc <- spark_connect(master = "local")
iris_sdf <- copy_to(sc, iris, name = random_string())

# Select all records with Petal.Width > 2 and Petal.Length > 2
iris_sdf %>% filter(if_all(starts_with("Petal"), ~ .x > 2))

# Select all records with Petal.Width > 5 or Petal.Length > 5
iris_sdf %>% filter(if_any(starts_with("Petal"), ~ .x > 5))


Compatibility with dbplyr 2.0 backend API

Sparklyr 1.6 is fully compatible with the newer dbplyr 2.0 backend API (by
implementing all interface changes recommended in
here
), while still
maintaining backward compatibility with the previous edition of dbplyr API, so
that sparklyr users will not be forced to switch to any particular version of
dbplyr.
This should be a mostly non-user-visible change as of now. In fact, the only
discernible behavior change will be the following code


1
2
3
4
5
6


library(dbplyr)
library(sparklyr)

sc <- spark_connect(master = "local")

print(dbplyr_edition(sc))


outputting
[1] 2

if sparklyr is working with dbplyr 2.0+, and
[1] 1

if otherwise.
Acknowledgements

In chronological order, we would like to thank the following contributors for
making sparklyr 1.6 awesome:

@yitao-li

@pgramme

@javierluraschi

@andrew-christianson

@jozefhajnala

@nathaneastwood

@mzorko


We would also like to give a big shout-out to the wonderful open-source community
behind sparklyr, without whom we would not have benefitted from numerous
sparklyr-related bug reports and feature suggestions.
Finally, the author of this blog post also very much appreciates the highly
valuable editorial suggestions from @skeydan
.
If you wish to learn more about sparklyr, we recommend checking out
sparklyr.ai
, spark.rstudio.com
,
and also some previous sparklyr release posts such as
sparklyr 1.5

and sparklyr 1.4
.
That is all. Thanks for reading!
Greenwald, Michael, and Sanjeev Khanna. 2001. “Space-Efficient Online Computation of Quantile Summaries.” SIGMOD Rec. (New York, NY, USA) 30 (2): 58–66. https://doi.org/10.1145/376284.375670
.
Lin, Frank, and William Cohen. 2010. “Power Iteration Clustering.” August, 655–62.




modulo possible implementation-dependent short-circuit evaluations ↩︎






sparklyr 1.5: better dplyr interface, more sdf_* functions, and RDS-based serialization routines
Yitao Li — Mon, 14 Dec 2020 00:00:00 +0000
We are thrilled to announce sparklyr
 1.5 is now
available on CRAN
!
To install sparklyr 1.5 from CRAN, run


1


install.packages("sparklyr")


In this blog post, we will highlight the following aspects of sparklyr 1.5:

Better dplyr interface

4 useful additions to the sdf_* family of functions

New RDS-based serialization routines
 along with several serialization-related improvements and bug fixes

Better dplyr interface

A large fraction of pull requests that went into the sparklyr 1.5 release were focused on making
Spark dataframes work with various dplyr verbs in the same way that R dataframes do.
The full list of dplyr-related bugs and feature requests that were resolved in
sparklyr 1.5 can be found in here
.
In this section, we will showcase three new dplyr functionalities that were shipped with sparklyr 1.5.
Stratified sampling

Stratified sampling on an R dataframe can be accomplished with a combination of dplyr::group_by() followed by
dplyr::sample_n() or dplyr::sample_frac(), where the grouping variables specified in the dplyr::group_by()
step are the ones that define each stratum. For instance, the following query will group mtcars by number
of cylinders and return a weighted random sample of size two from each group, without replacement, and weighted by
the mpg column:


1
2
3
4


mtcars %>%
  dplyr::group_by(cyl) %>%
  dplyr::sample_n(size = 2, weight = mpg, replace = FALSE) %>%
  print()


## # A tibble: 6 x 11
## # Groups:   cyl [3]
##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##             
## 1  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1
## 2  22.8     4 108      93  3.85  2.32  18.6     1     1     4     1
## 3  21.4     6 258     110  3.08  3.22  19.4     1     0     3     1
## 4  21       6 160     110  3.9   2.62  16.5     0     1     4     4
## 5  15.5     8 318     150  2.76  3.52  16.9     0     0     3     2
## 6  19.2     8 400     175  3.08  3.84  17.0     0     0     3     2

Starting from sparklyr 1.5, the same can also be done for Spark dataframes with Spark 3.0 or above, e.g.,:


1
2
3
4
5
6
7
8
9


library(sparklyr)

sc <- spark_connect(master = "local", version = "3.0.0")
mtcars_sdf <- copy_to(sc, mtcars, replace = TRUE, repartition = 3)

mtcars_sdf %>%
  dplyr::group_by(cyl) %>%
  dplyr::sample_n(size = 2, weight = mpg, replace = FALSE) %>%
  print()


# Source: spark [?? x 11]
# Groups: cyl
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
            
1  21       6 160     110  3.9   2.62  16.5     0     1     4     4
2  21.4     6 258     110  3.08  3.22  19.4     1     0     3     1
3  27.3     4  79      66  4.08  1.94  18.9     1     1     4     1
4  32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
5  16.4     8 276.    180  3.07  4.07  17.4     0     0     3     3
6  18.7     8 360     175  3.15  3.44  17.0     0     0     3     2

or


1
2
3
4


mtcars_sdf %>%
  dplyr::group_by(cyl) %>%
  dplyr::sample_frac(size = 0.2, weight = mpg, replace = FALSE) %>%
  print()


## # Source: spark [?? x 11]
## # Groups: cyl
##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##             
## 1  21       6 160     110  3.9   2.62  16.5     0     1     4     4
## 2  21.4     6 258     110  3.08  3.22  19.4     1     0     3     1
## 3  22.8     4 141.     95  3.92  3.15  22.9     1     0     4     2
## 4  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1
## 5  30.4     4  95.1   113  3.77  1.51  16.9     1     1     5     2
## 6  15.5     8 318     150  2.76  3.52  16.9     0     0     3     2
## 7  18.7     8 360     175  3.15  3.44  17.0     0     0     3     2
## 8  16.4     8 276.    180  3.07  4.07  17.4     0     0     3     3

Row sums

The rowSums() functionality offered by dplyr is handy when one needs to sum up
a large number of columns within an R dataframe that are impractical to be enumerated
individually.
For example, here we have a six-column dataframe of random real numbers, where the
partial_sum column in the result contains the sum of columns b through d within
each row:


1
2
3
4
5
6
7
8


ncols <- 6
nums <- seq(ncols) %>% lapply(function(x) runif(5))
names(nums) <- letters[1:ncols]
tbl <- tibble::as_tibble(nums)

tbl %>%
  dplyr::mutate(partial_sum = rowSums(.[2:5])) %>%
  print()


## # A tibble: 5 x 7
##         a     b     c      d     e      f partial_sum
##                   
## 1 0.781   0.801 0.157 0.0293 0.169 0.0978        1.16
## 2 0.696   0.412 0.221 0.941  0.697 0.675         2.27
## 3 0.802   0.410 0.516 0.923  0.190 0.904         2.04
## 4 0.200   0.590 0.755 0.494  0.273 0.807         2.11
## 5 0.00149 0.711 0.286 0.297  0.107 0.425         1.40

Beginning with sparklyr 1.5, the same operation can be performed with Spark dataframes:


1
2
3
4
5
6
7
8


library(sparklyr)

sc <- spark_connect(master = "local")
sdf <- copy_to(sc, tbl, overwrite = TRUE)

sdf %>%
  dplyr::mutate(partial_sum = rowSums(.[2:5])) %>%
  print()


## # Source: spark [?? x 7]
##         a     b     c      d     e      f partial_sum
##                   
## 1 0.781   0.801 0.157 0.0293 0.169 0.0978        1.16
## 2 0.696   0.412 0.221 0.941  0.697 0.675         2.27
## 3 0.802   0.410 0.516 0.923  0.190 0.904         2.04
## 4 0.200   0.590 0.755 0.494  0.273 0.807         2.11
## 5 0.00149 0.711 0.286 0.297  0.107 0.425         1.40

As a bonus from implementing the rowSums feature for Spark dataframes,
sparklyr 1.5 now also offers limited support for the column-subsetting
operator on Spark dataframes.
For example, all code snippets below will return some subset of columns from
the dataframe named sdf:


1
2


# select columns `b` through `e`
sdf[2:5]




1
2


# select columns `b` and `c`
sdf[c("b", "c")]




1
2


# drop the first and third columns and return the rest
sdf[c(-1, -3)]


Weighted-mean summarizer

Similar to the two dplyr functions mentioned above, the weighted.mean() summarizer is another
useful function that has become part of the dplyr interface for Spark dataframes in sparklyr 1.5.
One can see it in action by, for example, comparing the output from the following


1
2
3
4
5
6
7
8
9


library(sparklyr)

sc <- spark_connect(master = "local")

mtcars_sdf <- copy_to(sc, mtcars, replace = TRUE)
mtcars_sdf %>%
  dplyr::group_by(cyl) %>%
  dplyr::summarize(mpg_wm = weighted.mean(mpg, wt)) %>%
  print()


with output from the equivalent operation on mtcars in R:


1
2
3
4


mtcars %>%
  dplyr::group_by(cyl) %>%
  dplyr::summarize(mpg_wm = weighted.mean(mpg, wt)) %>%
  print()


both of them should evaluate to the following:
##     cyl mpg_wm
##     
## 1     4   25.9
## 2     6   19.6
## 3     8   14.8

New additions to the sdf_* family of functions

sparklyr provides a large number of convenience functions for working with Spark dataframes,
and all of them have names starting with the sdf_ prefix.
In this section we will briefly mention four new additions
and show some example scenarios in which those functions are useful.
sdf_expand_grid()

As the name suggests, sdf_expand_grid() is simply the Spark equivalent of expand.grid().
Rather than running expand.grid() in R and importing the resulting R dataframe to Spark, one
can now run sdf_expand_grid(), which accepts both R vectors and Spark dataframes and supports
hints for broadcast hash joins. The example below shows sdf_expand_grid() creating a
100-by-100-by-10-by-10 grid in Spark over 1000 Spark partitions, with broadcast hash join hints
on variables with small cardinalities:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


library(sparklyr)

sc <- spark_connect(master = "local")

grid_sdf <- sdf_expand_grid(
  sc,
  var1 = seq(100),
  var2 = seq(100),
  var3 = seq(10),
  var4 = seq(10),
  broadcast_vars = c(var3, var4),
  repartition = 1000
)

grid_sdf %>% sdf_nrow() %>% print()


## [1] 1e+06

sdf_partition_sizes()

As sparklyr user @sbottelli
 suggested here
,
one thing that would be great to have in sparklyr is an efficient way to query partition sizes of a Spark dataframe.
In sparklyr 1.5, sdf_partition_sizes() does exactly that:


1
2
3
4
5
6
7


library(sparklyr)

sc <- spark_connect(master = "local")

sdf_len(sc, 1000, repartition = 5) %>%
  sdf_partition_sizes() %>%
  print(row.names = FALSE)


##  partition_index partition_size
##                0            200
##                1            200
##                2            200
##                3            200
##                4            200

sdf_unnest_longer() and sdf_unnest_wider()

sdf_unnest_longer() and sdf_unnest_wider() are the equivalents of
tidyr::unnest_longer() and tidyr::unnest_wider() for Spark dataframes.
sdf_unnest_longer() expands all elements in a struct column into multiple rows, and
sdf_unnest_wider() expands them into multiple columns. As illustrated with an example
dataframe below,


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


library(sparklyr)

sc <- spark_connect(master = "local")
sdf <- copy_to(
  sc,
  tibble::tibble(
    id = seq(3),
    attribute = list(
      list(name = "Alice", grade = "A"),
      list(name = "Bob", grade = "B"),
      list(name = "Carol", grade = "C")
    )
  )
)




1
2
3


sdf %>%
  sdf_unnest_longer(col = record, indices_to = "key", values_to = "value") %>%
  print()


evaluates to
## # Source: spark [?? x 3]
##      id value key
##     
## 1     1 A     grade
## 2     1 Alice name
## 3     2 B     grade
## 4     2 Bob   name
## 5     3 C     grade
## 6     3 Carol name

whereas


1
2
3


sdf %>%
  sdf_unnest_wider(col = record) %>%
  print()


evaluates to
## # Source: spark [?? x 3]
##      id grade name
##     
## 1     1 A     Alice
## 2     2 B     Bob
## 3     3 C     Carol

RDS-based serialization routines

Some readers must be wondering why a brand new serialization format would need to be implemented in sparklyr at all.
Long story short, the reason is that RDS serialization is a strictly better replacement for its CSV predecessor.
It possesses all desirable attributes the CSV format has,
while avoiding a number of disadvantages that are common among text-based data formats.
In this section, we will briefly outline why sparklyr should support at least one serialization format other than arrow,
deep-dive into issues with CSV-based serialization,
and then show how the new RDS-based serialization is free from those issues.
Why arrow is not for everyone?

To transfer data between Spark and R correctly and efficiently, sparklyr must rely on some data serialization
format that is well-supported by both Spark and R.
Unfortunately, not many serialization formats satisfy this requirement,
and among the ones that do are text-based formats such as CSV and JSON,
and binary formats such as Apache Arrow, Protobuf, and as of recent, a small subset of RDS version 2.
Further complicating the matter is the additional consideration that
sparklyr should support at least one serialization format whose implementation can be fully self-contained within the sparklyr code base,
i.e., such serialization should not depend on any external R package or system library,
so that it can accommodate users who want to use sparklyr but who do not necessarily have the required C++ compiler tool chain and
other system dependencies for setting up R packages such as arrow
 or
protolite
.
Prior to sparklyr 1.5, CSV-based serialization was the default alternative to fallback to when users do not have the arrow package installed or
when the type of data being transported from R to Spark is unsupported by the version of arrow available.
Why is the CSV format not ideal?

There are at least three reasons to believe CSV format is not the best choice when it comes to exporting data from R to Spark.
One reason is efficiency. For example, a double-precision floating point number such as .Machine$double.eps needs to
be expressed as "2.22044604925031e-16" in CSV format in order to not incur any loss of precision, thus taking up 20 bytes
rather than 8 bytes.
But more important than efficiency are correctness concerns. In a R dataframe, one can store both NA_real_ and
NaN in a column of floating point numbers. NA_real_ should ideally translate to null within a Spark dataframe, whereas
NaN should continue to be NaN when transported from R to Spark. Unfortunately, NA_real_ in R becomes indistinguishable
from NaN once serialized in CSV format, as evident from a quick demo shown below:


1
2


original_df <- data.frame(x = c(NA_real_, NaN))
original_df %>% dplyr::mutate(is_nan = is.nan(x)) %>% print()


##     x is_nan
## 1  NA  FALSE
## 2 NaN   TRUE



1
2
3
4


csv_file <- "/tmp/data.csv"
write.csv(original_df, file = csv_file, row.names = FALSE)
deserialized_df <- read.csv(csv_file)
deserialized_df %>% dplyr::mutate(is_nan = is.nan(x)) %>% print()


##    x is_nan
## 1 NA  FALSE
## 2 NA  FALSE

Another correctness issue very much similar to the one above was the fact that
"NA" and NA within a string column of an R dataframe become indistinguishable
once serialized in CSV format, as correctly pointed out in
this Github issue

by @caewok
 and others.
RDS to the rescue!

RDS format is one of the most widely used binary formats for serializing R objects.
It is described in some detail in chapter 1, section 8 of
this document
.
Among advantages of the RDS format are efficiency and accuracy: it has a reasonably
efficient implementation in base R, and supports all R data types.
Also worth noticing is the fact that when an R dataframe containing only data types
with sensible equivalents in Apache Spark (e.g., RAWSXP, LGLSXP, CHARSXP, REALSXP, etc)
is saved using RDS version 2,
(e.g., serialize(mtcars, connection = NULL, version = 2L, xdr = TRUE)),
only a tiny subset of the RDS format will be involved in the serialization process,
and implementing deserialization routines in Scala capable of decoding such a restricted
subset of RDS constructs is in fact a reasonably simple and straightforward task
(as shown in
here

).
Last but not least, because RDS is a binary format, it allows NA_character_, "NA",
NA_real_, and NaN to all be encoded in an unambiguous manner, hence allowing sparklyr
1.5 to avoid all correctness issues detailed above in non-arrow serialization use cases.
Other benefits of RDS serialization

In addition to correctness guarantees, RDS format also offers quite a few other advantages.
One advantage is of course performance: for example, importing a non-trivially-sized dataset
such as nycflights13::flights from R to Spark using the RDS format in sparklyr 1.5 is
roughly 40%-50% faster compared to CSV-based serialization in sparklyr 1.4. The
current RDS-based implementation is still nowhere as fast as arrow-based serialization
though (arrow is about 3-4x faster), so for performance-sensitive tasks involving
heavy serialization, arrow should still be the top choice.
Another advantage is that with RDS serialization, sparklyr can import R dataframes containing
raw columns directly into binary columns in Spark. Thus, use cases such as the one below
will work in sparklyr 1.5


1
2
3
4
5
6
7
8


library(sparklyr)

sc <- spark_connect(master = "local")

tbl <- tibble::tibble(
  x = list(serialize("sparklyr", NULL), serialize(c(123456, 789), NULL))
)
sdf <- copy_to(sc, tbl)


While most sparklyr users probably won’t find this capability of importing binary columns
to Spark immediately useful in their typical sparklyr::copy_to() or sparklyr::collect()
usages, it does play a crucial role in reducing serialization overheads in the Spark-based
foreach
 parallel backend that
was first introduced in sparklyr 1.2.
This is because Spark workers can directly fetch the serialized R closures to be computed
from a binary Spark column instead of extracting those serialized bytes from intermediate
representations such as base64-encoded strings.
Similarly, the R results from executing worker closures will be directly available in RDS
format which can be efficiently deserialized in R, rather than being delivered in other
less efficient formats.
Acknowledgement

In chronological order, we would like to thank the following contributors for making their pull
requests part of sparklyr 1.5:

@wkdavis

@yitao-li

@falaki

@nathaneastwood

@pgramme


We would also like to express our gratitude towards numerous bug reports and feature requests for
sparklyr from a fantastic open-source community.
Finally, the author of this blog post is indebted to
@javierluraschi
,
@batpigandme
,
and @skeydan
 for their valuable editorial inputs.
If you wish to learn more about sparklyr, check out sparklyr.ai
,
spark.rstudio.com
, and some of the previous release posts such as
sparklyr 1.4
 and
sparklyr 1.3
.
Thanks for reading!



sparklyr 1.4: Weighted Sampling, Tidyr Verbs, Robust Scaler, RAPIDS, and more
Yitao Li — Wed, 30 Sep 2020 00:00:00 +0000
sparklyr
 1.4 is now available on CRAN
! To install sparklyr 1.4 from CRAN, run


1


install.packages("sparklyr")


In this blog post, we will showcase the following much-anticipated new functionalities from the sparklyr 1.4 release:

Parallelized Weighted Sampling
 with Spark
Support for Tidyr Verbs
 on Spark Dataframes
ft_robust_scaler
 as the R interface for RobustScaler
 from Spark 3.0
Option for enabling RAPIDS
 GPU acceleration plugin in spark_connect()
Higher-order functions and dplyr-related improvements


Parallelized Weighted Sampling

Readers familiar with dplyr::sample_n() and dplyr::sample_frac() functions may have noticed that both of them support weighted-sampling use cases on R dataframes, e.g.,


1


dplyr::sample_n(mtcars, size = 3, weight = mpg, replace = FALSE)


               mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Fiat 128      32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Merc 280C     17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4

and


1


dplyr::sample_frac(mtcars, size = 0.1, weight = mpg, replace = FALSE)


             mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Honda Civic 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Merc 450SE  16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Fiat X1-9   27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1

will select some random subset of mtcars using the mpg attribute as the sampling weight for each row. If replace = FALSE is set, then a row is removed from the sampling population once it gets selected, whereas when setting replace = TRUE, each row will always stay in the sampling population and can be selected multiple times.
Now the exact same use cases are supported for Spark dataframes in sparklyr 1.4! For example:


1
2
3
4
5
6


library(sparklyr)

sc <- spark_connect(master = "local")
mtcars_sdf <- copy_to(sc, mtcars, repartition = 4L)

dplyr::sample_n(mtcars_sdf, size = 5, weight = mpg, replace = FALSE)


will return a random subset of size 5 from the Spark dataframe mtcars_sdf.
More importantly, the sampling algorithm implemented in sparklyr 1.4 is something that fits perfectly into the MapReduce paradigm: as we have split our mtcars data into 4 partitions of mtcars_sdf by specifying repartition = 4L, the algorithm will first process each partition independently and in parallel, selecting a sample set of size up to 5 from each, and then reduce all 4 sample sets into a final sample set of size 5 by choosing records having the top 5 highest sampling priorities among all.
How is such parallelization possible, especially for the sampling without replacement scenario, where the desired result is defined as the outcome of a sequential process? A detailed answer to this question is in this blog post
, which includes a definition of the problem (in particular, the exact meaning of sampling weights in term of probabilities), a high-level explanation of the current solution and the motivation behind it, and also, some mathematical details all hidden in one link to a PDF file, so that non-math-oriented readers can get the gist of everything else without getting scared away, while math-oriented readers can enjoy working out all the integrals themselves before peeking at the answer.
Tidyr Verbs

The specialized implementations of the following tidyr
 verbs that work efficiently with Spark dataframes were included as part of sparklyr 1.4:

tidyr::fill

tidyr::nest

tidyr::unnest

tidyr::pivot_wider

tidyr::pivot_longer

tidyr::separate

tidyr::unite


We can demonstrate how those verbs are useful for tidying data through some examples.
Let’s say we are given mtcars_sdf, a Spark dataframe containing all rows from mtcars plus the name of each row:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10


library(sparklyr)

sc <- spark_connect(master = "local")
mtcars_sdf <- cbind(
  data.frame(model = rownames(mtcars)),
  data.frame(mtcars, row.names = NULL)
) %>%
  copy_to(sc, ., repartition = 4L)

print(mtcars_sdf, n = 5)


# Source: spark [?? x 12]
  model          mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
                    
1 Mazda RX4     21       6   160   110  3.9   2.62  16.5     0     1     4     4
2 Mazda RX4 W…  21       6   160   110  3.9   2.88  17.0     0     1     4     4
3 Datsun 710    22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
4 Hornet 4 Dr…  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
5 Hornet Spor…  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
# … with more rows

and we would like to turn all numeric attributes in mtcar_sdf (in other words, all columns other than the model column) into key-value pairs stored in 2 columns, with the key column storing the name of each attribute, and the value column storing each attribute’s numeric value. One way to accomplish that with tidyr is by utilizing the tidyr::pivot_longer functionality:


1
2
3


mtcars_kv_sdf <- mtcars_sdf %>%
  tidyr::pivot_longer(cols = -model, names_to = "key", values_to = "value")
print(mtcars_kv_sdf, n = 5)


# Source: spark [?? x 3]
  model     key   value
        
1 Mazda RX4 am      1
2 Mazda RX4 carb    4
3 Mazda RX4 cyl     6
4 Mazda RX4 disp  160
5 Mazda RX4 drat    3.9
# … with more rows

To undo the effect of tidyr::pivot_longer, we can apply tidyr::pivot_wider to our mtcars_kv_sdf Spark dataframe, and get back the original data that was present in mtcars_sdf:


1
2
3


tbl <- mtcars_kv_sdf %>%
  tidyr::pivot_wider(names_from = key, values_from = value)
print(tbl, n = 5)


# Source: spark [?? x 12]
  model         carb   cyl  drat    hp   mpg    vs    wt    am  disp  gear  qsec
                    
1 Mazda RX4        4     6  3.9    110  21       0  2.62     1  160      4  16.5
2 Hornet 4 Dr…     1     6  3.08   110  21.4     1  3.22     0  258      3  19.4
3 Hornet Spor…     2     8  3.15   175  18.7     0  3.44     0  360      3  17.0
4 Merc 280C        4     6  3.92   123  17.8     1  3.44     0  168.     4  18.9
5 Merc 450SLC      3     8  3.07   180  15.2     0  3.78     0  276.     3  18
# … with more rows

Another way to reduce many columns into fewer ones is by using tidyr::nest to move some columns into nested tables. For instance, we can create a nested table perf encapsulating all performance-related attributes from mtcars (namely, hp, mpg, disp, and qsec). However, unlike R dataframes, Spark Dataframes do not have the concept of nested tables, and the closest to nested tables we can get is a perf column containing named structs with hp, mpg, disp, and qsec attributes:


1
2


mtcars_nested_sdf <- mtcars_sdf %>%
  tidyr::nest(perf = c(hp, mpg, disp, qsec))


We can then inspect the type of perf column in mtcars_nested_sdf:


1


sdf_schema(mtcars_nested_sdf)$perf$type


[1] "ArrayType(StructType(StructField(hp,DoubleType,true), StructField(mpg,DoubleType,true), StructField(disp,DoubleType,true), StructField(qsec,DoubleType,true)),true)"

and inspect individual struct elements within perf:


1
2


perf <- mtcars_nested_sdf %>% dplyr::pull(perf)
unlist(perf[[1]])


    hp    mpg   disp   qsec
110.00  21.00 160.00  16.46

Finally, we can also use tidyr::unnest to undo the effects of tidyr::nest:


1
2
3


mtcars_unnested_sdf <- mtcars_nested_sdf %>%
  tidyr::unnest(col = perf)
print(mtcars_unnested_sdf, n = 5)


# Source: spark [?? x 12]
  model          cyl  drat    wt    vs    am  gear  carb    hp   mpg  disp  qsec
                    
1 Mazda RX4        6  3.9   2.62     0     1     4     4   110  21    160   16.5
2 Hornet 4 Dr…     6  3.08  3.22     1     0     3     1   110  21.4  258   19.4
3 Duster 360       8  3.21  3.57     0     0     3     4   245  14.3  360   15.8
4 Merc 280         6  3.92  3.44     1     0     4     4   123  19.2  168.  18.3
5 Lincoln Con…     8  3     5.42     0     0     3     4   215  10.4  460   17.8
# … with more rows

Robust Scaler

RobustScaler
 is a new functionality introduced in Spark 3.0 (SPARK-28399
). Thanks to a pull request
 by @zero323
, an R interface for RobustScaler, namely, the ft_robust_scaler() function, is now part of sparklyr.
It is often observed that many machine learning algorithms perform better on numeric inputs that are standardized. Many of us have learned in stats 101 that given a random variable $X$, we can compute its mean $\mu = E[X]$, standard deviation $\sigma = \sqrt{E[X^2] - (E[X])^2}$, and then obtain a standard score $z = \frac{X - \mu}{\sigma}$ which has mean of 0 and standard deviation of 1.
However, notice both $E[X]$ and $E[X^2]$ from above are quantities that can be easily skewed by extreme outliers in $X$, causing distortions in $z$. A particular bad case of it would be if all non-outliers among $X$ are very close to $0$, hence making $E[X]$ close to $0$, while extreme outliers are all far in the negative direction, hence dragging down $E[X]$ while skewing $E[X^2]$ upwards.
An alternative way of standardizing $X$ based on its median, 1st quartile, and 3rd quartile values, all of which are robust against outliers, would be the following:
$\displaystyle z = \frac{X - \text{Median}(X)}{\text{P75}(X) - \text{P25}(X)}$
and this is precisely what RobustScaler
 offers.
To see ft_robust_scaler() in action and demonstrate its usefulness, we can go through a contrived example consisting of the following steps:

Draw 500 random samples from the standard normal distribution



1
2


sample_values <- rnorm(500)
print(sample_values)


  [1] -0.626453811  0.183643324 -0.835628612  1.595280802  0.329507772
  [6] -0.820468384  0.487429052  0.738324705  0.575781352 -0.305388387
  ...


Inspect the minimal and maximal values among the $500$ random samples:



1


print(min(sample_values))


  [1] -3.008049



1


print(max(sample_values))


  [1] 3.810277


Now create $10$ other values that are extreme outliers compared to the $500$ random samples above. Given that we know all $500$ samples are within the range of $(-4, 4)$, we can choose $-501, -502, \ldots, -509, -510$ as our $10$ outliers:



1


outliers <- -500L - seq(10)



Copy all $510$ values into a Spark dataframe named sdf



1
2
3
4


library(sparklyr)

sc <- spark_connect(master = "local", version = "3.0.0")
sdf <- copy_to(sc, data.frame(value = c(sample_values, outliers)))



We can then apply ft_robust_scaler() to obtain the standardized value for each input:



1
2
3
4
5


scaled <- sdf %>%
  ft_vector_assembler("value", "input") %>%
  ft_robust_scaler("input", "scaled") %>%
  dplyr::pull(scaled) %>%
  unlist()



Plotting the result shows the non-outlier data points being scaled to values that still more or less form a bell-shaped distribution centered around $0$, as expected, so the scaling is robust against influence of the outliers:



1
2
3
4
5


library(ggplot2)

ggplot(data.frame(scaled = scaled), aes(x = scaled)) +
  xlim(-7, 7) +
  geom_histogram(binwidth = 0.2)




Finally, we can compare the distribution of the scaled values above with the distribution of z-scores of all input values, and notice how scaling the input with only mean and standard deviation would have caused noticeable skewness – which the robust scaler has successfully avoided:



1
2
3
4
5


all_values <- c(sample_values, outliers)
z_scores <- (all_values - mean(all_values)) / sd(all_values)
ggplot(data.frame(scaled = z_scores), aes(x = scaled)) +
  xlim(-0.05, 0.2) +
  geom_histogram(binwidth = 0.005)




From the 2 plots above, one can observe while both standardization processes produced some distributions that were still bell-shaped, the one produced by ft_robust_scaler() is centered around $0$, correctly indicating the average among all non-outlier values, while the z-score distribution is clearly not centered around $0$ as its center has been noticeably shifted by the $10$ outlier values.

RAPIDS

Readers following Apache Spark releases closely probably have noticed the recent addition of RAPIDS
 GPU acceleration support in Spark 3.0. Catching up with this recent development, an option to enable RAPIDS in Spark connections was also created in sparklyr and shipped in sparklyr 1.4. On a host with RAPIDS-capable hardware (e.g., an Amazon EC2 instance of type ‘p3.2xlarge’), one can install sparklyr 1.4 and observe RAPIDS hardware acceleration being reflected in Spark SQL physical query plans:


1
2
3
4


library(sparklyr)

sc <- spark_connect(master = "local", version = "3.0.0", packages = "rapids")
dplyr::db_explain(sc, "SELECT 4")


== Physical Plan ==
*(2) GpuColumnarToRow false
+- GpuProject [4 AS 4#45]
   +- GpuRowToColumnar TargetSize(2147483647)
      +- *(1) Scan OneRowRelation[]

Higher-Order Functions and dplyr-Related Improvements

All newly introduced higher-order functions from Spark 3.0, such as array_sort() with custom comparator, transform_keys(), transform_values(), and map_zip_with(), are supported by sparklyr 1.4.
In addition, all higher-order functions can now be accessed directly through dplyr rather than their hof_* counterparts in sparklyr. This means, for example, that we can run the following dplyr queries to calculate the square of all array elements in column x of sdf, and then sort them in descending order:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


library(sparklyr)

sc <- spark_connect(master = "local", version = "3.0.0")
sdf <- copy_to(sc, tibble::tibble(x = list(c(-3, -2, 1, 5), c(6, -7, 5, 8))))

sq_desc <- sdf %>%
  dplyr::mutate(x = transform(x, ~ .x * .x)) %>%
  dplyr::mutate(x = array_sort(x, ~ as.integer(sign(.y - .x)))) %>%
  dplyr::pull(x)

print(sq_desc)


[[1]]
[1] 25  9  4  1

[[2]]
[1] 64 49 36 25

Acknowledgement

In chronological order, we would like to thank the following individuals for their contributions to sparklyr 1.4:

@javierluraschi

@nealrichardson

@yitao-li

@wkdavis

@Loquats

@zero323


We also appreciate bug reports, feature requests, and valuable other feedback about sparklyr from our awesome open-source community (e.g., the weighted sampling feature in sparklyr 1.4 was largely motivated by this Github issue
 filed by @ajing
, and some dplyr-related bug fixes in this release were initiated in #2648
 and completed with this pull request
 by @wkdavis
).
Last but not least, the author of this blog post is extremely grateful for fantastic editorial suggestions from @javierluraschi
, @batpigandme
, and @skeydan
.
If you wish to learn more about sparklyr, we recommend checking out sparklyr.ai
, spark.rstudio.com
, and also some of the previous release posts such as sparklyr 1.3
 and sparklyr 1.2
.
Thanks for reading!



Training ImageNet with R
Javier Luraschi — Mon, 24 Aug 2020 00:00:00 +0000
ImageNet
 (Deng et al. 2009) is an image database organized according to the WordNet
 (Miller 1995) hierarchy which, historically, has been used in computer vision benchmarks and research. However, it was not until AlexNet (Krizhevsky et al. 2012) demonstrated the efficiency of deep learning using convolutional neural networks on GPUs that the computer-vision discipline turned to deep learning to achieve state-of-the-art models that revolutionized their field. Given the importance of ImageNet and AlexNet, this post introduces tools and techniques to consider when training ImageNet and other large-scale datasets with R.
Now, in order to process ImageNet, we will first have to divide and conquer, partitioning the dataset into several manageable subsets. Afterwards, we will train ImageNet using AlexNet across multiple GPUs and compute instances. Preprocessing ImageNet
 and distributed training
 are the two topics that this post will present and discuss, starting with preprocessing ImageNet.
Preprocessing ImageNet

When dealing with large datasets, even simple tasks like downloading or reading a dataset can be much harder than what you would expect. For instance, since ImageNet is roughly 300GB in size, you will need to make sure to have at least 600GB of free space to leave some room for download and decompression. But no worries, you can always borrow computers with huge disk drives from your favorite cloud provider. While you are at it, you should also request compute instances with multiple GPUs, Solid State Drives (SSDs), and a reasonable amount of CPUs and memory. If you want to use the exact configuration we used, take a look at the mlverse/imagenet
 repo, which contains a Docker image and configuration commands required to provision reasonable computing resources for this task. In summary, make sure you have access to sufficient compute resources.
Now that we have resources capable of working with ImageNet, we need to find a place to download ImageNet from. The easiest way is to use a variation of ImageNet used in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
, which contains a subset of about 250GB of data and can be easily downloaded from many Kaggle
 competitions, like the ImageNet Object Localization Challenge
.
If you’ve read some of our previous posts, you might be already thinking of using the pins
 package, which you can use to: cache, discover and share resources from many services, including Kaggle. You can learn more about data retrieval from Kaggle in the Using Kaggle Boards
 article; in the meantime, let’s assume you are already familiar with this package.
All we need to do now is register the Kaggle board, retrieve ImageNet as a pin, and decompress this file. Warning, the following code requires you to stare at a progress bar for, potentially, over an hour.


1
2
3
4
5


library(pins)
board_register("kaggle", token = "kaggle.json")

pin_get("c/imagenet-object-localization-challenge", board = "kaggle")[1] %>%
  untar(exdir = "/localssd/imagenet/")


If we are going to be training this model over and over using multiple GPUs and even multiple compute instances, we want to make sure we don’t waste too much time downloading ImageNet every single time.
The first improvement to consider is getting a faster hard drive. In our case, we locally-mounted an array of SSDs into the /localssd path. We then used /localssd to extract ImageNet and configured R’s temp path and pins cache to use the SSDs as well. Consult your cloud provider’s documentation to configure SSDs, or take a look at mlverse/imagenet
.
Next, a well-known approach we can follow is to partition ImageNet into chunks that can be individually downloaded to perform distributed training later on.
In addition, it is also faster to download ImageNet from a nearby location, ideally from a URL stored within the same data center where our cloud instance is located. For this, we can also use pins to register a board with our cloud provider and then re-upload each partition. Since ImageNet is already partitioned by category, we can easily split ImageNet into multiple zip files and re-upload to our closest data center as follows. Make sure the storage bucket is created in the same region as your computing instances.


1
2
3
4
5
6
7


board_register("", name = "imagenet", bucket = "r-imagenet")

train_path <- "/localssd/imagenet/ILSVRC/Data/CLS-LOC/train/"
for (path in dir(train_path, full.names = TRUE)) {
  dir(path, full.names = TRUE) %>%
    pin(name = basename(path), board = "imagenet", zip = TRUE)
}


We can now retrieve a subset of ImageNet quite efficiently. If you are motivated to do so and have about one gigabyte to spare, feel free to follow along executing this code. Notice that ImageNet contains lots of JPEG images for each WordNet category.


1
2
3
4
5


board_register("https://storage.googleapis.com/r-imagenet/", "imagenet")

categories <- pin_get("categories", board = "imagenet")
pin_get(categories$id[1], board = "imagenet", extract = TRUE) %>%
  tibble::as_tibble()


# A tibble: 1,300 x 1
   value                                                           
                                                              
 1 /localssd/pins/storage/n01440764/n01440764_10026.JPEG
 2 /localssd/pins/storage/n01440764/n01440764_10027.JPEG
 3 /localssd/pins/storage/n01440764/n01440764_10029.JPEG
 4 /localssd/pins/storage/n01440764/n01440764_10040.JPEG
 5 /localssd/pins/storage/n01440764/n01440764_10042.JPEG
 6 /localssd/pins/storage/n01440764/n01440764_10043.JPEG
 7 /localssd/pins/storage/n01440764/n01440764_10048.JPEG
 8 /localssd/pins/storage/n01440764/n01440764_10066.JPEG
 9 /localssd/pins/storage/n01440764/n01440764_10074.JPEG
10 /localssd/pins/storage/n01440764/n01440764_1009.JPEG 
# … with 1,290 more rows

When doing distributed training over ImageNet, we can now let a single compute instance process a partition of ImageNet with ease. Say, 1/16 of ImageNet can be retrieved and extracted, in under a minute, using parallel downloads with the callr
 package:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


categories <- pin_get("categories", board = "imagenet")
categories <- categories$id[1:(length(categories$id) / 16)]

procs <- lapply(categories, function(cat)
  callr::r_bg(function(cat) {
    library(pins)
    board_register("https://storage.googleapis.com/r-imagenet/", "imagenet")
    
    pin_get(cat, board = "imagenet", extract = TRUE)
  }, args = list(cat))
)
  
while (any(sapply(procs, function(p) p$is_alive()))) Sys.sleep(1)


We can wrap this up partition in a list containing a map of images and categories, which we will later use in our AlexNet model through tfdatasets
.


1
2
3
4
5
6
7
8
9


data <- list(
    image = unlist(lapply(categories, function(cat) {
        pin_get(cat, board = "imagenet", download = FALSE)
    })),
    category = unlist(lapply(categories, function(cat) {
        rep(cat, length(pin_get(cat, board = "imagenet", download = FALSE)))
    })),
    categories = categories
)


Great! We are halfway there training ImageNet. The next section will focus on introducing distributed training using multiple GPUs.
Distributed Training

Now that we have broken down ImageNet into manageable parts, we can forget for a second about the size of ImageNet and focus on training a deep learning model for this dataset. However, any model we choose is likely to require a GPU, even for a 1/16 subset of ImageNet. So make sure your GPUs are properly configured by running is_gpu_available(). If you need help getting a GPU configured, the Using GPUs with TensorFlow and Docker
 video can help you get up to speed.


1
2


library(tensorflow)
tf$test$is_gpu_available()


[1] TRUE

We can now decide which deep learning model would best be suited for ImageNet classification tasks. Instead, for this post, we will go back in time to the glory days of AlexNet and use the r-tensorflow/alexnet
 repo instead. This repo contains a port of AlexNet to R, but please notice that this port has not been tested and is not ready for any real use cases. In fact, we would appreciate PRs to improve it if someone feels inclined to do so. Regardless, the focus of this post is on workflows and tools, not about achieving state-of-the-art image classification scores. So by all means, feel free to use more appropriate models.
Once we’ve chosen a model, we will want to me make sure that it properly trains on a subset of ImageNet:


1
2


remotes::install_github("r-tensorflow/alexnet")
alexnet::alexnet_train(data = data)


Epoch 1/2
 103/2269 [>...............] - ETA: 5:52 - loss: 72306.4531 - accuracy: 0.9748

So far so good! However, this post is about enabling large-scale training across multiple GPUs, so we want to make sure we are using as many as we can. Unfortunately, running nvidia-smi will show that only one GPU currently being used:


1


nvidia-smi


+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.152.00   Driver Version: 418.152.00   CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:05.0 Off |                    0 |
| N/A   48C    P0    89W / 149W |  10935MiB / 11441MiB |     28%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 00000000:00:06.0 Off |                    0 |
| N/A   74C    P0    74W / 149W |     71MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

In order to train across multiple GPUs, we need to define a distributed-processing strategy. If this is a new concept, it might be a good time to take a look at the Distributed Training with Keras
 tutorial and the distributed training with TensorFlow
 docs. Or, if you allow us to oversimplify the process, all you have to do is define and compile your model under the right scope. A step-by-step explanation is available in the Distributed Deep Learning with TensorFlow and R
 video. In this case, the alexnet model already supports
 a strategy parameter, so all we have to do is pass it along.


1
2
3
4
5


library(tensorflow)
strategy <- tf$distribute$MirroredStrategy(
  cross_device_ops = tf$distribute$ReductionToOneDevice())

alexnet::alexnet_train(data = data, strategy = strategy, parallel = 6)


Notice also parallel = 6 which configures tfdatasets to make use of multiple CPUs when loading data into our GPUs, see Parallel Mapping
 for details.
We can now re-run nvidia-smi to validate all our GPUs are being used:


1


nvidia-smi


+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.152.00   Driver Version: 418.152.00   CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:05.0 Off |                    0 |
| N/A   49C    P0    94W / 149W |  10936MiB / 11441MiB |     53%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 00000000:00:06.0 Off |                    0 |
| N/A   76C    P0   114W / 149W |  10936MiB / 11441MiB |     26%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

The MirroredStrategy can help us scale up to about 8 GPUs per compute instance; however, we are likely to need 16 instances with 8 GPUs each to train ImageNet in a reasonable time (see Jeremy Howard’s post on Training Imagenet in 18 Minutes
). So where do we go from here?
Welcome to MultiWorkerMirroredStrategy: This strategy can use not only multiple GPUs, but also multiple GPUs across multiple computers. To configure them, all we have to do is define a TF_CONFIG environment variable with the right addresses and run the exact same code in each compute instance.


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


library(tensorflow)

partition <- 0
Sys.setenv(TF_CONFIG = jsonlite::toJSON(list(
    cluster = list(
        worker = c("10.100.10.100:10090", "10.100.10.101:10090")
    ),
    task = list(type = 'worker', index = partition)
), auto_unbox = TRUE))

strategy <- tf$distribute$MultiWorkerMirroredStrategy(
  cross_device_ops = tf$distribute$ReductionToOneDevice())

alexnet::imagenet_partition(partition = partition) %>%
  alexnet::alexnet_train(strategy = strategy, parallel = 6)


Please note that partition must change for each compute instance to uniquely identify it, and that the IP addresses also need to be adjusted. In addition, data should point to a different partition of ImageNet, which we can retrieve with pins; although, for convenience, alexnet contains similar code under alexnet::imagenet_partition(). Other than that, the code that you need to run in each compute instance is exactly the same.
However, if we were to use 16 machines with 8 GPUs each to train ImageNet, it would be quite time-consuming and error-prone to manually run code in each R session. So instead, we should think of making use of cluster-computing frameworks, like Apache Spark with barrier execution
. If you are new to Spark, there are many resources available at sparklyr.ai
. To learn just about running Spark and TensorFlow together, watch our Deep Learning with Spark, TensorFlow and R
 video.
Putting it all together, training ImageNet in R with TensorFlow and Spark looks as follows:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24


library(sparklyr)
sc <- spark_connect("yarn|mesos|etc", config = list("sparklyr.shell.num-executors" = 16))

sdf_len(sc, 16, repartition = 16) %>%
  spark_apply(function(df, barrier) {
      library(tensorflow)

      Sys.setenv(TF_CONFIG = jsonlite::toJSON(list(
        cluster = list(
          worker = paste(
            gsub(":[0-9]+$", "", barrier$address),
            8000 + seq_along(barrier$address), sep = ":")),
        task = list(type = 'worker', index = barrier$partition)
      ), auto_unbox = TRUE))
      
      if (is.null(tf_version())) install_tensorflow()
      
      strategy <- tf$distribute$MultiWorkerMirroredStrategy()
    
      result <- alexnet::imagenet_partition(partition = barrier$partition) %>%
        alexnet::alexnet_train(strategy = strategy, epochs = 10, parallel = 6)
      
      result$metrics$accuracy
  }, barrier = TRUE, columns = c(accuracy = "numeric"))


We hope this post gave you a reasonable overview of what training large-datasets in R looks like – thanks for reading along!
Deng, Jia, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. “Imagenet: A Large-Scale Hierarchical Image Database.” 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–55.
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. 2012. “Imagenet Classification with Deep Convolutional Neural Networks.” Advances in Neural Information Processing Systems, 1097–105.
Miller, George A. 1995. “WordNet: A Lexical Database for English.” Communications of the ACM 38 (11): 39–41.



Parallelized sampling using exponential variates
Yitao Li — Wed, 29 Jul 2020 00:00:00 +0000
\usepackage{algorithm2e}
As part of our recent work to support weighted sampling of Spark data frames in sparklyr, we embarked on a journey searching for algorithms that can perform weighted sampling, especially sampling without replacement, in efficient and scalable ways within a distributed cluster-computing framework, such as Apache Spark.
In the interest of brevity, “weighted sampling without replacement” shall be shortened into SWoR for the remainder of this blog post.
In the following sections, we will explain and illustrate what SWoR means probability-wise, briefly outline some alternative solutions we have considered but were not completely satisfied with, and then deep-dive into exponential variates, a simple mathematical construct that made the ideal solution for this problem possible.
If you cannot wait to jump into action, there is also a section
 in which we showcase example usages of sdf_weighted_sample() in sparklyr. In addition, you can examine the implementation detail of sparklyr::sdf_weighted_sample() in this pull request
.
How it all started

Our journey started from a Github issue
 inquiring about the possibility of supporting the equivalent of dplyr::sample_frac(..., weight = ) for Spark data frames in sparklyr. For example,


1


dplyr::sample_frac(mtcars, 0.25, weight = gear, replace = FALSE)


##                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Merc 280C         17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat X1-9         27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Porsche 914-2     26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Maserati Bora     15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Ferrari Dino      19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6

will randomly select one-fourth of all rows from a R data frame named “mtcars” without replacement, using mtcars$gear as weights. We were unable to find any function implementing the weighted versions of dplyr::sample_frac among Spark SQL built-in functions
 in Spark 3.0 or in earlier versions, which means a future version of sparklyr will need to run its own weighted sampling algorithm to support such use cases.
What exactly is SWoR

The purpose of this section is to mathematically describe the probability distribution generated by SWoR in terms of $w_1, \dotsc, w_N$, so that readers can clearly see that the exponential-variate based algorithm presented in a subsequent section in fact samples from precisely the same probability distribution. Readers already having a crystal-clear mental picture of what SWoR entails should probably skip most of this section. The key take-away here is given $N$ rows $r_1, \dotsc, r_N$ and their weights $w_1, \dotsc, w_N$ and a desired sample size $n$, the probability of SWoR selecting $(r_1, \dotsc, r_n)$ is $\prod\limits_{j = 1}^{n} \left( {w_j} \middle/ {\sum\limits_{k = j}^{N}{w_k}} \right)$.
SWOR is conceptually equivalent to a $n$-step process of selecting 1 out of $(n - j + 1)$ remaining rows in the $j$-th step for $j \in \{1, \dotsc, n\}$, with each remaining row’s likelihood of getting selected being linearly proportional to its weight in any of the steps, i.e.,
samples := {}
population := {r[1], ..., r[N]}

for j = 1 to n
  select r[x] from population with probability
    (w[x] / TotalWeight(population))
  samples := samples + {r[x]}
  population := population - {r[x]}

Notice the outcome of a SWoR process is in fact order-significant, which is why in this post it will always be represented as an ordered tuple of elements.
Intuitively, SWoR is analogous to throwing darts at a bunch of tiles. For example, let’s say the size of our sample space is 5:


Imagine $r_1, r_2, \dotsc, r_5$ as 5 rectangular tiles laid out contiguously on a wall with widths $w_1, w_2, \dotsc, w_5$, with $r_1$ covering $[0, w_1)$, $r_2$ covering $[w_1, w_1 + w_2)$, …, and $r_5$ covering $\left[\sum\limits_{j = 1}^{4} w_j, \sum\limits_{j = 1}^{5} w_j\right)$


Equate drawing a random sample in each step to throwing a dart uniformly randomly within the interval covered by all tiles that are not hit yet


After a tile is hit, it gets taken out and remaining tiles are re-arranged so that they continue to cover a contiguous interval without overlapping


If our sample size is 3, then we shall ask ourselves what is the probability of the dart hitting $(r_1, r_2, r_3)$ in that order?
In step $j = 1$, the dart will hit $r_1$ with probability $\left. w_1 \middle/ \left(\sum\limits_{k = 1}^{N}w_k\right) \right.$
 .
After deleting $r_1$ from the sample space after it’s hit, step $j = 2$ will look like this:
 ,
and the probability of the dart hitting $r_2$ in step 2 is $\left. w_2 \middle/ \left(\sum\limits_{k = 2}^{N}w_k\right) \right.$ .
Finally, moving on to step $j = 3$, we have:
 ,
with the probability of the dart hitting $r_3$ being $\left. w_3 \middle/ \left(\sum\limits_{k = 3}^{N}w_k\right) \right.$.
So, combining all of the above, the overall probability of selecting $(r_1, r_2, r_3)$ is $\prod\limits_{j = 1}^{3} \left( {w_j} \middle/ {\sum\limits_{k = j}^{N}{w_k}} \right)$.
Naive approaches for implementing SWoR

This section outlines some possible approaches that were briefly under consideration. Because none of these approaches scales well to a large number of rows or a non-trivial number of partitions in a Spark data frame, we decided to avoid all of them in sparklyr.
A tree-base approach

One possible way to accomplish SWoR is to have a mutable data structure keeping track of the sample space at each step.
Continuing with the dart-throwing analogy from the previous section, let us say initially, none of the tiles has been taken out yet, and a dart has landed at some point $x \in \left[0, \sum\limits_{k = 1}^{N} w_k\right)$. Which tile did it hit? This can be answered efficiently if we have a binary tree, pictured as the following (or in general, some $b$-ary tree for integer $b \ge 2$)


.

To find the tile that was hit given the dart’s position $x$, we simply need to traverse down the tree, going through the box containing $x$ in each level, incurring a $O(\log(N))$ cost in time complexity for each sample. To take a tile out of the picture, we update the width of the tile to $0$ and propagate this change upwards from leaf level to root of the tree, again incurring a $O(\log(N))$ cost in time complexity, making the overall time complexity of selecting $n$ samples $O(n \cdot \log(N))$, which is not so great for large data sets, and also, not parallelizable across multiple partitions of a Spark data frame.
Rejection sampling

Another possible approach is to use rejection sampling. In term of the previously mentioned dart-throwing analogy, that means not removing any tile that is hit, hence avoiding the performance cost of keeping the sample space up-to-date, but then having to re-throw the dart in each of the subsequent rounds until the dart lands on a tile that was not hit previously. This approach, just like the previous one, would not be performant, and would not be parallelizable across multiple partitions of a Spark data frame either.
Exponential variates to the rescue

A solution that has proven to be much better than any of the naive approaches turns out to be a numerical stable variant of the algorithm described in “Weighted Random Sampling” (Efraimidis and Spirakis 2016) by Pavlos S. Efraimidis and Paul G. Spirakis.
A version of this sampling algorithm implemented by sparklyr does the following to sample $n$ out of $N$ rows from a Spark data frame $X$:

For each row $r_j \in X$, draw a random number $u_j$ independently and uniformly randomly from $(0, 1)$ and compute the key of $r_j$ as $k_j = \ln(u_j) / w_j$, where $w_j$ is the weight of $r_j$. Perform this calulation in parallel across all partitions of $X$.
Select $n$ rows with largest keys and return them as the result. This step is also mostly parallelizable: for each partition of $X$, one can select up to $n$ rows having largest keys within that partition as candidates, and after selecting candidates from all partitions in parallel, simply extract the top $n$ rows among all candidates, and return them as the $n$ chosen samples.

There are at least 4 reasons why this solution is highly appealing and was chosen to be implemented in sparklyr:

It is a one-pass algorithm (i.e., only need to iterate through all rows of a data frame exactly once).
Its computational overhead is quite low (as selecting top $n$ rows at any stage only requires a bounded priority queue of max size $n$, which costs $O(\log(n))$ per update in time complexity).
More importantly, most of its required computations can be performed in parallel. In fact, the only non-parallelizable step is the very last stage of combining top candidates from all partitions and choosing the top $n$ rows among those candidates. So, it fits very well into the world of Spark / MapReduce, and has drastically better horizontal scalability compared to the naive approaches.
Bonus: It is also suitable for weighted reservoir sampling (i.e., can sample $n$ out of a possibly infinite stream of rows according to their weights such that at any moment the $n$ samples will be a weighted representation of all rows that have been processed so far).

Why does this algorithm work

As an interesting aside, some readers have probably seen this technique presented in a slightly different form under another name. It is in fact equivalent to a generalized version of the Gumbel-max trick
 which is commonly referred to as the Gumbel-top-k trick. Readers familiar with properties of the Gumbel distribution will no doubt have an easy time convincing themselves the algorithm above works as expected.
In this section, we will also present a proof of correctness for this algorithm based on elementary properties of probability density function
 (shortened as PDF from now on), cumulative distribution function
 (shortened as CDF from now on), and basic calculus.
First of all, to make sense of all the $\ln(u_j) / w_j$ calculations in this algorithm, one has to understand inverse transform sampling
. For each $j \in \{1, \dotsc, N\}$, consider the probability distribution defined on $(-\infty, 0)$ with CDF $F_j(x) = e^{w_j \cdot x}$. In order to pluck out a value $y$ from this distribution, we first sample a value $u_j$ uniformly randomly out of $(0, 1)$ that determines the percentile of $y$ (i.e., how our $y$ value ranks relative to all possible $y$ values, a.k.a, the “overall population”, from this distribution), and then apply $F_j^{-1}$ to $u_j$ to find $y$, so, $y = F_j^{-1}(u_j) = \ln(u_j) / w_j$.
Secondly, after defining all the required CDF functions $F_j(x) = e^{w_j \cdot x}$ for $j \in \{1, \dotsc, N\}$, we can also easily derive their corresponding PDF functions $f_j$: 
$$f_j(x) = \frac{d F_j(x)}{dx} = w_j e^{w_j \cdot x}$$.
Finally, with a clear understanding of the family of probability distributions involved, one can prove the probability of this algorithm selecting a given sequence of rows $(r_1, \dotsc, r_n)$ is equal to $\prod\limits_{j = 1}^{n} \left( {w_j} \middle/ {\sum\limits_{k = j}^{N}{w_k}} \right)$, identical to the probability previously mentioned in the “What exactly is SWoR”
 section, which implies the possible outcomes of this algorithm will follow exactly the same probability distribution as that of a $n$-step SWoR.
In order to not deprive our dear readers the pleasure of completing this proof by themselves, we have decided to not inline the rest of the proof (which boils down to a calculus exercise) within this blog post, but it is available in this file
.
Weighted sampling with replacement

While all previous sections focused entirely on weighted sampling without replacement, this section will briefly discuss how the exponential-variate approach can also benefit the weighted-sampling-with-replacement use case (which will be shortened as SWR from now on).
Although SWR with sample size $n$ can be carried out by $n$ independent processes each selecting $1$ sample, parallelizing a SWR workload across all partitions of a Spark data frame (let’s call it $X$) will still be more performant if the number of partitions is much larger than $n$ and more than $n$ executors are available in a Spark cluster.
An initial solution we had in mind was to run SWR with sample size $n$ in parallel on each partition of $X$, and then re-sample the results based on relative total weights of each partition. Despite sounding deceptively simple when summarized in words, implementing such a solution in practice would be a moderately complicated task. First, one has to apply the alias method
 or similar in order to perform weighted sampling efficiently on each partition of $X$, and on top of that, implementing the re-sampling logic across all partitions correctly and verifying the correctness of such procedure will also require considerable effort.
In comparison, with the help of exponential variates, a SWR carried out as $n$ independent SWoR processes each selecting $1$ sample is much simpler to implement, while still being comparable to our initial solution in terms of efficiency and scalability. An example implementation of it (which takes fewer than 60 lines of Scala) is presented in samplingutils.scala
.
Visualization

How do we know sparklyr::sdf_weighted_sample() is working as expected? While the rigorous answer to this question is presented in full in the testing
 section, we thought it would also be useful to first show some histograms that will help readers visualize what that test plan is. Therefore in this section, we will do the following:

Run dplyr::slice_sample() multiple times on a small sample space, with each run using a different PRNG seed (sample size will be reduced to $2$ here so that there will fewer than 100 possible outcomes and visualization will be easier)
Do the same for sdf_weighted_sample()
Use histograms to visualize the distribution of sampling outcomes

Throughout this section, we will sample $2$ elements out of $\{0, \dotsc, 7\}$ without replacement according to some weights, so, the first step is to set up the following in R:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


library(sparklyr)

sc <- spark_connect(master = "local")

# `octs` will be our sample space
octs <- data.frame(
  x = seq(0, 7),
  weight = c(1, 4, 2, 8, 5, 7, 1, 4)
)
# `octs_sdf` will be our sample space copied into a Spark data frame
octs_sdf <- copy_to(sc, octs)

sample_size <- 2


In order to tally up and visualize the sampling outcomes efficiently, we shall map each possible outcome to an octal number (e.g., (6, 7) gets mapped to $6 \cdot 8^0 + 7 \cdot 8^1$) using a helper function to_oct in R:


1


to_oct <- function(sample) sum(8 ^ seq(0, sample_sz - 1) * sample$x)


We also need to tally up sampling outcomes from dplyr::slice_sample() and sparklyr::sdf_weighted_sample() in 2 separate arrays:


1
2
3
4


max_possible_outcome <- to_oct(list(x = seq(8 - sample_sz, 7)))

sdf_weighted_sample_outcomes <- rep(0, max_possible_outcome)
dplyr_slice_sample_outcomes <- rep(0, max_possible_outcome)


Finally, we can run both dplyr::slice_sample() and sparklyr::sdf_weighted_sample() for arbitrary number of iterations and compare tallied outcomes from both:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


num_sampling_iters <- 1000  # actually we will vary this value from 500 to 5000

for (x in seq(num_sampling_iters)) {
  sample1 <- octs_sdf %>%
    sdf_weighted_sample(
      k = sample_size, weight_col = "weight", replacement = FALSE, seed = seed
    ) %>%
    collect() %>%
    to_oct()
  sdf_weighted_sample_outcomes[[sample1]] <-
      sdf_weighted_sample_outcomes[[sample1]] + 1

  seed <- x * 97
  set.seed(seed) # set random seed for dplyr::sample_slice()
  sample2 <- octs %>%
    dplyr::slice_sample(
      n = sample_size, weight_by = weight, replace = FALSE
    ) %>%
    to_oct()
  dplyr_slice_sample_outcomes[[sample2]] <-
      dplyr_slice_sample_outcomes[[sample2]] + 1
}


After all the hard work above, we can now enjoy plotting the sampling outcomes from dplyr::slice_sample() and those from sparklyr::sdf_weighted_sample() after 500, 1000, and 5000 iterations and observe how the distributions of both start converging after a large number of iterations.
Sampling outcomes after 500, 1000, and 5000 iterations, shown in 3 histograms:

    
  

(you will most probably need to view it in a separate tab
 to see everything clearly)
Testing

While parallelized sampling based on exponential variates looks fantastic on paper, there are still plenty of potential pitfalls when it comes to translating such idea into code, and as usual, a good testing plan is necessary to ensure implementation correctness.
For instance, numerical instability issues from floating point numbers arise if $\ln(u_j) / w_j$ were replaced by $u_j ^ {1 / w_j}$ in the aforementioned computations.
Another more subtle source of error is the usage of PRNG seeds. For example, consider the following:
  def sampleWithoutReplacement(
    rdd: RDD[Row],
    weightColumn: String,
    sampleSize: Int,
    seed: Long
  ): RDD[Row] = {
    val sc = rdd.context
    if (0 == sampleSize) {
      sc.emptyRDD
    } else {
      val random = new Random(seed)
      val mapRDDs = rdd.mapPartitions { iter =>
        for (row <- iter) {
          val weight = row.getAs[Double](weightColumn)
          val key = scala.math.log(random.nextDouble) / weight
          
        }
        ...
      }
      ...
    }
  }

Even though it might look OK upon first glance, rdd.mapPartitions(...) from the above will cause the same sequence of pseudorandom numbers to be applied to multiple partitions of the input Spark data frame, which will cause undesired bias (i.e., sampling outcomes from one partition will have non-trivial correlation with those from another partition when such correlation should be negligible in a correct implementation).
The code snippet below is an example implementation in which each partition of the input Spark data frame is sampled using a different sequence of pseudorandom numbers:
  def sampleWithoutReplacement(
    rdd: RDD[Row],
    weightColumn: String,
    sampleSize: Int,
    seed: Long
  ): RDD[Row] = {
    val sc = rdd.context
    if (0 == sampleSize) {
      sc.emptyRDD
    } else {
      val mapRDDs = rdd.mapPartitionsWithIndex { (index, iter) =>
        val random = new Random(seed + index)

        for (row <- iter) {
          val weight = row.getAs[Double](weightColumn)
          val key = scala.math.log(random.nextDouble) / weight
          
        }

        ...
      }
    ...
  }
}

An example test case in which a two-sided Kolmogorov-Smirnov test is used to compare distribution of sampling outcomes from dplyr::slice_sample() with that from sparklyr::sdf_weighted_sample() is shown in this file
. Such tests have proven to be effective in surfacing non-obvious implementation errors such as the ones mentioned above.
Example Usages

Please note the sparklyr::sdf_weighted_sample() functionality is not included in any official release of sparklyr yet. We are aiming to ship it as part of sparklyr 1.4 in about 2 to 3 months from now.
In the meanwhile, you can try it out with the following steps:
First, make sure remotes is installed, and then run


1


remotes::install_github("sparklyr/sparklyr", ref = "master")


to install sparklyr from source.
Next, create a test data frame with numeric weight column consisting of non-negative weight for each row, and then copy it to Spark (see code snippet below as an example):


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


library(sparklyr)

sc <- spark_connect(master = "local")

example_df <- data.frame(
  x = seq(100),
  weight = c(
    rep(1, 50),
    rep(2, 25),
    rep(4, 10),
    rep(8, 10),
    rep(16, 5)
  )
)
example_sdf <- copy_to(sc, example_df, repartition = 5, overwrite = TRUE)


Finally, run sparklyr::sdf_weighted_sample() on example_sdf:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10


sample_size <- 5

samples_without_replacement <- example_sdf %>%
  sdf_weighted_sample(
    weight_col = "weight",
    k = sample_size,
    replacement = FALSE
  )

samples_without_replacement %>% print(n = sample_size)


## # Source: spark [?? x 2]
##       x weight
##     
## 1    48      1
## 2    22      1
## 3    78      4
## 4    56      2
## 5   100     16



1
2
3
4
5
6
7
8


samples_with_replacement <- example_sdf %>%
  sdf_weighted_sample(
    weight_col = "weight",
    k = sample_size,
    replacement = TRUE
  )

samples_with_replacement %>% print(n = sample_size)


## # Source: spark [?? x 2]
##       x weight
##     
## 1    86      8
## 2    97     16
## 3    91      8
## 4   100     16
## 5    65      2

Acknowledgement

First and foremost, the author wishes to thank @ajing
 for reporting the weighted sampling use cases were not properly supported yet in sparklyr 1.3 and suggesting it should be part of some future version of sparklyr in this Github issue
.
Special thanks also goes to Javier (@javierluraschi
) for reviewing the implementation
 of all exponential-variate based sampling algorithms in sparklyr, and to Mara (@batpigandme
), Sigrid (@Sigrid
), and Javier (@javierluraschi
) for their valuable editorial suggestions.
We hope you have enjoyed reading this blog post! If you wish to learn more about sparklyr, we recommend visiting sparklyr.ai
, spark.rstudio.com
, and some of the previous release posts such as sparklyr 1.3
 and sparklyr 1.2
. Also, your contributions to sparklyr are more than welcome. Please send your pull requests through here
 and file any bug report or feature request in here
.
Thanks for reading!
Efraimidis, Pavlos, and Paul (Pavlos) Spirakis. 2016. “Weighted Random Sampling.” In Encyclopedia of Algorithms, edited by Ming-Yang Kao. Springer New York. https://doi.org/10.1007/978-1-4939-2864-4_478
.



sparklyr 1.3: Higher-order Functions, Avro and Custom Serializers
Yitao Li — Thu, 16 Jul 2020 00:00:00 +0000
sparklyr
 1.3 is now available on CRAN
, with the following major new features:

Higher-order Functions
 to easily manipulate arrays and structs
Support for Apache Avro
, a row-oriented data serialization framework
Custom Serialization
 using R functions to read and write any data format
Other Improvements
 such as compatibility with EMR 6.0 & Spark 3.0, and initial support for Flint time series library

To install sparklyr 1.3 from CRAN, run


1


install.packages("sparklyr")


In this post, we shall highlight some major new features introduced in sparklyr 1.3, and showcase scenarios where such features come in handy. While a number of enhancements and bug fixes (especially those related to spark_apply(), Apache Arrow
, and secondary Spark connections) were also an important part of this release, they will not be the topic of this post, and it will be an easy exercise for the reader to find out more about them from the sparklyr NEWS
 file.
Higher-order Functions

Higher-order functions
 are built-in Spark SQL constructs that allow user-defined lambda expressions to be applied efficiently to complex data types such as arrays and structs. As a quick demo to see why higher-order functions are useful, let’s say one day Scrooge McDuck dove into his huge vault of money and found large quantities of pennies, nickels, dimes, and quarters. Having an impeccable taste in data structures, he decided to store the quantities and face values of everything into two Spark SQL array columns:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10


library(sparklyr)

sc <- spark_connect(master = "local", version = "2.4.5")
coins_tbl <- copy_to(
  sc,
  tibble::tibble(
    quantities = list(c(4000, 3000, 2000, 1000)),
    values = list(c(1, 5, 10, 25))
  )
)


Thus declaring his net worth of 4k pennies, 3k nickels, 2k dimes, and 1k quarters. To help Scrooge McDuck calculate the total value of each type of coin in sparklyr 1.3 or above, we can apply hof_zip_with(), the sparklyr equivalent of ZIP_WITH
, to quantities column and values column, combining pairs of elements from arrays in both columns. As you might have guessed, we also need to specify how to combine those elements, and what better way to accomplish that than a concise one-sided formula   ~ .x * .y   in R, which says we want (quantity * value) for each type of coin? So, we have the following:


1
2
3
4
5


result_tbl <- coins_tbl %>%
  hof_zip_with(~ .x * .y, dest_col = total_values) %>%
  dplyr::select(total_values)

result_tbl %>% dplyr::pull(total_values)


[1]  4000 15000 20000 25000

With the result 4000 15000 20000 25000 telling us there are in total $40 dollars worth of pennies, $150 dollars worth of nickels, $200 dollars worth of dimes, and $250 dollars worth of quarters, as expected.
Using another sparklyr function named hof_aggregate(), which performs an AGGREGATE
 operation in Spark, we can then compute the net worth of Scrooge McDuck based on result_tbl, storing the result in a new column named total. Notice for this aggregate operation to work, we need to ensure the starting value of aggregation has data type (namely, BIGINT) that is consistent with the data type of total_values (which is ARRAY), as shown below:


1
2
3
4
5


result_tbl %>%
  dplyr::mutate(zero = dplyr::sql("CAST (0 AS BIGINT)")) %>%
  hof_aggregate(start = zero, ~ .x + .y, expr = total_values, dest_col = total) %>%
  dplyr::select(total) %>%
  dplyr::pull(total)


[1] 64000

So Scrooge McDuck’s net worth is $640 dollars.
Other higher-order functions supported by Spark SQL so far include transform, filter, and exists, as documented in here
, and similar to the example above, their counterparts (namely, hof_transform(), hof_filter(), and hof_exists()) all exist in sparklyr 1.3, so that they can be integrated with other dplyr verbs in an idiomatic manner in R.
Avro

Another highlight of the sparklyr 1.3 release is its built-in support for Avro data sources. Apache Avro is a widely used data serialization protocol that combines the efficiency of a binary data format with the flexibility of JSON schema definitions. To make working with Avro data sources simpler, in sparklyr 1.3, as soon as a Spark connection is instantiated with spark_connect(..., package = "avro"), sparklyr will automatically figure out which version of spark-avro package to use with that connection, saving a lot of potential headaches for sparklyr users trying to determine the correct version of spark-avro by themselves. Similar to how spark_read_csv() and spark_write_csv() are in place to work with CSV data, spark_read_avro() and spark_write_avro() methods were implemented in sparklyr 1.3 to facilitate reading and writing Avro files through an Avro-capable Spark connection, as illustrated in the example below:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31


library(sparklyr)

# The `package = "avro"` option is only supported in Spark 2.4 or higher
sc <- spark_connect(master = "local", version = "2.4.5", package = "avro")

sdf <- sdf_copy_to(
  sc,
  tibble::tibble(
    a = c(1, NaN, 3, 4, NaN),
    b = c(-2L, 0L, 1L, 3L, 2L),
    c = c("a", "b", "c", "", "d")
  )
)

# This example Avro schema is a JSON string that essentially says all columns
# ("a", "b", "c") of `sdf` are nullable.
avro_schema <- jsonlite::toJSON(list(
  type = "record",
  name = "topLevelRecord",
  fields = list(
    list(name = "a", type = list("double", "null")),
    list(name = "b", type = list("int", "null")),
    list(name = "c", type = list("string", "null"))
  )
), auto_unbox = TRUE)

# persist the Spark data frame from above in Avro format
spark_write_avro(sdf, "/tmp/data.avro", as.character(avro_schema))

# and then read the same data frame back
spark_read_avro(sc, "/tmp/data.avro")


# Source: spark [?? x 3]
      a     b c
    
  1     1    -2 "a"
  2   NaN     0 "b"
  3     3     1 "c"
  4     4     3 ""
  5   NaN     2 "d"

Custom Serialization

In addition to commonly used data serialization formats such as CSV, JSON, Parquet, and Avro, starting from sparklyr 1.3, customized data frame serialization and deserialization procedures implemented in R can also be run on Spark workers via the newly implemented spark_read() and spark_write() methods. We can see both of them in action through a quick example below, where saveRDS() is called from a user-defined writer function to save all rows within a Spark data frame into 2 RDS files on disk, and readRDS() is called from a user-defined reader function to read the data from the RDS files back to Spark:


1
2
3
4
5
6
7
8


library(sparklyr)

sc <- spark_connect(master = "local")
sdf <- sdf_len(sc, 7)
paths <- c("/tmp/file1.RDS", "/tmp/file2.RDS")

spark_write(sdf, writer = function(df, path) saveRDS(df, path), paths = paths)
spark_read(sc, paths, reader = function(path) readRDS(path), columns = c(id = "integer"))


# Source: spark [?? x 1]
     id
  
1     1
2     2
3     3
4     4
5     5
6     6
7     7

Other Improvements

Sparklyr.flint

Sparklyr.flint
 is a sparklyr extension that aims to make functionalities from the Flint
 time-series library easily accessible from R. It is currently under active development. One piece of good news is that, while the original Flint
 library was designed to work with Spark 2.x, a slightly modified fork
 of it will work well with Spark 3.0, and within the existing sparklyr extension framework. sparklyr.flint can automatically determine which version of the Flint library to load based on the version of Spark it’s connected to. Another bit of good news is, as previously mentioned, sparklyr.flint doesn’t know too much about its own destiny yet. Maybe you can play an active part in shaping its future!
EMR 6.0

This release also features a small but important change that allows sparklyr to correctly connect to the version of Spark 2.4 that is included in Amazon EMR 6.0.
Previously, sparklyr automatically assumed any Spark 2.x it was connecting to was built with Scala 2.11 and attempted to load any required Scala artifacts built with Scala 2.11 as well. This became problematic when connecting to Spark 2.4 from Amazon EMR 6.0, which is built with Scala 2.12. Starting from sparklyr 1.3, such problem can be fixed by simply specifying scala_version = "2.12" when calling spark_connect() (e.g., spark_connect(master = "yarn-client", scala_version = "2.12")).
Spark 3.0

Last but not least, it is worthwhile to mention sparklyr 1.3.0 is known to be fully compatible with the recently released Spark 3.0. We highly recommend upgrading your copy of sparklyr to 1.3.0 if you plan to have Spark 3.0 as part of your data workflow in future.
Acknowledgement

In chronological order, we want to thank the following individuals for submitting pull requests towards sparklyr 1.3:

Jozef Hajnala

Hossein Falaki

Samuel Macêdo

Yitao Li

Andy Zhang

Javier Luraschi

Neal Richardson


We are also grateful for valuable input on the sparklyr 1.3 roadmap, #2434
, and #2551
 from @javierluraschi
, and great spiritual advice on #1773
 and #2514
 from @mattpollock
 and @benmwhite
.
Please note if you believe you are missing from the acknowledgement above, it may be because your contribution has been considered part of the next sparklyr release rather than part of the current release. We do make every effort to ensure all contributors are mentioned in this section. In case you believe there is a mistake, please feel free to contact the author of this blog post via e-mail (yitao at rstudio dot com) and request a correction.
If you wish to learn more about sparklyr, we recommend visiting sparklyr.ai
, spark.rstudio.com
, and some of the previous release posts such as sparklyr 1.2
 and sparklyr 1.1
.
Thanks for reading!



pins 0.4.0: Versioning
Javier Luraschi — Fri, 29 May 2020 00:00:00 +0000
A new version of pins is available on CRAN today, which adds support for versioning
 your datasets and DigitalOcean Spaces
 boards!
As a quick recap, the pins package allows you to cache, discover and share resources. You can use pins in a wide range of situations, from downloading a dataset from a URL to creating complex automation workflows (learn more at pins.rstudio.com
). You can also use pins in combination with TensorFlow and Keras; for instance, use cloudml
 to train models in cloud GPUs, but rather than manually copying files into the GPU instance, you can store them as pins directly from R.
To install this new version of pins from CRAN, simply run:
install.packages("pins")
You can find a detailed list of improvements in the pins NEWS
 file.
Versioning

To illustrate the new versioning functionality, let’s start by downloading and caching a remote dataset with pins. For this example, we will download the weather in London, this happens to be in JSON format and requires jsonlite to be parsed:
library(pins)

weather_url <- "https://samples.openweathermap.org/data/2.5/weather?q=London,uk&appid=b6907d289e10d714a6e88b30761fae22"

pin(weather_url, "weather") %>%
  jsonlite::read_json() %>%
  as.data.frame()
  coord.lon coord.lat weather.id weather.main     weather.description weather.icon
1     -0.13     51.51        300      Drizzle light intensity drizzle          09d
One advantage of using pins is that, even if the URL or your internet connection becomes unavailable, the above code will still work.
But back to pins 0.4! The new signature parameter in pin_info() allows you to retrieve the “version” of this dataset:
pin_info("weather", signature = TRUE)
# Source: local [files]
# Signature: 624cca260666c6f090b93c37fd76878e3a12a79b
# Properties:
#   - path: weather
You can then validate the remote dataset has not changed by specifying its signature:
pin(weather_url, "weather", signature = "624cca260666c6f090b93c37fd76878e3a12a79b") %>%
  jsonlite::read_json()
If the remote dataset changes, pin() will fail and you can take the appropriate steps to accept the changes by updating the signature or properly updating your code. The previous example is useful as a way of detecting version changes, but we might also want to retrieve specific versions even when the dataset changes.
pins 0.4 allows you to display and retrieve versions from services like GitHub, Kaggle and RStudio Connect. Even in boards that don’t support versioning natively, you can opt-in by registering a board with versions = TRUE.
To keep this simple, let’s focus on GitHub first. We will register a GitHub board and pin a dataset to it. Notice that you can also specify the commit parameter in GitHub boards as the commit message for this change.
board_register_github(repo = "javierluraschi/datasets", branch = "datasets")

pin(iris, name = "versioned", board = "github", commit = "use iris as the main dataset")
Now suppose that a colleague comes along and updates this dataset as well:
pin(mtcars, name = "versioned", board = "github", commit = "slight preference to mtcars")
From now on, your code could be broken or, even worse, produce incorrect results!
However, since GitHub was designed as a version control system and pins 0.4 adds support for pin_versions(), we can now explore particular versions of this dataset:
pin_versions("versioned", board = "github")
# A tibble: 2 x 4
  version created              author         message                     
                                                      
1 6e6c320 2020-04-02T21:28:07Z javierluraschi slight preference to mtcars 
2 01f8ddf 2020-04-02T21:27:59Z javierluraschi use iris as the main dataset
You can then retrieve the version you are interested in as follows:
pin_get("versioned", version = "01f8ddf", board = "github")
# A tibble: 150 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
                                   
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# … with 140 more rows
You can follow similar steps for RStudio Connect
 and Kaggle
 boards, even for existing pins! Other boards like Amazon S3
, Google Cloud
, Digital Ocean
 and Microsoft Azure
 require you explicitly enable versioning when registering your boards.
DigitalOcean

To try out the new DigitalOcean Spaces board
, first you will have to register this board and enable versioning by setting versions to TRUE:
library(pins)
board_register_dospace(space = "pinstest",
                       key = "AAAAAAAAAAAAAAAAAAAA",
                       secret = "ABCABCABCABCABCABCABCABCABCABCABCABCABCA==",
                       datacenter = "sfo2",
                       versions = TRUE)
You can then use all the functionality pins provides, including versioning:
# create pin and replace content in digitalocean
pin(iris, name = "versioned", board = "pinstest")
pin(mtcars, name = "versioned", board = "pinstest")

# retrieve versions from digitalocean
pin_versions(name = "versioned", board = "pinstest")
# A tibble: 2 x 1
  version
    
1 c35da04
2 d9034cd
Notice that enabling versions in cloud services requires additional storage space for each version of the dataset being stored:

    
  
{width=100%}
To learn more visit the Versioning
 and DigitalOcean
 articles. To catch up with previous releases:

pins 0.3
: Azure, GCloud and S3
pins 0.2
: Pin, Discover and Share Resources

Thanks for reading along!



sparklyr 1.2: Foreach, Spark 3.0 and Databricks Connect
Yitao Li — Wed, 06 May 2020 00:00:00 +0000
A new version of sparklyr
 is now available on CRAN! In this sparklyr 1.2 release, the following new improvements have emerged into spotlight:

A registerDoSpark() method to create a foreach
 parallel backend powered by Spark that enables hundreds of existing R packages to run in Spark.
Support for Databricks Connect
, allowing sparklyr to connect to remote Databricks clusters.
Improved support for Spark structures
 when collecting and querying their nested attributes with dplyr.

A number of inter-op issues observed with sparklyr and the Spark 3.0 preview were also addressed recently, in hope that by the time Spark 3.0 officially graces us with its presence, sparklyr will be fully ready to work with it. Most notably, key features such as spark_submit(), sdf_bind_rows(), and standalone connections are now finally working with Spark 3.0 preview.
To install sparklyr 1.2 from CRAN run,


1


install.packages("sparklyr")


The full list of changes are available in the sparklyr NEWS
 file.
Foreach

The foreach
 package provides the %dopar% operator to iterate over elements in a collection in parallel. Using sparklyr 1.2, you can now register Spark as a backend using registerDoSpark() and then easily iterate over R objects using Spark:


1
2
3
4
5
6
7


library(sparklyr)
library(foreach)
sc <- spark_connect(master = "local", version = "2.4")
registerDoSpark(sc)
foreach(i = 1:3, .combine = 'c') %dopar% {
  sqrt(i)
}


[1] 1.000000 1.414214 1.732051
Since many R packages are based on foreach to perform parallel computation, we can now make use of all those great packages in Spark as well!
For instance, we can use parsnip
 and the tune
 package with data from mlbench
 to perform hyperparameter tuning in Spark with ease:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10


library(tune)
library(parsnip)
library(mlbench)
data(Ionosphere)
svm_rbf(cost = tune(), rbf_sigma = tune()) %>%
  set_mode("classification") %>%
  set_engine("kernlab") %>%
  tune_grid(Class ~ .,
    resamples = rsample::bootstraps(dplyr::select(Ionosphere, -V2), times = 30),
    control = control_grid(verbose = FALSE))


# Bootstrap sampling
# A tibble: 30 x 4
   splits            id          .metrics          .notes
 *                                
 1  Bootstrap01  
 2  Bootstrap02  
 3  Bootstrap03  
 4  Bootstrap04  
 5  Bootstrap05  
 6  Bootstrap06  
 7  Bootstrap07  
 8  Bootstrap08  
 9  Bootstrap09  
10  Bootstrap10  
# … with 20 more rows
The Spark connection was already registered, so the code ran in Spark without any additional changes. We can verify that this was the case by navigating to the Spark web interface:

Databricks Connect

Databricks Connect
 allows you to connect your favorite IDE (like RStudio
!) to a Spark Databricks
 cluster.
You will first have to install the databricks-connect Python package as described in our README
 and start a Databricks cluster, but once that’s ready, connecting to the remote cluster is as easy as running:


1
2
3


sc <- spark_connect(
  method = "databricks",
  spark_home = system2("databricks-connect", "get-spark-home", stdout = TRUE))



That’s about it, you are now remotely connected to a Databricks cluster from your local R session.
Structures

If you previously used collect() to deserialize structurally complex Spark data frames into their equivalents in R, you likely have noticed that Spark SQL struct columns were only mapped into JSON strings in R, which was non-ideal. You might also have run into a much dreaded java.lang.IllegalArgumentException: Invalid type list error when using dplyr to query nested attributes from any struct column of a Spark data frame in sparklyr.
Unfortunately, often times in real-world Spark use cases, data describing entities comprised of sub-entities (e.g., a product catalog of all hardware components of some computers) needs to be denormalized / shaped in an object-oriented manner in the form of Spark SQL structs to allow efficient read queries. When sparklyr had the limitations mentioned above, users often had to invent their own workarounds when querying Spark struct columns, which explained why there was a mass popular demand for sparklyr to have better support for such use cases.
The good news is with sparklyr 1.2, those limitations no longer exist when working running with Spark 2.4 or above.
As a concrete example, consider the following catalog of computers:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


library(dplyr)
computers <- tibble::tibble(
  id = seq(1, 2),
  attributes = list(
    list(
      processor = list(freq = 2.4, num_cores = 256),
      price = 100
   ),
   list(
     processor = list(freq = 1.6, num_cores = 512),
     price = 133
   )
  )
)
computers <- copy_to(sc, computers, overwrite = TRUE)


A typical dplyr use case involving computers would be the following:


1
2
3


high_freq_computers <- computers %>%
                       filter(attributes.processor.freq >= 2) %>%
                       collect()


As previously mentioned, before sparklyr 1.2, such query would fail with Error: java.lang.IllegalArgumentException: Invalid type list.
Whereas with sparklyr 1.2, the expected result is returned in the following form:
# A tibble: 1 x 2
     id attributes
   
1     1 
where high_freq_computers$attributes is what we would expect:
[[1]]
[[1]]$price
[1] 100
[[1]]$processor
[[1]]$processor$freq
[1] 2.4
[[1]]$processor$num_cores
[1] 256
And More!

Last but not least, we heard about a number of pain points sparklyr users have run into, and have addressed many of them in this release as well. For example:

Date type in R is now correctly serialized into Spark SQL date type by copy_to()
 %>% print(n = 20) now actually prints 20 rows as expected instead of 10
spark_connect(master = "local") will emit a more informative error message if it’s failing because the loopback interface is not up

… to name just a few. We want to thank the open source community for their continuous feedback on sparklyr, and are looking forward to incorporating more of that feedback to make sparklyr even better in the future.
Finally, in chronological order, we wish to thank the following individuals for contributing to sparklyr 1.2: zero323
, Andy Zhang
, Yitao Li
,
Javier Luraschi
, Hossein Falaki
, Lu Wang
, Samuel Macedo
 and Jozef Hajnala
. Great job everyone!
If you need to catch up on sparklyr, please visit sparklyr.ai
, spark.rstudio.com
, or some of the previous release posts: sparklyr 1.1
 and sparklyr 1.0
.
Thank you for reading this post.
This post was originally published on blogs.rstudio.com/ai/




sparklyr 1.2: Foreach, Spark 3.0 and Databricks Connect
Yitao Li — Tue, 21 Apr 2020 00:00:00 +0000
Behold the glory that is sparklyr
 1.2! In this release, the following new hotnesses have emerged into spotlight:

A registerDoSpark method to create a foreach
 parallel backend powered by Spark that enables hundreds of existing R packages to run in Spark.
Support for Databricks Connect
, allowing sparklyr to connect to remote Databricks clusters.
Improved support for Spark structures
 when collecting and querying their nested attributes with dplyr.

A number of inter-op issues observed with sparklyr and Spark 3.0 preview were also addressed recently, in hope that by the time Spark 3.0 officially graces us with its presence, sparklyr will be fully ready to work with it. Most notably, key features such as spark_submit, sdf_bind_rows, and standalone connections are now finally working with Spark 3.0 preview.
To install sparklyr 1.2 from CRAN run,


1


install.packages("sparklyr")


The full list of changes are available in the sparklyr NEWS
 file.
Foreach

The foreach package provides the %dopar% operator to iterate over elements in a collection in parallel. Using sparklyr 1.2, you can now register Spark as a backend using registerDoSpark() and then easily iterate over R objects using Spark:


1
2
3
4
5
6
7
8
9


library(sparklyr)
library(foreach)

sc <- spark_connect(master = "local", version = "2.4")

registerDoSpark(sc)
foreach(i = 1:3, .combine = 'c') %dopar% {
  sqrt(i)
}


[1] 1.000000 1.414214 1.732051

Since many R packages are based on foreach to perform parallel computation, we can now make use of all those great packages in Spark as well!
For instance, we can use parsnip
 and the tune
 package with data from mlbench
 to perform hyperparameter tuning in Spark with ease:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


library(tune)
library(parsnip)
library(mlbench)

data(Ionosphere)
svm_rbf(cost = tune(), rbf_sigma = tune()) %>%
  set_mode("classification") %>%
  set_engine("kernlab") %>%
  tune_grid(Class ~ .,
    resamples = rsample::bootstraps(dplyr::select(Ionosphere, -V2), times = 30),
    control = control_grid(verbose = FALSE))


# Bootstrap sampling
# A tibble: 30 x 4
   splits            id          .metrics          .notes
 *                                
 1  Bootstrap01  
 2  Bootstrap02  
 3  Bootstrap03  
 4  Bootstrap04  
 5  Bootstrap05  
 6  Bootstrap06  
 7  Bootstrap07  
 8  Bootstrap08  
 9  Bootstrap09  
10  Bootstrap10  
# … with 20 more rows

The Spark connection was already registered, so the code ran in Spark without any additional changes. We can verify this was the case by navigating to the Spark web interface:

    
  

Databricks Connect

Databricks Connect
 allows you to connect your favorite IDE (like RStudio
!) to a Spark Databricks
 cluster.
You will first have to install the databricks-connect package as described in our README
 and start a Databricks cluster, but once that’s ready, connecting to the remote cluster is as easy as running:


1
2
3


sc <- spark_connect(
  method = "databricks",
  spark_home = system2("databricks-connect", "get-spark-home", stdout = TRUE))



    
  

That’s about it, you are now remotely connected to a Databricks cluster from your local R session.
Structures

If you previously used collect to deserialize structurally complex Spark dataframes into their equivalents in R, you likely have noticed Spark SQL struct columns were only mapped into JSON strings in R, which was non-ideal. You might also have run into a much dreaded java.lang.IllegalArgumentException: Invalid type list error when using dplyr to query nested attributes from any struct column of a Spark dataframe in sparklyr.
Unfortunately, often times in real-world Spark use cases, data describing entities comprising of sub-entities (e.g., a product catalog of all hardware components of some computers) needs to be denormalized / shaped in an object-oriented manner in the form of Spark SQL structs to allow efficient read queries. When sparklyr had the limitations mentioned above, users often had to invent their own workarounds when querying Spark struct columns, which explained why there was a mass popular demand for sparklyr to have better support for such use cases.
The good news is with sparklyr 1.2, those limitations no longer exist any more when working running with Spark 2.4 or above.
As a concrete example, consider the following catalog of computers:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


library(dplyr)

computers <- tibble::tibble(
  id = seq(1, 2),
  attributes = list(
    list(
      processor = list(freq = 2.4, num_cores = 256),
      price = 100
   ),
   list(
     processor = list(freq = 1.6, num_cores = 512),
     price = 133
   )
  )
)

computers <- copy_to(sc, computers, overwrite = TRUE)


A typical dplyr use case involving computers would be the following:


1
2
3


high_freq_computers <- computers %>%
                       filter(attributes.processor.freq >= 2) %>%
                       collect()


As previously mentioned, before sparklyr 1.2, such query would fail with Error: java.lang.IllegalArgumentException: Invalid type list.
Whereas with sparklyr 1.2, the expected result is returned in the following form:
# A tibble: 1 x 2
     id attributes
   
1     1 

where high_freq_computers$attributes is what we would expect:
[[1]]
[[1]]$price
[1] 100

[[1]]$processor
[[1]]$processor$freq
[1] 2.4

[[1]]$processor$num_cores
[1] 256

And More!

Last but not least, we heard about a number of pain points sparklyr users have run into, and have addressed many of them in this release as well. For example:

Date type in R is now correctly serialized into Spark SQL date type by copy_to
 %>% print(n = 20) now actually prints 20 rows as expected instead of 10
spark_connect(master = "local") will emit a more informative error message if it’s failing because the loopback interface is not up

… to just name a few. We want to thank the open source community for their continuous feedback on sparklyr, and are looking forward to incorporating more of that feedback to make sparklyr even better in the future.
Finally, in chronological order, we wish to thank the following individuals for contributing to sparklyr 1.2: zero323
, Andy Zhang
, Yitao Li
,
Javier Luraschi
, Hossein Falaki
, Lu Wang
, Samuel Macedo
 and Jozef Hajnala
. Great job everyone!
If you need to catch up on sparklyr, please visit sparklyr.ai
, spark.rstudio.com
, or some of the previous release posts: sparklyr 1.1
 and sparklyr 1.0
.
Thank you for reading this post.



pins 0.4: Versioning
Javier Luraschi — Mon, 13 Apr 2020 00:00:00 +0000
A new version of pins is available on CRAN today, which adds support for versioning
 your datasets and DigitalOcean Spaces
 boards!
As a quick recap, the pins package allows you to cache, discover and share resources. You can use pins in a wide range of situations, from downloading a dataset from a URL to creating complex automation workflows (learn more at pins.rstudio.com
). You can also use pins in combination with TensorFlow and Keras; for instance, use cloudml
 to train models in cloud GPUs, but rather than manually copying files into the GPU instance, you can store them as pins directly from R.
To install this new version of pins from CRAN, simply run:


1


install.packages("pins")


You can find a detailed list of improvements in the pins NEWS
 file.
Versioning

To illustrate the new versioning functionality, let’s start by downloading and caching a remote dataset with pins. For this example, we will download the weather in London, this happens to be in JSON format and requires jsonlite to be parsed:


1
2
3
4
5
6
7


library(pins)

weather_url <- "https://samples.openweathermap.org/data/2.5/weather?q=London,uk&appid=b6907d289e10d714a6e88b30761fae22"

pin(weather_url, "weather") %>%
  jsonlite::read_json() %>%
  as.data.frame()


  coord.lon coord.lat weather.id weather.main     weather.description weather.icon
1     -0.13     51.51        300      Drizzle light intensity drizzle          09d

One advantage of using pins is that, even if the URL or your internet connection becomes unavailable, the above code will still work.
But back to pins 0.4! The new signature parameter in pin_info() allows you to retrieve the “version” of this dataset:


1


pin_info("weather", signature = TRUE)


# Source: local [files]
# Signature: 624cca260666c6f090b93c37fd76878e3a12a79b
# Properties:
#   - path: weather

You can then validate the remote dataset has not changed by specifying its signature:


1
2


pin(weather_url, "weather", signature = "624cca260666c6f090b93c37fd76878e3a12a79b") %>%
  jsonlite::read_json()


If the remote dataset changes, pin() will fail and you can take the appropriate steps to accept the changes by updating the signature or properly updating your code. The previous example is useful as a way of detecting version changes, but we might also want to retrieve specific versions even when the dataset changes.
pins 0.4 allows you to display and retrieve versions from services like GitHub, Kaggle and RStudio Connect. Even in boards that don’t support versioning natively, you can opt-in by registering a board with versions = TRUE.
To keep this simple, let’s focus on GitHub first. We will register a GitHub board and pin a dataset to it. Notice that you can also specify the commit parameter in GitHub boards as the commit message for this change.


1
2
3


board_register_github(repo = "javierluraschi/datasets", branch = "datasets")

pin(iris, name = "versioned", board = "github", commit = "use iris as the main dataset")


Now suppose that a colleague comes along and updates this dataset as well:


1


pin(mtcars, name = "versioned", board = "github", commit = "slight preference to mtcars")


From now on, your code could be broken or, even worse, produce incorrect results!
However, since GitHub was designed as a version control system and pins 0.4 adds support for pin_versions(), we can now explore particular versions of this dataset:


1


pin_versions("versioned", board = "github")


# A tibble: 2 x 4
  version created              author         message                     
                                                      
1 6e6c320 2020-04-02T21:28:07Z javierluraschi slight preference to mtcars 
2 01f8ddf 2020-04-02T21:27:59Z javierluraschi use iris as the main dataset

You can then retrieve the version you are interested in as follows:


1


pin_get("versioned", version = "01f8ddf", board = "github")


# A tibble: 150 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
                                   
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# … with 140 more rows

You can follow similar steps for RStudio Connect
 and Kaggle
 boards, even for existing pins! Other boards like Amazon S3
, Google Cloud
, Digital Ocean
 and Microsoft Azure
 require you explicitly enable versioning when registering your boards.
DigitalOcean

To try out the new DigitalOcean Spaces board
, first you will have to register this board and enable versioning by setting versions to TRUE:


1
2
3
4
5
6


library(pins)
board_register_dospace(space = "pinstest",
                       key = "AAAAAAAAAAAAAAAAAAAA",
                       secret = "ABCABCABCABCABCABCABCABCABCABCABCABCABCA==",
                       datacenter = "sfo2",
                       versions = TRUE)


You can then use all the functionality pins provides, including versioning:


1
2
3
4
5
6


# create pin and replace content in digitalocean
pin(iris, name = "versioned", board = "pinstest")
pin(mtcars, name = "versioned", board = "pinstest")

# retrieve versions from digitalocean
pin_versions(name = "versioned", board = "pinstest")


# A tibble: 2 x 1
  version
    
1 c35da04
2 d9034cd

Notice that enabling versions in cloud services requires additional storage space for each version of the dataset being stored:

To learn more visit the Versioning
 and DigitalOcean
 articles. To catch up with previous releases:

pins 0.3
: Azure, GCloud and S3
pins 0.2
: Pin, Discover and Share Resources

Thanks for reading along!



sparklyr 1.1: Foundations, Books, Lakes and Barriers
Javier Luraschi — Wed, 29 Jan 2020 00:00:00 +0000

Today we are excited to share that sparklyr
 1.1 is now available on CRAN
!
In a nutshell, you can use sparklyr to scale datasets across computing clusters running Apache Spark
. For this particular release, we would like to highlight the following new features:

Delta Lake
 enables database-like properties in Spark.
Spark 3.0
 preview is now available through sparklyr.
Barrier Execution
 paves the way to use Spark with deep learning frameworks.
Qubole
 clusters running Spark can be easily used with sparklyr.

In addition, new community Extensions
 enable natural language processing and genomics, sparklyr is now being hosted within the Linux Foundation
, and the Mastering Spark with R
 book is now available and free-to-use online.
You can install sparklyr 1.1 from CRAN as follows:


1


install.packages("sparklyr")


Delta Lake

The Delta Lake
 project is an open-source storage layer that brings ACID transactions
 to Apache Spark. To use Delta Lake, first connect using the new packages parameter set to "delta".


1
2


library(sparklyr)
sc <- spark_connect(master = "local", version = "2.4", packages = "delta")


As a simple example, let’s write a small data frame to Delta using spark_write_delta(), overwrite it, and then read it back with  spark_read_delta():


1
2
3
4


sdf_len(sc, 5) %>% spark_write_delta(path = "delta-test")
sdf_len(sc, 3) %>% spark_write_delta(path = "delta-test", mode = "overwrite")

spark_read_delta(sc, "/tmp/delta-1")


# Source: spark [?? x 1]
     id
  
1     1
2     2
3     3
Now, since Delta is capable of tracking all versions of your data, you can easily time travel to retrieve the version that we overwrote:


1


spark_read_delta(sc, "delta-test", version = 0L)


# Source: spark [?? x 1]
     id
  
1     1
2     2
3     3
4     4
5     5
Spark 3.0

To install and try out Spark 3.0 preview, simply run:


1
2
3
4


library(sparklyr)
spark_install("3.0.0-preview")

sc <- spark_connect(master = "local", version = "3.0.0-preview")


You can then preview upcoming features, like the ability to read binary files. To demonstrate this, we can use pins
 to download a 237MB subset of ImageNet
, and then load them into Spark:


1
2


tiny_imagenet <- pins::pin("http://cs231n.stanford.edu/tiny-imagenet-200.zip")
spark_read_source(sc, dirname(tiny_imagenet[1]), source = "binaryFile")


# Source: spark [?? x 4]
   path                       modificationTime    length content   
                                             
 1 file:images/test_2009.JPEG 2020-01-08 20:36:41   3138 < [3,138]>
 2 file:images/test_8245.JPEG 2020-01-08 20:36:43   3066 < [3,066]>
 3 file:images/test_4186.JPEG 2020-01-08 20:36:42   2998 < [2,998]>
 4 file:images/test_403.JPEG  2020-01-08 20:36:39   2980 < [2,980]>
 5 file:images/test_8544.JPEG 2020-01-08 20:36:38   2958 < [2,958]>
 6 file:images/test_5814.JPEG 2020-01-08 20:36:38   2929 < [2,929]>
 7 file:images/test_1063.JPEG 2020-01-08 20:36:41   2920 < [2,920]>
 8 file:images/test_1942.JPEG 2020-01-08 20:36:39   2908 < [2,908]>
 9 file:images/test_5456.JPEG 2020-01-08 20:36:42   2906 < [2,906]>
10 file:images/test_5859.JPEG 2020-01-08 20:36:39   2896 < [2,896]>
# … with more rows
Please notice that the Spark 3.0.0 preview
 not a stable release in terms of either API or functionality.
Barrier Execution

Barrier execution is a new feature introduced in Spark 2.4
 which enables Deep Learning on Apache Spark by implementing an all-or-nothing scheduler into Apache Spark. This allows Spark to not only process analytic workflows, but also to use Spark as a high-performance computing cluster where other framework, like OpenMP
 or TensorFlow Distributed
, can reuse cluster machines and have them directly communicate with each other for a given task.
In general, we don’t expect most users to use this feature directly; instead, this is a feature relevant to advanced users interested in creating extensions that support additional modeling frameworks. You can learn more about barrier execution in Reynold Xin’s keynote
.
To use barrier execution from R, set the barrier = TRUE parameter in spark_apply() and then make use of a new  parameter in the R closure to retrieve the network address of the additional nodes available for this task. A simple example follows:


1
2
3
4
5
6


library(sparklyr)
sc <- spark_connect(master = "local", version = "2.4")

sdf_len(sc, 1, repartition = 1) %>%
  spark_apply(~ .y$address, barrier = TRUE, columns = c(address = "character")) %>%
  collect()


# A tibble: 1 x 1
  address        
            
1 localhost:50693
Qubole

Qubole
 is a fully self-service multi-cloud data platform based on enterprise-grade data processing engines including Apache Spark.
If you are using Qubole clusters, you can now easily connect to a Spark through a new "qubole" connection method:


1
2


library(sparklyr)
sc <- spark_connect(method = "qubole")


Once connected, you can use Spark and R as usual. To learn more, visit RStudio for Running Distributed R Jobs
.
Extensions

The new github.com/r-spark
 repo contains new community extensions. To mention a few, variantspark
 and sparkhail
 are two new extensions for genomic research, sparknlp
 adds support for natural language processing.
For those of you with background in genomics, you can use sparkhail by first installing this extension from CRAN. Followed by connecting to Spark, creating a Hail Context, and then loading a subset of the 1000 Genomes
 dataset using Hail
:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10


library(sparklyr)
library(sparkhail)

sc <- spark_connect(master = "local", version = "2.4", config = hail_config())
hc <- hail_context(sc)

hail_data <- pins::pin("https://github.com/r-spark/sparkhail/blob/master/inst/extdata/1kg.zip?raw=true")

hail_df <- hail_read_matrix(hc, file.path(dirname(hail_data[1]), "1kg.mt")) %>%
  hail_dataframe()


You can then analyze it with packages like dplyr, sparklyr.nested, and dbplot:


1
2
3
4
5
6


library(dplyr)

sdf_separate_column(hail_df, "alleles") %>% 
  group_by(alleles_1, alleles_2) %>% 
  tally() %>% 
  arrange(-n)


# Source:     spark [?? x 3]
# Groups:     alleles_1
# Ordered by: -n
   alleles_1 alleles_2     n
             
 1 C         T          2436
 2 G         A          2387
 3 A         G          1944
 4 T         C          1879
 5 C         A           496
 6 G         T           480
 7 T         G           468
 8 A         C           454
 9 C         G           150
10 G         C           112
# … with more rows
Notice that these frequencies come in pairs, C/T and G/A are actually the same mutation, just viewed from opposite strands. You can then create a histogram over the DP field, depth of the proband, as follows:


1
2


sparklyr.nested::sdf_select(hail_df, dp = info.DP) %>%
  dbplot::dbplot_histogram(dp)



This code was adapted from Hail’s Genome Wide Association-Study
. You can learn more about this Hail community extensions from r-spark/sparkhail
.
Linux Foundation

The Linux Foundation
 is home of projects such as Linux
, Kubernetes
, Node.js
 and umbrella foundations such as LF AI
, LF Edge
, and LF Network
. We are very excited to have sparklyr be hosted as an incubation project within LF AI alongside Acumos
, Angel
, Horovod
, Pyro
, ONNX
 and several others.
Hosting sparklyr in LF AI within the Linux Foundation provides a neutral entity to hold the project assets and open governance. Furthermore, we believe hosting with LF AI will also help bring additional talent, ideas, and shared components from other Linux Foundation projects like Delta Lake
, Horovod
, ONNX
, and so on into sparklyr as part of cross-project and cross-foundation collaboration.
This makes it a great time for you to join the sparklyr community, contribute, and help this project grow. You can learn more about this in sparklyr.org
.
Mastering Spark with R

Mastering Spark with R
 is a new book to help you learn and master Apache Spark with R from start to finish. It introduces data analysis with well-known tools like dplyr
, and covers everything else related to processing large-scale datasets, modeling, productionizing pipelines, using extensions, distributing R code, and processing real-time data – if you are not yet familiar with Spark, this is a great resource to get started!

This book was published by O’Reilly
, is available on Amazon
, and is also free-to-use online
. We hope you find this book useful and easy to read.
To catch up on previous releases, take a look at the sparklyr 1.0
 post or watch various video tutorials in the mlverse
 channel.
Thank you for reading along!



pins 0.3.0: Azure, GCloud and S3
Javier Luraschi — Thu, 28 Nov 2019 00:00:00 +0000
A new version of pins is available on CRAN! pins 0.3 comes with many improvements and the following major features:

Retrieve pin information with pin_info() including properties particular to each board.

You can install this new version from CRAN as follows:


1


install.packages("pins")


In addition, there is a new Use Cases
 section in our docs, various improvements (see NEWS
) and two community extensions being developed to support databases
 and Nextcloud
 as boards.
Cloud Boards

pins 0.3 adds support to find, retrieve and store resources in various cloud providers like: Microsoft Azure
, Google Cloud
 and Amazon Web Services
.

    
  

To illustrate how they work, lets first try to find the World Bank indicators dataset in Kaggle
:


1
2


library(pins)
pin_find("indicators", board = "kaggle")


# A tibble: 6 x 4
  name                                            description                             type  board 
                                                                                  
1 worldbank/world-development-indicators          World Development Indicators            files kaggle
2 theworldbank/world-development-indicators       World Development Indicators            files kaggle
3 cdc/chronic-disease                             Chronic Disease Indicators              files kaggle
4 bigquery/worldbank-wdi                          World Development Indicators (WDI) Data files kaggle
5 rajanand/key-indicators-of-annual-health-survey Health Analytics                        files kaggle
6 loveall/human-happiness-indicators              Human Happiness Indicators              files kaggle
Which we can then easily download with pin_get(), beware this is a 2GB download:


1


pin_get("worldbank/world-development-indicators")


[1] "/.../worldbank/world-development-indicators/Country.csv"     
[2] "/.../worldbank/world-development-indicators/CountryNotes.csv"
[3] "/.../worldbank/world-development-indicators/database.sqlite" 
[4] "/.../worldbank/world-development-indicators/Footnotes.csv"   
[5] "/.../worldbank/world-development-indicators/hashes.txt"      
[6] "/.../worldbank/world-development-indicators/Indicators.csv"  
[7] "/.../worldbank/world-development-indicators/Series.csv"      
[8] "/.../worldbank/world-development-indicators/SeriesNotes.csv" 
The Indicators.csv file contains all the indicators, so let’s load it with readr
:


1
2


indicators <- pin_get("worldbank/world-development-indicators")[6] %>%
  readr::read_csv()


Analysing this dataset would be quite interesting; however, this post focuses on how to share this in S3, Google Cloud or Azure storage. More specifically, we will learn how to publish to an S3 board
. To publish to other cloud providers, take a look at the Google Cloud
 and Azure boards
 articles.
As you would expect, the first step is to register the S3 board. When using RStudio, you can use the New Connection
 action to guide you through this process, or you can specify your key and secret as follows. Please refer to the S3 board
 article to understand how to store your credentials securely.


1
2
3
4


board_register_s3(name = "rpins",
                  bucket  = "rpins",
                  key = "VerySecretKey",
                  secret = "EvenMoreImportantSecret")


With the S3 board registered, we can now pin the indicators dataset with pin():


1


pin(indicators, name = "worldbank/indicators", board = "rpins")


That’s about it! We can now find and retrieve this dataset from S3 using pin_find(), pin_get() or view the uploaded resources in the S3 management console:

    
  

To make this even easier for others to consume, we can make this S3 bucket public; which means you can now connect to this board without even having to configure S3, making it possible to retrieve this dataset with just one line of R code!


1


pins::pin_get("worldbank/indicators", "https://rpins.s3.amazonaws.com")


# A tibble: 5,656,458 x 6
   CountryName CountryCode IndicatorName                          IndicatorCode    Year      Value
                                                                    
 1 Arab World  ARB         Adolescent fertility rate (births per… SP.ADO.TFRT      1960    1.34e+2
 2 Arab World  ARB         Age dependency ratio (% of working-ag… SP.POP.DPND      1960    8.78e+1
 3 Arab World  ARB         Age dependency ratio, old (% of worki… SP.POP.DPND.OL   1960    6.63e+0
 4 Arab World  ARB         Age dependency ratio, young (% of wor… SP.POP.DPND.YG   1960    8.10e+1
 5 Arab World  ARB         Arms exports (SIPRI trend indicator v… MS.MIL.XPRT.KD   1960    3.00e+6
 6 Arab World  ARB         Arms imports (SIPRI trend indicator v… MS.MIL.MPRT.KD   1960    5.38e+8
 7 Arab World  ARB         Birth rate, crude (per 1,000 people)   SP.DYN.CBRT.IN   1960    4.77e+1
 8 Arab World  ARB         CO2 emissions (kt)                     EN.ATM.CO2E.KT   1960    5.96e+4
 9 Arab World  ARB         CO2 emissions (metric tons per capita) EN.ATM.CO2E.PC   1960    6.44e-1
10 Arab World  ARB         CO2 emissions from gaseous fuel consu… EN.ATM.CO2E.GF…  1960    5.04e+0
# … with 5,656,448 more rows
This works since pins 0.3 automatically register URLs as a website board
 to save you from having to explicitly call board_register_datatxt().
It’s also worth mentioning that pins stores the dataset using an R native format, which requires only 72MB and loads much faster than the original 2GB dataset.
Pin Information

Boards like Kaggle
 and RStudio Connect
, store additional information for each pin which you can now easily retrieve with pin_info().
For instance, we can retrieve additional properties from the indicators pin from Kaggle as follows,


1


pin_info("worldbank/world-development-indicators", board = "kaggle")


# Source: kaggle [files]
# Description: World Development Indicators
# Properties:
#   - id: 23
#   - subtitle: Explore country development indicators from around the world
#   - tags: (ref) business, economics, international relations, business finance...
#   - creatorName: Megan Risdal
#   - creatorUrl: mrisdal
#   - totalBytes: 387054886
#   - url: https://www.kaggle.com/worldbank/world-development-indicators
#   - lastUpdated: 2017-05-01T17:50:44.863Z
#   - downloadCount: 42961
#   - isPrivate: FALSE
#   - isReviewed: TRUE
#   - isFeatured: FALSE
#   - licenseName: World Bank Dataset Terms of Use
#   - ownerName: World Bank
#   - ownerRef: worldbank
#   - kernelCount: 422
#   - topicCount: 7
#   - viewCount: 254379
#   - voteCount: 1121
#   - currentVersionNumber: 2
#   - usabilityRating: 0.7647
#   - extension: zip
And from RStudio Connect boards as well,


1


pin_info("worldnews", board = "rsconnect")


# Source: rsconnect [table]
# Properties:
#   - id: 6446
#   - guid: 1b9f04c5-ddd4-43ca-8352-98f6f01a7034
#   - access_type: all
#   - url: https://beta.rstudioconnect.com/content/6446/
#   - vanity_url: FALSE
#   - bundle_id: 16216
#   - app_mode: 4
#   - content_category: pin
#   - has_parameters: FALSE
#   - created_time: 2019-09-30T18:20:21.911777Z
#   - last_deployed_time: 2019-11-18T16:00:16.919478Z
#   - build_status: 2
#   - run_as_current_user: FALSE
#   - owner_first_name: Javier
#   - owner_last_name: Luraschi
#   - owner_username: jluraschi
#   - owner_guid: ac498f34-174c-408f-8089-a9f10c630a37
#   - owner_locked: FALSE
#   - is_scheduled: FALSE
#   - rows: 44
#   - cols: 1
To retrieve all the extended information when discovering pins, pass extended = TRUE to pin_find().
Thank you for reading this post!
Please refer to rstudio.github.io/pins
 for detailed documentation and GitHub
 to file issues or feature requests.

MLOps and Admin on Posit Open Source

News from the sparkly-verse

Highlights

pysparklyr 0.1.4

sparkxgb

sparklyr 1.8.5

Announcing bundle

Saving things is hard

When to bundle your model

Acknowledgements

Announcing vetiver for MLOps in R and Python

Train a model

Create a vetiver model

Acknowledgements

Integrating Dynamic R and Python Models in Tableau Using plumbertableau

Foster Data Analytics Capabilities With plumbertableau

Improve Data Quality With APIs for Continuous Use

Increase Deliverability by Streamlining Data Pipelines

How to Use plumbertableau: XGBoost with Dynamic Model Output Example

1. Build the model

2. Create a plumbertableau Extension

3. Host your API

4. Create a calculated field in Tableau

5. Run model and visualize results in Tableau

Conclusion

Learn More

Sharing Data With the pins Package

What is a pin, anyway?

Pins for Sharing Across Projects and Teams

Pins for Updating and Versioning

Learn More

pins 1.0.0

Basics

Sharing pins

Other boards

sparklyr.sedona: A sparklyr extension for analyzing geospatial data

Motivation for sparklyr.sedona

The lay of the land2

Acknowledgements

sparklyr 1.7: New data sources and spark_apply() capabilities, better interfaces for sparklyr extensions, and more!

Image and binary data sources

The demo

New spark_apply() capabilities

Optimizations & custom serializers

Inferring dependencies automatically

Better integration with sparklyr extensions

Customizing the dbplyr SQL translation environment

Improved interface for invoking Java/Scala functions

Other exciting news

Acknowledgement

sparklyr 1.6: weighted quantile summaries, power iteration clustering, spark_write_rds(), and more

Weighted quantile summaries

Power iteration clustering

spark_write_rds() + collect_from_rds()

Dplyr-related improvements

select(where(...)) and summarize(across(where(...)))

if_all() and if_any()

Compatibility with dbplyr 2.0 backend API

Acknowledgements

sparklyr 1.5: better dplyr interface, more sdf_* functions, and RDS-based serialization routines

Better dplyr interface

Stratified sampling

Row sums

Weighted-mean summarizer

New additions to the sdf_* family of functions

sdf_expand_grid()

sdf_partition_sizes()

sdf_unnest_longer() and sdf_unnest_wider()

RDS-based serialization routines

Why arrow is not for everyone?

Why is the CSV format not ideal?

RDS to the rescue!

Other benefits of RDS serialization

Acknowledgement

sparklyr 1.4: Weighted Sampling, Tidyr Verbs, Robust Scaler, RAPIDS, and more

Parallelized Weighted Sampling

Tidyr Verbs

Robust Scaler

RAPIDS

Higher-Order Functions and dplyr-Related Improvements

Motivation for `sparklyr.sedona`

The lay of the land²

New `spark_apply()` capabilities

Customizing the `dbplyr` SQL translation environment

`spark_write_rds()` + `collect_from_rds()`

`select(where(...))` and `summarize(across(where(...)))`

`if_all()` and `if_any()`

Compatibility with `dbplyr` 2.0 backend API

New additions to the `sdf_*` family of functions

`sdf_expand_grid()`

`sdf_partition_sizes()`

`sdf_unnest_longer()` and `sdf_unnest_wider()`

Why `arrow` is not for everyone?

Higher-Order Functions and `dplyr`-Related Improvements