Machine Learning on Posit Open Source

orbital 0.4.0

Emil Hvitfeldt — Mon, 12 Jan 2026 00:00:00 +0000

We’re over the moon to announce the release of orbital 0.4.0. orbital lets you predict in databases using tidymodels workflows.

You can install it from CRAN with:

install.packages("orbital")

This blog post will cover the highlights, which are post processing support and the new show_query() method.

You can see a full list of changes in the release notes .

Post processing support

The biggest improvement in this version is that orbital() now works for supported tailor methods. See vignette for a list of all supported post-processors.

Let’s start by fitting a classification model on the penguins data set, using {xgboost} as the engine. We will be showcasing using an adjustment that only works on binary classification and will thus recode species to have levels "Adelie" and "not_Adelie".

penguins$species <- forcats::fct_recode(
 penguins$species,
 not_Adelie = "Chinstrap", not_Adelie = "Gentoo"
)

After we have modified the data, we set up a simple workflow, with a preprocessor using recipes and the model specification using parsnip.

We also set up a post processor using the tailor package. A single adjustment will be done by adding adjust_equivocal_zone(). This will apply an equivocal zone to our binary classification model. Stopping predictions that are too close to the thresholds by labeling them as "[EQ]". Setting the argument value = 0.2 means that any predictions with a predicted probability of between 0.3 and 0.7 will be predicted as "[EQ]" instead.

rec_spec <- recipe(species ~ ., data = penguins) |>
  step_unknown(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_impute_mean(all_numeric_predictors()) |>
  step_zv(all_predictors())

lr_spec <- boost_tree(tree_depth = 1, trees = 5) |>
  set_mode("classification") |>
  set_engine("xgboost")

tlr_spec <- tailor() |>
  adjust_equivocal_zone(value = 0.2)

wf_spec <- workflow(rec_spec, lr_spec, tlr_spec)
wf_fit <- fit(wf_spec, data = penguins)

With this fitted workflow object, we can call orbital() on it to create an orbital object. Notice that for adjust_equivocal_zone() to work, we need to set type = c("class", "prob") as both are required for the adjust_equivocal_zone() transformation.

orbital_obj <- orbital(wf_fit, type = c("class", "prob"))
orbital_obj
#> 
#> ── orbital Object ───────────────────────────────────────────────────────
#> • bill_length_mm = dplyr::if_else(is.na(bill_length_mm), 43.92193, ...
#> • flipper_length_mm = dplyr::if_else(is.na(flipper_length_mm), 201 ...
#> • .pred_class = dplyr::case_when(1 - 1/(1 + exp(dplyr::case_when(b ...
#> • .pred_Adelie = 1 - 1/(1 + exp(dplyr::case_when(bill_length_mm < ...
#> • .pred_not_Adelie = 1 - (1 - 1/(1 + exp(dplyr::case_when(bill_len ...
#> • .pred_class = dplyr::case_when( .pred_Adelie > 0.5 + 0.2 ~ 'Adel ...
#> ─────────────────────────────────────────────────────────────────────────
#> 6 equations in total.

This object contains all the information that is needed to produce predictions. Which we can produce with predict() .

preds <- predict(orbital_obj, penguins)
preds
#> # A tibble: 344 × 3
#>    .pred_class .pred_Adelie .pred_not_Adelie
#>                              
#>  1 Adelie             0.845            0.155
#>  2 Adelie             0.845            0.155
#>  3 Adelie             0.845            0.155
#>  4 not_Adelie         0.291            0.709
#>  5 Adelie             0.845            0.155
#>  6 Adelie             0.845            0.155
#>  7 Adelie             0.845            0.155
#>  8 Adelie             0.845            0.155
#>  9 Adelie             0.845            0.155
#> 10 Adelie             0.845            0.155
#> # ℹ 334 more rows

The predictions are working; however, we don’t see any evidence that adjust_equivocal_zone() is working. A call to count() reveals that a couple of observation lands in the equivocal zone.

count(preds, .pred_class)
#> # A tibble: 3 × 2
#>   .pred_class     n
#>          
#> 1 Adelie        144
#> 2 [EQ]           15
#> 3 not_Adelie    185

And we can further verify that they are correct.

filter(preds, .pred_class == '[EQ]')
#> # A tibble: 15 × 3
#>    .pred_class .pred_Adelie .pred_not_Adelie
#>                              
#>  1 [EQ]               0.483            0.517
#>  2 [EQ]               0.483            0.517
#>  3 [EQ]               0.483            0.517
#>  4 [EQ]               0.483            0.517
#>  5 [EQ]               0.483            0.517
#>  6 [EQ]               0.483            0.517
#>  7 [EQ]               0.483            0.517
#>  8 [EQ]               0.348            0.652
#>  9 [EQ]               0.348            0.652
#> 10 [EQ]               0.348            0.652
#> 11 [EQ]               0.348            0.652
#> 12 [EQ]               0.348            0.652
#> 13 [EQ]               0.483            0.517
#> 14 [EQ]               0.483            0.517
#> 15 [EQ]               0.483            0.517

New show_query method

One of the main purposes of orbital is to allow for predictions in databases.

library(DBI)
library(RSQLite)

con_sqlite <- dbConnect(SQLite(), path = ":memory:")
penguins_sqlite <- copy_to(con_sqlite, penguins, name = "penguins_table")

Having set up a database we could have used orbital_sql() to show what the SQL query would have looked like. For quick testing, the output isn’t immediately ready to be pasted into its own file due to the fragments within the output.

The show_query() method has been implemented to see exactly what the generated SQL looks like.

show_query(orbital_obj, con_sqlite)
#> CASE WHEN ((`bill_length_mm` IS NULL)) THEN 43.9219298245614 WHEN NOT ((`bill_length_mm` IS NULL)) THEN `bill_length_mm` END AS bill_length_mm
#> CASE WHEN ((`flipper_length_mm` IS NULL)) THEN 201.0 WHEN NOT ((`flipper_length_mm` IS NULL)) THEN `flipper_length_mm` END AS flipper_length_mm
#> CASE
#> WHEN ((1.0 - 1.0 / (1.0 + EXP(((((CASE
#> WHEN (`bill_length_mm` < 42.4000015) THEN 0.627138138
#> WHEN ((`bill_length_mm` >= 42.4000015 OR (`bill_length_mm` IS NULL))) THEN (-0.449751347)
#> END + CASE
#> WHEN (`bill_length_mm` < 43.2999992) THEN 0.425288886
#> WHEN ((`bill_length_mm` >= 43.2999992 OR (`bill_length_mm` IS NULL))) THEN (-0.398178101)
#> END) + CASE
#> WHEN (`bill_length_mm` < 42.4000015) THEN 0.380251437
#> WHEN ((`bill_length_mm` >= 42.4000015 OR (`bill_length_mm` IS NULL))) THEN (-0.306771189)
#> END) + CASE
#> WHEN (`bill_length_mm` < 44.4000015) THEN 0.286071777
#> WHEN ((`bill_length_mm` >= 44.4000015 OR (`bill_length_mm` IS NULL))) THEN (-0.330096036)
#> END) + CASE
#> WHEN (`flipper_length_mm` < 203.0) THEN 0.209298179
#> WHEN ((`flipper_length_mm` >= 203.0 OR (`flipper_length_mm` IS NULL))) THEN (-0.348002464)
#> END) + LOG(0.44186047 / (1.0 - 0.44186047))))) > 0.5) THEN 'Adelie'
#> ELSE 'not_Adelie'
#> END AS .pred_class
#> 1.0 - 1.0 / (1.0 + EXP(((((CASE
#> WHEN (`bill_length_mm` < 42.4000015) THEN 0.627138138
#> WHEN ((`bill_length_mm` >= 42.4000015 OR (`bill_length_mm` IS NULL))) THEN (-0.449751347)
#> END + CASE
#> WHEN (`bill_length_mm` < 43.2999992) THEN 0.425288886
#> WHEN ((`bill_length_mm` >= 43.2999992 OR (`bill_length_mm` IS NULL))) THEN (-0.398178101)
#> END) + CASE
#> WHEN (`bill_length_mm` < 42.4000015) THEN 0.380251437
#> WHEN ((`bill_length_mm` >= 42.4000015 OR (`bill_length_mm` IS NULL))) THEN (-0.306771189)
#> END) + CASE
#> WHEN (`bill_length_mm` < 44.4000015) THEN 0.286071777
#> WHEN ((`bill_length_mm` >= 44.4000015 OR (`bill_length_mm` IS NULL))) THEN (-0.330096036)
#> END) + CASE
#> WHEN (`flipper_length_mm` < 203.0) THEN 0.209298179
#> WHEN ((`flipper_length_mm` >= 203.0 OR (`flipper_length_mm` IS NULL))) THEN (-0.348002464)
#> END) + LOG(0.44186047 / (1.0 - 0.44186047)))) AS .pred_Adelie
#> 1.0 - (1.0 - 1.0 / (1.0 + EXP(((((CASE
#> WHEN (`bill_length_mm` < 42.4000015) THEN 0.627138138
#> WHEN ((`bill_length_mm` >= 42.4000015 OR (`bill_length_mm` IS NULL))) THEN (-0.449751347)
#> END + CASE
#> WHEN (`bill_length_mm` < 43.2999992) THEN 0.425288886
#> WHEN ((`bill_length_mm` >= 43.2999992 OR (`bill_length_mm` IS NULL))) THEN (-0.398178101)
#> END) + CASE
#> WHEN (`bill_length_mm` < 42.4000015) THEN 0.380251437
#> WHEN ((`bill_length_mm` >= 42.4000015 OR (`bill_length_mm` IS NULL))) THEN (-0.306771189)
#> END) + CASE
#> WHEN (`bill_length_mm` < 44.4000015) THEN 0.286071777
#> WHEN ((`bill_length_mm` >= 44.4000015 OR (`bill_length_mm` IS NULL))) THEN (-0.330096036)
#> END) + CASE
#> WHEN (`flipper_length_mm` < 203.0) THEN 0.209298179
#> WHEN ((`flipper_length_mm` >= 203.0 OR (`flipper_length_mm` IS NULL))) THEN (-0.348002464)
#> END) + LOG(0.44186047 / (1.0 - 0.44186047))))) AS .pred_not_Adelie
#> CASE
#> WHEN (`.pred_Adelie` > (0.5 + 0.2)) THEN 'Adelie'
#> WHEN (`.pred_Adelie` < (0.5 - 0.2)) THEN 'not_Adelie'
#> ELSE '[EQ]'
#> END AS .pred_class

Acknowledgements

A big thank you to all the people who have contributed to orbital since the release of v0.4.0:

@EmilHvitfeldt , @frankiethull , @jeroenjanssens , and @topepo .

tidymodels & xgboost

Emil Hvitfeldt — Mon, 15 Dec 2025 00:00:00 +0000

The xgboost library has recently gotten a big CRAN release. Jumping from version 1.7.11.1 to 3.1.2.1. We at the tidymodels team have been following the development and have done our best to ensure that your experience is unaffected by this release.

In addition to all the new features and improvements that are now available for users relying on CRAN versions of packages, there are also a few breaking changes. Specifically between version 1.x and 2.x of the xgboost library. The xgboost team has kindly provided a migration guide for how to update your code if you are upgrading from before version 2.x.

If you are using xgboost purely through tidymodels via functions like parsnip::boost_tree() and embed::step_discretize_xgb() , you should not need to change anything, as we have updated our packages to work with both the new and old versions of xgboost. If you are having any issues, please let us know by filing an issue for the affected package.

We look forward to integrating parsnip more deeply into these new changes, such as support for categorical predictors and quantile regression .

Here are the package that we’ve updated or helped the maintainers update

tidypredict 1.0.0

Emil Hvitfeldt — Wed, 10 Dec 2025 00:00:00 +0000

We’re tickled pink to announce the release of version 1.0.0 of tidypredict . The main goal of tidypredict is to enable running predictions inside databases. It reads the model, extracts the components needed to calculate the prediction, and then creates an R formula that can be translated into SQL.

You can install them from CRAN with:

install.packages("tidypredict")

This blog post highlights the most important changes in this release, including faster computations for tree-based models, more efficient tree representations, glmnet model support, and a change in how random forests are handled. You can see a full list of changes in the release notes .

library(tidypredict)

Improved output for random forest models

The previous version of tidypredict tidypredict_fit() would return a list of expressions, one for each tree, when applied to random forest models. This didn’t align with what is returned by other types of models. In version 1.0.0, this has been changed to produce a single, combined expression that reflects how predictions should be made.

This is technically a breaking change, but one we believe is worthwhile, as it provides a more consistent output for tidypredict_fit() and hides the technical details about how to combine trees from different packages.

Faster parsing of trees

The parsing of xgboost, partykit, and ranger models should now be substantially faster than before. Examples have been shown to be 10 to 200 times faster. Please note that larger models, more trees, and deeper trees still take some time to parse.

More efficient tree expressions

All trees, whether they are a single tree or part of a collection of trees, such as in boosted trees or random forests, are encoded as case_when() statements by tidypredict. This means that the following tree.

model <- partykit::ctree(mpg ~ am + cyl, data = mtcars)
model
#> 
#> Model formula:
#> mpg ~ am + cyl
#> 
#> Fitted party:
#> [1] root
#> |   [2] cyl <= 4: 26.664 (n = 11, err = 203.4)
#> |   [3] cyl > 4
#> |   |   [4] cyl <= 6: 19.743 (n = 7, err = 12.7)
#> |   |   [5] cyl > 6: 15.100 (n = 14, err = 85.2)
#> 
#> Number of inner nodes:    2
#> Number of terminal nodes: 3

Would be turned into the following case_when() statement.

1
2
3
4
5


case_when(
 cyl <= 4 ~ 26.6636363636364,
 cyl <= 6 & cyl > 4 ~ 19.7428571428571, 
 cyl > 6 & cyl > 4 ~= 15.1
)

With this new update, we have taken advantage of the .default argument whenever possible, which should lead to faster predictions, as we no longer need to calculate redundant conditionals.

tidypredict_fit(model)
#> case_when(cyl <= 4 ~ 26.6636363636364, cyl <= 6 & cyl > 4 ~ 19.7428571428571, 
#>     .default = 15.1)

Glmnet support

We now support the glmnet package. This package provides generalized linear models with lasso or elasticnet regularization.

The primary restriction when using a glmnet model with tidypredict() is that the model must have been fitted with the lambda argument set to a single value.

model <- glmnet::glmnet(mtcars[, -1], mtcars$mpg, lambda = 0.01)

tidypredict_fit(model)
#> 13.0081464696679 + (cyl * -0.0773532164346008) + (disp * 0.00969507138358544) + 
#>     (hp * -0.0192462098902709) + (drat * 0.816753237688302) + 
#>     (wt * -3.41564341709663) + (qsec * 0.758580151032383) + (vs * 
#>     0.277874296242861) + (am * 2.47356523820533) + (gear * 0.645144527527598) + 
#>     (carb * -0.300886812079305)

glmnet() computes a collection of models using many sets of penalty values. This can be very efficient, but for tidypredict, we need to predict with a single penalty. Note how, as we increase the penalty, the extracted expression correctly removes terms with coefficients of 0 instead of leaving them as (disp * 0).

model <- glmnet::glmnet(mtcars[, -1], mtcars$mpg, lambda = 1)

tidypredict_fit(model)
#> 35.3137765116027 + (cyl * -0.871451193824228) + (hp * -0.0101173960249783) + 
#>     (wt * -2.59443677687505)

tidypredict is used as the primary parser for models employed by the orbital package. This means that all the changes seen in this post also take effect when using orbital with tidymodels workflows. Such as using parsnip::linear_reg() with engine = "glmnet".

Acknowledgements

A big thank you to all the folks who helped make this release happen: @EmilHvitfeldt , and @jeroenjanssens .

Two New tidymodels Packages

Frances Lin — Sat, 22 Nov 2025 00:00:00 +0000

We’re very chuffed to announce the release of two new modeling packages: filtro and important.

You can install them from CRAN with:

1

install.packages(c("filtro", "important"))

This blog post will introduce both.

filtro

Feature selection is an important step in building machine learning models that are robust and reliable. By keeping only the most relevant predictors, we can reduce overfitting, improve model performance, and speed up computation.

filtro is a low-level tidy tools designed for filter-based supervised feature selection. filtro makes it easy to score, rank, and select features using a wide range of statistical and model-based metrics. The scoring metrics include: p-values, correlation, random forest feature importance, information gain, and more.

With filtro, we can quickly rank the variables and select either the top proportion or the top number of features that best contribute to our model. It also supports multi-parameter optimization via desirability functions . filtro is a standalone tool, but it integrates with other packages, allowing it to be used within the tidymodels workflows.

Currently, filtro implements a total of six filters. Like other elements of the framework, also filtro is extensible if you want to use a score we haven’t implemented yet. You can read more on how to do this on tidymodels.org .

The available score class objects are:

##  [1] "score_aov_fstat"          "score_aov_pval"          
##  [3] "score_cor_pearson"        "score_cor_spearman"      
##  [5] "score_gain_ratio"         "score_imp_rf"            
##  [7] "score_imp_rf_conditional" "score_imp_rf_oblique"    
##  [9] "score_info_gain"          "score_roc_auc"           
## [11] "score_sym_uncert"         "score_xtab_pval_chisq"   
## [13] "score_xtab_pval_fisher"

Let’s look at an example. Kuhn and Johnson (2013) described a data set where 176 samples were collected from a chemical manufacturing process. The goal is to predict process yield. Predictors are continuous, count, and categorical; some are correlated, and some contain missing values.

Let’s create an initial split of the data (which are in the modeldata package):

1
2
3
4
5
6


library(tidymodels)
library(filtro)

set.seed(1)
yield_split <- initial_split(modeldata::chem_proc_yield)
yield_split

## 
## <132/44/176>

1
2


yield_train <- training(yield_split)
yield_test <- testing(yield_split)

We’d like to estimate the strength of the relationship between these 57 predictors and the process yield. We’ll quantify that in two ways. First is the old-fashioned Spearman rank correlation statistic. We can estimate these values and rank them by the absolute value of the correlations. We can also measure their value using a random forest variable importance. One quality of the predictors is that their values are correlated, so there may be some value in using an oblique random forest model. This creates a collection of tree-based models with splits that are linear combinations of the selected predictors.

To estimate the scores, we use the score objects contained in the package along with the fit() method:

1
2
3
4
5
6
7


yield_rank_res <-
  score_cor_spearman |>
  fit(yield ~ ., data = yield_train)

# The object contains the statistics:
yield_rank_res@results |> 
  arrange(desc(abs(score)))

## # A tibble: 57 × 4
##    name          score outcome predictor      
##                           
##  1 cor_spearman  0.655 yield   man_proc_32    
##  2 cor_spearman -0.537 yield   man_proc_36    
##  3 cor_spearman  0.519 yield   bio_material_03
##  4 cor_spearman  0.502 yield   bio_material_06
##  5 cor_spearman  0.491 yield   man_proc_09    
##  6 cor_spearman  0.478 yield   bio_material_02
##  7 cor_spearman  0.446 yield   man_proc_33    
##  8 cor_spearman  0.421 yield   bio_material_12
##  9 cor_spearman -0.420 yield   man_proc_13    
## 10 cor_spearman  0.412 yield   bio_material_04
## # ℹ 47 more rows

To score via a random forest model, we only need to switch out the score object:

1
2
3
4
5
6


yield_rf_res <-
  score_imp_rf_oblique |>
  fit(yield ~ ., data = yield_train)

yield_rf_res@results |> 
  arrange(desc(abs(score)))

## # A tibble: 57 × 4
##    name            score outcome predictor      
##                             
##  1 imp_rf_oblique 0.128  yield   man_proc_32    
##  2 imp_rf_oblique 0.0697 yield   man_proc_36    
##  3 imp_rf_oblique 0.0670 yield   man_proc_17    
##  4 imp_rf_oblique 0.0644 yield   man_proc_09    
##  5 imp_rf_oblique 0.0612 yield   man_proc_13    
##  6 imp_rf_oblique 0.0446 yield   bio_material_03
##  7 imp_rf_oblique 0.0315 yield   man_proc_33    
##  8 imp_rf_oblique 0.0263 yield   man_proc_11    
##  9 imp_rf_oblique 0.0263 yield   bio_material_04
## 10 imp_rf_oblique 0.0262 yield   bio_material_06
## # ℹ 47 more rows

We should probably combine the scores and do a joint ranking. To combine the two sets of statistics:

1
2
3
4


class_score_list <- list(yield_rank_res, yield_rf_res) |>
  bind_scores()

class_score_list

## # A tibble: 57 × 4
##    outcome predictor       cor_spearman imp_rf_oblique
##                                   
##  1 yield   bio_material_01        0.404        0.0178 
##  2 yield   bio_material_02        0.478        0.0190 
##  3 yield   bio_material_03        0.519        0.0446 
##  4 yield   bio_material_04        0.412        0.0263 
##  5 yield   bio_material_05        0.116        0.00639
##  6 yield   bio_material_06        0.502        0.0262 
##  7 yield   bio_material_07       -0.101        0.00151
##  8 yield   bio_material_08        0.369        0.00714
##  9 yield   bio_material_09        0.109        0.0122 
## 10 yield   bio_material_10        0.214        0.00998
## # ℹ 47 more rows

We can accomplish a joint ranking via desirability functions. Here, we set goals for each score (i.e., maximize, minimize, etc.). The algorithm rescales their values and uses a geometric mean for an overall ranking. The desirability2 package has some nice tools for this. Here’s how we do it:

1
2
3
4
5
6
7
8


library(desirability2)
class_score_list |>
  show_best_desirability_prop(
    maximize(cor_spearman, low = 0.25, high = 1),
    maximize(imp_rf_oblique, scale = 2)
  ) |> 
  arrange(desc(.d_overall)) |> 
  select(-starts_with(".d_max_"))

## # A tibble: 57 × 5
##    outcome predictor       cor_spearman imp_rf_oblique .d_overall
##                                         
##  1 yield   man_proc_32            0.655         0.128      0.735 
##  2 yield   man_proc_09            0.491         0.0644     0.291 
##  3 yield   bio_material_03        0.519         0.0446     0.217 
##  4 yield   man_proc_33            0.446         0.0315     0.134 
##  5 yield   bio_material_06        0.502         0.0262     0.129 
##  6 yield   bio_material_04        0.412         0.0263     0.104 
##  7 yield   bio_material_02        0.478         0.0190     0.0926
##  8 yield   bio_material_01        0.404         0.0178     0.0719
##  9 yield   bio_material_11        0.381         0.0194     0.0714
## 10 yield   man_proc_12            0.391         0.0183     0.0705
## # ℹ 47 more rows

Using the scale = 2 option puts more weight on the random forest results.

It is unlikely that users will work with filtro directly; it is much better to incorporate these feature selection tools inside a model workflow (as we will see below).

Now that we’ve looked at filtro, next up is the important package (yes, this is what we named it).

important

The important package does two things. First, it provides yet another tool for calculating random forest-like permutation importance scores. We highly value other packages that perform these same calculations (such as DALEX and vip ). Our rationale for creating another package for this is that we’ve developed interfaces for censored regression, including dynamic metrics such as Brier scores or ROC curves that evaluate models at a specific time point. These dynamic methods aren’t available in other packages, and the peculiarities of these metrics make them difficult to incorporate into existing frameworks.

Other niceties about importance scores are that any metric from the yardstick package can be used, and we have optimized parallel processing for the underlying computations. For the latter feature, we support the future and mirai packages for parallel processing.

important also has three recipe steps for supervised feature selection (similar to what Steven Pawley did with his colino package ). The steps are:

Let’s look at the last one, which mirrors our analysis above.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


library(important)
goals <-
  desirability(
    maximize(cor_spearman, low = 0.25, high = 1),
    maximize(imp_rf_oblique, scale = 2)
  )

yield_rec <-
  recipe(yield ~ ., data = yield_train) |>
  step_impute_knn(all_predictors(), neighbors = 10) |>
  step_predictor_desirability(
    all_predictors(),
    score = goals,
    prop_terms = 1 / 10
  )
yield_rec

##

## ── Recipe ───────────────────────────────────────────────────────

##

## ── Inputs

## Number of variables by role

## outcome:    1
## predictor: 57

##

## ── Operations

## • K-nearest neighbor imputation for: all_predictors()

## • Feature selection via desirability functions (`cor_spearman`
##   and `imp_rf_oblique`) on: all_predictors()

When combined with a specific model, we can tune the number of neighbors as well as the proportion of predictors retained (10% above).

prep() will do the appropriate estimation steps:

1

trained_rec <- prep(yield_rec)

Which 10% of the predictors were retained? The tidy() method can list the scores and their rankings:

1
2
3
4


scores <- tidy(trained_rec, number = 2)
scores |>
  arrange(desc(.d_overall)) |>
  select(-starts_with(".d_max_"), -id)

## # A tibble: 57 × 5
##    terms           removed cor_spearman imp_rf_oblique .d_overall
##                                         
##  1 man_proc_32     FALSE          0.655         0.128       0.735
##  2 man_proc_36     FALSE         -0.530         0.0668      0.325
##  3 man_proc_09     FALSE          0.491         0.0673      0.304
##  4 man_proc_13     FALSE         -0.420         0.0725      0.275
##  5 bio_material_03 FALSE          0.519         0.0517      0.249
##  6 bio_material_06 TRUE           0.502         0.0445      0.210
##  7 man_proc_17     TRUE          -0.303         0.0749      0.158
##  8 man_proc_33     TRUE           0.443         0.0374      0.156
##  9 bio_material_02 TRUE           0.478         0.0330      0.151
## 10 bio_material_04 TRUE           0.412         0.0347      0.133
## # ℹ 47 more rows

1
2


# What percentage was removed?
mean(scores$removed * 100)

## [1] 91.22807

Summary

Both filtro and important satisfy a feature for tidymodels that has been highly ranked in our user surveys: supervised feature selection. filtro contains the underlying framework and important provides recipe steps that can be used in a workflow.

Q3 2025 tidymodels digest

Emil Hvitfeldt — Tue, 18 Nov 2025 00:00:00 +0000

The tidymodels framework is a collection of R packages for modeling and machine learning using tidyverse principles.

Since the beginning of 2021, we have been publishing quarterly updates here on the tidyverse blog summarizing what’s new in the tidymodels ecosystem. The purpose of these regular posts is to share useful new features and any updates you may have missed. You can check out the tidymodels tag to find all tidymodels blog posts here, including our roundup posts as well as those that are more focused.

Since our last update we have had some larger releases that you can read about in these posts.

The post will update, you on which packages have changed and the improvements you should know about that haven’t been covered in the above posts.

Here’s a list of the packages and their News sections:

Let’s look at a few specific updates.

Quiet linear svm models

When you used to fit a linear SVM model, you would get a message that you were not able to avoid.

1
2
3
4
5
6
7


library(parsnip)
library(modeldata)

res <- 
  svm_linear(mode = "classification", engine = "kernlab") |> 
  fit(Class ~ ., data = two_class_dat)
#>  Setting default kernel parameters

This message by itself was not that useful and was unable to turn off in a reasonable way. We have silenced this message to hopefully alleviate some of the noise that came from using this method.

library(parsnip)
library(modeldata)
#> 
#> Attaching package: 'modeldata'
#> The following object is masked from 'package:datasets':
#> 
#>     penguins

res <- 
  svm_linear(mode = "classification", engine = "kernlab") |> 
  fit(Class ~ ., data = two_class_dat)
res
#> parsnip model object
#> 
#> Support Vector Machine object of class "ksvm" 
#> 
#> SV type: C-svc  (classification) 
#>  parameter : cost C = 1 
#> 
#> Linear (vanilla) kernel function. 
#> 
#> Number of Support Vectors : 361 
#> 
#> Objective Function Value : -357.1487 
#> Training error : 0.178255 
#> Probability model included.

Fewer numeric overflow issues in brulee

The brulee package has been improved to try to help avoid numeric overflow in the loss functions. The following things have been done to help deal with this type of issue.

Starting values were transitioned to using Gaussian distribution (instead of uniform) with a smaller standard deviation.
The results always contain the initial results to use as a fallback if there is overflow during the first epoch.
brulee_mlp() has two additional parameters, grad_value_clip and grad_value_clip, that prevent issues.
The warning was changed to “Early stopping occurred at epoch {X} due to numerical overflow of the loss function.”

Additional torch optimizers in brulee

Several additional optimizers have been added: "ADAMw", "Adadelta", "Adagrad", and "RMSprop". Previously, the options were "SGD" and LBFGS". ## Acknowledgements

We want to sincerely thank everyone who contributed to these packages since their previous versions:

dials: @brendad8 , @hfrick , @topepo , and @Wander03 .
parsnip: @chillerb , @EmilHvitfeldt , @jmgirard , @topepo , and @ZWael .
rsample: @abichat , @hfrick , @mkiang , and @vincentarelbundock .
recipes: @EmilHvitfeldt , @SimonDedman , and @topepo .
probably: @abichat , @ayueme , @dchiu911 , @EmilHvitfeldt , @frankiethull , @gaborcsardi , @hfrick , @Jeffrothschild , @jgaeb , @jrwinget , @mark-burdon , @martinhulin , @simonpcouch , @teunbrand , @topepo , @wjakethompson , and @yellowbridge .
brulee: @genec1 , @talegari , and @topepo .

tune version 2.0.0

Max Kuhn — Wed, 05 Nov 2025 00:00:00 +0000

We’re very chuffed to announce the release of tune 2.0.0. tune is a package that can be used to resample models and/or optimize their tuning parameters

You can install it from CRAN with:

1

install.packages("tune")

This blog post will describe the two major updates to the package. You can see a full list of changes in the release notes .

Those two big improvements to the package: new parallel processing features and postprocessing.

Using future or mirai for parallel processing

Historically , we’ve used the foreach package to run calculations in parallel. Sadly, that package is no longer under active development. We’ve been progressively moving away from it, and as of this version, it is deprecated. In its place, we’ve added functionality for the future and mirai packages.

Previously, you would load a foreach parallel backend package, such as doParallel, doMC, or doFuture, and then register it. For example:

library(doParallel)
cl <- makePSOCKcluster()
registerDoParallel(cl)

Instead, you can use the future package via:

library(future)
plan("multisession")

or the mirai package by using

library(mirai)
daemons(num_cores)

Each of these is configurable to run in various ways, such as on remote servers.

tidymodels.org and the tune pkgdown site have more information to help users switch away from foreach.

Tuning your postprocessor

A postprocessor is an operation that modifies model predictions. For example, if your classifier can separate classes but its probability estimates are not accurate enough, you can add a calibrator operation that can attempt to adjust those probability estimates. Another good example is for binary classifiers, where the default threshold for classifying a prediction as an event can be adjusted based on its corresponding probability estimate.

Currently, we’ve enabled postprocessing using the tailor package . The operations that are currently available:

adjust_numeric_calibration(): Estimate and apply a calibration model for regression problems.
adjust_numeric_range(): Truncate the range of predictions.
adjust_probability_calibration(): Estimate and apply a calibration model for classification problems.
adjust_probability_threshold(): Covert binary class probabilities to hard class predictions using different thresholds.
adjust_equivocal_zone(): Decline to predict a sample if its strongest class probability is low.
adjust_predictions_custom(): A general mutate()-like adjustment.

If the operations have arguments, these can be tuned in the same way as the preprocessors (e.g., a recipe) or the supervised model. For example, let’s tune the probability threshold for a random forest classifier.

We’ll simulate some data with a class imbalance:

1
2
3
4
5


library(tidymodels)

set.seed(296)
sim_data <- sim_classification(2000, intercept = -12)
sim_data |> count(class)

## # A tibble: 2 × 2
##   class       n
##      
## 1 class_1   234
## 2 class_2  1766

We’ll resampling them via 10-fold cross-validation:

1

sim_rs <- vfold_cv(sim_data, strata = class)

We define a tailor object that tags the class probability threshold for optimization:

1
2
3


tlr_spec <- 
  tailor() |> 
  adjust_probability_threshold(threshold = tune())

We also specify a random forest that uses its default tuning parameters:

1
2
3


rf_spec <- rand_forest(mode = "classification")
rf_thrsh_wflow <- workflow(class ~ ., rf_spec, tlr_spec)
rf_thrsh_wflow

## ══ Workflow ════════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: rand_forest()
## Postprocessor: tailor
## 
## ── Preprocessor ────────────────────────────────────────────────────────
## class ~ .
## 
## ── Model ───────────────────────────────────────────────────────────────
## Random Forest Model Specification (classification)
## 
## Computational engine: ranger 
## 
## 
## ── Postprocessor ───────────────────────────────────────────────────────

##

## ── tailor ──────────────────────────────────────────────────────────────

## A binary postprocessor with 1 adjustment:

##

## • Adjust probability threshold to optimized value.

## NA
## NA
## NA

With a class imbalance, the default 50% threshold yields high specificity but low sensitivity. When we alter the threshold, those numbers will change, and we can select the best trade-off for our application. Let’s tune the workflow:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


cls_mtr <- metric_set(roc_auc, sensitivity, specificity)

# To run all resamples in parallel:
mirai::daemons(10)

set.seed(985)
rf_thrsh_res <- 
  rf_thrsh_wflow |> 
  tune_grid(
    resamples = sim_rs,
    grid = tibble(threshold = seq(0, 0.6, by = 0.01)),
    metrics = cls_mtr
  )

Let’s visualize the results:

1

autoplot(rf_thrsh_res) + lims(y = 0:1)

We can see that we can improve sensitivity by reducing the threshold. The rate of decay in specificity is slow compared to the gain in sensitivity until thresholds less than 10% are used. The Brier score is constant over the threshold since it only uses the estimated class probabilities, which are unaffected by the threshold.

We’ve taken great pains to avoid redundant calculations. In this example, for each resample, a single random forest model is trained, and then the postprocessing grid is evaluated. This conditional execution strategy is used to fit the fewest possible preprocessors, models, and postprocessors.

For this classification example, recent updates to the desirability2 package can enable you to jointly find the best sensitivity/specificity trade-off using the threshold parameter and model calibration/separation using other parameters.

We’ll add more examples and tutorials to tidymodels.org to showcase what we can do with postprocessing.

What’s next

This had been a race towards posit::conf(2025). Our focus had to be on the two big features for this release (since we taught workshops that use them). There are a few other relatively minor issues to address as the year closes.

One is to swap the package that we currently use for Gaussian Processes in Bayesian optimization from the GPfit package to the GauPro package. The former is not actively supported, and the latter has a few features that we’d love to have. Specifically, better kernel methods for non-numeric tuning parameters (e.g., the type of activation function used in neural networks). Hopefully, we’ll have another planned release before the end of the year.

Another near-future development goal is to have comprehensive integration for quantile regression models. We’ve added a few parsnip engines already and will expand the support in yardstick and tune.

Acknowledgements

We’d like to thanks everyone who contributed since the previous version: @3styleJam , @Diyar0D , @EmilHvitfeldt , @hfrick , @MatthieuStigler , @MattJEM , @mthulin , @tjburch , and @topepo .

mall 0.2.0

Edgar Ruiz — Tue, 19 Aug 2025 00:00:00 +0000

mall uses Large Language Models (LLM) to run Natural Language Processing (NLP) operations against your data. This package is available for both R, and Python. Version 0.2.0 has been released to CRAN and PyPi respectively.

In R, you can install the latest version with:

1

install.packages("mall")

In Python, with:

1

pip install mlverse-mall

This release expands the number of LLM providers you can use with mall. Also, in Python it introduces the option to run the NLP operations over string vectors, and in R, it enables support for ‘parallelized’ requests.

It is also very exciting to announce a brand new cheatsheet for this package. It is available in print (PDF) and HTML format!

More LLM providers

The biggest highlight of this release is the the ability to use external LLM providers such as OpenAI , Gemini and Anthropic . Instead of writing integration for each provider one by one, mall uses specialized integration packages to act as intermediates.

In R, mall uses the ellmer package to integrate with a variety of LLM providers . To access the new feature, first create a chat connection, and then pass that connection to llm_use(). Here is an example of connecting and using OpenAI:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


install.packages("ellmer")

library(mall)
library(ellmer)

chat <- chat_openai()
#> Using model = "gpt-4.1".

llm_use(chat, .cache = "_my_cache")
#> 
#> ── mall session object 
#> Backend: ellmerLLM session: model:gpt-4.1R session: cache_folder:_my_cache

In Python, mall uses chatlas as the integration point with the LLM. chatlas also integrates with several LLM providers . To use, first instantiate a chatlas chat connection class, and then pass that to the Polars data frame via the .llm.use() function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


pip install chatlas

import mall
from chatlas import ChatOpenAI

chat = ChatOpenAI()

data = mall.MallData
reviews = data.reviews

reviews.llm.use(chat)
#> {'backend': 'chatlas', 'chat': 
#> , '_cache': '_mall_cache'}

Connecting mall to external LLM providers introduces a consideration of cost. Most providers charge for the use of their API, so there is a potential that a large table, with long texts, could be an expensive operation.

Parallel requests (R only)

A new feature introduced in ellmer 0.3.0 enables the access to submit multiple prompts in parallel, rather than in sequence. This makes it faster, and potentially cheaper, to process a table. If the provider supports this feature, ellmer is able to leverage it via the parallel_chat() function. Gemini and OpenAI support the feature.

In the new release of mall, the integration with ellmer has been specially written to take advantage of parallel chat. The internals have been re-written to submit the NLP-specific instructions as a system message in order reduce the size of each prompt. Additionally, the cache system has also been re-tooled to support batched requests.

NLP operations without a table

Since its initial version, mall has provided the ability for R users to perform the NLP operations over a string vector, in other words, without needing a table. Starting with the new release, mall also provides this same functionality in its Python version.

mall can process vectors contained in a list object. To use, initialize a new LLMVec class object with either an Ollama model, or a chatlas Chat object, and then access the same NLP functions as the Polars extension.

1
2
3
4
5
6
7


# Initialize a Chat object
from chatlas import ChatOllama
chat = ChatOllama(model = "llama3.2")

# Pass it to a new LLMVec
from mall import LLMVec
llm = LLMVec(chat)    

Access the functions via the new LLMVec object, and pass the text to be processed.

1
2
3
4
5


llm.sentiment(["I am happy", "I am sad"])
#> ['positive', 'negative']

llm.translate(["Este es el mejor dia!"], "english")
#> ['This is the best day!']

For more information visit the reference page: LLMVec

New cheatsheet

The brand new official cheatsheet is now available from Posit: Natural Language processing using LLMs in R/Python . Its mean feature is that one side of the page is dedicated to the R version, and the other side of the page to the Python version.

An web page version is also availabe in the official cheatsheet site here . It takes advantage of the tab feature that lets you select between R and Python explanations and examples.

recipes 1.3.0

Emil Hvitfeldt — Mon, 28 Apr 2025 00:00:00 +0000

We’re thrilled to announce the release of recipes 1.3.0. recipes lets you create a pipeable sequence of feature engineering steps.

You can install it from CRAN with:

install.packages("recipes")

This blog post will walk through some of the highlights of this release, which includes changes to how strings_as_factors are specified, deprecation of step_select() , new contrasts argument for step_dummy() , and improvements for step_impute_bag() .

You can see a full list of changes in the release notes .

Let’s first load the package:

library(recipes)

`strings_as_factors`

Recipes by default convert predictor strings to factors, and the option for that is located in prep() . This caused an issue when you wanted to set strings_as_factors = FALSE for a recipe that is used somewhere else like in a workflow.

This is no longer an issue as we have moved the argument to recipe() itself. We are at the same time deprecating the use of strings_as_factors when used in prep() . Here is an example:

library(modeldata)
tate_text
#> # A tibble: 4,284 × 5
#>        id artist             title                                  medium  year
#>                                                        
#>  1  21926 Absalon            Proposals for a Habitat                Video…  1990
#>  2  20472 Auerbach, Frank    Michael                                Etchi…  1990
#>  3  20474 Auerbach, Frank    Geoffrey                               Etchi…  1990
#>  4  20473 Auerbach, Frank    Jake                                   Etchi…  1990
#>  5  20513 Auerbach, Frank    To the Studios                         Oil p…  1990
#>  6  21389 Ayres, OBE Gillian Phaëthon                               Oil p…  1990
#>  7 121187 Barlow, Phyllida   Untitled                               Acryl…  1990
#>  8  19455 Baselitz, Georg    Green VIII                             Woodc…  1990
#>  9  20938 Beattie, Basil     Present Bound                          Oil p…  1990
#> 10 105941 Beuys, Joseph      Joseph Beuys: A Private Collection. A… Print…  1990
#> # ℹ 4,274 more rows

We are loading the modeldata package to get tate_text which has a character column title. If we don’t do anything then it turns into a factor.

recipe(~., data = tate_text) |>
  prep() |>
  bake(tate_text)
#> # A tibble: 4,284 × 5
#>        id artist             title                                  medium  year
#>                                                        
#>  1  21926 Absalon            Proposals for a Habitat                Video…  1990
#>  2  20472 Auerbach, Frank    Michael                                Etchi…  1990
#>  3  20474 Auerbach, Frank    Geoffrey                               Etchi…  1990
#>  4  20473 Auerbach, Frank    Jake                                   Etchi…  1990
#>  5  20513 Auerbach, Frank    To the Studios                         Oil p…  1990
#>  6  21389 Ayres, OBE Gillian Phaëthon                               Oil p…  1990
#>  7 121187 Barlow, Phyllida   Untitled                               Acryl…  1990
#>  8  19455 Baselitz, Georg    Green VIII                             Woodc…  1990
#>  9  20938 Beattie, Basil     Present Bound                          Oil p…  1990
#> 10 105941 Beuys, Joseph      Joseph Beuys: A Private Collection. A… Print…  1990
#> # ℹ 4,274 more rows

But we can set strings_as_factors = FALSE in recipe() and it won’t anymore.

recipe(~., data = tate_text, strings_as_factors = FALSE) |>
  prep() |>
  bake(tate_text)
#> # A tibble: 4,284 × 5
#>        id artist             title                                  medium  year
#>                                                        
#>  1  21926 Absalon            Proposals for a Habitat                Video…  1990
#>  2  20472 Auerbach, Frank    Michael                                Etchi…  1990
#>  3  20474 Auerbach, Frank    Geoffrey                               Etchi…  1990
#>  4  20473 Auerbach, Frank    Jake                                   Etchi…  1990
#>  5  20513 Auerbach, Frank    To the Studios                         Oil p…  1990
#>  6  21389 Ayres, OBE Gillian Phaëthon                               Oil p…  1990
#>  7 121187 Barlow, Phyllida   Untitled                               Acryl…  1990
#>  8  19455 Baselitz, Georg    Green VIII                             Woodc…  1990
#>  9  20938 Beattie, Basil     Present Bound                          Oil p…  1990
#> 10 105941 Beuys, Joseph      Joseph Beuys: A Private Collection. A… Print…  1990
#> # ℹ 4,274 more rows

This change should also make pragmatic sense as whether you want to turn strings into factors is something that should encoded into the recipe itself.

Deprecating `step_select()`

We have started the process of deprecating step_select() . Given the number of issues people are having with the step and the fact that it doesn’t play well with workflows we think this is the right call.

There are two main use cases where step_select() was used: removing variables, and selecting variables. Removing variables when done with - in step_select()

recipe(mpg ~ ., mtcars) |>
  step_select(-starts_with("d")) |>
  prep() |>
  bake(new_data = NULL)
#> # A tibble: 32 × 9
#>      cyl    hp    wt  qsec    vs    am  gear  carb   mpg
#>            
#>  1     6   110  2.62  16.5     0     1     4     4  21  
#>  2     6   110  2.88  17.0     0     1     4     4  21  
#>  3     4    93  2.32  18.6     1     1     4     1  22.8
#>  4     6   110  3.22  19.4     1     0     3     1  21.4
#>  5     8   175  3.44  17.0     0     0     3     2  18.7
#>  6     6   105  3.46  20.2     1     0     3     1  18.1
#>  7     8   245  3.57  15.8     0     0     3     4  14.3
#>  8     4    62  3.19  20       1     0     4     2  24.4
#>  9     4    95  3.15  22.9     1     0     4     2  22.8
#> 10     6   123  3.44  18.3     1     0     4     4  19.2
#> # ℹ 22 more rows

These use cases can seamlessly be converted to use step_rm() without the - for the same result.

recipe(mpg ~ ., mtcars) |>
  step_rm(starts_with("d")) |>
  prep() |>
  bake(new_data = NULL)
#> # A tibble: 32 × 9
#>      cyl    hp    wt  qsec    vs    am  gear  carb   mpg
#>            
#>  1     6   110  2.62  16.5     0     1     4     4  21  
#>  2     6   110  2.88  17.0     0     1     4     4  21  
#>  3     4    93  2.32  18.6     1     1     4     1  22.8
#>  4     6   110  3.22  19.4     1     0     3     1  21.4
#>  5     8   175  3.44  17.0     0     0     3     2  18.7
#>  6     6   105  3.46  20.2     1     0     3     1  18.1
#>  7     8   245  3.57  15.8     0     0     3     4  14.3
#>  8     4    62  3.19  20       1     0     4     2  24.4
#>  9     4    95  3.15  22.9     1     0     4     2  22.8
#> 10     6   123  3.44  18.3     1     0     4     4  19.2
#> # ℹ 22 more rows

For selecting variables there are two cases. The first is as a tool to select which variables to use in our model. We recommend that you use select() to do that before passing the data into the recipe() . This is especially helpful since recipes are tighter with respect to their input types , so only passing the data you need to use is helpful.

If you need to do the selection after another step takes effect you should still be able to do so, by using step_rm() in the following manner.

1

step_rm(recipe, all_predictors(), -all_of(<variables that you want to keep>))

`step_dummy()` contrasts argument

Contrasts such as contr.treatment() and contr.poly() are used in step_dummy() to determine how the steps should translate categorical values into one or more numeric columns. Traditionally the contrasts were set using options() like so:

options(contrasts = c(unordered = "contr.poly", ordered = "contr.poly"))

recipe(~species + island, penguins) |>
  step_dummy(all_nominal_predictors()) |>
  prep() |>
  bake(new_data = penguins)
#> # A tibble: 344 × 4
#>    species_Chinstrap species_Gentoo island_Dream island_Torgersen
#>                                              
#>  1            -0.707          0.408        0.707            0.408
#>  2            -0.707          0.408        0.707            0.408
#>  3            -0.707          0.408        0.707            0.408
#>  4            -0.707          0.408        0.707            0.408
#>  5            -0.707          0.408        0.707            0.408
#>  6            -0.707          0.408        0.707            0.408
#>  7            -0.707          0.408        0.707            0.408
#>  8            -0.707          0.408        0.707            0.408
#>  9            -0.707          0.408        0.707            0.408
#> 10            -0.707          0.408        0.707            0.408
#> # ℹ 334 more rows

The issue with this approach is that it pulls from options() when it needs it instead of storing the information. This means that if you put this recipe in production you will need to set the option in the production environment to match that of the training environment.

To fix this issue we have given step_dummy() an argument contrasts that work in much the same way. You simply specify the contrast you want and it will be stored in the object for easy deployment.

recipe(~species + island, penguins) |>
  step_dummy(
    all_nominal_predictors(), contrasts = "contr.poly") |>
  prep() |>
  bake(new_data = penguins)
#> # A tibble: 344 × 4
#>    species_Chinstrap species_Gentoo island_Dream island_Torgersen
#>                                              
#>  1            -0.707          0.408        0.707            0.408
#>  2            -0.707          0.408        0.707            0.408
#>  3            -0.707          0.408        0.707            0.408
#>  4            -0.707          0.408        0.707            0.408
#>  5            -0.707          0.408        0.707            0.408
#>  6            -0.707          0.408        0.707            0.408
#>  7            -0.707          0.408        0.707            0.408
#>  8            -0.707          0.408        0.707            0.408
#>  9            -0.707          0.408        0.707            0.408
#> 10            -0.707          0.408        0.707            0.408
#> # ℹ 334 more rows

If you are using a contrasts from an external package such as hardhat::contr_one_hot() you will need to have the package loaded in the environments you are working in with library(hardhat) and setting contrasts = "contr_one_hot". You will also need to call library(hardhat) in any production environments you are using this recipe.

tidyselect can be used everywhere

Several steps such as step_pls() and step_impute_bag() require the selection of more than just the affected columns. step_pls() needs you to select an outcome variable and step_impute_bag() needs you to select which variables to impute with, impute_with, if you don’t want to use all predictors. Previously these needed to be strings or use special selectors like imp_vars() . You don’t have to do that anymore. You can now use tidyselect in these arguments too.

recipe(mpg ~ ., mtcars) |>
  step_pls(all_predictors(), outcome = mpg) |>
  prep() |>
  bake(new_data = mtcars)
#> # A tibble: 32 × 3
#>      mpg   PLS1   PLS2
#>        
#>  1  21    0.693  0.895
#>  2  21    0.650  0.654
#>  3  22.8  2.78   0.378
#>  4  21.4  0.210 -0.368
#>  5  18.7 -1.95   0.845
#>  6  18.1  0.137 -0.624
#>  7  14.3 -2.77   0.364
#>  8  24.4  1.81  -1.30 
#>  9  22.8  2.12  -1.95 
#> 10  19.2  0.531 -1.51 
#> # ℹ 22 more rows

For arguments that allow for multiple selections now work with recipes selectors like all_numeric_predictors() and has_role() .

recipe(mpg ~ ., mtcars) |>
  step_impute_bag(all_predictors(), impute_with = has_role("predictor")) |>
  prep() |>
  bake(new_data = mtcars)
#> # A tibble: 32 × 11
#>      cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb   mpg
#>              
#>  1     6  160    110  3.9   2.62  16.5     0     1     4     4  21  
#>  2     6  160    110  3.9   2.88  17.0     0     1     4     4  21  
#>  3     4  108     93  3.85  2.32  18.6     1     1     4     1  22.8
#>  4     6  258    110  3.08  3.22  19.4     1     0     3     1  21.4
#>  5     8  360    175  3.15  3.44  17.0     0     0     3     2  18.7
#>  6     6  225    105  2.76  3.46  20.2     1     0     3     1  18.1
#>  7     8  360    245  3.21  3.57  15.8     0     0     3     4  14.3
#>  8     4  147.    62  3.69  3.19  20       1     0     4     2  24.4
#>  9     4  141.    95  3.92  3.15  22.9     1     0     4     2  22.8
#> 10     6  168.   123  3.92  3.44  18.3     1     0     4     4  19.2
#> # ℹ 22 more rows

These changes are backwards compatible meaning that the old ways still work with minimal warnings.

`step_impute_bag()` now takes up less memory

We have another benefit for users of step_impute_bag() . For each variable it imputes on, it fits a bagged tree model, which is then used to predict with for imputation. It was noticed that these models had a larger memory footprint than was needed. This has been remedied, so now there should be a noticeable decrease in size for recipes with step_impute_bag() .

rec <- recipe(Sale_Price ~ ., data = ames) |>
  step_impute_bag(starts_with("Lot_"), impute_with = all_numeric_predictors()) |>
  prep()

lobstr::obj_size(rec)
#> 20.23 MB

This recipe took up over 75 MB and now takes up 20 MB.

Acknowledgements

Many thanks to all the people who contributed to recipes since the last release!

@chillerb , @dshemetov , @EmilHvitfeldt , @kevbaer , @nhward , @regisely , and @topepo .

rsample 1.3.0

Hannah Frick — Thu, 03 Apr 2025 00:00:00 +0000

We’re thrilled to announce the release of rsample 1.3.0. rsample makes it easy to create resamples for assessing model performance. It is part of the tidymodels framework, a collection of R packages for modeling and machine learning using tidyverse principles.

You can install it from CRAN with:

install.packages("rsample")

This blog post will walk you through the more flexible grouping for calculating bootstrap confidence intervals and highlight the contributions made by participants of the tidyverse developer day.

You can see a full list of changes in the release notes .

library(rsample)

Flexible grouping for bootstrap intervals

Resampling allows you get an understanding of the variability of an estimate, e.g., a summary statistic of your data. If you want to lean on statistical theory and get confidence intervals for your estimate, you can reach for the bootstrap resampling scheme: calculating your summary statistic on the bootstrap samples enables you to calculate confidence intervals around your point estimate.

rsample contains a family of int_*() functions to calculate bootstrap confidence intervals of different flavors: percentile intervals, “BCa” intervals, and bootstrap-t intervals. If you want to dive into the technical details, Chapter 11 of CASI is a good place to start.

You can calculate the confidence intervals based on a grouping in your data. However, so far, rsample would only let you provide a single grouping variable. With this release, we are extending this functionality to allow a more flexible grouping.

The motivating application for us was to be able to calculate confidence intervals around multiple model performance metrics, including dynamic metrics for time-to-event models which depend on an evaluation time point. So in this case, the metric is one grouping variable and the evaluation time another. But let’s pull back complexity for an example of how the new rsample functionality works!

We have a dataset with delivery times for orders containing one or more items. We’ll do some data wrangling with it, so we are also loading dplyr.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
data(deliveries, package = "modeldata")

deliveries
#> # A tibble: 10,012 × 31
#>    time_to_delivery  hour day   distance item_01 item_02 item_03 item_04 item_05
#>                                    
#>  1             16.1  11.9 Thu       3.15       0       0       2       0       0
#>  2             22.9  19.2 Tue       3.69       0       0       0       0       0
#>  3             30.3  18.4 Fri       2.06       0       0       0       0       1
#>  4             33.4  15.8 Thu       5.97       0       0       0       0       0
#>  5             27.2  19.6 Fri       2.52       0       0       0       1       0
#>  6             19.6  13.0 Sat       3.35       1       0       0       1       0
#>  7             22.1  15.5 Sun       2.46       0       0       1       1       0
#>  8             26.6  17.0 Thu       2.21       0       0       1       0       0
#>  9             30.8  16.7 Fri       2.62       0       0       0       0       0
#> 10             17.4  11.9 Sun       2.75       0       2       1       0       0
#> # ℹ 10,002 more rows
#> # ℹ 22 more variables: item_06 , item_07 , item_08 ,
#> #   item_09 , item_10 , item_11 , item_12 , item_13 ,
#> #   item_14 , item_15 , item_16 , item_17 , item_18 ,
#> #   item_19 , item_20 , item_21 , item_22 , item_23 ,
#> #   item_24 , item_25 , item_26 , item_27

Instead of fitting a whole model here, we are calculating a straightforward summary statistic for how much delivery time increases if an item is included in the order. So the item is one grouping factor. As a second one, we are using whether the order was delivered on a weekday or a weekend. Let’s start by making that weekend indicator and reshaping the data to make it easier to calculate our summary statistic.

Note that the name for the weekend indicator column, .weekend, starts with a dot. That is important as it is the convention to signal to rsample that this is an additional grouping variable.

item_data <- deliveries %>%
  mutate(.weekend = ifelse(day %in% c("Sat", "Sun"), "weekend", "weekday")) %>%
  select(time_to_delivery, .weekend, starts_with("item")) %>%
  tidyr::pivot_longer(starts_with("item"), names_to = "item", values_to = "value")

Next, we are making a small function that calculates the ratio of average delivery times with and without the item included in the order, as a estimate of how much a specific item in an order increases the delivery time.

relative_increase <- function(data) {
  data %>%
    mutate(includes_item = value > 0) %>%
    summarize(
      has = mean(time_to_delivery[includes_item]),
      has_not = mean(time_to_delivery[!includes_item]),
      .by = c(item, .weekend)
    ) %>%
    mutate(estimate = has / has_not) %>%
    select(term = item, .weekend, estimate)
}

We can calculate that on our entire dataset.

relative_increase(item_data)
#> # A tibble: 54 × 3
#>    term    .weekend estimate
#>              
#>  1 item_01 weekday      1.07
#>  2 item_02 weekday      1.02
#>  3 item_03 weekday      1.02
#>  4 item_04 weekday      1.00
#>  5 item_05 weekday      1.00
#>  6 item_06 weekday      1.01
#>  7 item_07 weekday      1.03
#>  8 item_08 weekday      1.01
#>  9 item_09 weekday      1.01
#> 10 item_10 weekday      1.06
#> # ℹ 44 more rows

This is fine, but what we really want here is to get confidence intervals around these estimates!

So let’s make bootstrap samples and calculate our statistic on those.

set.seed(1)
item_bootstrap <- bootstraps(item_data, times = 1000)

item_stats <-
  item_bootstrap %>%
  mutate(stats = purrr::map(splits, ~ analysis(.x) %>% relative_increase()))

Now we have everything we need to calculate the confidence intervals, stashed into the tibbles in the stats column: an estimate, a term (the primary grouping variable), and our additional grouping variable .weekend, starting with a dot. What’s left to do is call one of the int_*() functions and specify which column contains the statistics. Here, we’ll calculate percentile intervals with int_pctl() .

item_ci <- int_pctl(item_stats, statistics = stats, alpha = 0.1)
item_ci
#> # A tibble: 54 × 7
#>    term    .weekend .lower .estimate .upper .alpha .method   
#>                           
#>  1 item_01 weekday   1.05      1.07    1.09    0.1 percentile
#>  2 item_01 weekend   1.04      1.07    1.10    0.1 percentile
#>  3 item_02 weekday   1.00      1.02    1.03    0.1 percentile
#>  4 item_02 weekend   0.996     1.01    1.03    0.1 percentile
#>  5 item_03 weekday   1.01      1.02    1.04    0.1 percentile
#>  6 item_03 weekend   0.970     0.990   1.01    0.1 percentile
#>  7 item_04 weekday   0.989     1.00    1.02    0.1 percentile
#>  8 item_04 weekend   0.998     1.02    1.03    0.1 percentile
#>  9 item_05 weekday   0.987     1.00    1.02    0.1 percentile
#> 10 item_05 weekend   0.982     1.00    1.03    0.1 percentile
#> # ℹ 44 more rows

Tidyverse developer day

At the tidyverse developer day after posit::conf, rsample got a lot of love in form of contributions by various community members. People improved documentation and examples, move deprecations along, tightened checks to support good practice, and upgraded errors and warnings, both in style and content. None of these changes are flashy new features but all of them are essential to rsample working well!

So for example, leave-one-out (LOO) cross-validation is not a great choice of resampling scheme in most situations. From Tidy modeling with R :

For anything but pathologically small samples, LOO is computationally excessive, and it may not have good statistical properties.

It was possible, however, to create implicit LOO samples by using vfold_cv() with the number of folds set to the number of rows in the data. With a dev day contribution, this now errors:

vfold_cv(mtcars, v = nrow(mtcars))
#> Error in `vfold_cv()`:
#> ! Leave-one-out cross-validation is not supported by this function.
#> ✖ You set `v` to `nrow(data)`, which would result in a leave-one-out
#>   cross-validation.
#> ℹ Use `loo_cv()` in this case.

This is to make users pause and consider if this a good choice for their dataset. If you require LOO, you can still use loo_cv() .

Error messages in general have been a focus of ours across various tidymodels packages, rsample is no exception. We opened a bunch of issues to tackle all of rsample - and all got closed! Some of these changes are purely internal, upgrading manual formatting to let the cli package do the work. While the error message in most cases doesn’t look different, it’s a great deal more consistency in formatting.

For some error messages, the additional functionality in cli makes it easy to improve readability. This error message used to be one block of text, now it comes as three bullet points.

permutations(mtcars, everything())
#> Error in `permutations()`:
#> ! You have selected all columns to permute.
#> ℹ This effectively reorders the rows in the original data without changing the
#>   data structure.
#> → Please select fewer columns to permute.

Changes like these are super helpful to users and developers alike. A big thank you to all the contributors!

Acknowledgements

Many thanks to all the people who contributed to rsample since the last release!

@agmurray , @brshallo , @ccani007 , @dicook , @Dpananos , @EmilHvitfeldt , @gaborcsardi , @gregor-fausto , @hfrick , @JamesHWade , @jttoivon , @krz , @laurabrianna , @malcolmbarrett , @MatthieuStigler , @msberends , @nmercadeb , @PriKalra , @seb09 , @simonpcouch , @topepo , @ZWael , and @zz77zz .

Improved sparsity support in tidymodels

Emil Hvitfeldt — Wed, 19 Mar 2025 00:00:00 +0000

Photo by Oliver Olah on Unsplash

We’re stoked to announce tidymodels now fully supports sparse data from end to end. We have been working on this for over 5 years . This is an extension of the work we have done previously with blueprints, which would carry the data sparsely some of the way.

You will need recipes 1.2.0 , parsnip 1.3.0 , workflows 1.2.0 or later for this to work.

What are sparse data?

The term sparse data refers to a data set containing many zeroes. Sparse data appears in all kinds of fields and can be produced in a number of preprocessing methods. The reason why we care about sparse data is because of how computers store numbers. A 32-bit integer value takes 4 bytes to store. An array of 32-bit integers takes 40 bytes, and so on. This happens because each value is written down.

A sparse representation instead stores the locations and values of the non-zero entries. Suppose we have the following vector with 20 entries:

1

c(0, 0, 1, 0, 3, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)

It could be represented sparsely using the 3 values positions = c(1, 3, 7), values = c(3, 5, 8), and length = 20. Now, we have seven values to represent a vector of 20 elements. Since some modeling tasks contain even sparser data, this type of representation starts to show real benefits in terms of execution time and memory consumption.

The tidymodels set of packages has undergone several internal changes to allow it to represent data sparsely internally when it would be beneficial. These changes allow you to fit models that contain sparse data faster and more memory efficiently than before. Moreover, it allows you to fit models previously not possible due to them not fitting in memory.

Sparse matrix support

The first benefit of these changes is that recipe(), prep(), bake(), fit(), and predict() now accept sparse matrices created using the Matrix package.

The permeability_qsar data set from the modeldata package contains quite a lot of zeroes in the predictors, so we will use it as a demonstration. Starting by coercing it into a sparse matrix.

library(tidymodels)
library(Matrix)
permeability_sparse <- as(as.matrix(permeability_qsar), "sparseMatrix")

We can now use this sparse matrix in our code the same way as a dense matrix or data frame:

rec_spec <- recipe(permeability ~ ., data = permeability_sparse) |>
  step_zv(all_predictors())

mod_spec <- boost_tree("regression", "xgboost")

wf_spec <- workflow(rec_spec, mod_spec)

Model training has the usual syntax:

wf_fit <- fit(wf_spec, permeability_sparse)

as does prediction:

predict(wf_fit, permeability_sparse)
#> # A tibble: 165 × 1
#>     .pred
#>     
#>  1 10.5  
#>  2  1.50 
#>  3 13.1  
#>  4  1.10 
#>  5  1.25 
#>  6  0.738
#>  7 29.3  
#>  8  2.44 
#>  9 36.3  
#> 10  4.31 
#> # ℹ 155 more rows

Note that only some models/engines work well with sparse data. These are all listed here https://www.tidymodels.org/find/sparse/ . If the model doesn’t support sparse data, it will be coerced into the default non-sparse representation and used as usual.

With a few exceptions, it should work like any other data set. However, this approach has two main limitations. The first is that we are limited to regression tasks since the outcome has to be numeric to be part of the sparse matrix.

The second limitation is that it only works with non-formula methods for parsnip and workflows. This means that you can use a recipe with add_recipe() or select variables directly with add_variables() when using a workflow. And you need to use fit_xy() instead of fit() when using a parsnip object by itself.

If this is of interest we also have a https://www.tidymodels.org/ post about using sparse matrices in tidymodels .

Sparse data from recipes steps

Where this sparsity support really starts to shine is when the recipe we use will generate sparse data. They come in two flavors, sparsity creation steps and sparsity preserving steps. Both listed here: https://www.tidymodels.org/find/sparse/ .

Some steps like step_dummy(), step_indicate_na(), and textrecipes::step_tf() will almost always produce a lot of zeroes. We take advantage of that by generating it sparsely when it is beneficial. If these steps end up producing sparse vectors, we want to make sure the sparsity is preserved. A couple of handfuls of steps, such as step_impute_mean() and step_scale(), have been updated to be able to work efficiently with sparse vectors. Both types of steps are detailed in the above-linked list of compatible methods.

What this means in practice is that if you use a model/engine that supports sparse data and have a recipe that produces enough sparse data, then the steps will switch to produce sparse data by using a new sparse data format to store the data (when appropriate) as the recipe is being processed. Then if the model can accept sparse objects, we convert the data from our new sparse format to a standard sparse matrix object. Increasing performance when possible while preserving performance otherwise.

Below is a simple recipe using the ames data set. step_dummy() is applied to all the categorical predictors, leading to a significant amount of zeroes.

rec_spec <- recipe(Sale_Price ~ ., data = ames) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors()) |>
  step_dummy(all_nominal_predictors())

mod_spec <- boost_tree("regression", "xgboost")

wf_spec <- workflow(rec_spec, mod_spec)

When we go to fit it now, it takes around 125ms and allocates 37.2MB. Compared to before these changes it would take around 335ms and allocate 67.5MB.

wf_fit <- fit(wf_spec, ames)

We see similar speedups when we predictor with around 20ms and 25.2MB now, compared to around 60ms and 55.6MB before.

predict(wf_fit, ames)
#> # A tibble: 2,930 × 1
#>      .pred
#>      
#>  1 208649.
#>  2 115339.
#>  3 148634.
#>  4 239770.
#>  5 190082.
#>  6 184604.
#>  7 208572.
#>  8 177403 
#>  9 261000.
#> 10 198604.
#> # ℹ 2,920 more rows

These improvements are tightly related to memory allocation, which depends on the sparsity of the data set produced by the recipe. This is why it is hard to say how much benefit you will see. We have seen orders of magnitudes of improvements, both in terms of time and memory allocation. We have also been able to fit models where previously the data was too big to fit in memory.

Please see the post on tidymodels.org, which goes into more detail about when you are likely to benefit from this and how to change your recipes and workflows to take full advantage of this new feature.

There is also a https://www.tidymodels.org/ post going into a bit more detail about how to use recipes to produce sparse data .

Q1 2025 tidymodels digest

Max Kuhn — Thu, 27 Feb 2025 00:00:00 +0000

The tidymodels framework is a collection of R packages for modeling and machine learning using tidyverse principles.

We’ve sent a steady stream of tidymodels packages to CRAN recently. We usually release them in batches since many of our packages are tightly coupled with one another. Internally, this process is referred to as the “cascade” of CRAN submissions.

The post will update you on which packages have changed and the major improvements you should know about.

Here’s a list of the packages and their News sections:

Let’s look at a few specific updates.

Improvements in errors and warnings

A group effort was made to improve our error and warning messages across many packages. This started with an internal “upkeep week” (which ended up being 3-4 weeks) and concluded at the Tidy Dev Day in Seattle after posit::conf(2024).

The goal was to use new tools in the cli and rlang packages to make messages more informative than they used to be. For example, using:

1

tidy(pca_extract_trained, number = 3, type = "variances")

used to result in the error message:

Error in `match.arg()`:
! 'arg' should be one of "coef", "variance"

The new system references the function that you called and not the underlying base R function that actually errored. It also suggests a solution:

Error in `tidy()`:
! `type` must be one of "coef" or "variance", not "variances".
i Did you mean "variance"?

The rlang package created a set of standalone files that contain high-quality type checkers and related functions. This also improves the information that users get from an error. For example, using an inappropriate formula value in fit(linear_reg(), "boop", mtcars), the old message was:

Error in `fit()`:
! The `formula` argument must be a formula, but it is a .

and now you see:

Error in `fit()`:
! `formula` must be a formula, not the string "boop".

This was a lot of work and we’re still aren’t finished. Two events helped us get as far as we did.

First, Simon Couch made the chores package (its previous name was “pal”), which enabled us to use AI tools to solve small-scope problems, such as converting old rlang error code to use the new cli syntax . I can’t overstate how much of a speed-up this was for us.

Second, at developer day, many external folks pitched in to make pull requests from a list of issues:

Organizing Tidy Dev Day issues.

I love these sessions for many reasons, but mostly because we meet users and contributors to our packages in person and work with them on specific tasks.

There is a lot more to do here; we have a lot of secondary packages that would benefit from these improvements too.

Quantile regression in parsnip

One big update in parsnip was a new modeling mode of "quantile regression". Daniel McDonald and Ryan Tibshirani largely provided some inertia for this work based on their disease modeling framework .

You can generate quantile predictions by first creating a model specification, which includes the quantiles that you want to predict:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


library(tidymodels)
tidymodels_prefer()

ames <- 
  modeldata::ames |> 
  mutate(Sale_Price = log10(Sale_Price)) |> 
  select(Sale_Price, Latitude)

quant_spec <- 
  linear_reg() |> 
  set_engine("quantreg") |> 
  set_mode("quantile regression", quantile_levels = c(0.1, 0.5, 0.9))
quant_spec

## Linear Regression Model Specification (quantile regression)
## 
## Computational engine: quantreg

## Quantile levels: 0.1, 0.5, and 0.9.

We’ll add some spline terms via a recipe and fit the model:

1
2
3
4
5
6
7
8
9


spline_rec <- 
  recipe(Sale_Price ~ ., data = ames) |> 
  step_spline_natural(Latitude, deg_free = 10)

quant_fit <- 
  workflow(spline_rec, quant_spec) |> 
  fit(data = ames)

quant_fit

## ══ Workflow [trained] ═════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ───────────────────────────────────────────────────────
## 1 Recipe Step
## 
## • step_spline_natural()
## 
## ── Model ──────────────────────────────────────────────────────────────
## Call:
## quantreg::rq(formula = ..y ~ ., tau = quantile_levels, data = data)
## 
## Coefficients:
##               tau= 0.1    tau= 0.5    tau= 0.9
## (Intercept) 4.71981123  5.07728741  5.25221335
## Latitude_01 1.22409173  0.70928577  0.79000849
## Latitude_02 0.19561816  0.04937750  0.02832633
## Latitude_03 0.16616065  0.02045910  0.14730573
## Latitude_04 0.30583648  0.08489487  0.15595080
## Latitude_05 0.21663212  0.02016258 -0.01110625
## Latitude_06 0.33541228  0.12005254  0.03006777
## Latitude_07 0.47732205  0.09146728  0.17394021
## Latitude_08 0.24028784  0.30450058  0.26144584
## Latitude_09 0.05840312 -0.14733781 -0.11911843
## Latitude_10 1.52800673  0.95994216  1.21750501
## 
## Degrees of freedom: 2930 total; 2919 residual

For prediction, tidymodels always returns a data frame with as many rows as the input data set (here: ames). The result for quantile predictions is a special vctrs class:

1
2


quant_pred <- predict(quant_fit, ames) 
quant_pred |> slice(1:4)

## # A tibble: 4 × 1
##   .pred_quantile
##        
## 1         [5.33]
## 2         [5.33]
## 3         [5.33]
## 4         [5.31]

1

class(quant_pred$.pred_quantile)

## [1] "quantile_pred" "vctrs_vctr"    "list"

where the output [5.31] shows the middle quantile.

We can expand the set of quantile predictions so that there are three rows for each source row in ames. There’s also an integer column called .row so that we can merge the data with the source data:

1

quant_pred$.pred_quantile[1]

## 
## [1] [5.33]
## # Quantile levels: 0.1 0.5 0.9

1

as_tibble(quant_pred$.pred_quantile[1])

## # A tibble: 3 × 3
##   .pred_quantile .quantile_levels  .row
##                         
## 1           5.08              0.1     1
## 2           5.33              0.5     1
## 3           5.52              0.9     1

Here are the predicted quantile values:

1
2
3
4
5
6
7
8


quant_pred$.pred_quantile |> 
  as_tibble() |> 
  full_join(ames |> add_rowindex(), by = ".row") |> 
  arrange(Latitude) |> 
  ggplot(aes(x = Latitude)) + 
  geom_point(data = ames, aes(y = Sale_Price), alpha = 1 / 5) +
  geom_line(aes(y = .pred_quantile, col = format(.quantile_levels)), 
            show.legend = FALSE, linewidth = 1.5) 

10%, 50%, and 90% quantile predictions.

For now, the new mode does not have many engines. We need to implement some performance statistics in the yardstick package before integrating these models into the whole tidymodels ecosystem.

In other news, we’ve added some additional neural network models based on some improvements in the brulee package. Namely, two-layer networks can be tuned for feed-forward networks on tabular data (using torch).

One other improvement has been simmering for a long time: the ability to exploit sparse data structures better. We’ve improved our fit() interfaces for the few model engines that can use sparsely encoded data. There is much more to come on this in a few months, especially around recipes, so stay tuned.

Finally, we’ve created a set of checklists that can be used when creating new models or engines. These are very helpful, even for us, since there is a lot of minutiae to remember.

Parallelism in tune

This was a small maintenance release mostly related to parallel processing. Up to now, tune facilitated parallelism using the foreach package. That package is mature but not actively developed, so we have been slowly moving toward using the future package(s).

The first step in this journey was to keep using foreach internally (but lean toward future) but to encourage users to move from directly invoking the foreach package and, instead, load and use the future package.

We’re now moving folks into the second stage. tune will now raise a warning when:

A parallel backend has been registered with foreach, and
No plan() has been specified with future.

This will allow users to transition their existing code to only future and allow us to update existing documentation and training materials.

We anticipate that the third stage, removing foreach entirely, will occur sometime before posit::conf(2025) in September.

Things to look forward to

We are working hard on a few major initiatives that we plan on showing off at posit::conf(2025) .

First is integrated support for sparse data. The emphasis is on “data” because users can use a data frame of sparse vectors or the usual sparse matrix format. This is a big deal because it does not force you to convert non-numeric data into a numeric matrix format. Again, we’ll discuss this more in the future, but you should be able to use sparse data frames in parsnip, recipes, tune, etc.

The second initiative is the longstanding goal of adding postprocessing to tidymodels. Just as you can add a preprocessor to a model workflow, you will be able to add a set of postprocessing adjustments to the predictions your model generates. See our previous post for a sneak peek.

Finally, this year’s summer internship focuses on supervised feature selection methods. We’ll also have releases (and probably another package) for these tools.

These should come to fruition (and CRAN) before or around August 2025.

Acknowledgements

We want to sincerely thank everyone who contributed to these packages since their previous versions:

@AlbertoImg , @asb2111 , @balraadjsings , @bcjaeger , @beansrowning , @BrennanAntone , @cheryldietrich , @chillerb , @conarr5 , @corybrunson , @dajmcdon , @davidrsch , @Edgar-Zamora , @EmilHvitfeldt , @gaborcsardi , @gimholte , @grantmcdermott , @grouptheory , @hfrick , @ilaria-kode , @JamesHWade , @jesusherranz , @jkylearmstrong , @joranE , @joscani , @Joscelinrocha , @josho88 , @joshuagi , @JosiahParry , @jrosell , @jrwinget , @KarlKoe , @kscott-1 , @lilykoff , @lionel- , @LouisMPenrod , @luisDVA , @marcelglueck , @marcozanotti , @martaalcalde , @mattwarkentin , @mihem , @mitchellmanware , @naokiohno , @nhward , @npelikan , @obgeneralao , @owenjonesuob , @pbhogale , @Peter4801 , @pgg1309 , @reisner , @rfsaldanha , @rkb965 , @RobLBaker , @RodDalBen , @SantiagoD999 , @shum461 , @simonpcouch , @szimmer , @talegari , @therealjpetereit , @topepo , @walkerjameschris , and @ZWael

orbital 0.3.0

Emil Hvitfeldt — Mon, 13 Jan 2025 00:00:00 +0000

We’re thrilled to announce the release of orbital 0.3.0. orbital lets you predict in databases using tidymodels workflows.

You can install it from CRAN with:

install.packages("orbital")

This blog post will cover the highlights, which are classification support and the new augment method.

You can see a full list of changes in the release notes .

Classification support

The biggest improvement in this version is that orbital() now works for supported classification models. See vignette for list of all supported models.

Let’s start by fitting a classification model on the penguins data set, using {xgboost} as the engine.

rec_spec <- recipe(species ~ ., data = penguins) |>
  step_unknown(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_impute_mean(all_numeric_predictors()) |>
  step_zv(all_predictors())

lr_spec <- boost_tree() |>
  set_mode("classification") |>
  set_engine("xgboost")

wf_spec <- workflow(rec_spec, lr_spec)
wf_fit <- fit(wf_spec, data = penguins)

With this fitted workflow object, we can call orbital() on it to create an orbital object.

orbital_obj <- orbital(wf_fit)
orbital_obj
#> 
#> ── orbital Object ──────────────────────────────────────────────────────────────
#> • island = dplyr::if_else(is.na(island), "unknown", island)
#> • sex = dplyr::if_else(is.na(sex), "unknown", sex)
#> • island_Dream = as.numeric(island == "Dream")
#> • island_Torgersen = as.numeric(island == "Torgersen")
#> • sex_male = as.numeric(sex == "male")
#> • sex_unknown = as.numeric(sex == "unknown")
#> • bill_length_mm = dplyr::if_else(is.na(bill_length_mm), 43.92193, bill_l ...
#> • bill_depth_mm = dplyr::if_else(is.na(bill_depth_mm), 17.15117, bill_dep ...
#> • flipper_length_mm = dplyr::if_else(is.na(flipper_length_mm), 201, flipp ...
#> • body_mass_g = dplyr::if_else(is.na(body_mass_g), 4202, body_mass_g)
#> • island_Dream = dplyr::if_else(is.na(island_Dream), 0.3604651, island_Dr ...
#> • island_Torgersen = dplyr::if_else(is.na(island_Torgersen), 0.1511628, i ...
#> • sex_male = dplyr::if_else(is.na(sex_male), 0.4883721, sex_male)
#> • sex_unknown = dplyr::if_else(is.na(sex_unknown), 0.03197674, sex_unknow ...
#> • Adelie = 0 + dplyr::case_when((bill_depth_mm < 15.1 | is.na(bill_depth_ ...
#> • Chinstrap = 0 + dplyr::case_when((island_Dream < 0.5 | is.na(island_Dre ...
#> • Gentoo = 0 + dplyr::case_when((bill_depth_mm < 15.95 | is.na(bill_depth ...
#> • .pred_class = dplyr::case_when(Adelie > Chinstrap & Adelie > Gentoo ~ " ...
#> ────────────────────────────────────────────────────────────────────────────────
#> 18 equations in total.

This object contains all the information that is needed to produce predictions. Which we can produce with predict() .

predict(orbital_obj, penguins)
#> # A tibble: 344 × 1
#>    .pred_class
#>          
#>  1 Adelie     
#>  2 Adelie     
#>  3 Adelie     
#>  4 Adelie     
#>  5 Adelie     
#>  6 Adelie     
#>  7 Adelie     
#>  8 Adelie     
#>  9 Adelie     
#> 10 Adelie     
#> # ℹ 334 more rows

The main thing to note here is that the orbital package produces character vectors instead of factors. This is done as a unifying approach since many databases don’t have factor types.

Speaking of databases, you can predict() on an orbital object using tables from databases. Below we create an ephemeral in-memory RSQLite database.

library(DBI)
library(RSQLite)

con_sqlite <- dbConnect(SQLite(), path = ":memory:")
penguins_sqlite <- copy_to(con_sqlite, penguins, name = "penguins_table")

And we can predict with it like normal. All the calculations are sent to the database for execution.

predict(orbital_obj, penguins_sqlite)
#> # Source:   SQL [?? x 1]
#> # Database: sqlite 3.47.1 []
#>    .pred_class
#>          
#>  1 Adelie     
#>  2 Adelie     
#>  3 Adelie     
#>  4 Adelie     
#>  5 Adelie     
#>  6 Adelie     
#>  7 Adelie     
#>  8 Adelie     
#>  9 Adelie     
#> 10 Adelie     
#> # ℹ more rows

This works the same with many types of databases .

Classification is different from regression in part because it comes with multiple prediction types. The above example showed the default which is hard classification. You can set the type of prediction you want with the type argument to orbital. For classification models, possible options are "class" and "prob".

orbital_obj_prob <- orbital(wf_fit, type = c("class", "prob"))
orbital_obj_prob
#> 
#> ── orbital Object ──────────────────────────────────────────────────────────────
#> • island = dplyr::if_else(is.na(island), "unknown", island)
#> • sex = dplyr::if_else(is.na(sex), "unknown", sex)
#> • island_Dream = as.numeric(island == "Dream")
#> • island_Torgersen = as.numeric(island == "Torgersen")
#> • sex_male = as.numeric(sex == "male")
#> • sex_unknown = as.numeric(sex == "unknown")
#> • bill_length_mm = dplyr::if_else(is.na(bill_length_mm), 43.92193, bill_l ...
#> • bill_depth_mm = dplyr::if_else(is.na(bill_depth_mm), 17.15117, bill_dep ...
#> • flipper_length_mm = dplyr::if_else(is.na(flipper_length_mm), 201, flipp ...
#> • body_mass_g = dplyr::if_else(is.na(body_mass_g), 4202, body_mass_g)
#> • island_Dream = dplyr::if_else(is.na(island_Dream), 0.3604651, island_Dr ...
#> • island_Torgersen = dplyr::if_else(is.na(island_Torgersen), 0.1511628, i ...
#> • sex_male = dplyr::if_else(is.na(sex_male), 0.4883721, sex_male)
#> • sex_unknown = dplyr::if_else(is.na(sex_unknown), 0.03197674, sex_unknow ...
#> • Adelie = 0 + dplyr::case_when((bill_depth_mm < 15.1 | is.na(bill_depth_ ...
#> • Chinstrap = 0 + dplyr::case_when((island_Dream < 0.5 | is.na(island_Dre ...
#> • Gentoo = 0 + dplyr::case_when((bill_depth_mm < 15.95 | is.na(bill_depth ...
#> • .pred_class = dplyr::case_when(Adelie > Chinstrap & Adelie > Gentoo ~ " ...
#> • norm = exp(Adelie) + exp(Chinstrap) + exp(Gentoo)
#> • .pred_Adelie = exp(Adelie) / norm
#> • .pred_Chinstrap = exp(Chinstrap) / norm
#> • .pred_Gentoo = exp(Gentoo) / norm
#> ────────────────────────────────────────────────────────────────────────────────
#> 22 equations in total.

Notice how we can select both "class" and "prob". The predictions now include both hard and soft class predictions.

predict(orbital_obj_prob, penguins)
#> # A tibble: 344 × 4
#>    .pred_class .pred_Adelie .pred_Chinstrap .pred_Gentoo
#>                                     
#>  1 Adelie             0.989         0.00554      0.00560
#>  2 Adelie             0.989         0.00554      0.00560
#>  3 Adelie             0.989         0.00554      0.00560
#>  4 Adelie             0.709         0.0245       0.267  
#>  5 Adelie             0.989         0.00554      0.00560
#>  6 Adelie             0.989         0.00554      0.00560
#>  7 Adelie             0.989         0.00554      0.00560
#>  8 Adelie             0.989         0.00554      0.00560
#>  9 Adelie             0.979         0.00549      0.0158 
#> 10 Adelie             0.980         0.00559      0.0148 
#> # ℹ 334 more rows

That works equally well in databases.

predict(orbital_obj_prob, penguins_sqlite)
#> # Source:   SQL [?? x 4]
#> # Database: sqlite 3.47.1 []
#>    .pred_class .pred_Adelie .pred_Chinstrap .pred_Gentoo
#>                                     
#>  1 Adelie             0.989         0.00554      0.00560
#>  2 Adelie             0.989         0.00554      0.00560
#>  3 Adelie             0.989         0.00554      0.00560
#>  4 Adelie             0.709         0.0245       0.267  
#>  5 Adelie             0.989         0.00554      0.00560
#>  6 Adelie             0.989         0.00554      0.00560
#>  7 Adelie             0.989         0.00554      0.00560
#>  8 Adelie             0.989         0.00554      0.00560
#>  9 Adelie             0.979         0.00549      0.0158 
#> 10 Adelie             0.980         0.00559      0.0148 
#> # ℹ more rows

New augment method

The users of tidymodels have found the augment() function to be a handy tool. This function performs predictions and returns them alongside the original data set.

This release adds augment() support for orbital objects.

augment(orbital_obj, penguins)
#> # A tibble: 344 × 8
#>    .pred_class species island    bill_length_mm bill_depth_mm flipper_length_mm
#>                                                  
#>  1 Adelie      Adelie  Torgersen           39.1          18.7               181
#>  2 Adelie      Adelie  Torgersen           39.5          17.4               186
#>  3 Adelie      Adelie  Torgersen           40.3          18                 195
#>  4 Adelie      Adelie  Torgersen           NA            NA                  NA
#>  5 Adelie      Adelie  Torgersen           36.7          19.3               193
#>  6 Adelie      Adelie  Torgersen           39.3          20.6               190
#>  7 Adelie      Adelie  Torgersen           38.9          17.8               181
#>  8 Adelie      Adelie  Torgersen           39.2          19.6               195
#>  9 Adelie      Adelie  Torgersen           34.1          18.1               193
#> 10 Adelie      Adelie  Torgersen           42            20.2               190
#> # ℹ 334 more rows
#> # ℹ 2 more variables: body_mass_g , sex

The function works for most databases, but for technical reasons doesn’t work with all. It has been confirmed to not work work in spark databases or arrow tables.

augment(orbital_obj, penguins_sqlite)
#> # Source:   SQL [?? x 8]
#> # Database: sqlite 3.47.1 []
#>    .pred_class species island    bill_length_mm bill_depth_mm flipper_length_mm
#>                                                  
#>  1 Adelie      Adelie  Torgersen           39.1          18.7               181
#>  2 Adelie      Adelie  Torgersen           39.5          17.4               186
#>  3 Adelie      Adelie  Torgersen           40.3          18                 195
#>  4 Adelie      Adelie  Torgersen           NA            NA                  NA
#>  5 Adelie      Adelie  Torgersen           36.7          19.3               193
#>  6 Adelie      Adelie  Torgersen           39.3          20.6               190
#>  7 Adelie      Adelie  Torgersen           38.9          17.8               181
#>  8 Adelie      Adelie  Torgersen           39.2          19.6               195
#>  9 Adelie      Adelie  Torgersen           34.1          18.1               193
#> 10 Adelie      Adelie  Torgersen           42            20.2               190
#> # ℹ more rows
#> # ℹ 2 more variables: body_mass_g , sex

Acknowledgements

A big thank you to all the people who have contributed to orbital since the release of v0.3.0:

@EmilHvitfeldt , @joscani , @jrosell , @npelikan , and @szimmer .

Introducing mall for R...and Python

Edgar Ruiz — Wed, 30 Oct 2024 00:00:00 +0000

The beginning

A few months ago, while working on the Databricks with R workshop, I came across some of their custom SQL functions. These particular functions are prefixed with “ai_”, and they run NLP with a simple SQL call:

1
2
3
4
5


> SELECT ai_analyze_sentiment('I am happy');
  positive

> SELECT ai_analyze_sentiment('I am sad');
  negative

This was a revelation to me. It showcased a new way to use LLMs in our daily work as analysts. To-date, I had primarily employed LLMs for code completion and development tasks. However, this new approach focuses on using LLMs directly against our data instead.

My first reaction was to try and access the custom functions via R. With dbplyr we can access SQL functions in R, and it was great to see them work:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


orders |>
  mutate(
    sentiment = ai_analyze_sentiment(o_comment)
  )
#> # Source:   SQL [6 x 2]
#>   o_comment                   sentiment
#>                               
#> 1 ", pending theodolites …    neutral  
#> 2 "uriously special foxes …   neutral  
#> 3 "sleep. courts after the …  neutral  
#> 4 "ess foxes may sleep …      neutral  
#> 5 "ts wake blithely unusual … mixed    
#> 6 "hins sleep. fluffily …     neutral

One downside of this integration is that even though accessible through R, we require a live connection to Databricks in order to utilize an LLM in this manner, thereby limiting the number of people who can benefit from it.

According to their documentation, Databricks is leveraging the Llama 3.1 70B model. While this is a highly effective Large Language Model, its enormous size poses a significant challenge for most users’ machines, making it impractical to run on standard hardware.

Reaching viability

LLM development has been accelerating at a rapid pace. Initially, only online Large Language Models (LLMs) were viable for daily use. This sparked concerns among companies hesitant to share their data externally. Moreover, the cost of using LLMs online can be substantial, per-token charges can add up quickly.

The ideal solution would be to integrate an LLM into our own systems, requiring three essential components:

A model that can fit comfortably in memory
A model that achieves sufficient accuracy for NLP tasks
An intuitive interface between the model and the user’s laptop

In the past year, having all three of these elements was nearly impossible. Models capable of fitting in-memory were either inaccurate or excessively slow. However, recent advancements, such as Llama from Meta and cross-platform interaction engines like Ollama , have made it feasible to deploy these models, offering a promising solution for companies looking to integrate LLMs into their workflows.

The project

This project started as an exploration, driven by my interest in leveraging a “general-purpose” LLM to produce results comparable to those from Databricks AI functions. The primary challenge was determining how much setup and preparation would be required for such a model to deliver reliable and consistent results.

Without access to a design document or open-source code, I relied solely on the LLM’s output as a testing ground. This presented several obstacles, including the numerous options available for fine-tuning the model. Even within prompt engineering, the possibilities are vast. To ensure the model was not too specialized or focused on a specific subject or outcome, I needed to strike a delicate balance between accuracy and generality.

Fortunately, after conducting extensive testing, I discovered that a simple “one-shot” prompt yielded the best results. By “best,” I mean that the answers were both accurate for a given row and consistent across multiple rows. Consistency was crucial, as it meant providing answers that were one of the specified options (positive, negative, or neutral), without any additional explanations.

The following is an example of a prompt that worked reliably against Llama 3.2:

>>> You are a helpful sentiment engine. Return only one of the 
... following answers: positive, negative, neutral. No capitalization. 
... No explanations. The answer is based on the following text: 
... I am happy
positive

As a side note, my attempts to submit multiple rows at once proved unsuccessful. In fact, I spent a significant amount of time exploring different approaches, such as submitting 10 or 2 rows simultaneously, formatting them in JSON or CSV formats. The results were often inconsistent, and it didn’t seem to accelerate the process enough to be worth the effort.

Once I became comfortable with the approach, the next step was wrapping the functionality within an R package.

The approach

One of my goals was to make the mall package as “ergonomic” as possible. In other words, I wanted to ensure that using the package in R and Python integrates seamlessly with how data analysts use their preferred language on a daily basis.

For R, this was relatively straightforward. I simply needed to verify that the functions worked well with pipes (%>% and |>) and could be easily incorporated into packages like those in the tidyverse:

1
2
3
4
5
6


reviews |> 
  llm_sentiment(review) |> 
  filter(.sentiment == "positive") |> 
  select(review) 
#>                                                               review
#> 1 This has been the best TV I've ever used. Great screen, and sound.

However, for Python, being a non-native language for me, meant that I had to adapt my thinking about data manipulation. Specifically, I learned that in Python, objects (like pandas DataFrames) “contain” transformation functions by design.

This insight led me to investigate if the Pandas API allows for extensions, and fortunately, it did! After exploring the possibilities, I decided to start with Polar, which allowed me to extend its API by creating a new namespace. This simple addition enabled users to easily access the necessary functions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


>>> import polars as pl
>>> import mall
>>> df = pl.DataFrame(dict(x = ["I am happy", "I am sad"]))
>>> df.llm.sentiment("x")
shape: (2, 2)
┌────────────┬───────────┐
│ x          ┆ sentiment │
│ ---        ┆ ---       │
│ str        ┆ str       │
╞════════════╪═══════════╡
│ I am happy ┆ positive  │
│ I am sad   ┆ negative  │
└────────────┴───────────┘

By keeping all the new functions within the llm namespace, it becomes very easy for users to find and utilize the ones they need:

What’s next

I think it will be easier to know what is to come for mall once the community uses it and provides feedback. I anticipate that adding more LLM back ends will be the main request. The other possible enhancement will be when new updated models are available, then the prompts may need to be updated for that given model. I experienced this going from LLama 3.1 to Llama 3.2. There was a need to tweak one of the prompts. The package is structured in a way the future tweaks like that will be additions to the package, and not replacements to the prompts, so as to retains backwards compatibility.

This is the first time I write an article about the history and structure of a project. This particular effort was so unique because of the R + Python, and the LLM aspects of it, that I figured it is worth sharing.

If you wish to learn more about mall, feel free to visit its official site: https://mlverse.github.io/mall/

Postprocessing is coming to tidymodels

Simon Couch — Tue, 08 Oct 2024 00:00:00 +0000

We’re bristling with elation to share about a set of upcoming features for postprocessing with tidymodels. Postprocessors refine predictions outputted from machine learning models to improve predictive performance or better satisfy distributional limitations. The developmental versions of many tidymodels core packages include changes to support postprocessors, and we’re ready to share about our work and hear the community’s thoughts on our progress so far.

Postprocessing support with tidymodels hasn’t yet made it to CRAN, but you can install the needed versions of tidymodels packages with the following code.

pak::pak(
  paste0(
    "tidymodels/",
    c("tune", "workflows", "rsample", "tailor")
  )
)

Now, we load packages with those developmental versions installed.

library(tidymodels)
library(probably)
library(tailor)

Existing tidymodels users might have spotted something funky already; who is this tailor character?

Meet tailor👋

The tailor package introduces tailor objects, which compose iterative adjustments to model predictions. tailor is to postprocessing as recipes is to preprocessing; applying your mental model of recipes to tailor should get you a good bit of the way there.

Tool	Applied to...	Initialize with...	Composes...	Train with...	Predict with...
recipes	Training data	`recipe()`	`step_*()`s	`prep()`	`bake()`
tailor	Model predictions	`tailor()`	`adjust_*()`ments	`fit()`	`predict()`

First, users can initialize a tailor object with tailor() .

tailor()
#> 
#> ── tailor ──────────────────────────────────────────────────────────────────────
#> A postprocessor with 0 adjustments.

Tailors compose “adjustments,” analogous to steps from the recipes package.

tailor() %>%
  adjust_probability_threshold(threshold = .7)
#> 
#> ── tailor ──────────────────────────────────────────────────────────────────────
#> A binary postprocessor with 1 adjustment:
#> 
#> • Adjust probability threshold to 0.7.

As an example, we’ll apply this tailor to the two_class_example data made available after loading tidymodels.

head(two_class_example)
#>    truth      Class1       Class2 predicted
#> 1 Class2 0.003589243 0.9964107574    Class2
#> 2 Class1 0.678621054 0.3213789460    Class1
#> 3 Class2 0.110893522 0.8891064779    Class2
#> 4 Class1 0.735161703 0.2648382969    Class1
#> 5 Class2 0.016239960 0.9837600397    Class2
#> 6 Class1 0.999275071 0.0007249286    Class1

This data gives the true value of an outcome variable truth as well as predicted probabilities (Class1 and Class2). The hard class predictions, in predicted, are "Class1" if the probability assigned to "Class1" is above .5, and "Class2" otherwise.

The model predicts "Class1" more often than it does "Class2".

two_class_example %>% count(predicted)
#>   predicted   n
#> 1    Class1 277
#> 2    Class2 223

If we wanted the model to predict "Class2" more often, we could increase the probability threshold assigned to "Class1" above which the hard class prediction will be "Class1". In the tailor package, this adjustment is implemented in adjust_probability_threshold() , which can be situated in a tailor object.

tlr <-
  tailor() %>%
  adjust_probability_threshold(threshold = .7)

tlr
#> 
#> ── tailor ──────────────────────────────────────────────────────────────────────
#> A binary postprocessor with 1 adjustment:
#> 
#> • Adjust probability threshold to 0.7.

tailors must be fitted before they can predict on new data. For adjustments like adjust_probability_threshold() , there’s no training that actually happens at the fit() step besides recording the name and type of relevant variables. For other adjustments, like numeric calibration with adjust_numeric_calibration() , parameters are actually estimated at the fit() stage and separate data should be used to train the postprocessor and evaluate its performance. More on this in Tailors in context .

In this case, though, we can fit() on the whole dataset. The resulting object is still a tailor, but is now flagged as trained.

tlr_trained <- fit(
  tlr,
  two_class_example,
  outcome = truth,
  estimate = predicted,
  probabilities = c(Class1, Class2)
)

tlr_trained
#> 
#> ── tailor ──────────────────────────────────────────────────────────────────────
#> A binary postprocessor with 1 adjustment:
#> 
#> • Adjust probability threshold to 0.7. [trained]

When used with a model workflow via add_tailor() , the arguments to fit() a tailor will be set automatically. Generally, as in recipes, we recommend that users add tailors to model workflows for training and prediction rather than using them standalone for greater ease of use and to prevent data leakage, but tailors are totally functional by themselves, too.

Now, when passed new data, the trained tailor will determine the outputted class based on whether the probability assigned to the level "Class1" is above .7, resulting in more predictions of "Class2" than before.

predict(tlr_trained, two_class_example) %>% count(predicted)
#> # A tibble: 2 × 2
#>   predicted     n
#>        
#> 1 Class1      236
#> 2 Class2      264

Changing the probability threshold is one of many possible adjustments available in tailor.

For probabilities: calibration
For transformation of probabilities to hard class predictions: thresholds , equivocal zones
For numeric outcomes: calibration , range

Support for tailors is now plumbed through workflows (via add_tailor() ) and tune, and rsample includes a set of infrastructural changes to prevent data leakage behind the scenes. That said, we haven’t yet implemented support for tuning parameters in tailors, but we plan to implement that before this functionality heads to CRAN.

Tailors in context

As an example, let’s model a study of food delivery times in minutes (i.e., the time from the initial order to receiving the food) for a single restaurant. The deliveries data is available upon loading the tidymodels meta-package.

data(deliveries)

# split into training and testing sets
set.seed(1)
delivery_split <- initial_split(deliveries)
delivery_train <- training(delivery_split)
delivery_test  <- testing(delivery_split)

# resample the training set using 10-fold cross-validation
set.seed(1)
delivery_folds <- vfold_cv(delivery_train)

# print out the training set
delivery_train
#> # A tibble: 7,509 × 31
#>    time_to_delivery  hour day   distance item_01 item_02 item_03 item_04 item_05
#>                                    
#>  1             21.2  16.1 Tue       3.02       0       0       0       0       0
#>  2             17.9  12.4 Sun       3.37       0       0       0       0       0
#>  3             22.4  14.2 Fri       2.59       0       0       0       0       0
#>  4             30.9  19.1 Sat       2.77       0       0       0       0       0
#>  5             30.1  16.5 Fri       2.05       0       0       0       1       0
#>  6             35.3  14.7 Sat       4.57       0       0       2       1       1
#>  7             13.1  11.5 Sat       2.09       0       0       0       0       0
#>  8             18.3  13.4 Tue       2.35       0       2       1       0       0
#>  9             25.2  20.5 Sat       2.43       0       0       0       1       0
#> 10             30.7  16.7 Fri       2.24       0       0       0       1       0
#> # ℹ 7,499 more rows
#> # ℹ 22 more variables: item_06 , item_07 , item_08 ,
#> #   item_09 , item_10 , item_11 , item_12 , item_13 ,
#> #   item_14 , item_15 , item_16 , item_17 , item_18 ,
#> #   item_19 , item_20 , item_21 , item_22 , item_23 ,
#> #   item_24 , item_25 , item_26 , item_27

Let’s deliberately define a regression model that has poor predicted values: a boosted tree with only three ensemble members.

delivery_wflow <-
  workflow() %>%
  add_formula(time_to_delivery ~ .) %>%
  add_model(boost_tree(mode = "regression", trees = 3))

Evaluating against resamples:

set.seed(1)
delivery_res <- 
  fit_resamples(
    delivery_wflow, 
    delivery_folds, 
    control = control_resamples(save_pred = TRUE)
  )

The $R^2$ looks quite strong!

collect_metrics(delivery_res)
#> # A tibble: 2 × 6
#>   .metric .estimator  mean     n std_err .config             
#>                                
#> 1 rmse    standard   9.52     10 0.0533  Preprocessor1_Model1
#> 2 rsq     standard   0.853    10 0.00357 Preprocessor1_Model1

Let’s take a closer look at the predictions, though. How well are they calibrated? We can use the cal_plot_regression() helper from the probably package to put together a quick diagnostic plot.

collect_predictions(delivery_res) %>%
  cal_plot_regression(truth = time_to_delivery, estimate = .pred)

Ooof.

In comes tailor! Numeric calibration can help address the correlated errors here. We can add a tailor to our existing workflow to “bump up” predictions towards their true value.

delivery_wflow_improved <-
  delivery_wflow %>%
  add_tailor(tailor() %>% adjust_numeric_calibration())

The resampling code looks the same from here.

set.seed(1)
delivery_res_improved <- 
  fit_resamples(
    delivery_wflow_improved, 
    delivery_folds, 
    control = control_resamples(save_pred = TRUE)
  )

Checking out the same plot reveals a much better fit!

collect_predictions(delivery_res_improved) %>%
  cal_plot_regression(truth = time_to_delivery, estimate = .pred)

There’s actually some tricky data leakage prevention happening under the hood here. When you add tailors to workflow and fit them with tune, this is all taken care of for you. If you’re interested in using tailors outside of that context, check out this documentation section in add_tailor().

What’s to come

We’re excited about how this work is shaping up and would love to hear yall’s thoughts on what we’ve brought together so far. Please do comment on our social media posts about this blog entry or leave issues on the tailor GitHub repository and let us know what you think!

Before these changes head out to CRAN, we’ll also be implementing tuning functionality for postprocessors. You’ll be able to tag arguments like adjust_probability_threshold(threshold) or adjust_probability_calibration(method) with tune() to optimize across several values. Besides that, post-processing with tidymodels should “just work” on the developmental versions of our packages—let us know if you come across anything wonky.

Acknowledgements

Postprocessing support has been a longstanding feature request across many of our repositories; we’re grateful for the community discussions there for shaping this work. Additionally, we thank Ryan Tibshirani and Daniel McDonald for fruitful discussions on how we might scope these features.

recipes 1.1.0

Emil Hvitfeldt — Mon, 08 Jul 2024 00:00:00 +0000

We’re thrilled to announce the release of recipes 1.1.0. recipes lets you create a pipeable sequence of feature engineering steps.

You can install it from CRAN with:

install.packages("recipes")

This blog post will go over some of the bigger changes in this release. Improvements in column type checking, allowing more data types to be passed to recipes, use of long formulas and better error for misspelled argument names.

You can see a full list of changes in the release notes .

Column type checking

A longtime issue in recipes came from the fact that recipes didn’t keep a prototype (ptype) of the data it was specified with. This would cause unexpected things to happen or uninformative error messages to appear if different data was used to prep() than was used to create the recipe() .

Every recipe you create starts with a call to recipe() . In the below example, we create a recipe where x2 starts by being a character vector, but the recipe is prepped where x2 is a numeric vector. This didn’t produce any warnings or errors, silently doing something unintended.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


data_template <- tibble(
  outcome = rnorm(10), 
  x1 = rnorm(10), 
  x2 = sample(letters, 10, T)
)

rec <- recipe(outcome ~ ., data_template) %>%
  step_bin2factor(all_numeric_predictors())

data_training <- tibble(outcome = rnorm(1000), x1 = rnorm(1000), x2 = rnorm(1000))

prep(rec, training = data_training)
#> 
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#> 
#> ── Inputs
#> Number of variables by role
#> outcome:   1
#> predictor: 2
#> 
#> ── Training information
#> Training data contained 1000 data points and no incomplete rows.
#> 
#> ── Operations
#> • Dummy variable to factor conversion for: x1 | Trained

Now, we get an error detailing how the data is different.

data_template <- tibble(outcome = rnorm(10), x1 = rnorm(10), x2 = sample(letters, 10, T))

rec <- recipe(outcome ~ ., data_template) %>%
  step_bin2factor(all_numeric_predictors())

data_training <- tibble(outcome = rnorm(1000), x1 = rnorm(1000), x2 = rnorm(1000))

prep(rec, training = data_training)
#> Error in `prep()`:
#> ✖ The following variable has the wrong class:
#> • `x2` must have class , not .

Note that recipes created before version 1.1.0 don’t contain any ptype information, and will not undergo checking. Rerunning the code to create the recipe will add ptype information to the recipe.

Input checking in `recipe()`

We have relaxed the requirements of data frames, while making feedback more helpful when something goes wrong.

The data was previously passed through model.frame() inside the recipe, which restricted what could be handled. Previously prohibited input included data frames with list-columns or sf data frames. Both of these are now supported, as long as they are a data.frame object.

data_listcolumn <- tibble(
  y = 1:4,
  x = list(1:3, 4:6, 3:1, 1:10)
)

recipe(y ~ ., data = data_listcolumn)
#> 
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#> 
#> ── Inputs
#> Number of variables by role
#> outcome:   1
#> predictor: 1

library(sf)
#> Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
pathshp <- system.file("shape/nc.shp", package = "sf")
data_sf <- st_read(pathshp, quiet = TRUE)

recipe(AREA ~ ., data = data_sf)
#> 
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#> 
#> ── Inputs
#> Number of variables by role
#> outcome:    1
#> predictor: 14

We are excited to see what people can do with these new options.

Another way to tell a recipe what variables should be included and what roles they should have is to use add_role() and update_role() . But if you were not careful, you could end up in situations where the same variable is labeled as both the outcome and predictor.

# didn't used to throw a warning
recipe(mtcars) |>
  update_role(everything(), new_role = "predictor") |>
  add_role("mpg", new_role = "outcome")
#> Error in `add_role()`:
#> ! `mpg` cannot get "outcome" role as it already has role "predictor".

This error can be avoided by using update_role() instead of add_role() .

recipe(mtcars) |>
  update_role(everything(), new_role = "predictor") |>
  update_role("mpg", new_role = "outcome")
#> 
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#> 
#> ── Inputs
#> Number of variables by role
#> outcome:    1
#> predictor: 10

Long formulas in `recipe()`

Related to the changes we saw above, we now fully support very long formulas without hitting a C stack usage error.

data_wide <- matrix(1:10000, ncol = 10000)
data_wide <- as.data.frame(data_wide)
names(data_wide) <- c(paste0("x", 1:10000))

long_formula <- as.formula(paste("~ ", paste(names(data_wide), collapse = " + ")))

recipe(long_formula, data_wide)
#> 
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#> 
#> ── Inputs
#> Number of variables by role
#> predictor: 10000

Better error for misspelled argument names

If you have used recipes long enough you are very likely to have run into the following error.

1
2
3
4
5
6


recipe(mpg ~ ., data = mtcars) |>
  step_pca(all_numeric_predictors(), number = 4) |>
  prep()
#> Error in `step_pca()`:
#> Caused by error in `prep()`:
#> ! Can't rename variables in this context.

The first time you saw it, it didn’t make much sense. Hopefully, you figured out that step_pca() doesn’t have a number argument, and instead uses num_comp to determine the number of principal components to return. This confusion will be a thing of the past as we now include this improved error message.

recipe(mpg ~ ., data = mtcars) |>
  step_pca(all_numeric_predictors(), number = 4) |>
  prep()
#> Error in `step_pca()`:
#> Caused by error in `prep()` at recipes/R/recipe.R:479:9:
#> ! The following argument was specified but do not exist: `number`.

Quality of life increases in `step_dummy()`

I would imagine that one of the most used steps is step_dummy() . We have improved the errors and warnings it spits out when things go sideways.

If you apply step_dummy() to a variable that contains a lot of levels, it will produce a lot of columns, and the resulting object may not fit in memory. This can lead to the following error.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


data_id <- tibble(
  id = as.character(1:100000), 
  x1 = rnorm(100000), 
  x2 = sample(letters, 100000, TRUE)
)

recipe(~ ., data = data_id) |>
  step_dummy(all_nominal_predictors()) |>
  prep()
#> Error: vector memory exhausted (limit reached?)

Instead, you now get a more helpful error message.

data_id <- tibble(
  id = as.character(1:100000), 
  x1 = rnorm(100000), 
  x2 = sample(letters, 100000, TRUE)
)

recipe(~ ., data = data_id) |>
  step_dummy(all_nominal_predictors()) |>
  prep()
#> Error in `step_dummy()`:
#> Caused by error:
#> ! `id` contains too many levels (100000), which would result in a
#>   data.frame too large to fit in memory.

Likewise, you will get helpful errors if step_dummy() gets a NA or unseen values.

data_train <- tibble(x = c("a", "b"))
data_unseen <- tibble(x = "c")

rec_spec <- recipe(~., data = data_train) %>%
  step_dummy(x) %>%
  prep()

rec_spec %>%
  bake(data_unseen)
#> Warning: ! There are new levels in `x`: "c".
#> ℹ Consider using step_novel() (`?recipes::step_novel()`) before `step_dummy()`
#>   to handle unseen values.
#> # A tibble: 1 × 1
#>     x_b
#>   
#> 1    NA

data_na <- tibble(x = NA)

rec_spec %>%
  bake(data_na)
#> Warning: ! There are new levels in `x`: NA.
#> ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before
#>   `step_dummy()` to handle missing values.
#> # A tibble: 1 × 1
#>     x_b
#>   
#> 1    NA

Acknowledgements

A big thank you to all the people who have contributed to recipes since the release of v1.0.10:

@brynhum , @DemetriPananos , @diegoperoni , @EmilHvitfeldt , @JiahuaQu , @joranE , @nhward , @olivroy , and @simonpcouch .

Chocolate Chocolate Chip Cookies

preheat oven 350°F

1/3c butter
1/2 + 1/3c sugar

mix until fluffy

1 tsp vanilla
1 egg

mix until combined

1/2c cocoa
1/2 tsp baking soda
1c flour

mix until combined

3/4c chocolate chips

bake for about 8 mins, depending on size! they will crack on top, but still be soft.

bonsai 0.3.0

Simon Couch — Tue, 25 Jun 2024 00:00:00 +0000

We’re brimming with glee to announce the release of bonsai 0.3.0. bonsai is a parsnip extension package for tree-based models, and includes support for random forest and gradient-boosted tree frameworks like partykit and LightGBM. This most recent release of the package introduces support for the "aorsf" engine, which implements accelerated oblique random forests (Jaeger et al. 2022, Jaeger et al. 2024).

You can install it from CRAN with:

install.packages("bonsai")

This blog post will demonstrate a modeling workflow where the benefits of using oblique random forests shine through.

You can see a full list of changes in the release notes .

library(tidymodels)
library(bonsai)
library(plsmod)
library(corrr)

The `meats` data

The modeldata package, loaded automatically with the tidymodels meta-package, includes several example datasets to demonstrate modeling problems. We’ll make use of a dataset called meats in this post. Each row is a measurement of a sample of finely chopped meat.

meats
#> # A tibble: 215 × 103
#>    x_001 x_002 x_003 x_004 x_005 x_006 x_007 x_008 x_009 x_010 x_011 x_012 x_013
#>                
#>  1  2.62  2.62  2.62  2.62  2.62  2.62  2.62  2.62  2.63  2.63  2.63  2.63  2.64
#>  2  2.83  2.84  2.84  2.85  2.85  2.86  2.86  2.87  2.87  2.88  2.88  2.89  2.90
#>  3  2.58  2.58  2.59  2.59  2.59  2.59  2.59  2.60  2.60  2.60  2.60  2.61  2.61
#>  4  2.82  2.82  2.83  2.83  2.83  2.83  2.83  2.84  2.84  2.84  2.84  2.85  2.85
#>  5  2.79  2.79  2.79  2.79  2.80  2.80  2.80  2.80  2.81  2.81  2.81  2.82  2.82
#>  6  3.01  3.02  3.02  3.03  3.03  3.04  3.04  3.05  3.06  3.06  3.07  3.08  3.09
#>  7  2.99  2.99  3.00  3.01  3.01  3.02  3.02  3.03  3.04  3.04  3.05  3.06  3.07
#>  8  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.54  2.54  2.54  2.54  2.54
#>  9  3.27  3.28  3.29  3.29  3.30  3.31  3.31  3.32  3.33  3.33  3.34  3.35  3.36
#> 10  3.40  3.41  3.41  3.42  3.43  3.43  3.44  3.45  3.46  3.47  3.48  3.48  3.49
#> # ℹ 205 more rows
#> # ℹ 90 more variables: x_014 , x_015 , x_016 , x_017 ,
#> #   x_018 , x_019 , x_020 , x_021 , x_022 ,
#> #   x_023 , x_024 , x_025 , x_026 , x_027 ,
#> #   x_028 , x_029 , x_030 , x_031 , x_032 ,
#> #   x_033 , x_034 , x_035 , x_036 , x_037 ,
#> #   x_038 , x_039 , x_040 , x_041 , x_042 , …

From that dataset’s documentation:

These data are recorded on a Tecator Infratec Food and Feed Analyzer… For each meat sample the data consists of a 100 channel spectrum of absorbances and the contents of moisture (water), fat and protein. The absorbance is -log10 of the transmittance measured by the spectrometer. The three contents, measured in percent, are determined by analytic chemistry.

We’ll try to predict the protein content, as a percentage, using the absorbance measurements.

Before we take a further look, let’s split up our data. I’ll first select off two other possible outcome variables and, after splitting into training and testing sets, resample the data using 5-fold cross-validation with 2 repeats.

meats <- meats %>% select(-water, -fat)

set.seed(1)
meats_split <- initial_split(meats)
meats_train <- training(meats_split)
meats_test <- testing(meats_split)
meats_folds <- vfold_cv(meats_train, v = 5, repeats = 2)

The tricky parts of this modeling problem are that:

There are few observations to work with (215 total).
Each of these 100 absorbance measurements are highly correlated.

Visualizing that correlation:

meats_train %>%
  correlate() %>%
  autoplot() +
  theme(axis.text.x = element_blank(), axis.text.y = element_blank())
#> Correlation computed with
#> • Method: 'pearson'
#> • Missing treated using: 'pairwise.complete.obs'

Almost all of these pairwise correlations between predictors are near 1, besides the last variable and every other variable. That last variable with weaker correlation values? It’s the outcome.

Baseline models

There are several existing model implementations in tidymodels that are resilient to highly correlated predictors. The first one I’d probably reach for is an elastic net: an interpolation of the LASSO and Ridge regularized linear regression models. Evaluating that modeling approach against resamples:

# define a regularized linear model
spec_lr <- 
  linear_reg(penalty = tune(), mixture = tune()) %>%
  set_engine("glmnet")

# try out different penalization approaches
res_lr <- tune_grid(spec_lr, protein ~ ., meats_folds)

show_best(res_lr, metric = "rmse")
#> # A tibble: 5 × 8
#>         penalty mixture .metric .estimator  mean     n std_err .config          
#>                                         
#> 1 0.0000324       0.668 rmse    standard    1.24    10  0.0516 Preprocessor1_Mo…
#> 2 0.00000000524   0.440 rmse    standard    1.25    10  0.0548 Preprocessor1_Mo…
#> 3 0.000000461     0.839 rmse    standard    1.26    10  0.0538 Preprocessor1_Mo…
#> 4 0.00000550      0.965 rmse    standard    1.26    10  0.0540 Preprocessor1_Mo…
#> 5 0.0000000489    0.281 rmse    standard    1.26    10  0.0534 Preprocessor1_Mo…
show_best(res_lr, metric = "rsq")
#> # A tibble: 5 × 8
#>         penalty mixture .metric .estimator  mean     n std_err .config          
#>                                         
#> 1 0.0000324       0.668 rsq     standard   0.849    10  0.0126 Preprocessor1_Mo…
#> 2 0.00000000524   0.440 rsq     standard   0.848    10  0.0128 Preprocessor1_Mo…
#> 3 0.000000461     0.839 rsq     standard   0.846    10  0.0114 Preprocessor1_Mo…
#> 4 0.00000550      0.965 rsq     standard   0.846    10  0.0111 Preprocessor1_Mo…
#> 5 0.0000000489    0.281 rsq     standard   0.846    10  0.0126 Preprocessor1_Mo…

That best RMSE value of 1.24 gives us a baseline to work with, and the best R-squared 0.85 seems like a good start.

Many tree-based model implementations in tidymodels generally handle correlated predictors well. Just to be apples-to-apples with "aorsf", let’s use a different random forest engine to get a better sense for baseline performance:

spec_rf <- 
  rand_forest(mtry = tune(), min_n = tune()) %>%
  # this is the default engine, but for consistency's sake:
  set_engine("ranger") %>%
  set_mode("regression")

res_rf <- tune_grid(spec_rf, protein ~ ., meats_folds)
#> i Creating pre-processing data to finalize unknown parameter: mtry

show_best(res_rf, metric = "rmse")
#> # A tibble: 5 × 8
#>    mtry min_n .metric .estimator  mean     n std_err .config              
#>                                   
#> 1    96     4 rmse    standard    2.37    10  0.0905 Preprocessor1_Model08
#> 2    41     6 rmse    standard    2.39    10  0.0883 Preprocessor1_Model01
#> 3    88    10 rmse    standard    2.43    10  0.0816 Preprocessor1_Model06
#> 4    79    17 rmse    standard    2.51    10  0.0740 Preprocessor1_Model07
#> 5    27    18 rmse    standard    2.52    10  0.0778 Preprocessor1_Model04
show_best(res_rf, metric = "rsq")
#> # A tibble: 5 × 8
#>    mtry min_n .metric .estimator  mean     n std_err .config              
#>                                   
#> 1    96     4 rsq     standard   0.424    10  0.0385 Preprocessor1_Model08
#> 2    41     6 rsq     standard   0.409    10  0.0394 Preprocessor1_Model01
#> 3    88    10 rsq     standard   0.387    10  0.0365 Preprocessor1_Model06
#> 4    79    17 rsq     standard   0.353    10  0.0404 Preprocessor1_Model07
#> 5    27    18 rsq     standard   0.346    10  0.0397 Preprocessor1_Model04

Not so hot. Just to show I’m not making a straw man here, I’ll evaluate a few more alternative modeling approaches behind the curtain and print out their best performance metrics:

Gradient boosted tree with LightGBM. Best RMSE: 2.34. Best R-squared: 0.43.
Partial least squares regression. Best RMSE: 1.39. Best R-squared: 0.81.
Support vector machine. Best RMSE: 2.28. Best R-squared: 0.46.

This is a tricky one.

Introducing accelerated oblique random forests

The 0.3.0 release of bonsai introduces support for accelerated oblique random forests via the "aorsf" engine for classification and regression in tidymodels. (Tidy survival modelers might note that we already support "aorsf" for censored regression via the censored parsnip extension package!)

Unlike trees in conventional random forests, which create splits using thresholds based on individual predictors (e.g. x_001 > 3), oblique random forests use linear combinations of predictors to create splits (e.g. x_001 * x_002 > 7.5) and have been shown to improve predictive performance related to conventional random forests for a variety of applications (Menze et al. 2011). “Oblique” references the appearance of decision boundaries when a set of splits is plotted; I’ve grabbed a visual from the aorsf README that demonstrates:

In the above, we’d like to separate the purple dots from the orange squares. A tree in a traditional random forest, represented on the left, can only generate splits based on one of X1 or X2 at a time. A tree in an oblique random forest, represented on the right, can consider both X1 and X2 in creating decision boundaries, often resulting in stronger predictive performance.

Where does the “accelerated” come from? Generally, finding optimal oblique splits is computationally more intensive than finding single-predictor splits. The aorsf package uses something called “Newton Raphson scoring”—the same algorithm under the hood in the survival package—to identify splits based on linear combinations of predictor variables. This approach speeds up that process greatly, resulting in fit times that are analogous to implementations of traditional random forests in R (and hundreds of times faster than existing oblique random forest implementations, Jaeger et al. 2024).

The code to tune this model with the "aorsf" engine is the same as for "ranger", except we switch out the engine argument to set_engine() :

spec_aorsf <- 
  rand_forest(
    mtry = tune(),
    min_n = tune()
  ) %>%
  set_engine("aorsf") %>%
  set_mode("regression")

res_aorsf <- tune_grid(spec_aorsf, protein ~ ., meats_folds)
#> i Creating pre-processing data to finalize unknown parameter: mtry

show_best(res_aorsf, metric = "rmse")
#> # A tibble: 5 × 8
#>    mtry min_n .metric .estimator  mean     n std_err .config              
#>                                   
#> 1    87    11 rmse    standard   0.786    10  0.0370 Preprocessor1_Model02
#> 2    98     8 rmse    standard   0.789    10  0.0363 Preprocessor1_Model10
#> 3    48     5 rmse    standard   0.793    10  0.0363 Preprocessor1_Model01
#> 4    16    17 rmse    standard   0.803    10  0.0325 Preprocessor1_Model09
#> 5    31    18 rmse    standard   0.813    10  0.0359 Preprocessor1_Model05
show_best(res_aorsf, metric = "rsq")
#> # A tibble: 5 × 8
#>    mtry min_n .metric .estimator  mean     n std_err .config              
#>                                   
#> 1    48     5 rsq     standard   0.946    10 0.00446 Preprocessor1_Model01
#> 2    98     8 rsq     standard   0.945    10 0.00482 Preprocessor1_Model10
#> 3    87    11 rsq     standard   0.945    10 0.00484 Preprocessor1_Model02
#> 4    16    17 rsq     standard   0.941    10 0.00370 Preprocessor1_Model09
#> 5    31    18 rsq     standard   0.940    10 0.00547 Preprocessor1_Model05

Holy smokes. The best RMSE from aorsf is 0.79, much more performant than the previous best RMSE from the elastic net with a value of 1.24, and the best R-squared is 0.95, much stronger than the previous best (also from the elastic net) of 0.85.

Especially if your modeling problems involve few samples of many, highly correlated predictors, give the "aorsf" modeling engine a whirl in your workflows and let us know what you think!

References

Byron C. Jaeger, Sawyer Welden, Kristin Lenoir, Jaime L. Speiser, Matthew W. Segar, Ambarish Pandey, Nicholas M. Pajewski. 2024. “Accelerated and Interpretable Oblique Random Survival Forests.” Journal of Computational and Graphical Statistics 33.1: 192-207.

Byron C. Jaeger, Sawyer Welden, Kristin Lenoir, and Nicholas M. Pajewski. 2022. “aorsf: An R package for Supervised Learning Using the Oblique Random Survival Forest.” The Journal of Open Source Software.

Bjoern H. Menze, B. Michael Kelm, Daniel N. Splitthoff, Ullrich Koethe, and Fred A. Hamprecht. (2011). “On Oblique Random Forests.” Joint European Conference on Machine Learning and Knowledge Discovery in Databases (pp. 453–469). Springer.

Acknowledgements

Thank you to @bcjaeger , the aorsf author, for doing most of the work to implement aorsf support in bonsai. Thank you to @hfrick , @joranE , @jrosell , @nipnipj , @p-schaefer , @seb-mueller , and @tcovert for their contributions on the bonsai repository since version 0.2.1.

Introducing Keras 3 for R

Tomasz Kalinowski — Tue, 21 May 2024 00:00:00 +0000

We are thrilled to introduce keras3, the next version of the Keras R package. keras3 is a ground-up rebuild of {keras}, maintaining the beloved features of the original while refining and simplifying the API based on valuable insights gathered over the past few years.

Keras provides a complete toolkit for building deep learning models in R—it’s never been easier to build, train, evaluate, and deploy deep learning models.

Installation

To install Keras 3:

1
2
3


install.packages("keras3")
library(keras3)
install_keras()

What’s new:

Documentation

Great documentation is essential, and we’ve worked hard to make sure that keras3 has excellent documentation, both now, and in the future.

Keras 3 comes with a full refresh of the website: https://keras.posit.co . There, you will find guides, tutorials, reference pages with rendered examples, and a new examples gallery. All the reference pages and guides are also available via R’s built-in help system.

In a fast moving ecosystem like deep learning, creating great documentation and wrappers once is not enough. There also need to be workflows that ensure the documentation is up-to-date with upstream dependencies. To accomplish this, {keras3} includes two new maintainer features that ensure the R documentation and function wrappers will stay up-to-date:

We now take snapshots of the upstream documentation and API surface. With each release, all R documentation is rebased on upstream updates. This workflow ensures that all R documentation (guides, examples, vignettes, and reference pages) and R function signatures stay up-to-date with upstream. This snapshot-and-rebase functionality is implemented in a new standalone R package, {doctether} , which may be useful for R package maintainers needing to keep documentation in parity with dependencies.
All examples and vignettes can now be evaluated and rendered during a package build. This ensures that no stale or broken example code makes it into a release. It also means all user facing example code now additionally serves as an extended suite of snapshot unit and integration tests.

Evaluating code in vignettes and examples is still not permitted according to CRAN restrictions. We work around the CRAN restriction by adding additional package build steps that pre-render examples and vignettes .

Combined, these two features will make it substantially easier for Keras in R to maintain feature parity and up-to-date documentation with the Python API to Keras.

Multi-backend support

Soon after its launch in 2015, Keras featured support for most popular deep learning frameworks: TensorFlow, Theano, MXNet, and CNTK. Over time, the landscape shifted; Theano, MXNet, and CNTK were retired, and TensorFlow surged in popularity. In 2021, three years ago, TensorFlow became the premier and only supported Keras backend. Now, the landscape has shifted again.

Keras 3 brings the return of multi-backend support. Choose a backend by calling:

1

use_backend("jax") # or "tensorflow", "torch", "numpy"

The default backend continues to be TensorFlow, which is the best choice for most users today; for small-to-medium sized models this is still the fastest backend. However, each backend has different strengths, and being able to switch easily will let you adapt to changes as your project, or the frameworks themselves, evolve.

Today, switching to the Jax backend can, for some model types, bring substantial speed improvements. Jax is also the only backend that has support for a new model parallelism distributed training API. Switching to Torch can be helpful during development, often producing simpler trackbacks while debugging.

Keras 3 also lets you incorporate any pre-existing Torch, Jax, or Flax module as a standard Keras layer by using the appropriate wrapper, letting you build atop existing projects with Keras. For example, train a Torch model using the Keras high-level training API (compile() + fit()), or include a Flax module as a component of a larger Keras model. The new multi-backend support lets you use Keras à la carte.

The ‘Ops’ family

{keras3} introduces a new “Operations” family of function. The Ops family, currently with over 200 functions , provides a comprehensive suite of operations typically needed when operating on nd-arrays for deep learning. The Operation family supersedes and greatly expands on the former family of backend functions prefixed with k_ in the {keras} package.

The Ops functions let you write backend-agnostic code. They provide a uniform API, regardless of if you’re working with TensorFlow Tensors, Jax Arrays, Torch Tensors, Keras Symbolic Tensors, NumPy arrays, or R arrays.

The Ops functions:

all start with prefix op_ (e.g., op_stack())
all are pure functions (they produce no side-effects)
all use consistent 1-based indexing, and coerce doubles to integers as needed
all are safe to use with any backend (tensorflow, jax, torch, numpy)
all are safe to use in both eager and graph/jit/tracing modes

The Ops API includes:

The entirety of the NumPy API (numpy.*)
The TensorFlow NN API (tf.nn.*)
Common linear algebra functions (A subset of scipy.linalg.*)
A subfamily of image transformers
A comprehensive set of loss functions
And more!

Ingest tabular data with `layer_feature_space()`

keras3 provides a new set of functions for building models that ingest tabular data: layer_feature_space() and a family of feature transformer functions (prefix, feature_) for building keras models that can work with tabular data, either as inputs to a keras model, or as preprocessing steps in a data loading pipeline (e.g., a tfdatasets::dataset_map()).

See the reference page and an example usage in a full end-to-end example to learn more.

New Subclassing API

The subclassing API has been refined and extended to more Keras types . Define subclasses simply by calling: Layer(), Loss(), Metric(), Callback(), Constraint(), Model(), and LearningRateSchedule(). Defining {R6} proxy classes is no longer necessary.

Additionally the documentation page for each of the subclassing functions now contains a comprehensive listing of all the available attributes and methods for that type. Check out ?Layer to see what’s possible.

Saving and Export

Keras 3 brings a new model serialization and export API. It is now much simpler to save and restore models, and also, to export them for serving.

save_model()/load_model():
A new high-level file format (extension: .keras) for saving and restoring a full model.

The file format is backend-agnostic. This means that you can convert trained models between backends, simply by saving with one backend, and then loading with another. For example, train a model using Jax, and then convert to Tensorflow for export.
export_savedmodel():
Export just the forward pass of a model as a compiled artifact for inference with TF Serving or (soon) Posit Connect . This is the easiest way to deploy a Keras model for efficient and concurrent inference serving, all without any R or Python runtime dependency.
Lower level entry points:
- save_model_weights() / load_model_weights():
  save just the weights as .h5 files.
- save_model_config() / load_model_config():
  save just the model architecture as a json file.
register_keras_serializable():
Register custom objects to enable them to be serialized and deserialized.
serialize_keras_object() / deserialize_keras_object():
Convert any Keras object to an R list of simple types that is safe to convert to JSON or rds.
See the new Serialization and Saving vignette for more details and examples.

New `random` family

A new family of random tensor generators . Like the Ops family, these work with all backends. Additionally, all the RNG-using methods have support for stateless usage when you pass in a seed generator. This enables tracing and compilation by frameworks that have special support for stateless, pure, functions, like Jax. See ?random_seed_generator() for example usage.

Other additions:

New shape() function, one-stop utility for working with tensor shapes in all contexts.
New and improved print(model) and plot(model) method. See some examples of output in the Functional API guide
All new fit() progress bar and live metrics viewer output, including new dark-mode support in the RStudio IDE.
New config family , a curated set of functions for getting and setting Keras global configurations.
All of the other function families have expanded with new members:
- Layers (prefix, layer_)
- Activation functions (prefix, activation_)
- Optimizers (prefix, optimizer_)
- Metrics (prefix metric_)
- Losses (prefix loss_)
- Image preprocesing (prefixes image_ and op_image_)
- Applications (prefix, application_)

Migrating from `{keras}` to `{keras3}`

{keras3} supersedes the {keras} package.

If you’re writing new code today, you can start using {keras3} right away.

If you have legacy code that uses {keras}, you are encouraged to update the code for {keras3}. For many high-level API functions, such as layer_dense(), fit(), and keras_model(), minimal to no changes are required. However there is a long tail of small changes that you might need to make when updating code that made use of the lower-level Keras API. Some of those are documented here: https://keras.io/guides/migrating_to_keras_3/ .

If you’re running into issues or have questions about updating, don’t hesitate to ask on https://github.com/rstudio/keras/issues or https://github.com/rstudio/keras/discussions .

The {keras} and {keras3} packages will coexist while the community transitions. During the transition, {keras} will continue to receive patch updates for compatibility with Keras v2, which continues to be published to PyPi under the package name tf-keras. After tf-keras is no longer maintained, the {keras} package will be archived.

Summary

In summary, {keras3} is a robust update to the Keras R package, incorporating new features while preserving the ease of use and functionality of the original. The new multi-backend support, comprehensive suite of Ops functions, refined model serialization API, and updated documentation workflows enable users to easily take advantage of the latest developments in the deep learning community.

Whether you are a seasoned Keras user or just starting your deep learning journey, Keras 3 provides the tools and flexibility to build, train, and deploy models with ease and confidence. As we transition from Keras 2 to Keras 3, we are committed to supporting the community and ensuring a smooth migration. We invite you to explore the new features, check out the updated documentation, and join the conversation on our GitHub discussions page. Welcome to the next chapter of deep learning in R with Keras 3!

Q1 2024 tidymodels digest

Hannah Frick — Wed, 24 Apr 2024 00:00:00 +0000

The tidymodels framework is a collection of R packages for modeling and machine learning using tidyverse principles.

Since the beginning of 2021, we have been publishing quarterly updates here on the tidyverse blog summarizing what’s new in the tidymodels ecosystem. The purpose of these regular posts is to share useful new features and any updates you may have missed. You can check out the tidymodels tag to find all tidymodels blog posts here, including our roundup posts as well as those that are more focused, like these posts from the past couple of months:

Additionally, we have published several related articles on tidymodels.org :

Since our last roundup post , there have been CRAN releases of 21 tidymodels packages. Here are links to their NEWS files:

baguette (1.0.2)
brulee (0.3.0)
butcher (0.3.4)
censored (0.3.0)
dials (1.2.1)
embed (1.1.4)
finetune (1.2.0)
hardhat (1.3.1)
modeldata (1.3.0)
parsnip (1.2.1)
probably (1.0.3)
recipes (1.0.10)
rsample (1.2.1)
shinymodels (0.1.1)
stacks (1.0.4)
tidyclust (0.2.1)
tidymodels (1.2.0)
tune (1.2.0)
workflows (1.1.4)
workflowsets (1.1.0)
yardstick (1.3.1)

We’ll highlight a few especially notable changes below: new prediction options in censored, consistency in augmenting parsnip models and workflows, as well as a new autoplot type for workflow sets.

library(tidymodels)
library(censored)

New prediction options in censored

As part of the framework-wide integration of survival analysis, the parsnip extension package censored has received some love in the form of new prediction options.

Random forests with the "aorsf" engine can now predict survival time, thanks to the new feature in the aorsf package itself. This means that all engines in censored can now predict survival time.

Let’s predict survival time for the first five rows of the lung cancer dataset, survival analysis’ mtcars.

rf_spec <- rand_forest() |>
  set_engine("aorsf") |>
  set_mode("censored regression")

rf_fit <- rf_spec |>
  fit(Surv(time, status) ~ age + sex, data = lung)

lung_5 <- lung[1:5, ]
predict(rf_fit, new_data = lung_5, type = "time")
#> # A tibble: 5 × 1
#>   .pred_time
#>        
#> 1       217.
#> 2       240.
#> 3       236.
#> 4       236.
#> 5       254.

Some models allow for predictions based on different values for tuning parameter without having to refit the model. In parsnip, we refer to this as “the submodel trick.” Some of those models are regularized models fitted with the glmnet engine. In censored, the corresponding multi_predict() method has now gained the prediction types "time" and "raw" in addition to the existing types "survival" and "linear_pred".

Let’s fit a regularized Cox model to illustrate. Note how we set the penalty to a fixed value of 0.1.

cox_fit <- proportional_hazards(penalty = 0.1) |>
  set_engine("glmnet") |>
  set_mode("censored regression") |>
  fit(Surv(time, status) ~ ., data = lung)

Predictions made with predict() use that penalty value of 0.1. With multi_predict() , we can change that value to something different without having to refit. Conveniently, we can predict for multiple penalty values as well.

predict(cox_fit, new_data = lung_5, type = "time")
#> # A tibble: 5 × 1
#>   .pred_time
#>        
#> 1        NA 
#> 2       425.
#> 3        NA 
#> 4       350.
#> 5        NA

mpred <- multi_predict(cox_fit, new_data = lung_5, type = "time", 
                       penalty = c(0.01, 0.1)) 
mpred
#> # A tibble: 5 × 1
#>   .pred           
#>             
#> 1 
#> 2 
#> 3 
#> 4 
#> 5

The resulting tibble is nested by observation to follow the convention of one row per observation. For each observation, the predictions are stored in a tibble containing the penalty value along with the prediction.

mpred$.pred[[2]]
#> # A tibble: 2 × 2
#>   penalty .pred_time
#>           
#> 1    0.01       461.
#> 2    0.1        425.

You can see that the predicted value from predict() matches the predicted value from multi_predict() with a penalty of 0.1.

Consistent `augment()` for workflows and parsnip models

If you are interested in exploring predictions in relation to predictors, augment() is your extended predict() method: it will augment the inputted dataset with its predictions. For classification, it will add hard class predictions as well as class probabilities. For regression, it will add the numeric prediction. If the outcome variable is part of the dataset, it also calculates residuals. This has already been the case for fitted parsnip models, and the augment() method for workflows will now also calculate residuals.

spec_fit <- fit(linear_reg(), mpg ~ ., mtcars)
wflow_fit <- workflow(mpg ~ ., linear_reg()) %>% fit(mtcars)

augment(spec_fit, mtcars)
#> # A tibble: 32 × 13
#>    .pred  .resid   mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear
#>                 
#>  1  22.6 -1.60    21       6  160    110  3.9   2.62  16.5     0     1     4
#>  2  22.1 -1.11    21       6  160    110  3.9   2.88  17.0     0     1     4
#>  3  26.3 -3.45    22.8     4  108     93  3.85  2.32  18.6     1     1     4
#>  4  21.2  0.163   21.4     6  258    110  3.08  3.22  19.4     1     0     3
#>  5  17.7  1.01    18.7     8  360    175  3.15  3.44  17.0     0     0     3
#>  6  20.4 -2.28    18.1     6  225    105  2.76  3.46  20.2     1     0     3
#>  7  14.4 -0.0863  14.3     8  360    245  3.21  3.57  15.8     0     0     3
#>  8  22.5  1.90    24.4     4  147.    62  3.69  3.19  20       1     0     4
#>  9  24.4 -1.62    22.8     4  141.    95  3.92  3.15  22.9     1     0     4
#> 10  18.7  0.501   19.2     6  168.   123  3.92  3.44  18.3     1     0     4
#> # ℹ 22 more rows
#> # ℹ 1 more variable: carb 

augment(wflow_fit, mtcars)
#> # A tibble: 32 × 13
#>    .pred  .resid   mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear
#>  *              
#>  1  22.6 -1.60    21       6  160    110  3.9   2.62  16.5     0     1     4
#>  2  22.1 -1.11    21       6  160    110  3.9   2.88  17.0     0     1     4
#>  3  26.3 -3.45    22.8     4  108     93  3.85  2.32  18.6     1     1     4
#>  4  21.2  0.163   21.4     6  258    110  3.08  3.22  19.4     1     0     3
#>  5  17.7  1.01    18.7     8  360    175  3.15  3.44  17.0     0     0     3
#>  6  20.4 -2.28    18.1     6  225    105  2.76  3.46  20.2     1     0     3
#>  7  14.4 -0.0863  14.3     8  360    245  3.21  3.57  15.8     0     0     3
#>  8  22.5  1.90    24.4     4  147.    62  3.69  3.19  20       1     0     4
#>  9  24.4 -1.62    22.8     4  141.    95  3.92  3.15  22.9     1     0     4
#> 10  18.7  0.501   19.2     6  168.   123  3.92  3.44  18.3     1     0     4
#> # ℹ 22 more rows
#> # ℹ 1 more variable: carb

Both methods also append on the left-hand side of the data frame, rather than the right-hand side. This means that prediction columns are always visible when printed, even for data frames with many columns. As you might expect, the order of the columns is the same for both methods as well.

New autoplot type for workflow sets

Many tidymodels objects have autoplot() methods for quickly getting a sense of the most important aspects of an object. For workflow sets, the method shows the value of the calculated performance metrics, as well as the respective rank of each workflow in the set. Let’s put together a workflow set on the actual mtcars data and take a look at the default autoplot.

mt_rec <- recipe(mpg ~ ., mtcars)
mt_rec2 <- mt_rec |> step_normalize(all_numeric_predictors())
mt_rec3 <- mt_rec |> step_YeoJohnson(all_numeric_predictors())

wflow_set <- workflow_set(
  list(plain = mt_rec, normalize = mt_rec2, yeo_johnson = mt_rec3), 
  list(linear_reg())
)

set.seed(1)
wflow_set_fit <- workflow_map(
  wflow_set, 
  "fit_resamples", 
  resamples = bootstraps(mtcars)
)

autoplot(wflow_set_fit)

This allows you to grasp the metric values and rank of a workflow and let’s you distinguish the type of preprocessor and model. In our case, we only have one type of model, and even just one type of preprocessor, a recipe. What we are much more interested in is which recipe corresponds to which rank. The new option of type = "wflow_id" lets us see which values and ranks correspond with which workflow and thus also with which recipe.

autoplot(wflow_set_fit, type = "wflow_id")

This makes it easy to spot that it’s the Yeo-Johnson transformation that makes the difference here!

Acknowledgements

We’d like to thank those in the community that contributed to tidymodels in the last quarter:

baguette: @EmilHvitfeldt , and @topepo .
brulee: @jrosell , and @topepo .
butcher: @juliasilge .
censored: @EmilHvitfeldt , @hfrick , @simonpcouch , and @tripartio .
dials: @hfrick , and @topepo .
embed: @EmilHvitfeldt .
finetune: @hfrick , @jrosell , @mfansler , @simonpcouch , and @topepo .
hardhat: @DavisVaughan , and @simonpcouch .
modeldata: @topepo .
parsnip: @birbritto , @EmilHvitfeldt , @hfrick , @jmunyoon , @marcelglueck , @mattheaphy , @mesdi , @nipnipj , @pgg1309 , @simonpcouch , @topepo , and @wzbillings .
probably: @brshallo , @Jeffrothschild , @jgaeb , @simonpcouch , and @topepo .
recipes: @DemetriPananos , @EmilHvitfeldt , @jdonland , @JiahuaQu , @joranE , @mikemahoney218 , @olivroy , @SantiagoD999 , @simonpcouch , @stufield , and @topepo .
rsample: @EmilHvitfeldt , @hfrick , @mikemahoney218 , @paulcbauer , @StevenWallaert , @topepo , and @ZWael .
shinymodels: @simonpcouch .
stacks: @simonpcouch .
tidyclust: @EmilHvitfeldt , and @katieburak .
tidymodels: @jkylearmstrong , @mine-cetinkaya-rundel , @nikosGeography , @nipnipj , and @topepo .
tune: @AlbertoImg , @EmilHvitfeldt , @hfrick , @joranE , @joshuagi , @lionel- , @marcozanotti , @Peter4801 , @rfsaldanha , @simonpcouch , @topepo , and @walkerjameschris .
workflows: @EmilHvitfeldt , @mesdi , @Milardkh , @simonpcouch , and @topepo .
workflowsets: @hfrick , and @simonpcouch .
yardstick: @asb2111 , @Dpananos , @EduMinsky , @EmilHvitfeldt , @hfrick , and @tripartio .

We’re grateful for all of the tidymodels community, from observers to users to contributors. Happy modeling!

tune 1.2.0

Simon Couch — Thu, 18 Apr 2024 00:00:00 +0000

We’re indubitably amped to announce the release of tune 1.2.0, a package for hyperparameter tuning in the tidymodels framework .

You can install it from CRAN, along with the rest of the core packages in tidymodels, using the tidymodels meta-package:

install.packages("tidymodels")

The 1.2.0 release of tune has introduced support for two major features that we’ve written about on the tidyverse blog already:

While those features got their own blog posts, there are several more features in this release that we thought were worth calling out. This post will highlight improvements to our support for parallel processing, the introduction of support for percentile confidence intervals for performance metrics, and a few other bits and bobs. You can see a full list of changes in the release notes .

library(tidymodels)

Throughout this post, I’ll refer to the example of tuning an XGBoost model to predict the fuel efficiency of various car models. I hear this is already a well-explored modeling problem, but alas:

set.seed(2024)

xgb_res <- 
  tune_grid(
    boost_tree(mode = "regression", mtry = tune(), learn_rate = tune()),
    mpg ~ .,
    bootstraps(mtcars),
    control = control_grid(save_pred = TRUE)
  )

Note that we’ve used the control option save_pred = TRUE to indicate that we want to save the predictions from our resampled models in the tuning results. Both int_pctl() and compute_metrics() below will need those predictions. The metrics for our resampled model look like so:

collect_metrics(xgb_res)
#> # A tibble: 20 × 8
#>    mtry learn_rate .metric .estimator   mean     n std_err .config              
#>                                         
#> 1     2    0.00204 rmse    standard   19.7      25  0.262  Preprocessor1_Model01
#> 2     2    0.00204 rsq     standard    0.659    25  0.0314 Preprocessor1_Model01
#> 3     6    0.00859 rmse    standard   18.0      25  0.260  Preprocessor1_Model02
#> 4     6    0.00859 rsq     standard    0.607    25  0.0270 Preprocessor1_Model02
#> 5     3    0.0276  rmse    standard   14.0      25  0.267  Preprocessor1_Model03
#> 6     3    0.0276  rsq     standard    0.710    25  0.0237 Preprocessor1_Model03
#> # ℹ 14 more rows

Modernized support for parallel processing

The tidymodels framework has long supported evaluating models in parallel using the foreach package. This release of tune has introduced support for parallelism using the futureverse framework, and we will begin deprecating our support for foreach in a coming release.

To tune a model in parallel with foreach, a user would load a parallel backend package (usually with a name like library(doBackend) ) and then register it with foreach (with a function call like registerDoBackend()). The tune package would then detect that registered backend and take it from there. For example, the code to distribute the above tuning process across 10 cores with foreach would look like:

library(doMC)
registerDoMC(cores = 10)

set.seed(2024)

xgb_res <- 
  tune_grid(
    boost_tree(mode = "regression", mtry = tune(), learn_rate = tune()),
    mpg ~ .,
    bootstraps(mtcars),
    control = control_grid(save_pred = TRUE)
  )

The code to do so with future is similarly simple. Users first load the future package, and then specify a plan() which dictates how computations will be distributed. For example, the code to distribute the above tuning process across 10 cores with future looks like:

library(future)
plan(multisession, workers = 10)

set.seed(2024)

xgb_res <- 
  tune_grid(
    boost_tree(mode = "regression", mtry = tune(), learn_rate = tune()),
    mpg ~ .,
    bootstraps(mtcars),
    control = control_grid(save_pred = TRUE)
  )

For users, the transition to parallelism with future has several benefits:

The futureverse presently supports a greater number of parallelism technologies and has been more likely to receive implementations for new ones.
Once foreach is fully deprecated, users will be able to use the interactive logger when tuning in parallel.

From our perspective, transitioning our parallelism support to future makes our packages much more maintainable, reducing complexity in random number generation, error handling, and progress reporting.

In an upcoming release of the package, you’ll see a deprecation warning when a foreach parallel backend is registered but no future plan has been specified, so start transitioning your code sooner than later!

Percentile confidence intervals

Following up on changes in the most recent rsample release , tune has introduced a method for int_pctl() that calculates percentile confidence intervals for performance metrics. To calculate a 90% confidence interval for the values of each performance metric returned in collect_metrics(), we’d write:

set.seed(2024)

int_pctl(xgb_res, alpha = .1)
#> # A tibble: 20 × 8
#>   .metric .estimator .lower .estimate .upper .config             mtry learn_rate
#>                                         
#> 1 rmse    bootstrap  18.1      19.9   22.0   Preprocessor1_Mod…     2    0.00204
#> 2 rsq     bootstrap   0.570     0.679  0.778 Preprocessor1_Mod…     2    0.00204
#> 3 rmse    bootstrap  16.6      18.3   19.9   Preprocessor1_Mod…     6    0.00859
#> 4 rsq     bootstrap   0.548     0.665  0.765 Preprocessor1_Mod…     6    0.00859
#> 5 rmse    bootstrap  12.5      14.1   15.9   Preprocessor1_Mod…     3    0.0276 
#> 6 rsq     bootstrap   0.622     0.720  0.818 Preprocessor1_Mod…     3    0.0276 
#> # ℹ 14 more rows

Note that the output has the same number of rows as the collect_metrics() output: one for each unique pair of metric and workflow.

This is very helpful for validation sets. Other resampling methods generate replicated performance statistics. We can compute simple interval estimates using the mean and standard error for those. Validation sets produce only one estimate, and these bootstrap methods are probably the best option for obtaining interval estimates.

Breaking change: relocation of ellipses

We’ve made a breaking change in argument order for several functions in the package (and downstream packages like finetune and workflowsets). Ellipses (…) are now used consistently in the package to require optional arguments to be named. For functions that previously had unused ellipses at the end of the function signature, they have been moved to follow the last argument without a default value, and several other functions that previously did not have ellipses in their signatures gained them. This applies to methods for augment(), collect_predictions(), collect_metrics(), select_best(), show_best(), and conf_mat_resampled().

Compute new metrics without re-fitting

We’ve also added a new function, compute_metrics() , that allows for calculating metrics that were not used when evaluating against resamples. For example, consider our xgb_res object. Since we didn’t supply any metrics to evaluate, and this model is a regression model, tidymodels selected RMSE and R² as defaults:

collect_metrics(xgb_res)
#> # A tibble: 20 × 8
#>    mtry learn_rate .metric .estimator   mean     n std_err .config              
#>                                         
#> 1     2    0.00204 rmse    standard   19.7      25  0.262  Preprocessor1_Model01
#> 2     2    0.00204 rsq     standard    0.659    25  0.0314 Preprocessor1_Model01
#> 3     6    0.00859 rmse    standard   18.0      25  0.260  Preprocessor1_Model02
#> 4     6    0.00859 rsq     standard    0.607    25  0.0270 Preprocessor1_Model02
#> 5     3    0.0276  rmse    standard   14.0      25  0.267  Preprocessor1_Model03
#> 6     3    0.0276  rsq     standard    0.710    25  0.0237 Preprocessor1_Model03
#> # ℹ 14 more rows

In the past, if you wanted to evaluate that workflow against a performance metric that you hadn’t included in your tune_grid() run, you’d need to re-run tune_grid(), fitting models and predicting new values all over again. Now, using the compute_metrics() function, you can use the tune_grid() output you’ve already generated and compute any number of new metrics without having to fit any more models as long as you use the control option save_pred = TRUE when tuning.

So, say I want to additionally calculate Huber Loss and Mean Absolute Percent Error. I just pass those metrics along with the tuning result to compute_metrics(), and the result looks just like collect_metrics() output for the metrics originally calculated:

compute_metrics(xgb_res, metric_set(huber_loss, mape))
#> # A tibble: 20 × 8
#>    mtry learn_rate .metric    .estimator  mean     n std_err .config            
#>                                         
#> 1     2    0.00204 huber_loss standard    18.3    25  0.232  Preprocessor1_Mode…
#> 2     2    0.00204 mape       standard    94.4    25  0.0685 Preprocessor1_Mode…
#> 3     6    0.00859 huber_loss standard    16.7    25  0.229  Preprocessor1_Mode…
#> 4     6    0.00859 mape       standard    85.7    25  0.178  Preprocessor1_Mode…
#> 5     3    0.0276  huber_loss standard    12.6    25  0.230  Preprocessor1_Mode…
#> 6     3    0.0276  mape       standard    64.4    25  0.435  Preprocessor1_Mode…
#> # ℹ 14 more rows

Easily pivot resampled metrics

Finally, the collect_metrics() method for tune results recently gained a new argument , type, indicating the shape of the returned metrics. The default, type = "long", is the same shape as before. The argument value type = "wide" will allot each metric its own column, making it easier to compare metrics across different models.

collect_metrics(xgb_res, type = "wide")
#> # A tibble: 10 × 5
#>    mtry learn_rate .config                rmse   rsq
#>                            
#> 1     2    0.00204 Preprocessor1_Model01  19.7 0.659
#> 2     6    0.00859 Preprocessor1_Model02  18.0 0.607
#> 3     3    0.0276  Preprocessor1_Model03  14.0 0.710
#> 4     2    0.0371  Preprocessor1_Model04  12.3 0.728
#> 5     5    0.00539 Preprocessor1_Model05  18.8 0.595
#> 6     9    0.0110  Preprocessor1_Model06  17.4 0.577
#> # ℹ 4 more rows

Under the hood, this is indeed just a pivot_wider() call. We’ve found that it’s time-consuming and error-prone to programmatically determine identifying columns when pivoting resampled metrics, so we’ve localized and thoroughly tested the code that we use to do so with this feature.

More love for the Brier score

Tuning and resampling functions use default metrics when the user does not specify a custom metric set. For regression models, these are RMSE and R². For classification, accuracy and the area under the ROC curve were the default. We’ve also added the Brier score to the default classification metric list.

Acknowledgements

As always, we’re appreciative of the community contributors who helped make this release happen: @AlbertoImg , @dramanica , @epiheather , @joranE , @jrosell , @jxu , @kbodwin , @kenraywilliams , @KJT-Habitat , @lionel- , @marcozanotti , @MasterLuke84 , @mikemahoney218 , @PathosEthosLogos , and @Peter4801 .

Q4 2023 tidymodels digest

Emil Hvitfeldt — Tue, 09 Jan 2024 00:00:00 +0000

The tidymodels framework is a collection of R packages for modeling and machine learning using tidyverse principles.

Since the beginning of 2021, we have been publishing quarterly updates here on the tidyverse blog summarizing what’s new in the tidymodels ecosystem. The purpose of these regular posts is to share useful new features and any updates you may have missed. You can check out the tidymodels tag to find all tidymodels blog posts here, including our roundup posts as well as those that are more focused, like this post from the past couple of months:

Three ways errors are about to get better in tidymodels

Since our last roundup post , there have been CRAN releases of 7 tidymodels packages. Here are links to their NEWS files:

embed (1.1.3)
modeldb (0.3.0)
recipes (1.0.9)
spatialsample (0.5.1)
stacks (1.0.3)
textrecipes (1.0.6)
tidyposterior (1.0.1)

We’ll highlight a few especially notable changes below: updated warnings when normalizing, and better error messages in recipes.

library(tidymodels)

data("ames", package = "modeldata")

Updated warnings when normalizing

The latest release of recipes features an overhaul of the warnings and error messages to use the cli package. With this, we are starting the project of providing more information signaling when things don’t go well.

The first type of issue we now signal for is when you try to normalize data that contains elements such as NA or Inf. These can sneak in for several reasons, and before this release, it happened silently. Below we are creating a recipe using the ames data set, and before we normalize, we are taking the logarithms of all variables that pertain to square footage.

rec <- recipe(Sale_Price ~ ., data = ames) |>
  step_log(contains("SF")) |>
  step_normalize(all_numeric_predictors()) |>
  prep()
#> Warning: Columns `BsmtFin_SF_1`, `BsmtFin_SF_2`, `Bsmt_Unf_SF`, `Total_Bsmt_SF`,
#> `Second_Flr_SF`, `Wood_Deck_SF`, and `Open_Porch_SF` returned NaN, because
#> variance cannot be calculated and scaling cannot be used. Consider avoiding
#> `Inf` or `-Inf` values and/or setting `na_rm = TRUE` before normalizing.

We now get a warning that something happened, telling us that it encountered Inf or -Inf. Knowing that, we can go back and investigate what went wrong. If we exclude step_normalize() and bake() the recipe, we see that a number of -Inf values appear.

recipe(Sale_Price ~ ., data = ames) |>
  step_log(contains("SF")) |>
  prep() |>
  bake(new_data = NULL, contains("SF")) |>
  glimpse()
#> Rows: 2,930
#> Columns: 8
#> $ BsmtFin_SF_1   0.6931472, 1.7917595, 0.0000000, 0.0000000, 1.0986123, 1…
#> $ BsmtFin_SF_2   -Inf, 4.969813, -Inf, -Inf, -Inf, -Inf, -Inf, -Inf, -Inf…
#> $ Bsmt_Unf_SF    6.089045, 5.598422, 6.006353, 6.951772, 4.919981, 5.7807…
#> $ Total_Bsmt_SF  6.984716, 6.782192, 7.192182, 7.654443, 6.833032, 6.8308…
#> $ First_Flr_SF   7.412160, 6.797940, 7.192182, 7.654443, 6.833032, 6.8308…
#> $ Second_Flr_SF  -Inf, -Inf, -Inf, -Inf, 6.552508, 6.519147, -Inf, -Inf, …
#> $ Wood_Deck_SF   5.347108, 4.941642, 5.973810, -Inf, 5.356586, 5.886104, …
#> $ Open_Porch_SF  4.127134, -Inf, 3.583519, -Inf, 3.526361, 3.583519, -Inf…

Looking at the bare data set, we notice that the -Inf all appear where there are 0, which makes sense since log(0) is undefined.

ames |>
  select(contains("SF")) |>
  glimpse()
#> Rows: 2,930
#> Columns: 8
#> $ BsmtFin_SF_1   2, 6, 1, 1, 3, 3, 3, 1, 3, 7, 7, 1, 7, 3, 3, 1, 3, 3, 4,…
#> $ BsmtFin_SF_2   0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1120, 0, 0, …
#> $ Bsmt_Unf_SF    441, 270, 406, 1045, 137, 324, 722, 1017, 415, 994, 763,…
#> $ Total_Bsmt_SF  1080, 882, 1329, 2110, 928, 926, 1338, 1280, 1595, 994, …
#> $ First_Flr_SF   1656, 896, 1329, 2110, 928, 926, 1338, 1280, 1616, 1028,…
#> $ Second_Flr_SF  0, 0, 0, 0, 701, 678, 0, 0, 0, 776, 892, 0, 676, 0, 0, 1…
#> $ Wood_Deck_SF   210, 140, 393, 0, 212, 360, 0, 0, 237, 140, 157, 483, 0,…
#> $ Open_Porch_SF  62, 0, 36, 0, 34, 36, 0, 82, 152, 60, 84, 21, 75, 0, 54,…

Knowing that it was 0 that caused the problem, we can set an offset to avoid taking log(0).

rec <- recipe(Sale_Price ~ ., data = ames) |>
  step_log(contains("SF"), offset = 0.5) |>
  step_normalize(all_numeric_predictors()) |>
  prep()

These warnings appear in step_scale(), step_normalize(), step_center() or step_range().

Better error messages in recipes

Another problem that happens a lot when using recipes, is accidentally selecting variables that have the wrong types. Previously this caused the following error:

recipe(Sale_Price ~ ., data = ames) |>
  step_dummy(starts_with("Lot_")) |>
  prep()
#> Error in `step_dummy()`:
#> Caused by error in `prep()`:
#> ! All columns selected for the step should be string, factor, or ordered.

In the newest release, it will detail the offending variables and what was wrong with them.

recipe(Sale_Price ~ ., data = ames) |>
  step_dummy(starts_with("Lot_")) |>
  prep() |>
  bake()
#> Error in `step_dummy()`:
#> Caused by error in `prep()`:
#> ✖ All columns selected for the step should be factor or ordered.
#> • 1 double variable found: `Lot_Frontage`
#> • 1 integer variable found: `Lot_Area`

Coming Attractions

In the next month or so we are planning a cascade of CRAN releases. There is a lot of new functionality coming your way, especially in the tune package.

A number of our packages will (finally) be able to cohesively fit, evaluate, tune, and predict models for event times (a.k.a., survival analysis ). If you don’t do this type of work, you might not notice the new capabilities. However, if you do, tidymodels will be able to do a lot more for you.

We’ve also implemented a number of features related to model fairness. These tools allow tidymodels users to identify when machine learning models behave unfairly towards certain groups of people, and will also be included in the upcoming releases of tidymodels packages in Q1.

We’ll highlight a lot of these new capabilities in blog posts here as well as tutorials on tidymodels.org .

So, there’s a lot more coming! We are very excited to have these features officially available and to see what people can do with them.

Acknowledgements

We’d like to thank those in the community that contributed to tidymodels in the last quarter:

embed: @EmilHvitfeldt .
modeldb: @EmilHvitfeldt , @hadley , and @topepo .
recipes: @atusy , @bcadenato , @collinberke , @EmilHvitfeldt , @gfronk , @jkennel , @joeycouse , @jxu , @mastoffel , @matthewgson , @millermc38 , @ray-p144 , @sebsfox , @simonpcouch , and @topepo .
spatialsample: @mikemahoney218 .
stacks: @juliasilge , and @simonpcouch .
textrecipes: @EmilHvitfeldt , @jd4ds , and @masurp .
tidyposterior: @topepo .

We’re grateful for all of the tidymodels community, from observers to users to contributors. Happy modeling!

Machine Learning on Posit Open Source

orbital 0.4.0

Post processing support

New show_query method

Acknowledgements

tidymodels & xgboost

tidypredict 1.0.0

Improved output for random forest models

Faster parsing of trees

More efficient tree expressions

Glmnet support

Acknowledgements

Two New tidymodels Packages

filtro

important

Summary

Q3 2025 tidymodels digest

Quiet linear svm models

Fewer numeric overflow issues in brulee

Additional torch optimizers in brulee

tune version 2.0.0

Using future or mirai for parallel processing

Tuning your postprocessor

What’s next

Acknowledgements

mall 0.2.0

More LLM providers

Parallel requests (R only)

NLP operations without a table

New cheatsheet

recipes 1.3.0

strings_as_factors

Deprecating step_select()

step_dummy() contrasts argument

tidyselect can be used everywhere

step_impute_bag() now takes up less memory

Acknowledgements

rsample 1.3.0

Flexible grouping for bootstrap intervals

Tidyverse developer day

Acknowledgements

Improved sparsity support in tidymodels

What are sparse data?

Sparse matrix support

Sparse data from recipes steps

Q1 2025 tidymodels digest

Improvements in errors and warnings

Quantile regression in parsnip

Parallelism in tune

Things to look forward to

Acknowledgements

orbital 0.3.0

Classification support

New augment method

Acknowledgements

Introducing mall for R...and Python

The beginning

Reaching viability

The project

The approach

What’s next

Postprocessing is coming to tidymodels

Meet tailor👋

Tailors in context

What’s to come

Acknowledgements

recipes 1.1.0

Column type checking

Input checking in recipe()

Long formulas in recipe()

Better error for misspelled argument names

Quality of life increases in step_dummy()

Acknowledgements

Chocolate Chocolate Chip Cookies

bonsai 0.3.0

The meats data

Baseline models

Introducing accelerated oblique random forests

References

Acknowledgements

`strings_as_factors`

Deprecating `step_select()`

`step_dummy()` contrasts argument

`step_impute_bag()` now takes up less memory

Input checking in `recipe()`

Long formulas in `recipe()`

Quality of life increases in `step_dummy()`

The `meats` data

Ingest tabular data with `layer_feature_space()`

New `random` family

Migrating from `{keras}` to `{keras3}`

Consistent `augment()` for workflows and parsnip models