<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>MLOps and Admin on Posit Open Source</title>
    <link>https://posit-open-source.netlify.app/categories/mlops-and-admin/</link>
    <description>Recent content in MLOps and Admin on Posit Open Source</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en-us</language>
    <lastBuildDate>Mon, 22 Apr 2024 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://posit-open-source.netlify.app/categories/mlops-and-admin/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>News from the sparkly-verse</title>
      <link>https://posit-open-source.netlify.app/blog/ai/sparklyr-updates-q1-2024/</link>
      <pubDate>Mon, 22 Apr 2024 00:00:00 +0000</pubDate>
      <guid>https://posit-open-source.netlify.app/blog/ai/sparklyr-updates-q1-2024/</guid>
      <dc:creator>Edgar Ruiz</dc:creator><description><![CDATA[<h2 id="highlights">Highlights
</h2>
<p><code>sparklyr</code> and friends have been getting some important updates in the past few
months, here are some highlights:</p>
<ul>
<li>
<p><code>spark_apply()</code> now works on Databricks Connect v2</p>
</li>
<li>
<p><code>sparkxgb</code> is coming back to life</p>
</li>
<li>
<p>Support for Spark 2.3 and below has ended</p>
</li>
</ul>
<h2 id="pysparklyr-014">pysparklyr 0.1.4
</h2>
<p><code>spark_apply()</code> now works on Databricks Connect v2. The latest <code>pysparklyr</code>
release uses the <code>rpy2</code> Python library as the backbone of the integration.</p>
<p>Databricks Connect v2, is based on Spark Connect. At this time, it supports
Python user-defined functions (UDFs), but not R user-defined functions.
Using <code>rpy2</code> circumvents this limitation. As shown in the diagram, <code>sparklyr</code>
sends the the R code to the locally installed <code>rpy2</code>, which in turn sends it
to Spark. Then the <code>rpy2</code> installed in the remote Databricks cluster will run
the R code.</p>
<figure>
<img src="https://posit-open-source.netlify.app/blog/ai/sparklyr-updates-q1-2024/images/r-udfs.png" data-fig-alt="Diagram that shows how sparklyr transmits the R code via the rpy2 python package, and how Spark uses it to run the R code" width="600" alt="R code via rpy2" />
<figcaption aria-hidden="true">R code via rpy2</figcaption>
</figure>
<p>A big advantage of this approach, is that <code>rpy2</code> supports Arrow. In fact it
is the recommended Python library to use when integrating <a href="https://arrow.apache.org/docs/python/integration/python_r.html" target="_blank" rel="noopener">Spark, Arrow and
R</a>
.
This means that the data exchange between the three environments will be much
faster!</p>
<p>As in its original implementation, schema inferring works, and as with the
original implementation, it has a performance cost. But unlike the original,
this implementation will return a &lsquo;columns&rsquo; specification that you can use
for the next time you run the call.</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">spark_apply</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">  <span class="n">tbl_mtcars</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="n">nrow</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="n">group_by</span> <span class="o">=</span> <span class="s">&#34;am&#34;</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; To increase performance, use the following schema:</span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; columns = &#34;am double, x long&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; # Source:   table&lt;`sparklyr_tmp_table_b84460ea_b1d3_471b_9cef_b13f339819b6`&gt; [2 x 2]</span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; # Database: spark_connection</span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt;      am     x</span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt;   &lt;dbl&gt; &lt;dbl&gt;</span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; 1     0    19</span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; 2     1    13</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>A full article about this new capability is available here:
<a href="https://spark.posit.co/deployment/databricks-connect-udfs.html" target="_blank" rel="noopener">Run R inside Databricks Connect</a>
</p>
<h2 id="sparkxgb">sparkxgb
</h2>
<p>The <code>sparkxgb</code> is an extension of <code>sparklyr</code>. It enables integration with
<a href="https://xgboost.readthedocs.io/en/stable/" target="_blank" rel="noopener">XGBoost</a>
. The current CRAN release
does not support the latest versions of XGBoost. This limitation has recently
prompted a full refresh of <code>sparkxgb</code>. Here is a summary of the improvements,
which are currently in the <a href="https://github.com/rstudio/sparkxgb" target="_blank" rel="noopener">development version of the package</a>
:</p>
<ul>
<li>
<p>The <code>xgboost_classifier()</code> and <code>xgboost_regressor()</code> functions no longer
pass values of two arguments. These were deprecated by XGBoost and
cause an error if used. In the R function, the arguments will remain for
backwards compatibility, but will generate an informative error if not left <code>NULL</code>:</p>
<ul>
<li><code>sketch_eps</code> - As of <a href="https://github.com/dmlc/xgboost/blob/59d7b8dc72df7ed942885676964ea0a681d09590/NEWS.md?plain=1#L494" target="_blank" rel="noopener">XGBoost version 1.6.0</a>

<code>sketch_eps</code> was replaced by <code>max_bins</code></li>
<li><code>timeout_request_workers</code> - Removed in <a href="https://github.com/dmlc/xgboost/blob/59d7b8dc72df7ed942885676964ea0a681d09590/NEWS.md?plain=1#L398" target="_blank" rel="noopener">XGBoost version 1.7.0</a>

because it was no longer needed when XGBoost added barrier support</li>
</ul>
</li>
<li>
<p>Updates the JVM version used during the Spark session. It now uses <a href="https://central.sonatype.com/artifact/ml.dmlc/xgboost4j-spark_2.12" target="_blank" rel="noopener">xgboost4j-spark
version 2.0.3</a>
,
instead of 0.8.1. This gives us access to XGboost&rsquo;s most recent Spark code.</p>
</li>
<li>
<p>Updates code that used deprecated functions from upstream R dependencies. It
also stops using an un-maintained package as a dependency (<code>forge</code>). This
eliminated all of the warnings that were happening when fitting a model.</p>
</li>
<li>
<p>Major improvements to package testing. Unit tests were updated and expanded,
the way <code>sparkxgb</code> automatically starts and stops the Spark session for testing
was modernized, and the continuous integration tests were restored. This will
ensure the package&rsquo;s health going forward.</p>
</li>
</ul>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span><span class="lnt">21
</span><span class="lnt">22
</span><span class="lnt">23
</span><span class="lnt">24
</span><span class="lnt">25
</span><span class="lnt">26
</span><span class="lnt">27
</span><span class="lnt">28
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">remotes</span><span class="o">::</span><span class="nf">install_github</span><span class="p">(</span><span class="s">&#34;rstudio/sparkxgb&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">sparkxgb</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">sparklyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">sc</span> <span class="o">&lt;-</span> <span class="nf">spark_connect</span><span class="p">(</span><span class="n">master</span> <span class="o">=</span> <span class="s">&#34;local&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">iris_tbl</span> <span class="o">&lt;-</span> <span class="nf">copy_to</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="n">iris</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">xgb_model</span> <span class="o">&lt;-</span> <span class="nf">xgboost_classifier</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">  <span class="n">iris_tbl</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="n">Species</span> <span class="o">~</span> <span class="n">.,</span>
</span></span><span class="line"><span class="cl">  <span class="n">num_class</span> <span class="o">=</span> <span class="m">3</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="n">num_round</span> <span class="o">=</span> <span class="m">50</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="n">max_depth</span> <span class="o">=</span> <span class="m">4</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">xgb_model</span> <span class="o">%&gt;%</span> 
</span></span><span class="line"><span class="cl">  <span class="nf">ml_predict</span><span class="p">(</span><span class="n">iris_tbl</span><span class="p">)</span> <span class="o">%&gt;%</span> 
</span></span><span class="line"><span class="cl">  <span class="nf">select</span><span class="p">(</span><span class="n">Species</span><span class="p">,</span> <span class="n">predicted_label</span><span class="p">,</span> <span class="nf">starts_with</span><span class="p">(</span><span class="s">&#34;probability_&#34;</span><span class="p">))</span> <span class="o">%&gt;%</span> 
</span></span><span class="line"><span class="cl">  <span class="n">dplyr</span><span class="o">::</span><span class="nf">glimpse</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; Rows: ??</span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; Columns: 5</span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; Database: spark_connection</span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; $ Species                &lt;chr&gt; &#34;setosa&#34;, &#34;setosa&#34;, &#34;setosa&#34;, &#34;setosa&#34;, &#34;setosa…</span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; $ predicted_label        &lt;chr&gt; &#34;setosa&#34;, &#34;setosa&#34;, &#34;setosa&#34;, &#34;setosa&#34;, &#34;setosa…</span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; $ probability_setosa     &lt;dbl&gt; 0.9971547, 0.9948581, 0.9968392, 0.9968392, 0.9…</span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; $ probability_versicolor &lt;dbl&gt; 0.002097376, 0.003301427, 0.002284616, 0.002284…</span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; $ probability_virginica  &lt;dbl&gt; 0.0007479066, 0.0018403779, 0.0008762418, 0.000…</span>
</span></span></code></pre></td></tr></table>
</div>
</div><h2 id="sparklyr-185">sparklyr 1.8.5
</h2>
<p>The new version of <code>sparklyr</code> does not have user facing improvements. But
internally, it has crossed an important milestone. Support for Spark version 2.3
and below has effectively ended. The Scala
code needed to do so is no longer part of the package. As per Spark&rsquo;s versioning
policy, <a href="https://spark.apache.org/versioning-policy.html" target="_blank" rel="noopener">found here</a>
,
Spark 2.3 was &rsquo;end-of-life&rsquo; in 2018.</p>
<p>This is part of a larger, and ongoing effort to make the immense code-base of
<code>sparklyr</code> a little easier to maintain, and hence reduce the risk of failures.
As part of the same effort, the number of upstream packages that <code>sparklyr</code>
depends on have been reduced. This has been happening across multiple CRAN
releases, and in this latest release <code>tibble</code>, and <code>rappdirs</code> are no longer
imported by <code>sparklyr</code>.</p>
]]></description>
      <enclosure url="https://posit-open-source.netlify.app/blog/ai/sparklyr-updates-q1-2024/thumbnail.png" length="314789" type="image/png" />
    </item>
    <item>
      <title>Announcing bundle</title>
      <link>https://posit-open-source.netlify.app/blog/tidyverse/2022/bundle-0-1-0/</link>
      <pubDate>Fri, 16 Sep 2022 00:00:00 +0000</pubDate>
      <guid>https://posit-open-source.netlify.app/blog/tidyverse/2022/bundle-0-1-0/</guid>
      <dc:creator>Julia Silge</dc:creator><description><![CDATA[<p>We&rsquo;re thrilled to announce the first release of <a href="https://rstudio.github.io/bundle/" target="_blank" rel="noopener">bundle</a>
. The bundle package provides a consistent interface to capture all information needed to serialize a model, situate that information within a portable object, and restore it for use in new settings.</p>
<p>You can install it from CRAN with:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">install.packages</span><span class="p">(</span><span class="s">&#34;bundle&#34;</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>Let&rsquo;s walk through what bundle does, and when you might need to use it.</p>
<h2 id="saving-things-is-hard">Saving things is hard
</h2>
<p>We often think of a trained model as a self-contained R object. The model exists in memory in R and if we have some new data, the model object can generate predictions on its own:</p>
<img src="https://posit-open-source.netlify.app/blog/tidyverse/2022/bundle-0-1-0/diagram_01.png" alt="A diagram showing a rectangle, labeled model object, and another rectangle, labeled predictions. The two are connected by an arrow from model object to predictions, with the label predict." width="100%" />
<p>In reality, model objects sometimes also make use of <em>references</em> to generate predictions. A reference is a piece of information that a model object refers to that isn&rsquo;t part of the object itself; this could be something like a connection to a server, a file on disk, or an internal function in the package used to train the model. When we call <code>predict()</code>, model objects know where to look to retrieve that information:</p>
<img src="https://posit-open-source.netlify.app/blog/tidyverse/2022/bundle-0-1-0/diagram_02.png" alt="A diagram showing the same pair of rectangles as before, connected by the arrow labeled predict. This time, though, we introduce two boxes labeled reference. These two boxes are connected to the arrow labeled predict with dotted arrows, to show that, most of the time, we don't need to think about including them in our workflow." width="100%" />
<p>Saving model objects can sometimes disrupt those references. Thus, if we want to train a model, save it, re-load it in a production setting, and generate predictions with it, we may run into issues:</p>
<img src="https://posit-open-source.netlify.app/blog/tidyverse/2022/bundle-0-1-0/diagram_03.png" alt="A diagram showing the same set of rectangles, representing a prediction problem, as before. This version of the diagram adds two boxes, labeled R Session numbe r one, and R session number two. In R session number two, we have a new rectangle labeled standalone model object. In focus is the arrow from the model object, in R Session number one, to the standalone model object in R session number two." width="100%" />
<p>We need some way to preserve access to those references. This new package provides a consistent interface for <em>bundling</em> model objects with their references so that they can be safely saved and re-loaded in production:</p>
<img src="https://posit-open-source.netlify.app/blog/tidyverse/2022/bundle-0-1-0/diagram_04.png" alt="A replica of the previous diagram, where the arrow previously connecting the model object in R session one and the standalone model object in R session two is connected by a verb called bundle. The bundle function outputs an object called a bundle." width="100%" />
<h2 id="when-to-bundle-your-model">When to bundle your model
</h2>
<p>Let&rsquo;s walk through building a couple of models using data on <a href="https://modeldata.tidymodels.org/reference/cells.html" target="_blank" rel="noopener">cell body segmentation</a>
.</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">tidymodels</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">data</span><span class="p">(</span><span class="n">cells</span><span class="p">,</span> <span class="n">package</span> <span class="o">=</span> <span class="s">&#34;modeldata&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">set.seed</span><span class="p">(</span><span class="m">123</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">cell_split</span> <span class="o">&lt;-</span> <span class="n">cells</span> <span class="o">%&gt;%</span> 
</span></span><span class="line"><span class="cl">  <span class="nf">select</span><span class="p">(</span><span class="o">-</span><span class="n">case</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">initial_split</span><span class="p">(</span><span class="n">strata</span> <span class="o">=</span> <span class="n">class</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">cell_train</span> <span class="o">&lt;-</span> <span class="nf">training</span><span class="p">(</span><span class="n">cell_split</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">cell_test</span>  <span class="o">&lt;-</span> <span class="nf">testing</span><span class="p">(</span><span class="n">cell_split</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>First, let&rsquo;s train a logistic regression model:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">glm_fit</span> <span class="o">&lt;-</span> <span class="nf">glm</span><span class="p">(</span><span class="n">class</span> <span class="o">~</span> <span class="n">.,</span> <span class="n">family</span> <span class="o">=</span> <span class="s">&#34;binomial&#34;</span><span class="p">,</span> <span class="n">data</span> <span class="o">=</span> <span class="n">cell_train</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>If we&rsquo;re satisfied with this model and think it is ready for production, we might want to deploy it somewhere, maybe as a REST API or as a Shiny app. A typical approach would be to:</p>
<ul>
<li>save our model object</li>
<li>start up a new R session</li>
<li>load the model object into the new session</li>
<li>predict on new data with the loaded model object</li>
</ul>
<p>The <a href="https://callr.r-lib.org/" target="_blank" rel="noopener">callr</a>
 package is helpful for demonstrating this kind of situation; it allows us to start up a fresh R session and pass a few objects in.</p>
<p>We&rsquo;ll just make use of two of the arguments to the function <code>r()</code>:</p>
<ul>
<li><code>func</code>: A function that, given a model object and some new data, will generate predictions, and</li>
<li><code>args</code>: A named list, giving the arguments to the above function.</li>
</ul>
<p>Let&rsquo;s save our model object to a temporary file and pass it to a fresh R session for prediction, like if we had deployed the model somewhere.</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">callr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">temp_file</span> <span class="o">&lt;-</span> <span class="nf">tempfile</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="nf">saveRDS</span><span class="p">(</span><span class="n">glm_fit</span><span class="p">,</span> <span class="n">file</span> <span class="o">=</span> <span class="n">temp_file</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">r</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">  <span class="kr">function</span><span class="p">(</span><span class="n">temp_file</span><span class="p">,</span> <span class="n">new_data</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="n">model_object</span> <span class="o">&lt;-</span> <span class="nf">readRDS</span><span class="p">(</span><span class="n">file</span> <span class="o">=</span> <span class="n">temp_file</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="nf">predict</span><span class="p">(</span><span class="n">model_object</span><span class="p">,</span> <span class="n">new_data</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">  <span class="p">},</span>
</span></span><span class="line"><span class="cl">  <span class="n">args</span> <span class="o">=</span> <span class="nf">list</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">temp_file</span> <span class="o">=</span> <span class="n">temp_file</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">new_data</span> <span class="o">=</span> <span class="nf">head</span><span class="p">(</span><span class="n">cell_test</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">  <span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre tabindex="0"><code>##          1          2          3          4          5          6 
## -4.8706401 -1.8143956  2.3386470 -1.2735249 -0.3586448  2.7865270
</code></pre><p>Nice! 😀</p>
<p>What if instead we wanted to train a neural network using tidymodels, with keras as the modeling engine?</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">cell_rec</span> <span class="o">&lt;-</span> 
</span></span><span class="line"><span class="cl">  <span class="nf">recipe</span><span class="p">(</span><span class="n">class</span> <span class="o">~</span> <span class="n">.,</span> <span class="n">data</span> <span class="o">=</span> <span class="n">cell_train</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">step_YeoJohnson</span><span class="p">(</span><span class="nf">all_numeric_predictors</span><span class="p">())</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">step_normalize</span><span class="p">(</span><span class="nf">all_numeric_predictors</span><span class="p">())</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">keras_spec</span> <span class="o">&lt;-</span> 
</span></span><span class="line"><span class="cl">  <span class="nf">mlp</span><span class="p">(</span><span class="n">penalty</span> <span class="o">=</span> <span class="m">0</span><span class="p">,</span> <span class="n">epochs</span> <span class="o">=</span> <span class="m">10</span><span class="p">)</span> <span class="o">%&gt;%</span> 
</span></span><span class="line"><span class="cl">  <span class="nf">set_mode</span><span class="p">(</span><span class="s">&#34;classification&#34;</span><span class="p">)</span> <span class="o">%&gt;%</span> 
</span></span><span class="line"><span class="cl">  <span class="nf">set_engine</span><span class="p">(</span><span class="s">&#34;keras&#34;</span><span class="p">,</span> <span class="n">verbose</span> <span class="o">=</span> <span class="m">0</span><span class="p">)</span> 
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">keras_fit</span> <span class="o">&lt;-</span> 
</span></span><span class="line"><span class="cl">  <span class="nf">workflow</span><span class="p">(</span><span class="n">cell_rec</span><span class="p">,</span> <span class="n">keras_spec</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">fit</span><span class="p">(</span><span class="n">data</span> <span class="o">=</span> <span class="n">cell_train</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>Let&rsquo;s try to save this to disk and then reload it in a new session.</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">temp_file</span> <span class="o">&lt;-</span> <span class="nf">tempfile</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="nf">saveRDS</span><span class="p">(</span><span class="n">keras_fit</span><span class="p">,</span> <span class="n">file</span> <span class="o">=</span> <span class="n">temp_file</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">r</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">  <span class="kr">function</span><span class="p">(</span><span class="n">temp_file</span><span class="p">,</span> <span class="n">new_data</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="nf">library</span><span class="p">(</span><span class="n">workflows</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">model_object</span> <span class="o">&lt;-</span> <span class="nf">readRDS</span><span class="p">(</span><span class="n">file</span> <span class="o">=</span> <span class="n">temp_file</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="nf">predict</span><span class="p">(</span><span class="n">model_object</span><span class="p">,</span> <span class="n">new_data</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">  <span class="p">},</span>
</span></span><span class="line"><span class="cl">  <span class="n">args</span> <span class="o">=</span> <span class="nf">list</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">temp_file</span> <span class="o">=</span> <span class="n">temp_file</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">new_data</span> <span class="o">=</span> <span class="nf">head</span><span class="p">(</span><span class="n">cell_test</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">  <span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre tabindex="0"><code>## Error: ! error in callr subprocess
## Caused by error in `do.call(object$predict, args)`:
## ! &#39;what&#39; must be a function or character string
</code></pre><p>Oh no! 😱</p>
<p>It turns out that keras models <a href="https://tensorflow.rstudio.com/guides/keras/serialization_and_saving.html" target="_blank" rel="noopener">need to be saved in a special way</a>
. This is true of a handful of models, like XGBoost, and even some preprocessing steps, like UMAP. These special ways to save objects, like the ones that keras provide, are often referred to as <em>native serialization</em>. Methods for native serialization know which references need to be brought along in order for an object to effectively do its thing in a new environment, but they are different for each model.</p>
<p>The bundle package provides a consistent way to deal with all these kinds of special serialization. The package provides two functions, <code>bundle()</code> and <code>unbundle()</code>, that take care of all of the minutae of preparing to save and load R objects effectively. You <code>bundle()</code> your model before you save it:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">bundle</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">temp_file</span> <span class="o">&lt;-</span> <span class="nf">tempfile</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="n">keras_bundle</span> <span class="o">&lt;-</span> <span class="nf">bundle</span><span class="p">(</span><span class="n">keras_fit</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">saveRDS</span><span class="p">(</span><span class="n">keras_bundle</span><span class="p">,</span> <span class="n">file</span> <span class="o">=</span> <span class="n">temp_file</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>And then you <code>unbundle()</code> after you read the object in a new session:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">r</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">  <span class="kr">function</span><span class="p">(</span><span class="n">temp_file</span><span class="p">,</span> <span class="n">new_data</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="nf">library</span><span class="p">(</span><span class="n">bundle</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="nf">library</span><span class="p">(</span><span class="n">workflows</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">model_bundle</span> <span class="o">&lt;-</span> <span class="nf">readRDS</span><span class="p">(</span><span class="n">file</span> <span class="o">=</span> <span class="n">temp_file</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">model_object</span> <span class="o">&lt;-</span> <span class="nf">unbundle</span><span class="p">(</span><span class="n">model_bundle</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="nf">predict</span><span class="p">(</span><span class="n">model_object</span><span class="p">,</span> <span class="n">new_data</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">  <span class="p">},</span>
</span></span><span class="line"><span class="cl">  <span class="n">args</span> <span class="o">=</span> <span class="nf">list</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">temp_file</span> <span class="o">=</span> <span class="n">temp_file</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">new_data</span> <span class="o">=</span> <span class="nf">head</span><span class="p">(</span><span class="n">cell_test</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">  <span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre tabindex="0"><code>## # A tibble: 6 × 1
##   .pred_class
##   &lt;fct&gt;      
## 1 PS         
## 2 PS         
## 3 WS         
## 4 PS         
## 5 PS         
## 6 WS
</code></pre><p>Hooray! 🎉</p>
<p>We have support in bundle for a <a href="https://rstudio.github.io/bundle/reference/" target="_blank" rel="noopener">wide variety</a>
 of models that require (or <em>sometimes</em> require) special handling for serialization, from <a href="https://h2o.ai/" target="_blank" rel="noopener">H2O</a>
 to <a href="https://mlverse.github.io/luz/" target="_blank" rel="noopener">torch luz models</a>
. Soon bundle will be integrated into <a href="https://vetiver.rstudio.com/" target="_blank" rel="noopener">vetiver</a>
, for better and more robust deployment options. If you use a model that needs special serialization and is not yet supported, <a href="https://github.com/rstudio/bundle/issues" target="_blank" rel="noopener">let us know</a>
 in an issue.</p>
<h2 id="acknowledgements">Acknowledgements
</h2>
<p>Thank you so much to everyone who contributed to this first release: <a href="https://github.com/dfalbel" target="_blank" rel="noopener">@dfalbel</a>
, <a href="https://github.com/juliasilge" target="_blank" rel="noopener">@juliasilge</a>
, <a href="https://github.com/qiushiyan" target="_blank" rel="noopener">@qiushiyan</a>
, and <a href="https://github.com/simonpcouch" target="_blank" rel="noopener">@simonpcouch</a>
. I would especially like to highlight Simon&rsquo;s contributions, which have been central to bundle getting off the ground!</p>
]]></description>
      <enclosure url="https://posit-open-source.netlify.app/blog/tidyverse/2022/bundle-0-1-0/thumbnail-wd.jpg" length="185242" type="image/jpeg" />
    </item>
    <item>
      <title>Announcing vetiver for MLOps in R and Python</title>
      <link>https://posit-open-source.netlify.app/blog/tidyverse/2022/announce-vetiver/</link>
      <pubDate>Thu, 09 Jun 2022 00:00:00 +0000</pubDate>
      <guid>https://posit-open-source.netlify.app/blog/tidyverse/2022/announce-vetiver/</guid>
      <dc:creator>Julia Silge</dc:creator><description><![CDATA[<!--
TODO:
* [ ] Look over / edit the post's title in the yaml
* [ ] Edit (or delete) the description; note this appears in the Twitter card
* [ ] Pick category and tags (see existing with `hugodown::tidy_show_meta()`)
* [ ] Find photo & update yaml metadata
* [ ] Create `thumbnail-sq.jpg`; height and width should be equal
* [ ] Create `thumbnail-wd.jpg`; width should be >5x height
* [ ] `hugodown::use_tidy_thumbnails()`
* [ ] Add intro sentence, e.g. the standard tagline for the package
* [ ] `usethis::use_tidy_thanks()`
-->
<p>We are thrilled to announce the release of <a href="https://vetiver.rstudio.com/" target="_blank" rel="noopener">vetiver</a>
, a framework for MLOps tasks in R and Python! The goal of vetiver is to provide fluent tooling to <strong>version</strong>, <strong>share</strong>, <strong>deploy</strong>, and <strong>monitor</strong> a trained model. If you like perfume or candles, you may recognize this name; vetiver, also known as the &ldquo;oil of tranquility&rdquo;, is used as a stabilizing ingredient in perfumery to preserve more volatile fragrances.</p>
<p>You can install the released version of vetiver for R from <a href="https://cran.r-project.org/package=vetiver" target="_blank" rel="noopener">CRAN</a>
:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">install.packages</span><span class="p">(</span><span class="s">&#34;vetiver&#34;</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>You can install the released version of vetiver for Python from <a href="https://pypi.org/project/vetiver/" target="_blank" rel="noopener">PyPI</a>
:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">pip</span> <span class="n">install</span> <span class="n">vetiver</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>We are sharing more about what vetiver is and how it works over <a href="https://www.rstudio.com/blog/announce-vetiver/" target="_blank" rel="noopener">on the RStudio blog</a>
 so check that out, but we want to share here as well!</p>
<h2 id="train-a-model">Train a model
</h2>
<p>For this example, let’s work with data on everyone&rsquo;s favorite dataset on fuel efficiency for cars to predict miles per gallon. In R, we can train a decision tree model to predict miles per gallon using a <a href="https://www.tidymodels.org/" target="_blank" rel="noopener">tidymodels</a>
 workflow:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">tidymodels</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">car_mod</span> <span class="o">&lt;-</span>
</span></span><span class="line"><span class="cl">    <span class="nf">workflow</span><span class="p">(</span><span class="n">mpg</span> <span class="o">~</span> <span class="n">.,</span> <span class="nf">decision_tree</span><span class="p">(</span><span class="n">mode</span> <span class="o">=</span> <span class="s">&#34;regression&#34;</span><span class="p">))</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">    <span class="nf">fit</span><span class="p">(</span><span class="n">mtcars</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>In Python, we can train the same kind of model using <a href="https://scikit-learn.org/" target="_blank" rel="noopener">scikit-learn</a>
:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">vetiver.data</span> <span class="kn">import</span> <span class="n">mtcars</span>
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">sklearn</span> <span class="kn">import</span> <span class="n">tree</span>
</span></span><span class="line"><span class="cl"><span class="n">car_mod</span> <span class="o">=</span> <span class="n">tree</span><span class="o">.</span><span class="n">DecisionTreeRegressor</span><span class="p">()</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">mtcars</span><span class="p">,</span> <span class="n">mtcars</span><span class="p">[</span><span class="s2">&#34;mpg&#34;</span><span class="p">])</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>For both R and Python, the <code>car_mod</code> object is a fitted model, with parameters estimated using our training data <code>mtcars</code>.</p>
<h2 id="create-a-vetiver-model">Create a vetiver model
</h2>
<p>We can create a <code>vetiver_model()</code> in R or <code>VetiverModel()</code> in Python from the trained model; a vetiver model object collects the information needed to store, version, and deploy a trained model.</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">vetiver</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">v</span> <span class="o">&lt;-</span> <span class="nf">vetiver_model</span><span class="p">(</span><span class="n">car_mod</span><span class="p">,</span> <span class="s">&#34;cars_mpg&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">v</span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; </span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; ── cars_mpg ─ &lt;butchered_workflow&gt; model for deployment </span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; A rpart regression modeling workflow using 10 features</span>
</span></span></code></pre></td></tr></table>
</div>
</div><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">vetiver</span> <span class="kn">import</span> <span class="n">VetiverModel</span>
</span></span><span class="line"><span class="cl"><span class="n">v</span> <span class="o">=</span> <span class="n">VetiverModel</span><span class="p">(</span><span class="n">car_mod</span><span class="p">,</span> <span class="n">model_name</span> <span class="o">=</span> <span class="s2">&#34;cars_mpg&#34;</span><span class="p">,</span> 
</span></span><span class="line"><span class="cl">                 <span class="n">save_ptype</span> <span class="o">=</span> <span class="kc">True</span><span class="p">,</span> <span class="n">ptype_data</span> <span class="o">=</span> <span class="n">mtcars</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">v</span><span class="o">.</span><span class="n">description</span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; &#34;Scikit-learn &lt;class &#39;sklearn.tree._classes.DecisionTreeRegressor&#39;&gt; model&#34;</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>See our documentation for how to use these deployable model objects and:</p>
<ul>
<li><a href="https://vetiver.rstudio.com/get-started/version.html" target="_blank" rel="noopener">publish and version your model</a>
</li>
<li><a href="https://vetiver.rstudio.com/get-started/deploy.html" target="_blank" rel="noopener">deploy your model as a REST API</a>
</li>
</ul>
<p>Be sure to also read more <a href="https://www.rstudio.com/blog/announce-vetiver/" target="_blank" rel="noopener">on the RStudio blog</a>
.</p>
<h2 id="acknowledgements">Acknowledgements
</h2>
<p>We&rsquo;d like to extend our thanks to all of the contributors who helped make these initial releases of vetiver for R and Python possible!</p>
<ul>
<li>
<p>R package: <a href="https://github.com/cderv" target="_blank" rel="noopener">@cderv</a>
, <a href="https://github.com/ggpinto" target="_blank" rel="noopener">@ggpinto</a>
, <a href="https://github.com/isabelizimm" target="_blank" rel="noopener">@isabelizimm</a>
, <a href="https://github.com/juliasilge" target="_blank" rel="noopener">@juliasilge</a>
, and <a href="https://github.com/mfansler" target="_blank" rel="noopener">@mfansler</a>
</p>
</li>
<li>
<p>Python package: <a href="https://github.com/has2k1" target="_blank" rel="noopener">@has2k1</a>
, and <a href="https://github.com/isabelizimm" target="_blank" rel="noopener">@isabelizimm</a>
</p>
</li>
</ul>
]]></description>
      <enclosure url="https://posit-open-source.netlify.app/blog/tidyverse/2022/announce-vetiver/thumbnail-wd.jpg" length="152876" type="image/jpeg" />
    </item>
    <item>
      <title>Integrating Dynamic R and Python Models in Tableau Using plumbertableau</title>
      <link>https://posit-open-source.netlify.app/blog/rstudio/dynamic-r-and-python-models-in-tableau-using-plumbertableau/</link>
      <pubDate>Mon, 20 Dec 2021 00:00:00 +0000</pubDate>
      <guid>https://posit-open-source.netlify.app/blog/rstudio/dynamic-r-and-python-models-in-tableau-using-plumbertableau/</guid>
      <dc:creator>Isabella Velásquez</dc:creator><description><![CDATA[<p>RStudio believes that you can attain greater business intelligence with interoperable tools that <a href="https://www.rstudio.com/solutions/interoperability/" target = "_blank">take the full advantage of open-source data science</a>. Your organization may rely on Tableau for reporting purposes, but how can you ensure that you&rsquo;re using the full power of your data science team&rsquo;s R and Python models in your dashboards?</p>
<p>With the <a href="https://rstudio.github.io/plumbertableau/index.html" target = "_blank">plumbertableau</a> package (and its corresponding Python package, <a href="https://rstudio.github.io/fastapitableau/" target = "_blank">fastapitableau</a>), you can use functions or models created in R or Python from Tableau through an API. These packages allow you to showcase cutting-edge data science results in your organization’s preferred dashboard tool.</p>
<p>While this post mentions R, anything possible with R and plumbertableau is also doable with Python and fastapitableau.</p>
<h2 id="foster-data-analytics-capabilities-with-plumbertableau">Foster Data Analytics Capabilities With plumbertableau
</h2>
<p>With plumbertableau, you can fully develop your model with code-first data science. The package uses <a href="https://www.rplumber.io/" target = "_blank">plumber</a> to create an API directly from your code. Since your model is fully developed in your data science editor, it can use all the packages and complex calculations it needs.</p>
<p>You can extract the best data science results using R&rsquo;s capabilities as your model will not be constrained by Tableau&rsquo;s environment.</p>
<h2 id="improve-data-quality-with-apis-for-continuous-use">Improve Data Quality With APIs for Continuous Use
</h2>
<p>Seamless integration between analytic platforms prevents issues like using outdated, inaccurate, or incomplete data. Rather than depending on a manual process, data scientists can depend on their data pipelines to ensure data integrity.</p>
<p>With plumbertableau, your tools are integrated through an API. The Tableau dashboard displays results without any intermediate manipulation like copy-and-pasting code or uploading datasets. You can work in confidence knowing your results are synchronized, accurate, and reproducible.</p>
<h2 id="increase-deliverability-by-streamlining-data-pipelines">Increase Deliverability by Streamlining Data Pipelines
</h2>
<p>If your model has many dependencies or versioning requirements, it can be difficult to handle them outside of the development environment. Debugging is even more time-consuming when you need to work in separate environments to figure out what went wrong.</p>
<p>With <a href="https://connect.rstudioservices.com/connect/" target = "_blank">RStudio Connect</a>, you can publish directly plumbertableau extensions directly from the RStudio IDE. RStudio Connect automatically manages your API&rsquo;s dependent packages and files to recreate an environment closely mimicking your local development environment. And since all your R code remains in R, you can use your usual data science techniques to efficiently resolve issues.</p>
<p>Read more on the <a href="https://www.rplumber.io/articles/hosting.html/" target = "_blank">Hosting</a> page of the plumber package.</p>
<h2 id="how-to-use-plumbertableau-xgboost-with-dynamic-model-output-example">How to Use plumbertableau: XGBoost with Dynamic Model Output Example
</h2>
<p><div class="not-prose"><figure>
    <img class="h-auto max-w-full rounded-lg"
      src="https://posit-open-source.netlify.app/blog/rstudio/dynamic-r-and-python-models-in-tableau-using-plumbertableau/img/gif4.gif"
      alt="Showing predictive values in Tableau dashboard" 
      loading="lazy"
    >
  </figure></div>
</p>
<p>In this walkthrough, we will be using data from the <a href="https://data.seattle.gov/" target = "_blank">Seattle Open Data Portal</a> to predict the paid parking occupancy percentage in various areas around the city. We will run an XGBoost model in RStudio, create a plumbertableau extension to embed into Tableau, and visualize and interact with the model in a Tableau dashboard. The code is here for reproducibility purposes; however, it will <strong>require</strong> an RStudio Connect account to complete.</p>
<p>The plumbertableau and fastapi packages have wonderful documentation. Be sure to read them for more information on:</p>
<ul>
<li>The anatomy of the extensions</li>
<li>Details on setting up RStudio Connect and Tableau</li>
<li>Other examples to try out in your Tableau dashboards</li>
</ul>
<p><div class="not-prose"><figure>
    <img class="h-auto max-w-full rounded-lg"
      src="https://posit-open-source.netlify.app/blog/rstudio/dynamic-r-and-python-models-in-tableau-using-plumbertableau/img/img2.png"
      alt="Displaying dynamic model output in Tableau steps" 
      loading="lazy"
    >
  </figure></div>
</p>
<h3 id="1-build-the-model">1. Build the model
</h3>
<p>First, we need to build a model. This walkthrough won’t be covering how to create, tune, or validate a model. If you&rsquo;d like to learn more on models and machine learning, check out the <a href="https://www.tidymodels.org/" target = "_blank">tidymodels</a> website and Julia Silge&rsquo;s fantastic <a href="https://juliasilge.com/category/tidymodels/" target = "_blank">screencasts and tutorials</a>.</p>
<p><strong>Load Libraries</strong></p>
<pre tabindex="0"><code>library(tidyverse)
library(RSocrata)
library(lubridate)
library(usemodels)
library(tidymodels)
</code></pre><p><strong>Download and Clean Data</strong></p>
<p>The Seattle Open Data Portal uses <a href="https://www.tylertech.com/products/socrata" target = "_blank">Socrata</a>, a data management tool, for its APIs. We can use the <a href="https://cran.r-project.org/web/packages/RSocrata/index.html" target = "_blank">RSocrata</a> package to download the data.</p>
<pre tabindex="0"><code>parking_data &lt;-
  RSocrata::read.socrata(
    &#34;https://data.seattle.gov/resource/rke9-rsvs.json?$where=sourceelementkey &lt;= 1020&#34;
  )

parking_id &lt;-
  parking_data %&gt;%
  group_by(blockfacename, location.coordinates) %&gt;%
  mutate(id = cur_group_id()) %&gt;%
  ungroup()

parking_clean &lt;-
  parking_id %&gt;%
  mutate(across(c(parkingspacecount, paidoccupancy), as.numeric),
         occupancy_pct = paidoccupancy / parkingspacecount) %&gt;%
  group_by(
    id = id,
    hour = as.numeric(hour(occupancydatetime)),
    month = as.numeric(month(occupancydatetime)),
    dow = as.numeric(wday(occupancydatetime)),
    date = date(occupancydatetime)
  ) %&gt;%
  summarize(occupancy_pct = mean(occupancy_pct, na.rm = TRUE)) %&gt;%
  drop_na() %&gt;%
  ungroup()
</code></pre><p>We will also need information on the city blocks, so let&rsquo;s create that dataset.</p>
<pre tabindex="0"><code>parking_information &lt;-
  parking_id %&gt;%
  mutate(loc = location.coordinates) %&gt;%
  select(id, blockfacename, loc) %&gt;%
  distinct(id, blockfacename, loc) %&gt;%
  unnest_wider(loc, c(&#39;loc1&#39;, &#39;loc2&#39;))
</code></pre><p><strong>Create Training Data</strong></p>
<p>Now, let&rsquo;s create the training set from our original data.</p>
<pre tabindex="0"><code>parking_split &lt;-
  parking_clean %&gt;%
  arrange(date) %&gt;%
  select(-date) %&gt;%
  initial_time_split(prop = 0.75)
</code></pre><p><strong>Train and Tune the Model</strong></p>
<p>Here, we train and tune the model. We select the model with the best RSME to use in our dashboard.</p>
<pre tabindex="0"><code>xgboost_recipe &lt;-
  recipe(formula = occupancy_pct ~ ., data = parking_clean) %&gt;%
  step_zv(all_predictors())  %&gt;%
  prep()

xgboost_folds &lt;-
  recipes::bake(xgboost_recipe,
                new_data = training(parking_split)) %&gt;%
  rsample::vfold_cv(v = 5)

xgboost_model &lt;-
  boost_tree(
    mode = &#34;regression&#34;,
    trees = 1000,
    min_n = tune(),
    tree_depth = tune(),
    learn_rate = tune(),
    loss_reduction = tune()
  ) %&gt;%
  set_engine(&#34;xgboost&#34;, objective = &#34;reg:squarederror&#34;)

xgboost_params &lt;-
  parameters(min_n(),
             tree_depth(),
             learn_rate(),
             loss_reduction())

xgboost_grid &lt;-
  grid_max_entropy(xgboost_params,
                   size = 5)

xgboost_wf &lt;-
  workflows::workflow() %&gt;%
  add_model(xgboost_model) %&gt;%
  add_formula(occupancy_pct ~ .)

xgboost_tuned &lt;- tune::tune_grid(
  object = xgboost_wf,
  resamples = xgboost_folds,
  grid = xgboost_grid,
  metrics = yardstick::metric_set(rmse, rsq, mae),
  control = tune::control_grid(verbose = TRUE)
)

xgboost_best &lt;-
  xgboost_tuned %&gt;%
  tune::select_best(&#34;rmse&#34;)

xgboost_final &lt;-
  xgboost_model %&gt;%
  finalize_model(xgboost_best)
</code></pre><p>We bundle the recipe and fitted model in an object so we can use it later.</p>
<pre tabindex="0"><code>train_processed &lt;-
  bake(xgboost_recipe, new_data = training(parking_split))

prediction_fit &lt;-
  xgboost_final %&gt;%
  fit(formula = occupancy_pct ~ .,
      data    = train_processed)

model_details &lt;- list(model = xgboost_final,
                      recipe = xgboost_recipe,
                      prediction_fit = prediction_fit)
</code></pre><p><strong>Save Objects for the plumbertableau Extension</strong></p>
<p>We&rsquo;ll want to save our data and our model so that we can use them in the extension. If you have an RStudio Connect account, the <a href="https://pins.rstudio.com/" target = "_blank">pins</a> package is a great choice for saving these objects.</p>
<pre tabindex="0"><code>rsc &lt;-
  pins::board_rsconnect(server = Sys.getenv(&#34;CONNECT_SERVER&#34;),
                        key = Sys.getenv(&#34;CONNECT_API_KEY&#34;))

pins::pin_write(
  board = rsc,
  x = model_details,
  name = &#34;seattle_parking_model&#34;,
  description = &#34;Seattle Occupancy Percentage XGBoost Model&#34;,
  type = &#34;rds&#34;
)

pins::pin_write(
  board = rsc,
  x = parking_information,
  name = &#34;seattle_parking_info&#34;,
  description = &#34;Seattle Parking Information&#34;,
  type = &#34;rds&#34;
)
</code></pre><h3 id="2-create-a-plumbertableau-extension">2. Create a plumbertableau Extension
</h3>
<p>Next, we will use our model to create a plumbertableau extension. As noted previously, the plumbertableau extension is a Plumber API with some special annotations.</p>
<p>Create an R script called <code>plumber.R</code>. At the top, we list the libraries we&rsquo;ll need.</p>
<pre tabindex="0"><code>library(plumber)
library(pins)
library(tibble)
library(xgboost)
library(lubridate)
library(dplyr)
library(tidyr)
library(tidymodels)
library(plumbertableau)
</code></pre><p>We want to bring in our model details and our data. If you pinned your data, you&rsquo;ll change the name of the pin below.</p>
<pre tabindex="0"><code>rsc &lt;-
  pins::board_rsconnect(
    server = Sys.getenv(&#34;CONNECT_SERVER&#34;),
    key = Sys.getenv(&#34;CONNECT_API_KEY&#34;)
  )

xgboost_model &lt;-
  pins::pin_read(&#34;isabella.velasquez/seattle_parking_model&#34;, board = rsc)
</code></pre><p>Now, we add our <a href="https://www.rplumber.io/articles/annotations.htm" target = "_blank">annotations</a>. Note that we use plumbertableau annotations, which are slightly different than the ones from plumber.</p>
<ul>
<li>We use <code>tableauArg</code> rather than <code>params</code>.</li>
<li>We specify what is returned to Tableau with <code>tableauReturn</code>.</li>
<li>We must use <code>post</code> for what is being returned.</li>
</ul>
<pre tabindex="0"><code>#* @apiTitle Seattle Parking Occupancy Percentage Prediction API
#* @apiDescription Return the predicted occupancy percentage at various Seattle locations

#* @tableauArg block_id:integer numeric block ID
#* @tableauArg ndays:integer number of days in the future for the prediction

#* @tableauReturn [numeric] Predicted occupancy rate
#* @post /pred
</code></pre><p>Now, we create our function with the arguments <code>station_id</code> and <code>ndays</code>. These will have corresponding arguments in Tableau. The function will output our predicted occupancy percentage, which will be what we visualize and interact with in the dashboard.</p>
<p>This function takes the city block and number of days in the future to give us the predicted occupancy percentage at that time.</p>
<pre tabindex="0"><code>function(block_id, ndays) {
  times &lt;- Sys.time() + lubridate::ddays(ndays)
  
  current_time &lt;-
    tibble::tibble(times = times,
                   id = block_id)
  
  current_prediction  &lt;-
    current_time %&gt;%
    transmute(
      id = id,
      hour = hour(times),
      month = month(times),
      dow = wday(times),
      occupancy_pct = NA
    ) %&gt;%
    bake(xgboost_model$recipe, .)
  
  parking_prediction &lt;-
    xgboost_model$prediction_fit %&gt;%
    predict(new_data = current_prediction)
  
  predictions &lt;-
    parking_prediction$.pred
  
  predictions[[1]]
  
}
</code></pre><p>Finally, we finish off our script with the extension footer needed for plumbertableau extensions.</p>
<pre tabindex="0"><code>#* @plumber
tableau_extension
</code></pre><p>Here is the full <code>plumber.R</code> script:</p>
<pre tabindex="0"><code>library(plumber)
library(pins)
library(tibble)
library(xgboost)
library(lubridate)
library(dplyr)
library(tidyr)
library(tidymodels)
library(plumbertableau)

rsc &lt;-
  pins::board_rsconnect(server = Sys.getenv(&#34;CONNECT_SERVER&#34;),
                        key = Sys.getenv(&#34;CONNECT_API_KEY&#34;))

xgboost_model &lt;-
  pins::pin_read(&#34;isabella.velasquez/seattle_parking_model&#34;, board = rsc)

#* @apiTitle Seattle Parking Occupancy Percentage Prediction API
#* @apiDescription Return the predicted occupancy percentage at various Seattle locations

#* @tableauArg block_id:integer numeric block ID
#* @tableauArg ndays:integer number of days in the future for the prediction

#* @tableauReturn [numeric] Predicted occupancy rate
#* @post /pred

function(block_id, ndays) {
  times &lt;- Sys.time() + lubridate::ddays(ndays)
  
  current_time &lt;-
    tibble::tibble(times = times,
                   id = block_id)
  
  current_prediction  &lt;-
    current_time %&gt;%
    transmute(
      id = id,
      hour = hour(times),
      month = month(times),
      dow = wday(times),
      occupancy_pct = NA
    ) %&gt;%
    bake(xgboost_model$recipe, .)
  
  parking_prediction &lt;-
    xgboost_model$prediction_fit %&gt;%
    predict(new_data = current_prediction)
  
  predictions &lt;-
    parking_prediction$.pred
  
  predictions[[1]]
  
}

#* @plumber
tableau_extension
</code></pre><h3 id="3-host-your-api">3. Host your API
</h3>
<p>We have to host our API so that it can be accessed in Tableau. In our case, we publish it to RStudio Connect.</p>
<p>Once hosted, plumbertableau automatically generates a documentation page. Notice that the <code>SCRIPT_*</code> value is not R code. This is a Tableau command that we will use to connect our extension and Tableau.</p>
<p><div class="not-prose"><figure>
    <img class="h-auto max-w-full rounded-lg"
      src="https://posit-open-source.netlify.app/blog/rstudio/dynamic-r-and-python-models-in-tableau-using-plumbertableau/img/img3.png"
      alt="Automatically generated plumbertableau documentation page" 
      loading="lazy"
    >
  </figure></div>
</p>
<caption><center><i>Automatically generated plumbertableau documentation page</center></i></caption>
<h3 id="4-create-a-calculated-field-in-tableau">4. Create a calculated field in Tableau
</h3>
<p>There are a few steps you need to take so that Tableau can use your plumbertableau extension. If you are using RStudio Connect, read the documentation on how to <a href="https://docs.rstudio.com/rsc/integration/tableau/" target = "_blank">configure RStudio Connect as an analytic extension</a>.</p>
<p>Create a new workbook and upload the <code>station_information</code> file. Under Analysis, turn off Aggregate Measures. Drop <code>Lat</code> into Rows and <code>Lon</code> into Columns, which will create a map. Save the workbook.</p>
<p>Make sure your workbook knows to connect to RStudio Connect by going to Analysis &gt; Manage Analytic Extensions Connection &gt; Choose a Connection. Then, select your Connect account.</p>
<p>Drag <code>Id</code> into the &ldquo;Detail&rdquo; mark. Create a parameter called &ldquo;Days in the Future&rdquo;. We&rsquo;re using our model to predict parking occupancy percentage for that date. Show the parameter on the worksheet.</p>
<p><div class="not-prose"><figure>
    <img class="h-auto max-w-full rounded-lg"
      src="https://posit-open-source.netlify.app/blog/rstudio/dynamic-r-and-python-models-in-tableau-using-plumbertableau/img/gif1.gif"
      alt="Creating a parameter in Tableau" 
      loading="lazy"
    >
  </figure></div>
</p>
<p>Create a calculated field using the <code>SCRIPT</code> from the plumbertableau documentation page:</p>
<pre tabindex="0"><code>SCRIPT_REAL(&#34;/plumbertableau-xgboost-example/pred&#34;, block_id, ndays) 
</code></pre><p>For each <code>tableauArg</code> we have listed in the extension, we will replace it with its corresponding Tableau value. If you&rsquo;re following along, this means <code>block_id</code> will become <code>ATTR([Id])</code> and <code>ndays</code> will become <code>ATTR([Days in the Future])</code>.</p>
<pre tabindex="0"><code>SCRIPT_REAL(&#34;/plumbertableau-xgboost-example/pred&#34;, ATTR([Id]), ATTR([Days in the Future]))
</code></pre><p><div class="not-prose"><figure>
    <img class="h-auto max-w-full rounded-lg"
      src="https://posit-open-source.netlify.app/blog/rstudio/dynamic-r-and-python-models-in-tableau-using-plumbertableau/img/gif2.gif"
      alt="Creating a calculated field from a plumbertableau extension" 
      loading="lazy"
    >
  </figure></div>
</p>
<h3 id="5-run-model-and-visualize-results-in-tableau">5. Run model and visualize results in Tableau
</h3>
<p>That&rsquo;s it! Once you embed your extension in Tableau’s calculated fields, you can use your model&rsquo;s results in your Tableau dashboard like any other measure or dimension.</p>
<p><div class="not-prose"><figure>
    <img class="h-auto max-w-full rounded-lg"
      src="https://posit-open-source.netlify.app/blog/rstudio/dynamic-r-and-python-models-in-tableau-using-plumbertableau/img/gif3.gif"
      alt="Showing predictive results in Tableau" 
      loading="lazy"
    >
  </figure></div>
</p>
<p>We can change the <code>ndays</code> argument to get new predictions from our XGBoost model and display them on our Tableau dashboard.</p>
<p><div class="not-prose"><figure>
    <img class="h-auto max-w-full rounded-lg"
      src="https://posit-open-source.netlify.app/blog/rstudio/dynamic-r-and-python-models-in-tableau-using-plumbertableau/img/gif5.gif"
      alt="Showing predictive results in Tableau dashboard by changing the number of days in the future" 
      loading="lazy"
    >
  </figure></div>
</p>
<p>You can style your Tableau dashboard and then provide your users something that is not only aesthetically pleasing, but is dynamically calculating predictions based on a model you have created in R.</p>
<h2 id="conclusion">Conclusion
</h2>
<p>With plumbertableau, you can showcase sophisticated model results that are easy to integrate, debug, and reproduce. Your work will be at the forefront of data science while being visualized in Tableau&rsquo;s easy, point-and-click interface.</p>
<h2 id="learn-more">Learn More
</h2>
<p>Watch James Blair showcase plumbertableau in Leveraging R &amp; Python in Tableau with RStudio Connect:</p>
<script src="https://fast.wistia.com/embed/medias/hl37qvfnml.jsonp" async></script><script src="https://fast.wistia.com/assets/external/E-v1.js" async></script><div class="wistia_responsive_padding" style="padding:56.25% 0 0 0;position:relative;"><div class="wistia_responsive_wrapper" style="height:100%;left:0;position:absolute;top:0;width:100%;"><div class="wistia_embed wistia_async_hl37qvfnml videoFoam=true" style="height:100%;position:relative;width:100%"><div class="wistia_swatch" style="height:100%;left:0;opacity:0;overflow:hidden;position:absolute;top:0;transition:opacity 200ms;width:100%;"><img src="https://fast.wistia.com/embed/medias/hl37qvfnml/swatch" style="filter:blur(5px);height:100%;object-fit:contain;width:100%;" alt="" aria-hidden="true" onload="this.parentNode.style.opacity=1;" /></div></div></div></div>
<p>More on how RStudio supports interoperability across tools can be found on our <a href="https://www.rstudio.com/solutions/bi-and-data-science/" target = "_blank">BI and Data Science Overview Page</a>.</p>
]]></description>
      <enclosure url="https://posit-open-source.netlify.app/blog/rstudio/dynamic-r-and-python-models-in-tableau-using-plumbertableau/thumbnail.png" length="97818" type="image/png" />
    </item>
    <item>
      <title>Sharing Data With the pins Package</title>
      <link>https://posit-open-source.netlify.app/blog/rstudio/sharing-data-with-the-pins-package/</link>
      <pubDate>Wed, 15 Dec 2021 00:00:00 +0000</pubDate>
      <guid>https://posit-open-source.netlify.app/blog/rstudio/sharing-data-with-the-pins-package/</guid>
      <dc:creator>Katie Masiello</dc:creator>
      <dc:creator>Isabella Velásquez</dc:creator><description><![CDATA[<caption>
Photo by <a href="https://unsplash.com/@universaleye?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Universal Eye</a> on <a href="https://unsplash.com/@ivelasq/likes?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
</caption>
<p>Teams often need access to key data to do their work, but have you ever opened your coworker&rsquo;s script to see:</p>
<pre tabindex="0"><code>dat &lt;-  
 read_csv(&#34;C://Users/someone_else/data/dataset.csv&#34;)

more_dat &lt;- 
 read_csv(&#34;S://Path_to_mapped_drive_that_you_dont_have/dataset.csv&#34;)
</code></pre><p>Yikes! How will you get these files? Let&rsquo;s hope you can reach your coworker before they’ve logged off for the day.</p>
<p>How can your code be reproducible if you have to manually change the file paths? <em>Shudder</em>.</p>
<p>What if you need to make edits to the data, will you have to keep copying CSVs and emailing files forever? <em>Double shudder.</em></p>
<p>What if your coworker accidentally forwards your email to someone who is not supposed to have access? <em>Oh no.</em></p>
<p>We can struggle to share data assets easily and safely, relying on emailed files to keep our analyses up to date. This makes it difficult to keep current or know what version of the data we’re using. If you&rsquo;ve ever experienced any of the scenarios above, consider <a href="https://www.rstudio.com/blog/pins-1-0-0/" target = "_blank">pins</a> as a solution that can help you share your data assets.</p>
<h2 id="what-is-a-pin-anyway">What <em>is</em> a pin, anyway?
</h2>
<p>Pins, from the <a href="https://pins.rstudio.com/" target = "_blank">R package of the same name</a>, are a versatile way to publish R objects on a virtual corkboard so you can share them across projects and people.</p>
<p>Good pins are data or assets that are a few hundred megabytes or smaller. You can pin just about any object: data, models, JSON files, feather files from the Arrow package, and more. One of the most frequent use cases is pinning small data sets — often ephemeral data or reference tables that don&rsquo;t quite merit being in a database, but seemingly don&rsquo;t have a good home elsewhere (until now).</p>
<p>Pins get published to a board, which can be an <a href="https://www.rstudio.com/products/connect/" target = "_blank">RStudio Connect</a> server, an AWS S3 bucket or Azure Blob Storage, a shared drive like Dropbox or Sharepoint, or a <a href="https://pins.rstudio.com/reference/index.html#section-boards" target = "_blank">variety of other options</a>. Try it out for yourself — read in this data set we’ve pinned for you on RStudio Connect!</p>
<pre tabindex="0"><code># Install the latest pins from CRAN
install.packages(&#34;pins&#34;)

library(pins)

# Identify the board
board &lt;-
  board_url(c(&#34;penguins&#34; = &#34;https://colorado.rstudio.com/rsc/example_pin/&#34;))

# Read the shared data
board %&gt;%
  pin_read(&#34;penguins&#34;)
</code></pre><p>In short, if you’ve ever wondered where to put an R object that you or your colleague will need to use again, you might just want to pin it.</p>
<h2 id="pins-for-sharing-across-projects-and-teams">Pins for Sharing Across Projects and Teams
</h2>
<p>One of the greatest strengths of pins is how your pin becomes accessible directly from your R scripts  <em>and</em> the R scripts of anyone else to whom you’ve given access. Different projects can include code that reads the same pin without creating more copies of the data:</p>
<p><div class="not-prose"><figure>
    <img class="h-auto max-w-full rounded-lg"
      src="https://posit-open-source.netlify.app/blog/rstudio/sharing-data-with-the-pins-package/images/image1.png"
      alt="Three projects using the same pin to download data" 
      loading="lazy"
    >
  </figure></div>
</p>
<p>It&rsquo;s easier (and safer) to share a pin across multiple projects or people than to email files around. Pins respect the access controls of the board. Say you’ve pinned to RStudio Connect: you can control who gets to use the pin, just like any other piece of content.</p>
<h2 id="pins-for-updating-and-versioning">Pins for Updating and Versioning
</h2>
<p>You may be wondering why use pins if you already have a shared drive with your teammates. But what happens if you need to replace the dataset with a new one? Do you email everybody to let them know? Is it dataFINALv2.csv? Or dataFINALfinal.csv?</p>
<p>The pins package retrieves the newest version of the pin by default. That means pin users never have to worry about getting a stale version of the pin. If you need to update your pin regularly, a scheduled R Markdown on RStudio Connect can handle this task for you, so your pin stays fresh.</p>
<p>But you’re not locked into losing old versions of a pin. You can version pins so that writing to an existing pin adds a new copy rather than replacing the existing data.</p>
<p>Here&rsquo;s what versioning looks like using a temporary board:</p>
<pre tabindex="0"><code>library(pins)


board2 &lt;- board_temp(versioned = TRUE)

board2 %&gt;% pin_write(1:5, name = &#34;x&#34;, type = &#34;rds&#34;)
#&gt; Creating new version &#39;20210304T050607Z-ab444&#39;
#&gt; Writing to pin &#39;x&#39;

board2 %&gt;% pin_write(2:6, name = &#34;x&#34;, type = &#34;rds&#34;)
#&gt; Creating new version &#39;20210304T050607Z-a077a&#39;
#&gt; Writing to pin &#39;x&#39;

board2 %&gt;% pin_write(3:7, name = &#34;x&#34;, type = &#34;rds&#34;)
#&gt; Creating new version &#39;20210304T050607Z-0a284&#39;
#&gt; Writing to pin &#39;x&#39;

# see all versions
board2 %&gt;% pin_versions(&#34;x&#34;)
#&gt; # A tibble: 3 × 3
#&gt;   version                created             hash 
#&gt;   &lt;chr&gt;                  &lt;dttm&gt;              &lt;chr&gt;
#&gt; 1 20210304T050607Z-0a284 2021-03-04 05:06:00 0a284
#&gt; 2 20210304T050607Z-a077a 2021-03-04 05:06:00 a077a
#&gt; 3 20210304T050607Z-ab444 2021-03-04 05:06:00 ab444
</code></pre><h2 id="learn-more">Learn More
</h2>
<p>With pins, you and your teammates can know where your important data assets are, how to access them, and whether they are the correct version. You can work with confidence knowing you’re using the right asset, your work is reproducible, and you’re following good practices for data management.</p>
<p>There’s more to explore with pins. We’re excited to share how you can adopt them into your workflow.</p>
<p>Learn more about how and when to use pins:</p>
<ul>
<li><a href="https://pins.rstudio.com/" target = "_blank">The pins package documentation</a></li>
<li><a href="https://docs.rstudio.com/how-to-guides/users/pro-tips/pins/" target = "_blank">RStudio Pro Tips: Creating Efficient Workflows with <code>pins</code> and RStudio Connect</a></li>
</ul>
<p>See pins in action:</p>
<ul>
<li>Pins can pull intensive ETL processes out of your apps, improve performance, and save you the hassle of redeploying whenever the underlying data changes.
<ul>
<li>Watch: <a href="https://www.rstudio.com/resources/rstudioconf-2020/deploying-end-to-end-data-science-with-shiny-plumber-and-pins/" target = "_blank">Deploying End-To-End Data Science with Shiny, Plumber, and Pins</a></li>
</ul>
</li>
<li>Pins can play a key role in MLOps, publishing versioned models, and monitoring model metrics.
<ul>
<li>Read: <a href="https://www.rstudio.com/blog/model-monitoring-with-r-markdown/" target = "_blank">Model Monitoring with R Markdown, pins, and RStudio Connect</a></li>
</ul>
</li>
</ul>
]]></description>
      <enclosure url="https://posit-open-source.netlify.app/blog/rstudio/sharing-data-with-the-pins-package/thumbnail.png" length="392754" type="image/png" />
    </item>
    <item>
      <title>pins 1.0.0</title>
      <link>https://posit-open-source.netlify.app/blog/rstudio/pins-1-0-0/</link>
      <pubDate>Mon, 04 Oct 2021 00:00:00 +0000</pubDate>
      <guid>https://posit-open-source.netlify.app/blog/rstudio/pins-1-0-0/</guid>
      <dc:creator>Hadley Wickham</dc:creator><description><![CDATA[<sup>
Photo by <a href="https://unsplash.com/@kelsoknight" target="_blank" rel="noopener noreferrer">Kelsey Knight</a> on <a href="https://unsplash.com/">Unsplash</a>
</sup>

<p>I’m delighted to announce that <a href="https://pins.rstudio.com">pins</a> 1.0.0 is now available on CRAN.
The pins package publishes data, models, and other R objects, making it easy to share them across projects and with your colleagues.
You can pin objects to a variety of pin boards, including folders (to share on a networked drive or with services like DropBox), RStudio Connect, Amazon S3, and Azure blob storage.
Pins can be versioned, making it straightforward to track changes, re-run analyses on historical data, and undo mistakes. Our users have found numerous ways to use this ability to fluently share and version data and other objects, such as <a href="https://pins.rstudio.com/dev/articles/rsc.html">automating ETL for a Shiny app</a>.</p>
<p>You can install pins with:</p>
<pre class="r"><code>install.packages(&quot;pins&quot;)</code></pre>
<p>pins 1.0.0 includes a major overhaul of the API.
The legacy API (<code>pin()</code>, <code>pin_get()</code>, <code>board_register()</code>, and friends) will continue to work, but new features will only be implemented with the new API, so we encourage you to switch to the modern API as quickly as possible.
If you’re an existing pins user, you can learn more about the changes and how to update you code in <a href="https://pins.rstudio.com/articles/pins-update.html"><code>vignette("pins-update")</code></a>.</p>
<div id="basics" class="level2">

<h2>Basics</h2>
<p>To use the pins package, you must first create a pin board.
A good place to start is <code>board_folder()</code>, which stores pins in a directory you specify.
Here I’ll use a special version of <code>board_folder()</code> called <code>board_temp()</code> which creates a temporary board that’s automatically deleted when your R session ends.
This is great for examples, but obviously you shouldn’t use it for real work!</p>
<pre class="r"><code>library(pins)

board &lt;- board_temp()
board
#&gt; Pin board &lt;pins_board_folder&gt;
#&gt; Path: &#39;/tmp/RtmpLu2Bkx/pins-114af466104ab&#39;
#&gt; Cache size: 0</code></pre>
<p>You can “pin” (save) data to a board with <code>pin_write()</code>.
It takes three arguments: the board to pin to, an object, and a name:</p>
<pre class="r"><code>board %&gt;% pin_write(head(mtcars), &quot;mtcars&quot;)
#&gt; Guessing `type = &#39;rds&#39;`
#&gt; Creating new version &#39;20211004T155644Z-f8797&#39;
#&gt; Writing to pin &#39;mtcars&#39;</code></pre>
<p>As you can see, the data saved as an <code>.rds</code> by default, but depending on what you’re saving and who else you want to read it, you might use the <code>type</code> argument to instead save it as a <code>csv</code>, <code>json</code>, <code>arrow</code>, or <code>qs</code> file.</p>
<p>You can later retrieve the pinned data with <code>pin_read()</code>:</p>
<pre class="r"><code>board %&gt;% pin_read(&quot;mtcars&quot;)
#&gt;                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#&gt; Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#&gt; Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#&gt; Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#&gt; Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#&gt; Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#&gt; Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1</code></pre>
</div>
<div id="sharing-pins" class="level2">
<h2>Sharing pins</h2>
<p>A board on your computer is good place to start, but the real power of pins comes when you use a board that’s shared with multiple people.
To get started, you can use <a href="https://pins.rstudio.com/reference/board_folder.html"><code>board_folder()</code></a> with a directory on a shared drive or using Dropbox, or if you use <a href="https://www.rstudio.com/products/connect/">RStudio Connect</a> you can use <a href="https://pins.rstudio.com/reference/board_rsconnect.html"><code>board_rsconnect()</code></a>:</p>
<pre class="r"><code>board &lt;- board_rsconnect()
#&gt; Connecting to RSC 1.9.0.1 at &lt;https://connect.rstudioservices.com&gt;
board %&gt;% pin_write(tidy_sales_data, &quot;sales-summary&quot;, type = &quot;rds&quot;)
#&gt; Writing to pin &#39;hadley/sales-summary&#39;</code></pre>
<p>Then, someone else (or an automated Rmd report) can read and use your pin:</p>
<pre class="r"><code>board &lt;- board_rsconnect()
board %&gt;% pin_read(&quot;hadley/sales-summary&quot;)</code></pre>
<p>You can easily control who gets to access the data using the RStudio Connection permissions pane.</p>
</div>
<div id="other-boards" class="level2">
<h2>Other boards</h2>
<p>As well as <code>board_folder()</code> and <code>board_rsconnect()</code>, pins 1.0.0 provides:</p>
<ul>
<li><p><a href="https://pins.rstudio.com/reference/board_azure.html"><code>board_azure()</code></a>, which uses Azure’s blob storage.</p></li>
<li><p><a href="https://pins.rstudio.com/reference/board_s3.html"><code>board_s3()</code></a>, which uses Amazon’s S3 storage platform.</p></li>
<li><p><a href="https://pins.rstudio.com/reference/board_ms365.html"><code>board_ms365()</code></a>, which uses Microsoft’s OneDrive or SharePoint.
(Thanks to contribution from <a href="https://github.com/hongooi73">Hong Ooi</a>)</p></li>
</ul>
<p>Future versions of the pins package are likely to include other backends as we learn from our users what would be most useful.</p>
</div>
]]></description>
      <enclosure url="https://posit-open-source.netlify.app/blog/rstudio/pins-1-0-0/thumbnail.jpg" length="179731" type="image/jpeg" />
    </item>
    <item>
      <title>sparklyr.sedona: A sparklyr extension for analyzing geospatial data</title>
      <link>https://posit-open-source.netlify.app/blog/ai/sparklyr-sedona/</link>
      <pubDate>Wed, 07 Jul 2021 00:00:00 +0000</pubDate>
      <guid>https://posit-open-source.netlify.app/blog/ai/sparklyr-sedona/</guid>
      <dc:creator>Yitao Li</dc:creator><description><![CDATA[<p><a href="https://github.com/apache/incubator-sedona/tree/master/R/sparklyr.sedona" target="_blank" rel="noopener"><code>sparklyr.sedona</code></a>
 is now available
as the <code>sparklyr</code>-based R interface for <a href="https://sedona.apache.org/" target="_blank" rel="noopener">Apache Sedona</a>
.</p>
<p>To install <code>sparklyr.sedona</code> from GitHub using
the <a href="https://cran.r-project.org/web/packages/remotes/index.html" target="_blank" rel="noopener"><code>remotes</code></a>
 package
<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>, run</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">remotes</span><span class="o">::</span><span class="nf">install_github</span><span class="p">(</span><span class="n">repo</span> <span class="o">=</span> <span class="s">&#34;apache/incubator-sedona&#34;</span><span class="p">,</span> <span class="n">subdir</span> <span class="o">=</span> <span class="s">&#34;R/sparklyr.sedona&#34;</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>In this blog post, we will provide a quick introduction to <code>sparklyr.sedona</code>, outlining the motivation behind
this <code>sparklyr</code> extension, and presenting some example <code>sparklyr.sedona</code> use cases involving Spark spatial RDDs,
Spark dataframes, and visualizations.</p>
<h2 id="motivation-for-sparklyrsedona">Motivation for <code>sparklyr.sedona</code>
</h2>
<p>A suggestion from the
<a href="https://posit-open-source.netlify.app/blog/ai/2021-02-17-survey/">mlverse survey results</a>
 earlier
this year mentioned the need for up-to-date R interfaces for Spark-based GIS frameworks.
While looking into this suggestion, we learned about
<a href="https://sedona.apache.org/" target="_blank" rel="noopener">Apache Sedona</a>
, a geospatial data system powered by Spark
that is modern, efficient, and easy to use. We also realized that while our friends from the
Spark open-source community had developed a
<a href="https://github.com/harryprince/geospark" target="_blank" rel="noopener"><code>sparklyr</code> extension</a>
 for GeoSpark, the
predecessor of Apache Sedona, there was no similar extension making more recent Sedona
functionalities easily accessible from R yet.
We therefore decided to work on <code>sparklyr.sedona</code>, which aims to bridge the gap between
Sedona and R.</p>
<h2 id="the-lay-of-the-land">The lay of the land<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup>
</h2>
<p>We hope you are ready for a quick tour through some of the RDD-based and
Spark-dataframe-based functionalities in <code>sparklyr.sedona</code>, and also, some bedazzling
visualizations derived from geospatial data in Spark.</p>
<p>In Apache Sedona,
<a href="https://sedona.apache.org/api/javadoc/core/org/apache/sedona/core/spatialRDD/SpatialRDD.html" target="_blank" rel="noopener">Spatial Resilient Distributed Datasets</a>
(SRDDs)
are basic building blocks of distributed spatial data encapsulating
&ldquo;vanilla&rdquo; <a href="https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/api/java/JavaRDD.html" target="_blank" rel="noopener">RDD</a>
s of
geometrical objects and indexes. SRDDs support low-level operations such as Coordinate Reference System (CRS)
transformations, spatial partitioning, and spatial indexing. For example, with <code>sparklyr.sedona</code>, SRDD-based operations we can perform include the following:</p>
<ul>
<li>Importing some external data source into a SRDD:</li>
</ul>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">sparklyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">sparklyr.sedona</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">sedona_git_repo</span> <span class="o">&lt;-</span> <span class="nf">normalizePath</span><span class="p">(</span><span class="s">&#34;~/incubator-sedona&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">data_dir</span> <span class="o">&lt;-</span> <span class="nf">file.path</span><span class="p">(</span><span class="n">sedona_git_repo</span><span class="p">,</span> <span class="s">&#34;core&#34;</span><span class="p">,</span> <span class="s">&#34;src&#34;</span><span class="p">,</span> <span class="s">&#34;test&#34;</span><span class="p">,</span> <span class="s">&#34;resources&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">sc</span> <span class="o">&lt;-</span> <span class="nf">spark_connect</span><span class="p">(</span><span class="n">master</span> <span class="o">=</span> <span class="s">&#34;local&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">pt_rdd</span> <span class="o">&lt;-</span> <span class="nf">sedona_read_dsv_to_typed_rdd</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">  <span class="n">sc</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="n">location</span> <span class="o">=</span> <span class="nf">file.path</span><span class="p">(</span><span class="n">data_dir</span><span class="p">,</span> <span class="s">&#34;arealm.csv&#34;</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">  <span class="n">type</span> <span class="o">=</span> <span class="s">&#34;point&#34;</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><ul>
<li>Applying spatial partitioning to all data points:</li>
</ul>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">sedona_apply_spatial_partitioner</span><span class="p">(</span><span class="n">pt_rdd</span><span class="p">,</span> <span class="n">partitioner</span> <span class="o">=</span> <span class="s">&#34;kdbtree&#34;</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><ul>
<li>Building spatial index on each partition:</li>
</ul>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">sedona_build_index</span><span class="p">(</span><span class="n">pt_rdd</span><span class="p">,</span> <span class="n">type</span> <span class="o">=</span> <span class="s">&#34;quadtree&#34;</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><ul>
<li>Joining one spatial data set with another using &ldquo;contain&rdquo; or &ldquo;overlap&rdquo; as the join predicate:</li>
</ul>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">polygon_rdd</span> <span class="o">&lt;-</span> <span class="nf">sedona_read_dsv_to_typed_rdd</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">  <span class="n">sc</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="n">location</span> <span class="o">=</span> <span class="nf">file.path</span><span class="p">(</span><span class="n">data_dir</span><span class="p">,</span> <span class="s">&#34;primaryroads-polygon.csv&#34;</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">  <span class="n">type</span> <span class="o">=</span> <span class="s">&#34;polygon&#34;</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">pts_per_region_rdd</span> <span class="o">&lt;-</span> <span class="nf">sedona_spatial_join_count_by_key</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">  <span class="n">pt_rdd</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="n">polygon_rdd</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="n">join_type</span> <span class="o">=</span> <span class="s">&#34;contain&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="n">partitioner</span> <span class="o">=</span> <span class="s">&#34;kdbtree&#34;</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>It is worth mentioning that <code>sedona_spatial_join()</code> will perform spatial partitioning
and indexing on the inputs using the <code>partitioner</code> and <code>index_type</code> only if the inputs
are not partitioned or indexed as specified already.</p>
<p>From the examples above, one can see that SRDDs are great for spatial operations requiring
fine-grained control, e.g., for ensuring a spatial join query is executed as efficiently
as possible with the right types of spatial partitioning and indexing.</p>
<p>Finally, we can try visualizing the join result above, using a choropleth map:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span><span class="lnt">8
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">sedona_render_choropleth_map</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">  <span class="n">pts_per_region_rdd</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="n">resolution_x</span> <span class="o">=</span> <span class="m">1000</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="n">resolution_y</span> <span class="o">=</span> <span class="m">600</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="n">output_location</span> <span class="o">=</span> <span class="nf">tempfile</span><span class="p">(</span><span class="s">&#34;choropleth-map-&#34;</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">  <span class="n">boundary</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="m">-126.790180</span><span class="p">,</span> <span class="m">-64.630926</span><span class="p">,</span> <span class="m">24.863836</span><span class="p">,</span> <span class="m">50.000</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">  <span class="n">base_color</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="m">63</span><span class="p">,</span> <span class="m">127</span><span class="p">,</span> <span class="m">255</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>which gives us the following:</p>
<figure>
<img src="https://posit-open-source.netlify.app/blog/ai/sparklyr-sedona/images/choropleth-map.png" alt="Example choropleth map output" />
<figcaption aria-hidden="true">Example choropleth map output</figcaption>
</figure>
<p>Wait, but something seems amiss. To make the visualization above look nicer, we can
overlay it with the contour of each polygonal region:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">contours</span> <span class="o">&lt;-</span> <span class="nf">sedona_render_scatter_plot</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">  <span class="n">polygon_rdd</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="n">resolution_x</span> <span class="o">=</span> <span class="m">1000</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="n">resolution_y</span> <span class="o">=</span> <span class="m">600</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="n">output_location</span> <span class="o">=</span> <span class="nf">tempfile</span><span class="p">(</span><span class="s">&#34;scatter-plot-&#34;</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">  <span class="n">boundary</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="m">-126.790180</span><span class="p">,</span> <span class="m">-64.630926</span><span class="p">,</span> <span class="m">24.863836</span><span class="p">,</span> <span class="m">50.000</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">  <span class="n">base_color</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="m">255</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="m">0</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">  <span class="n">browse</span> <span class="o">=</span> <span class="kc">FALSE</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">sedona_render_choropleth_map</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">  <span class="n">pts_per_region_rdd</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="n">resolution_x</span> <span class="o">=</span> <span class="m">1000</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="n">resolution_y</span> <span class="o">=</span> <span class="m">600</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="n">output_location</span> <span class="o">=</span> <span class="nf">tempfile</span><span class="p">(</span><span class="s">&#34;choropleth-map-&#34;</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">  <span class="n">boundary</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="m">-126.790180</span><span class="p">,</span> <span class="m">-64.630926</span><span class="p">,</span> <span class="m">24.863836</span><span class="p">,</span> <span class="m">50.000</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">  <span class="n">base_color</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="m">63</span><span class="p">,</span> <span class="m">127</span><span class="p">,</span> <span class="m">255</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">  <span class="n">overlay</span> <span class="o">=</span> <span class="n">contours</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>which gives us the following:</p>
<figure>
<img src="https://posit-open-source.netlify.app/blog/ai/sparklyr-sedona/images/choropleth-map-with-overlay.png" alt="Choropleth map with overlay" />
<figcaption aria-hidden="true">Choropleth map with overlay</figcaption>
</figure>
<p>With some low-level spatial operations taken care of using the SRDD API and
the right spatial partitioning and indexing data structures, we can then
import the results from SRDDs to Spark dataframes. When working with spatial
objects within Spark dataframes, we can write high-level, declarative queries
on these objects using <code>dplyr</code> verbs in conjunction with Sedona
<a href="https://sedona.apache.org/api/sql/Function/" target="_blank" rel="noopener">spatial UDFs</a>
, e.g.
<sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup>
, the
following query tells us whether each of the <code>8</code> nearest polygons to the
query point contains that point, and also, the convex hull of each polygon.</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">tbl</span> <span class="o">&lt;-</span> <span class="n">DBI</span><span class="o">::</span><span class="nf">dbGetQuery</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">  <span class="n">sc</span><span class="p">,</span> <span class="s">&#34;SELECT ST_GeomFromText(\&#34;POINT(-66.3 18)\&#34;) AS `pt`&#34;</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">pt</span> <span class="o">&lt;-</span> <span class="n">tbl</span><span class="o">$</span><span class="n">pt[[1]]</span>
</span></span><span class="line"><span class="cl"><span class="n">knn_rdd</span> <span class="o">&lt;-</span> <span class="nf">sedona_knn_query</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">  <span class="n">polygon_rdd</span><span class="p">,</span> <span class="n">x</span> <span class="o">=</span> <span class="n">pt</span><span class="p">,</span> <span class="n">k</span> <span class="o">=</span> <span class="m">8</span><span class="p">,</span> <span class="n">index_type</span> <span class="o">=</span> <span class="s">&#34;rtree&#34;</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">knn_sdf</span> <span class="o">&lt;-</span> <span class="n">knn_rdd</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">sdf_register</span><span class="p">()</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="n">dplyr</span><span class="o">::</span><span class="nf">mutate</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">contains_pt</span> <span class="o">=</span> <span class="nf">ST_contains</span><span class="p">(</span><span class="n">geometry</span><span class="p">,</span> <span class="nf">ST_Point</span><span class="p">(</span><span class="m">-66.3</span><span class="p">,</span> <span class="m">18</span><span class="p">)),</span>
</span></span><span class="line"><span class="cl">    <span class="n">convex_hull</span> <span class="o">=</span> <span class="nf">ST_ConvexHull</span><span class="p">(</span><span class="n">geometry</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">  <span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">knn_sdf</span> <span class="o">%&gt;%</span> <span class="nf">print</span><span class="p">()</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code># Source: spark&lt;?&gt; [?? x 3]
  geometry                         contains_pt convex_hull
  &lt;list&gt;                           &lt;lgl&gt;       &lt;list&gt;
1 &lt;POLYGON ((-66.335674 17.986328… TRUE        &lt;POLYGON ((-66.335674 17.986328,…
2 &lt;POLYGON ((-66.335432 17.986626… TRUE        &lt;POLYGON ((-66.335432 17.986626,…
3 &lt;POLYGON ((-66.335432 17.986626… TRUE        &lt;POLYGON ((-66.335432 17.986626,…
4 &lt;POLYGON ((-66.335674 17.986328… TRUE        &lt;POLYGON ((-66.335674 17.986328,…
5 &lt;POLYGON ((-66.242489 17.988637… FALSE       &lt;POLYGON ((-66.242489 17.988637,…
6 &lt;POLYGON ((-66.242489 17.988637… FALSE       &lt;POLYGON ((-66.242489 17.988637,…
7 &lt;POLYGON ((-66.24221 17.988799,… FALSE       &lt;POLYGON ((-66.24221 17.988799, …
8 &lt;POLYGON ((-66.24221 17.988799,… FALSE       &lt;POLYGON ((-66.24221 17.988799, …
</code></pre>
<h2 id="acknowledgements">Acknowledgements
</h2>
<p>The author of this blog post would like to thank <a href="https://github.com/jiayuasu" target="_blank" rel="noopener">Jia Yu</a>
,
the creator of Apache Sedona, and <a href="https://github.com/lorenzwalthert" target="_blank" rel="noopener">Lorenz Walthert</a>
 for
their suggestion to contribute <code>sparklyr.sedona</code> to the upstream
<a href="https://github.com/apache/incubator-sedona" target="_blank" rel="noopener">incubator-sedona</a>
 repository. Jia has provided
extensive code-review feedback to ensure <code>sparklyr.sedona</code> complies with coding standards
and best practices of the Apache Sedona project, and has also been very helpful in the
instrumentation of CI workflows verifying <code>sparklyr.sedona</code> works as expected with snapshot
versions of Sedona libraries from development branches.</p>
<p>The author is also grateful for his colleague <a href="https://github.com/skeydan" target="_blank" rel="noopener">Sigrid Keydana</a>

for valuable editorial suggestions on this blog post.</p>
<p>That&rsquo;s all. Thank you for reading!</p>
<p>Photo by <a href="https://unsplash.com/@nasa" target="_blank" rel="noopener">NASA</a>
 on <a href="https://unsplash.com/" target="_blank" rel="noopener">Unsplash</a>
</p>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p><code>sparklyr.sedona</code> was not released to CRAN yet at the time of writing.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p>Yes, pun intended&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:3">
<p>This demo requires sparklyr 1.7 or above to generate the required Spark SQL type casts for <code>ST_Point()</code> automatically.&#160;<a href="#fnref:3" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></description>
      <enclosure url="https://posit-open-source.netlify.app/blog/ai/sparklyr-sedona/thumbnail.jpg" length="374380" type="image/jpeg" />
    </item>
    <item>
      <title>sparklyr 1.7: New data sources and spark_apply() capabilities, better interfaces for sparklyr extensions, and more!</title>
      <link>https://posit-open-source.netlify.app/blog/ai/sparklyr-1.7/</link>
      <pubDate>Tue, 06 Jul 2021 00:00:00 +0000</pubDate>
      <guid>https://posit-open-source.netlify.app/blog/ai/sparklyr-1.7/</guid>
      <dc:creator>Yitao Li</dc:creator><description><![CDATA[<p><a href="https://sparklyr.ai" target="_blank" rel="noopener"><code>Sparklyr</code></a>
 1.7 is now available on <a href="https://cran.r-project.org/web/packages/sparklyr/index.html" target="_blank" rel="noopener">CRAN</a>
!</p>
<p>To install <code>sparklyr</code> 1.7 from CRAN, run</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">install.packages</span><span class="p">(</span><span class="s">&#34;sparklyr&#34;</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>In this blog post, we wish to present the following highlights from the <code>sparklyr</code> 1.7 release:</p>
<ul>
<li><a href="#image-and-binary-data-sources">Image and binary data sources</a>
</li>
<li><a href="#new-spark_apply-capabilities">New spark_apply() capabilities</a>
</li>
<li><a href="#better-integration-with-sparklyr-extensions">Better integration with sparklyr extensions</a>
</li>
<li><a href="#other-exciting-news">Other exciting news</a>
</li>
</ul>
<h2 id="image-and-binary-data-sources">Image and binary data sources
</h2>
<p>As a unified analytics engine for large-scale data processing, <a href="https://spark.apache.org" target="_blank" rel="noopener">Apache Spark</a>

is well-known for its ability to tackle challenges associated with the volume, velocity, and last but
not least, the variety of big data. Therefore it is hardly surprising to see that &ndash; in response to recent
advances in deep learning frameworks &ndash; Apache Spark has introduced built-in support for
<a href="https://issues.apache.org/jira/browse/SPARK-22666" target="_blank" rel="noopener">image data sources</a>

and <a href="https://issues.apache.org/jira/browse/SPARK-25348" target="_blank" rel="noopener">binary data sources</a>
 (in releases 2.4 and 3.0, respectively).
The corresponding R interfaces for both data sources, namely,
<a href="https://spark.rstudio.com/reference/spark_read_image.html" target="_blank" rel="noopener"><code>spark_read_image()</code></a>
 and
<a href="https://spark.rstudio.com/reference/spark_read_binary.html" target="_blank" rel="noopener"><code>spark_read_binary()</code></a>
, were shipped
recently as part of <code>sparklyr</code> 1.7.</p>
<p>The usefulness of data source functionalities such as <code>spark_read_image()</code> is perhaps best illustrated
by a quick demo below, where <code>spark_read_image()</code>, through the standard Apache Spark
<a href="https://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/image/ImageSchema.html" target="_blank" rel="noopener"><code>ImageSchema</code></a>
,
helps connecting raw image inputs to a sophisticated feature extractor and a classifier, forming a powerful
Spark application for image classifications.</p>
<h3 id="the-demo">The demo
</h3>
<p><div class="not-prose"><figure>
    <img class="h-auto max-w-full rounded-lg"
      src="https://posit-open-source.netlify.app/blog/ai/sparklyr-1.7/images/photo-1571324524859-899fbd151860.jpeg"
      alt="" 
      loading="lazy"
    >
  </figure></div>

Photo by <a href="https://unsplash.com/@danieltuttle" target="_blank" rel="noopener">Daniel Tuttle</a>
 on
<a href="https://unsplash.com/?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText" target="_blank" rel="noopener">Unsplash</a>
</p>
<p>In this demo, we shall construct a scalable Spark ML pipeline capable of classifying images of cats and dogs
accurately and efficiently, using <code>spark_read_image()</code> and a pre-trained convolutional neural network
code-named <code>Inception</code> (Szegedy et al. (2015)).</p>
<p>The first step to building such a demo with maximum portability and repeatability is to create a
<a href="https://spark.rstudio.com/extensions/" target="_blank" rel="noopener">sparklyr extension</a>
 that accomplishes the following:</p>
<ul>
<li>Specifying the required MVN dependencies of this demo (namely, the
<a href="https://spark-packages.org/package/databricks/spark-deep-learning" target="_blank" rel="noopener">Spark Deep Learning library</a>

(Databricks, Inc. (2019)), which contains an <code>Inception</code>-V3-based image feature extractor accessible through
the <a href="https://spark.apache.org/docs/latest/ml-pipeline.html#transformers" target="_blank" rel="noopener">Spark ML Transformer interface</a>
)</li>
<li>Bundling with itself two <a href="https://xkcd.com/221" target="_blank" rel="noopener">randomly selected</a>

<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup> and disjoint subsets of the
dogs-vs-cats dataset (Elson et al. (2007)) as train and test data, which are stored in the <code>extdata/{train,test}</code> sub
directories of the package)</li>
</ul>
<p>A reference implementation of such a <code>sparklyr</code> extension can be found in
<a href="https://github.com/mlverse/sparklyr-image-classification-demo" target="_blank" rel="noopener">here</a>
.</p>
<p>The second step, of course, is to make use of the above-mentioned <code>sparklyr</code> extension to perform some feature
engineering. We will see very high-level features being extracted intelligently from each cat/dog image based
on what the pre-built <code>Inception</code>-V3 convolutional neural network has already learned from classifying a much
broader collection of images:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span><span class="lnt">21
</span><span class="lnt">22
</span><span class="lnt">23
</span><span class="lnt">24
</span><span class="lnt">25
</span><span class="lnt">26
</span><span class="lnt">27
</span><span class="lnt">28
</span><span class="lnt">29
</span><span class="lnt">30
</span><span class="lnt">31
</span><span class="lnt">32
</span><span class="lnt">33
</span><span class="lnt">34
</span><span class="lnt">35
</span><span class="lnt">36
</span><span class="lnt">37
</span><span class="lnt">38
</span><span class="lnt">39
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">sparklyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">sparklyr.deeperer</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># NOTE: the correct spark_home path to use depends on the configuration of the</span>
</span></span><span class="line"><span class="cl"><span class="c1"># Spark cluster you are working with.</span>
</span></span><span class="line"><span class="cl"><span class="n">spark_home</span> <span class="o">&lt;-</span> <span class="s">&#34;/usr/lib/spark&#34;</span>
</span></span><span class="line"><span class="cl"><span class="n">sc</span> <span class="o">&lt;-</span> <span class="nf">spark_connect</span><span class="p">(</span><span class="n">master</span> <span class="o">=</span> <span class="s">&#34;yarn&#34;</span><span class="p">,</span> <span class="n">spark_home</span> <span class="o">=</span> <span class="n">spark_home</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">data_dir</span> <span class="o">&lt;-</span> <span class="nf">copy_images_to_hdfs</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># extract features from train- and test-data</span>
</span></span><span class="line"><span class="cl"><span class="n">image_data</span> <span class="o">&lt;-</span> <span class="nf">list</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="kr">for</span> <span class="p">(</span><span class="n">x</span> <span class="kr">in</span> <span class="nf">c</span><span class="p">(</span><span class="s">&#34;train&#34;</span><span class="p">,</span> <span class="s">&#34;test&#34;</span><span class="p">))</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="c1"># import</span>
</span></span><span class="line"><span class="cl">  <span class="n">image_data[[x]]</span> <span class="o">&lt;-</span> <span class="nf">c</span><span class="p">(</span><span class="s">&#34;dogs&#34;</span><span class="p">,</span> <span class="s">&#34;cats&#34;</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">    <span class="nf">lapply</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">      <span class="kr">function</span><span class="p">(</span><span class="n">label</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="n">numeric_label</span> <span class="o">&lt;-</span> <span class="nf">ifelse</span><span class="p">(</span><span class="nf">identical</span><span class="p">(</span><span class="n">label</span><span class="p">,</span> <span class="s">&#34;dogs&#34;</span><span class="p">),</span> <span class="m">1L</span><span class="p">,</span> <span class="m">0L</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="nf">spark_read_image</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">          <span class="n">sc</span><span class="p">,</span> <span class="n">dir</span> <span class="o">=</span> <span class="nf">file.path</span><span class="p">(</span><span class="n">data_dir</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">label</span><span class="p">,</span> <span class="n">fsep</span> <span class="o">=</span> <span class="s">&#34;/&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">          <span class="n">dplyr</span><span class="o">::</span><span class="nf">mutate</span><span class="p">(</span><span class="n">label</span> <span class="o">=</span> <span class="n">numeric_label</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">      <span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">      <span class="nf">do.call</span><span class="p">(</span><span class="n">sdf_bind_rows</span><span class="p">,</span> <span class="n">.)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">  <span class="n">dl_featurizer</span> <span class="o">&lt;-</span> <span class="nf">invoke_new</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">sc</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s">&#34;com.databricks.sparkdl.DeepImageFeaturizer&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="nf">random_string</span><span class="p">(</span><span class="s">&#34;dl_featurizer&#34;</span><span class="p">)</span> <span class="c1"># uid</span>
</span></span><span class="line"><span class="cl">  <span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">    <span class="nf">invoke</span><span class="p">(</span><span class="s">&#34;setModelName&#34;</span><span class="p">,</span> <span class="s">&#34;InceptionV3&#34;</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">    <span class="nf">invoke</span><span class="p">(</span><span class="s">&#34;setInputCol&#34;</span><span class="p">,</span> <span class="s">&#34;image&#34;</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">    <span class="nf">invoke</span><span class="p">(</span><span class="s">&#34;setOutputCol&#34;</span><span class="p">,</span> <span class="s">&#34;features&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">  <span class="n">image_data[[x]]</span> <span class="o">&lt;-</span>
</span></span><span class="line"><span class="cl">    <span class="n">dl_featurizer</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">    <span class="nf">invoke</span><span class="p">(</span><span class="s">&#34;transform&#34;</span><span class="p">,</span> <span class="nf">spark_dataframe</span><span class="p">(</span><span class="n">image_data[[x]]</span><span class="p">))</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">    <span class="nf">sdf_register</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>Third step: equipped with features that summarize the content of each image well, we can
build a Spark ML pipeline that recognizes cats and dogs using only logistic regression
<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup></p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span><span class="lnt">8
</span><span class="lnt">9
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">label_col</span> <span class="o">&lt;-</span> <span class="s">&#34;label&#34;</span>
</span></span><span class="line"><span class="cl"><span class="n">prediction_col</span> <span class="o">&lt;-</span> <span class="s">&#34;prediction&#34;</span>
</span></span><span class="line"><span class="cl"><span class="n">pipeline</span> <span class="o">&lt;-</span> <span class="nf">ml_pipeline</span><span class="p">(</span><span class="n">sc</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">ml_logistic_regression</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">features_col</span> <span class="o">=</span> <span class="s">&#34;features&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">label_col</span> <span class="o">=</span> <span class="n">label_col</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">prediction_col</span> <span class="o">=</span> <span class="n">prediction_col</span>
</span></span><span class="line"><span class="cl">  <span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">model</span> <span class="o">&lt;-</span> <span class="n">pipeline</span> <span class="o">%&gt;%</span> <span class="nf">ml_fit</span><span class="p">(</span><span class="n">image_data</span><span class="o">$</span><span class="n">train</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>Finally, we can evaluate the accuracy of this model on the test images:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">predictions</span> <span class="o">&lt;-</span> <span class="n">model</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">ml_transform</span><span class="p">(</span><span class="n">image_data</span><span class="o">$</span><span class="n">test</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="n">dplyr</span><span class="o">::</span><span class="nf">compute</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">cat</span><span class="p">(</span><span class="s">&#34;Predictions vs. labels:\n&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">predictions</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="n">dplyr</span><span class="o">::</span><span class="nf">select</span><span class="p">(</span><span class="o">!!</span><span class="n">label_col</span><span class="p">,</span> <span class="o">!!</span><span class="n">prediction_col</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">print</span><span class="p">(</span><span class="n">n</span> <span class="o">=</span> <span class="nf">sdf_nrow</span><span class="p">(</span><span class="n">predictions</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">cat</span><span class="p">(</span><span class="s">&#34;\nAccuracy of predictions:\n&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">predictions</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">ml_multiclass_classification_evaluator</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">label_col</span> <span class="o">=</span> <span class="n">label_col</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">prediction_col</span> <span class="o">=</span> <span class="n">prediction_col</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">metric_name</span> <span class="o">=</span> <span class="s">&#34;accuracy&#34;</span>
</span></span><span class="line"><span class="cl">  <span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">    <span class="nf">print</span><span class="p">()</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code>## Predictions vs. labels:
## # Source: spark&lt;?&gt; [?? x 2]
##    label prediction
##    &lt;int&gt;      &lt;dbl&gt;
##  1     1          1
##  2     1          1
##  3     1          1
##  4     1          1
##  5     1          1
##  6     1          1
##  7     1          1
##  8     1          1
##  9     1          1
## 10     1          1
## 11     0          0
## 12     0          0
## 13     0          0
## 14     0          0
## 15     0          0
## 16     0          0
## 17     0          0
## 18     0          0
## 19     0          0
## 20     0          0
##
## Accuracy of predictions:
## [1] 1
</code></pre>
<h2 id="new-spark_apply-capabilities">New <code>spark_apply()</code> capabilities
</h2>
<h3 id="optimizations--custom-serializers">Optimizations &amp; custom serializers
</h3>
<p>Many <code>sparklyr</code> users who have tried to run
<a href="https://spark.rstudio.com/reference/spark_apply.html" target="_blank" rel="noopener"><code>spark_apply()</code></a>
 or
<a href="https://blog.rstudio.com/2020/05/06/sparklyr-1-2/#foreach" target="_blank" rel="noopener"><code>doSpark</code></a>
 to
parallelize R computations among Spark workers have probably encountered some
challenges arising from the serialization of R closures.
In some scenarios, the
serialized size of the R closure can become too large, often due to the size
of the enclosing R environment required by the closure. In other
scenarios, the serialization itself may take too much time, partially offsetting
the performance gain from parallelization. Recently, multiple optimizations went
into <code>sparklyr</code> to address those challenges. One of the optimizations was to
make good use of the
<a href="https://spark.apache.org/docs/latest/rdd-programming-guide.html#broadcast-variables" target="_blank" rel="noopener">broadcast variable</a>

construct in Apache Spark to reduce the overhead of distributing shared and
immutable task states across all Spark workers. In <code>sparklyr</code> 1.7, there is
also support for custom <code>spark_apply()</code> serializers, which offers more fine-grained
control over the trade-off between speed and compression level of serialization
algorithms. For example, one can specify</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">options</span><span class="p">(</span><span class="n">sparklyr.spark_apply.serializer</span> <span class="o">=</span> <span class="s">&#34;qs&#34;</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>,</p>
<p>which will apply the default options of <code>qs::qserialize()</code> to achieve a high
compression level, or</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">options</span><span class="p">(</span><span class="n">sparklyr.spark_apply.serializer</span> <span class="o">=</span> <span class="kr">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="n">qs</span><span class="o">::</span><span class="nf">qserialize</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">preset</span> <span class="o">=</span> <span class="s">&#34;fast&#34;</span><span class="p">))</span>
</span></span><span class="line"><span class="cl"><span class="nf">options</span><span class="p">(</span><span class="n">sparklyr.spark_apply.deserializer</span> <span class="o">=</span> <span class="kr">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="n">qs</span><span class="o">::</span><span class="nf">qdeserialize</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>,</p>
<p>which will aim for faster serialization speed with less compression.</p>
<h3 id="inferring-dependencies-automatically">Inferring dependencies automatically
</h3>
<p>In <code>sparklyr</code> 1.7, <code>spark_apply()</code> also provides the experimental
<code>auto_deps = TRUE</code> option. With <code>auto_deps</code> enabled, <code>spark_apply()</code> will
examine the R closure being applied, infer the list of required R packages,
and only copy the required R packages and their transitive dependencies
to Spark workers. In many scenarios, the <code>auto_deps = TRUE</code> option will be a
significantly better alternative compared to the default <code>packages = TRUE</code>
behavior, which is to ship everything within <code>.libPaths()</code> to Spark worker
nodes, or the advanced <code>packages = &lt;package config&gt;</code> option, which requires
users to supply the list of required R packages or manually create a
<code>spark_apply()</code> bundle.</p>
<h2 id="better-integration-with-sparklyr-extensions">Better integration with sparklyr extensions
</h2>
<p>Substantial effort went into <code>sparklyr</code> 1.7 to make lives easier for <code>sparklyr</code>
extension authors. Experience suggests two areas where any <code>sparklyr</code> extension
can go through a frictional and non-straightforward path integrating with
<code>sparklyr</code> are the following:</p>
<ul>
<li>The <a href="https://github.com/sparklyr/sparklyr/blob/1242adb632c881f0a8dd234898af84a76614f590/R/dplyr_spark_connection.R#L184" target="_blank" rel="noopener"><code>dbplyr</code> SQL translation environment</a>
</li>
<li><a href="https://spark.rstudio.com/extensions/#calling-spark-from-r" target="_blank" rel="noopener">Invocation of Java/Scala functions from R</a>
</li>
</ul>
<p>We will elaborate on recent progress in both areas in the sub-sections below.</p>
<h3 id="customizing-the-dbplyr-sql-translation-environment">Customizing the <code>dbplyr</code> SQL translation environment
</h3>
<p><code>sparklyr</code> extensions can now customize <code>sparklyr</code>&rsquo;s <code>dbplyr</code> SQL translations
through the
<a href="https://spark.rstudio.com/reference/spark_dependency.html" target="_blank" rel="noopener"><code>spark_dependency()</code></a>

specification returned from <code>spark_dependencies()</code> callbacks.
This type of flexibility becomes useful, for instance, in scenarios where a
<code>sparklyr</code> extension needs to insert type casts for inputs to custom Spark
UDFs. We can find a concrete example of this in
<a href="https://github.com/apache/incubator-sedona/tree/master/R/sparklyr.sedona#sparklyrsedona" target="_blank" rel="noopener"><code>sparklyr.sedona</code></a>
,
a <code>sparklyr</code> extension to facilitate geo-spatial analyses using
<a href="https://sedona.apache.org/" target="_blank" rel="noopener">Apache Sedona</a>
. Geo-spatial UDFs supported by Apache
Sedona such as <code>ST_Point()</code> and <code>ST_PolygonFromEnvelope()</code> require all inputs to be
<code>DECIMAL(24, 20)</code> quantities rather than <code>DOUBLE</code>s. Without any customization to
<code>sparklyr</code>&rsquo;s <code>dbplyr</code> SQL variant, the only way for a <code>dplyr</code>
query involving <code>ST_Point()</code> to actually work in <code>sparklyr</code> would be to explicitly
implement any type cast needed by the query using <code>dplyr::sql()</code>, e.g.,</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">my_geospatial_sdf</span> <span class="o">&lt;-</span> <span class="n">my_geospatial_sdf</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="n">dplyr</span><span class="o">::</span><span class="nf">mutate</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">x</span> <span class="o">=</span> <span class="n">dplyr</span><span class="o">::</span><span class="nf">sql</span><span class="p">(</span><span class="s">&#34;CAST(`x` AS DECIMAL(24, 20))&#34;</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">    <span class="n">y</span> <span class="o">=</span> <span class="n">dplyr</span><span class="o">::</span><span class="nf">sql</span><span class="p">(</span><span class="s">&#34;CAST(`y` AS DECIMAL(24, 20))&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">  <span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="n">dplyr</span><span class="o">::</span><span class="nf">mutate</span><span class="p">(</span><span class="n">pt</span> <span class="o">=</span> <span class="nf">ST_Point</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">))</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>.</p>
<p>This would, to some extent, be antithetical to <code>dplyr</code>&rsquo;s goal of freeing R users from
laboriously spelling out SQL queries. Whereas by customizing <code>sparklyr</code>&rsquo;s <code>dplyr</code> SQL
translations (as implemented in
<a href="https://github.com/apache/incubator-sedona/blob/d8c2aae0678b7262660bda68eb0a2048b849e438/R/sparklyr.sedona/R/dependencies.R#L55" target="_blank" rel="noopener">here</a>

and
<a href="https://github.com/apache/incubator-sedona/blob/d8c2aae0678b7262660bda68eb0a2048b849e438/R/sparklyr.sedona/R/dependencies.R#L135" target="_blank" rel="noopener">here</a>

), <code>sparklyr.sedona</code> allows users to simply write</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">my_geospatial_sdf</span> <span class="o">&lt;-</span> <span class="n">my_geospatial_sdf</span> <span class="o">%&gt;%</span> <span class="n">dplyr</span><span class="o">::</span><span class="nf">mutate</span><span class="p">(</span><span class="n">pt</span> <span class="o">=</span> <span class="nf">ST_Point</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">))</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>instead, and the required Spark SQL type casts are generated automatically.</p>
<h3 id="improved-interface-for-invoking-javascala-functions">Improved interface for invoking Java/Scala functions
</h3>
<p>In <code>sparklyr</code> 1.7, the R interface for Java/Scala invocations saw a number of
improvements.</p>
<p>With previous versions of <code>sparklyr</code>, many <code>sparklyr</code> extension authors would
run into trouble when attempting to invoke Java/Scala functions accepting an
<code>Array[T]</code> as one of their parameters, where <code>T</code> is any type bound more specific
than <code>java.lang.Object</code> / <code>AnyRef</code>. This was because any array of objects passed
through <code>sparklyr</code>&rsquo;s Java/Scala invocation interface will be interpreted as simply
an array of <code>java.lang.Object</code>s in absence of additional type information.
For this reason, a helper function
<a href="https://spark.rstudio.com/reference/jarray.html" target="_blank" rel="noopener"><code>jarray()</code></a>
 was implemented as
part of <code>sparklyr</code> 1.7 as a way to overcome the aforementioned problem.
For example, executing</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">sc</span> <span class="o">&lt;-</span> <span class="nf">spark_connect</span><span class="p">(</span><span class="kc">...</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">arr</span> <span class="o">&lt;-</span> <span class="nf">jarray</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">  <span class="n">sc</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nf">seq</span><span class="p">(</span><span class="m">5</span><span class="p">)</span> <span class="o">%&gt;%</span> <span class="nf">lapply</span><span class="p">(</span><span class="kr">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="nf">invoke_new</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s">&#34;MyClass&#34;</span><span class="p">,</span> <span class="n">x</span><span class="p">)),</span>
</span></span><span class="line"><span class="cl">  <span class="n">element_type</span> <span class="o">=</span> <span class="s">&#34;MyClass&#34;</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>will assign to <code>arr</code> a <em>reference</em> to an <code>Array[MyClass]</code> of length 5, rather
than an <code>Array[AnyRef]</code>. Subsequently, <code>arr</code> becomes suitable to be passed as a
parameter to functions accepting only <code>Array[MyClass]</code>s as inputs. Previously,
some possible workarounds of this <code>sparklyr</code> limitation included changing
function signatures to accept <code>Array[AnyRef]</code>s instead of <code>Array[MyClass]</code>s, or
implementing a &ldquo;wrapped&rdquo; version of each function accepting <code>Array[AnyRef]</code>
inputs and converting them to <code>Array[MyClass]</code> before the actual invocation.
None of such workarounds was an ideal solution to the problem.</p>
<p>Another similar hurdle that was addressed in <code>sparklyr</code> 1.7 as well involves
function parameters that must be single-precision floating point numbers or
arrays of single-precision floating point numbers.
For those scenarios,
<a href="https://spark.rstudio.com/reference/jfloat.html" target="_blank" rel="noopener"><code>jfloat()</code></a>
 and
<a href="https://spark.rstudio.com/reference/jfloat_array.html" target="_blank" rel="noopener"><code>jfloat_array()</code></a>

are the helper functions that allow numeric quantities in R to be passed to
<code>sparklyr</code>&rsquo;s Java/Scala invocation interface as parameters with desired types.</p>
<p>In addition, while previous verisons of <code>sparklyr</code> failed to serialize
parameters with <code>NaN</code> values correctly, <code>sparklyr</code> 1.7 preserves <code>NaN</code>s as
expected in its Java/Scala invocation interface.</p>
<h2 id="other-exciting-news">Other exciting news
</h2>
<p>There are numerous other new features, enhancements, and bug fixes made to
<code>sparklyr</code> 1.7, all listed in the
<a href="https://github.com/sparklyr/sparklyr/blob/main/NEWS.md#sparklyr-170" target="_blank" rel="noopener">NEWS.md</a>

file of the <code>sparklyr</code> repo and documented in <code>sparklyr</code>&rsquo;s
<a href="https://spark.rstudio.com/reference/" target="_blank" rel="noopener">HTML reference</a>
 pages.
In the interest of brevity, we will not describe all of them in great detail
within this blog post.</p>
<h2 id="acknowledgement">Acknowledgement
</h2>
<p>In chronological order, we would like to thank the following individuals who
have authored or co-authored pull requests that were part of the <code>sparklyr</code> 1.7
release:</p>
<ul>
<li><a href="https://github.com/yitao-li" target="_blank" rel="noopener">@yitao-li</a>
</li>
<li><a href="https://github.com/mzorko" target="_blank" rel="noopener">@mzorko</a>
</li>
<li><a href="https://github.com/jozefhajnala" target="_blank" rel="noopener">@jozefhajnala</a>
</li>
<li><a href="https://github.com/lresende" target="_blank" rel="noopener">@lresende</a>
</li>
</ul>
<p>We&rsquo;re also extremely grateful to everyone who has submitted
feature requests or bug reports, many of which have been tremendously helpful in
shaping <code>sparklyr</code> into what it is today.</p>
<p>Furthermore, the author of this blog post is indebted to
<a href="https://github.com/skeydan" target="_blank" rel="noopener">@skeydan</a>
 for her awesome editorial suggestions.
Without her insights about good writing and story-telling, expositions like this
one would have been less readable.</p>
<p>If you wish to learn more about <code>sparklyr</code>, we recommend visiting
<a href="https://sparklyr.ai" target="_blank" rel="noopener">sparklyr.ai</a>
, <a href="https://spark.rstudio.com" target="_blank" rel="noopener">spark.rstudio.com</a>
,
and also reading some previous <code>sparklyr</code> release posts such as
<a href="https://posit-open-source.netlify.app/blog/ai/2021-03-25-sparklyr-1.6.0-released/">sparklyr 1.6</a>

and
<a href="https://posit-open-source.netlify.app/blog/ai/2020-12-14-sparklyr-1.5.0-released/">sparklyr 1.5</a>
.</p>
<p>That is all. Thanks for reading!</p>
<p>Databricks, Inc. 2019. <em>Deep Learning Pipelines for Apache Spark</em>. V. 1.5.0. Released January 25. <a href="https://spark-packages.org/package/databricks/spark-deep-learning" target="_blank" rel="noopener">https://spark-packages.org/package/databricks/spark-deep-learning</a>
.</p>
<p>Elson, Jeremy, John (JD) Douceur, Jon Howell, and Jared Saul. 2007. &ldquo;Asirra: A CAPTCHA That Exploits Interest-Aligned Manual Image Categorization.&rdquo; <em>Proceedings of 14th ACM Conference on Computer and Communications Security (CCS)</em>, Proceedings of 14th ACM Conference on Computer and Communications Security (CCS) Editions. <a href="https://www.microsoft.com/en-us/research/publication/asirra-a-captcha-that-exploits-interest-aligned-manual-image-categorization/" target="_blank" rel="noopener">https://www.microsoft.com/en-us/research/publication/asirra-a-captcha-that-exploits-interest-aligned-manual-image-categorization/</a>
.</p>
<p>Szegedy, Christian, Wei Liu, Yangqing Jia, et al. 2015. &ldquo;Going Deeper with Convolutions.&rdquo; <em>Computer Vision and Pattern Recognition (CVPR)</em>. <a href="http://arxiv.org/abs/1409.4842" target="_blank" rel="noopener">http://arxiv.org/abs/1409.4842</a>
.</p>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>Fun exercise for our readers: why not experiment with different subsets of cats-vs-dogs images for training
and testing, or even better, replace train and test images with your own images of cats and dogs, and see what
happens?&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p>Another way to see why it works: in fact the pre-built <code>Inception</code>-based feature
extractor simply applies all transformations <code>Inception</code> would have applied to its input,
except for the last logistic-regression-esque affine transformation plus non-linearity
producing the final categorical output, and <code>Inception</code> is a highly successful
convolutional neural network trained to recognize 1000 categories of animals and objects,
including multiple types of cats and dogs.&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></description>
      <enclosure url="https://posit-open-source.netlify.app/blog/ai/sparklyr-1.7/thumbnail.png" length="7251" type="image/png" />
    </item>
    <item>
      <title>sparklyr 1.6: weighted quantile summaries, power iteration clustering, spark_write_rds(), and more</title>
      <link>https://posit-open-source.netlify.app/blog/ai/sparklyr-1.6/</link>
      <pubDate>Thu, 25 Mar 2021 00:00:00 +0000</pubDate>
      <guid>https://posit-open-source.netlify.app/blog/ai/sparklyr-1.6/</guid>
      <dc:creator>Yitao Li</dc:creator><description><![CDATA[<p><a href="https://sparklyr.ai" target="_blank" rel="noopener"><code>Sparklyr</code></a>
 1.6 is now available on <a href="https://cran.r-project.org/web/packages/sparklyr/index.html" target="_blank" rel="noopener">CRAN</a>
!</p>
<p>To install <code>sparklyr</code> 1.6 from CRAN, run</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">install.packages</span><span class="p">(</span><span class="s">&#34;sparklyr&#34;</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>In this blog post, we shall highlight the following features and enhancements
from <code>sparklyr</code> 1.6:</p>
<ul>
<li><a href="#weighted-quantile-summaries">Weighted quantile summaries</a>
</li>
<li><a href="#power-iteration-clustering">Power iteration clustering</a>
</li>
<li><a href="#spark_write_rds-collect_from_rds"><code>spark_write_rds()</code> + <code>collect_from_rds()</code></a>
</li>
<li><a href="#dplyr-related-improvements">Dplyr-related improvements</a>
</li>
</ul>
<h2 id="weighted-quantile-summaries">Weighted quantile summaries
</h2>
<p><a href="https://spark.apache.org" target="_blank" rel="noopener">Apache Spark</a>
 is well-known for supporting
approximate algorithms that trade off marginal amounts of accuracy for greater
speed and parallelism.
Such algorithms are particularly beneficial for performing preliminary data
explorations at scale, as they enable users to quickly query certain estimated
statistics within a predefined error margin, while avoiding the high cost of
exact computations.
One example is the Greenwald-Khanna algorithm for on-line computation of quantile
summaries, as described in Greenwald and Khanna (2001).
This algorithm was originally designed for efficient $\epsilon$-
approximation of quantiles within a large dataset <em>without</em> the notion of data
points carrying different weights, and the unweighted version of it has been
implemented as
<a href="https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameStatFunctions.html#approxQuantile%28java.lang.String,%20double%5B%5D,%20double%29" target="_blank" rel="noopener"><code>approxQuantile()</code></a>

since Spark 2.0.
However, the same algorithm can be generalized to handle weighted
inputs, and as <code>sparklyr</code> user <a href="https://github.com/Zhuk66" target="_blank" rel="noopener">@Zhuk66</a>
 mentioned
in <a href="https://github.com/sparklyr/sparklyr/issues/2915" target="_blank" rel="noopener">this issue</a>
, a
<a href="https://github.com/sparklyr/sparklyr/blob/4b6bc6677ecf92787ab3521f364a8d80b973d92f/java/spark-1.5.2/weightedquantilesummaries.scala#L13-L332" target="_blank" rel="noopener">weighted version</a>

of this algorithm makes for a useful <code>sparklyr</code> feature.</p>
<p>To properly explain what weighted-quantile means, we must clarify what the
weight of each data point signifies. For example, if we have a sequence of
observations $(1, 1, 1, 1, 0, 2, -1, -1)$, and would like to approximate the
median of all data points, then we have the following two options:</p>
<ul>
<li>
<p>Either run the unweighted version of <code>approxQuantile()</code> in Spark to scan
through all 8 data points</p>
</li>
<li>
<p>Or alternatively, &ldquo;compress&rdquo; the data into 4 tuples of (value, weight):
$(1, 0.5), (0, 0.125), (2, 0.125), (-1, 0.25)$, where the second component of
each tuple represents how often a value occurs relative to the rest of the
observed values, and then find the median by scanning through the 4 tuples
using the weighted version of the Greenwald-Khanna algorithm</p>
</li>
</ul>
<p>We can also run through a contrived example involving the standard normal
distribution to illustrate the power of weighted quantile estimation in
<code>sparklyr</code> 1.6. Suppose we cannot simply run <code>qnorm()</code> in R to evaluate the
<a href="https://en.wikipedia.org/wiki/Normal_distribution#Quantile_function" target="_blank" rel="noopener">quantile function</a>

of the standard normal distribution at $p = 0.25$ and $p = 0.75$, how can
we get some vague idea about the 1st and 3rd quantiles of this distribution?
One way is to sample a large number of data points from this distribution, and
then apply the Greenwald-Khanna algorithm to our unweighted samples, as shown
below:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">sparklyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">sc</span> <span class="o">&lt;-</span> <span class="nf">spark_connect</span><span class="p">(</span><span class="n">master</span> <span class="o">=</span> <span class="s">&#34;local&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">num_samples</span> <span class="o">&lt;-</span> <span class="m">1e6</span>
</span></span><span class="line"><span class="cl"><span class="n">samples</span> <span class="o">&lt;-</span> <span class="nf">data.frame</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="nf">rnorm</span><span class="p">(</span><span class="n">num_samples</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">samples_sdf</span> <span class="o">&lt;-</span> <span class="nf">copy_to</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="n">samples</span><span class="p">,</span> <span class="n">name</span> <span class="o">=</span> <span class="nf">random_string</span><span class="p">())</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">samples_sdf</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">sdf_quantile</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">column</span> <span class="o">=</span> <span class="s">&#34;x&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">probabilities</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="m">0.25</span><span class="p">,</span> <span class="m">0.75</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">    <span class="n">relative.error</span> <span class="o">=</span> <span class="m">0.01</span>
</span></span><span class="line"><span class="cl">  <span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">print</span><span class="p">()</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code>##        25%        75%
## -0.6629242  0.6874939
</code></pre>
<p>Notice that because we are working with an approximate algorithm, and have specified
<code>relative.error = 0.01</code>, the estimated value of $-0.6629242$ from above
could be anywhere between the 24th and the 26th percentile of all samples.
In fact, it falls in the $25.36896$-th percentile:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">pnorm</span><span class="p">(</span><span class="m">-0.6629242</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code>## [1] 0.2536896
</code></pre>
<p>Now how can we make use of weighted quantile estimation from <code>sparklyr</code> 1.6 to
obtain similar results? Simple! We can sample a large number of $x$ values
uniformly randomly from $(-\infty, \infty)$ (or alternatively, just select a
large number of values evenly spaced between $(-M, M)$ where $M$ is
approximately $\infty$), and assign each $x$ value a weight of
$\displaystyle \frac{1}{\sqrt{2 \pi}}e^{-\frac{x^2}{2}}$, the standard normal
distribution&rsquo;s probability density at $x$. Finally, we run the weighted version
of <code>sdf_quantile()</code> from <code>sparklyr</code> 1.6, as shown below:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span><span class="lnt">21
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">sparklyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">sc</span> <span class="o">&lt;-</span> <span class="nf">spark_connect</span><span class="p">(</span><span class="n">master</span> <span class="o">=</span> <span class="s">&#34;local&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">num_samples</span> <span class="o">&lt;-</span> <span class="m">1e6</span>
</span></span><span class="line"><span class="cl"><span class="n">M</span> <span class="o">&lt;-</span> <span class="m">1000</span>
</span></span><span class="line"><span class="cl"><span class="n">samples</span> <span class="o">&lt;-</span> <span class="n">tibble</span><span class="o">::</span><span class="nf">tibble</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">  <span class="n">x</span> <span class="o">=</span> <span class="n">M</span> <span class="o">*</span> <span class="nf">seq</span><span class="p">(</span><span class="o">-</span><span class="n">num_samples</span> <span class="o">/</span> <span class="m">2</span> <span class="o">+</span> <span class="m">1</span><span class="p">,</span> <span class="n">num_samples</span> <span class="o">/</span> <span class="m">2</span><span class="p">)</span> <span class="o">/</span> <span class="n">num_samples</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="n">weight</span> <span class="o">=</span> <span class="nf">dnorm</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">samples_sdf</span> <span class="o">&lt;-</span> <span class="nf">copy_to</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="n">samples</span><span class="p">,</span> <span class="n">name</span> <span class="o">=</span> <span class="nf">random_string</span><span class="p">())</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">samples_sdf</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">sdf_quantile</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">column</span> <span class="o">=</span> <span class="s">&#34;x&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">weight.column</span> <span class="o">=</span> <span class="s">&#34;weight&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">probabilities</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="m">0.25</span><span class="p">,</span> <span class="m">0.75</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">    <span class="n">relative.error</span> <span class="o">=</span> <span class="m">0.01</span>
</span></span><span class="line"><span class="cl">  <span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">print</span><span class="p">()</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code>##    25%    75%
## -0.696  0.662
</code></pre>
<p>Voilà! The estimates are not too far off from the 25th and 75th percentiles (in
relation to our abovementioned maximum permissible error of $0.01$):</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">pnorm</span><span class="p">(</span><span class="m">-0.696</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code>## [1] 0.2432144
</code></pre>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">pnorm</span><span class="p">(</span><span class="m">0.662</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code>## [1] 0.7460144
</code></pre>
<h2 id="power-iteration-clustering">Power iteration clustering
</h2>
<p>Power iteration clustering (PIC), a simple and scalable graph clustering method
presented in Lin and Cohen (2010), first finds a low-dimensional embedding of a dataset, using
truncated power iteration on a normalized pairwise-similarity matrix of all data
points, and then uses this embedding as the &ldquo;cluster indicator&rdquo;, an intermediate
representation of the dataset that leads to fast convergence when used as input
to k-means clustering. This process is very well illustrated in figure 1
of Lin and Cohen (2010) (reproduced below)</p>
<img src="https://posit-open-source.netlify.app/blog/ai/sparklyr-1.6/images/PIC.png" width="612" />
<p>in which the leftmost image is the visualization of a dataset consisting of 3
circles, with points colored in red, green, and blue indicating clustering
results, and the subsequent images show the power iteration process gradually
transforming the original set of points into what appears to be three disjoint line
segments, an intermediate representation that can be rapidly separated into 3
clusters using k-means clustering with $k = 3$.</p>
<p>In <code>sparklyr</code> 1.6, <code>ml_power_iteration()</code> was implemented to make the
<a href="http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/clustering/PowerIterationClustering.html" target="_blank" rel="noopener">PIC functionality</a>

in Spark accessible from R. It expects as input a 3-column Spark dataframe that
represents a pairwise-similarity matrix of all data points. Two of
the columns in this dataframe should contain 0-based row and column indices, and
the third column should hold the corresponding similarity measure.
In the example below, we will see a dataset consisting of two circles being
easily separated into two clusters by <code>ml_power_iteration()</code>, with the Gaussian
kernel being used as the similarity measure between any 2 points:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span><span class="lnt">21
</span><span class="lnt">22
</span><span class="lnt">23
</span><span class="lnt">24
</span><span class="lnt">25
</span><span class="lnt">26
</span><span class="lnt">27
</span><span class="lnt">28
</span><span class="lnt">29
</span><span class="lnt">30
</span><span class="lnt">31
</span><span class="lnt">32
</span><span class="lnt">33
</span><span class="lnt">34
</span><span class="lnt">35
</span><span class="lnt">36
</span><span class="lnt">37
</span><span class="lnt">38
</span><span class="lnt">39
</span><span class="lnt">40
</span><span class="lnt">41
</span><span class="lnt">42
</span><span class="lnt">43
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">gen_similarity_matrix</span> <span class="o">&lt;-</span> <span class="kr">function</span><span class="p">()</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="c1"># Guassian similarity measure</span>
</span></span><span class="line"><span class="cl">  <span class="n">guassian_similarity</span> <span class="o">&lt;-</span> <span class="kr">function</span><span class="p">(</span><span class="n">pt1</span><span class="p">,</span> <span class="n">pt2</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="nf">exp</span><span class="p">(</span><span class="o">-</span><span class="nf">sum</span><span class="p">((</span><span class="n">pt2</span> <span class="o">-</span> <span class="n">pt1</span><span class="p">)</span> <span class="n">^</span> <span class="m">2</span><span class="p">)</span> <span class="o">/</span> <span class="m">2</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">  <span class="p">}</span>
</span></span><span class="line"><span class="cl">  <span class="c1"># generate evenly distributed points on a circle centered at the origin</span>
</span></span><span class="line"><span class="cl">  <span class="n">gen_circle</span> <span class="o">&lt;-</span> <span class="kr">function</span><span class="p">(</span><span class="n">radius</span><span class="p">,</span> <span class="n">num_pts</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="nf">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="n">num_pts</span> <span class="o">-</span> <span class="m">1</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">      <span class="n">purrr</span><span class="o">::</span><span class="nf">map_dfr</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">        <span class="kr">function</span><span class="p">(</span><span class="n">idx</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">          <span class="n">theta</span> <span class="o">&lt;-</span> <span class="m">2</span> <span class="o">*</span> <span class="kc">pi</span> <span class="o">*</span> <span class="n">idx</span> <span class="o">/</span> <span class="n">num_pts</span>
</span></span><span class="line"><span class="cl">          <span class="n">radius</span> <span class="o">*</span> <span class="nf">c</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="nf">cos</span><span class="p">(</span><span class="n">theta</span><span class="p">),</span> <span class="n">y</span> <span class="o">=</span> <span class="nf">sin</span><span class="p">(</span><span class="n">theta</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">        <span class="p">})</span>
</span></span><span class="line"><span class="cl">  <span class="p">}</span>
</span></span><span class="line"><span class="cl">  <span class="c1"># generate points on both circles</span>
</span></span><span class="line"><span class="cl">  <span class="n">pts</span> <span class="o">&lt;-</span> <span class="nf">rbind</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="nf">gen_circle</span><span class="p">(</span><span class="n">radius</span> <span class="o">=</span> <span class="m">1</span><span class="p">,</span> <span class="n">num_pts</span> <span class="o">=</span> <span class="m">80</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">    <span class="nf">gen_circle</span><span class="p">(</span><span class="n">radius</span> <span class="o">=</span> <span class="m">4</span><span class="p">,</span> <span class="n">num_pts</span> <span class="o">=</span> <span class="m">80</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">  <span class="p">)</span>
</span></span><span class="line"><span class="cl">  <span class="c1"># populate the pairwise similarity matrix (stored as a 3-column dataframe)</span>
</span></span><span class="line"><span class="cl">  <span class="n">similarity_matrix</span> <span class="o">&lt;-</span> <span class="nf">data.frame</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">  <span class="kr">for</span> <span class="p">(</span><span class="n">i</span> <span class="kr">in</span> <span class="nf">seq</span><span class="p">(</span><span class="m">2</span><span class="p">,</span> <span class="nf">nrow</span><span class="p">(</span><span class="n">pts</span><span class="p">)))</span>
</span></span><span class="line"><span class="cl">    <span class="n">similarity_matrix</span> <span class="o">&lt;-</span> <span class="n">similarity_matrix</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">      <span class="nf">rbind</span><span class="p">(</span><span class="nf">seq</span><span class="p">(</span><span class="n">i</span> <span class="o">-</span> <span class="m">1L</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">        <span class="n">purrr</span><span class="o">::</span><span class="nf">map_dfr</span><span class="p">(</span><span class="o">~</span> <span class="nf">list</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">          <span class="n">src</span> <span class="o">=</span> <span class="n">i</span> <span class="o">-</span> <span class="m">1L</span><span class="p">,</span> <span class="n">dst</span> <span class="o">=</span> <span class="n">.x</span> <span class="o">-</span> <span class="m">1L</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">          <span class="n">similarity</span> <span class="o">=</span> <span class="nf">guassian_similarity</span><span class="p">(</span><span class="n">pts[i</span><span class="p">,</span><span class="n">]</span><span class="p">,</span> <span class="n">pts[.x</span><span class="p">,</span><span class="n">]</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="p">))</span>
</span></span><span class="line"><span class="cl">      <span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">  <span class="n">similarity_matrix</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">sparklyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">sc</span> <span class="o">&lt;-</span> <span class="nf">spark_connect</span><span class="p">(</span><span class="n">master</span> <span class="o">=</span> <span class="s">&#34;local&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">sdf</span> <span class="o">&lt;-</span> <span class="nf">copy_to</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="nf">gen_similarity_matrix</span><span class="p">())</span>
</span></span><span class="line"><span class="cl"><span class="n">clusters</span> <span class="o">&lt;-</span> <span class="nf">ml_power_iteration</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">  <span class="n">sdf</span><span class="p">,</span> <span class="n">k</span> <span class="o">=</span> <span class="m">2</span><span class="p">,</span> <span class="n">max_iter</span> <span class="o">=</span> <span class="m">10</span><span class="p">,</span> <span class="n">init_mode</span> <span class="o">=</span> <span class="s">&#34;degree&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="n">src_col</span> <span class="o">=</span> <span class="s">&#34;src&#34;</span><span class="p">,</span> <span class="n">dst_col</span> <span class="o">=</span> <span class="s">&#34;dst&#34;</span><span class="p">,</span> <span class="n">weight_col</span> <span class="o">=</span> <span class="s">&#34;similarity&#34;</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">clusters</span> <span class="o">%&gt;%</span> <span class="nf">print</span><span class="p">(</span><span class="n">n</span> <span class="o">=</span> <span class="m">160</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code>## # A tibble: 160 x 2
##        id cluster
##     &lt;dbl&gt;   &lt;int&gt;
##   1     0       1
##   2     1       1
##   3     2       1
##   4     3       1
##   5     4       1
##   ...
##   157   156       0
##   158   157       0
##   159   158       0
##   160   159       0
</code></pre>
<p>The output shows points from the two circles being assigned to separate clusters,
as expected, after only a small number of PIC iterations.</p>
<h2 id="spark_write_rds--collect_from_rds"><code>spark_write_rds()</code> + <code>collect_from_rds()</code>
</h2>
<p><code>spark_write_rds()</code> and <code>collect_from_rds()</code> are implemented as a less memory-
consuming alternative to <code>collect()</code>. Unlike <code>collect()</code>, which retrieves all
elements of a Spark dataframe through the Spark driver node, hence potentially
causing slowness or out-of-memory failures when collecting large amounts of data,
<code>spark_write_rds()</code>, when used in conjunction with <code>collect_from_rds()</code>, can
retrieve all partitions of a Spark dataframe directly from Spark workers,
rather than through the Spark driver node.
First, <code>spark_write_rds()</code> will
distribute the tasks of serializing Spark dataframe partitions in RDS version
2 format among Spark workers. Spark workers can then process multiple partitions
in parallel, each handling one partition at a time and persisting the RDS output
directly to disk, rather than sending dataframe partitions to the Spark driver
node. Finally, the RDS outputs can be re-assembled to R dataframes using
<code>collect_from_rds()</code>.</p>
<p>Shown below is an example of <code>spark_write_rds()</code> + <code>collect_from_rds()</code> usage,
where RDS outputs are first saved to HDFS, then downloaded to the local
filesystem with <code>hadoop fs -get</code>, and finally, post-processed with
<code>collect_from_rds()</code>:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span><span class="lnt">21
</span><span class="lnt">22
</span><span class="lnt">23
</span><span class="lnt">24
</span><span class="lnt">25
</span><span class="lnt">26
</span><span class="lnt">27
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">sparklyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">nycflights13</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">num_partitions</span> <span class="o">&lt;-</span> <span class="m">10L</span>
</span></span><span class="line"><span class="cl"><span class="n">sc</span> <span class="o">&lt;-</span> <span class="nf">spark_connect</span><span class="p">(</span><span class="n">master</span> <span class="o">=</span> <span class="s">&#34;yarn&#34;</span><span class="p">,</span> <span class="n">spark_home</span> <span class="o">=</span> <span class="s">&#34;/usr/lib/spark&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">flights_sdf</span> <span class="o">&lt;-</span> <span class="nf">copy_to</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="n">flights</span><span class="p">,</span> <span class="n">repartition</span> <span class="o">=</span> <span class="n">num_partitions</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Spark workers serialize all partition in RDS format in parallel and write RDS</span>
</span></span><span class="line"><span class="cl"><span class="c1"># outputs to HDFS</span>
</span></span><span class="line"><span class="cl"><span class="nf">spark_write_rds</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">  <span class="n">flights_sdf</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="n">dest_uri</span> <span class="o">=</span> <span class="s">&#34;hdfs://&lt;namenode&gt;:8020/flights-part-{partitionId}.rds&#34;</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Run `hadoop fs -get` to download RDS files from HDFS to local file system</span>
</span></span><span class="line"><span class="cl"><span class="kr">for</span> <span class="p">(</span><span class="n">partition</span> <span class="kr">in</span> <span class="nf">seq</span><span class="p">(</span><span class="n">num_partitions</span><span class="p">)</span> <span class="o">-</span> <span class="m">1</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">  <span class="nf">system2</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="s">&#34;hadoop&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="nf">c</span><span class="p">(</span><span class="s">&#34;fs&#34;</span><span class="p">,</span> <span class="s">&#34;-get&#34;</span><span class="p">,</span> <span class="nf">sprintf</span><span class="p">(</span><span class="s">&#34;hdfs://&lt;namenode&gt;:8020/flights-part-%d.rds&#34;</span><span class="p">,</span> <span class="n">partition</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">  <span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Post-process RDS outputs</span>
</span></span><span class="line"><span class="cl"><span class="n">partitions</span> <span class="o">&lt;-</span> <span class="nf">seq</span><span class="p">(</span><span class="n">num_partitions</span><span class="p">)</span> <span class="o">-</span> <span class="m">1</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">lapply</span><span class="p">(</span><span class="kr">function</span><span class="p">(</span><span class="n">partition</span><span class="p">)</span> <span class="nf">collect_from_rds</span><span class="p">(</span><span class="nf">sprintf</span><span class="p">(</span><span class="s">&#34;flights-part-%d.rds&#34;</span><span class="p">,</span> <span class="n">partition</span><span class="p">)))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Optionally, call `rbind()` to combine data from all partitions into a single R dataframe</span>
</span></span><span class="line"><span class="cl"><span class="n">flights_df</span> <span class="o">&lt;-</span> <span class="nf">do.call</span><span class="p">(</span><span class="n">rbind</span><span class="p">,</span> <span class="n">partitions</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><h2 id="dplyr-related-improvements">Dplyr-related improvements
</h2>
<p>Similar to other recent <code>sparklyr</code> releases, <code>sparklyr</code> 1.6 comes with a
number of dplyr-related improvements, such as</p>
<ul>
<li>Support for <code>where()</code> predicate within <code>select()</code> and <code>summarize(across(...))</code>
operations on Spark dataframes</li>
<li>Addition of <code>if_all()</code> and <code>if_any()</code> functions</li>
<li>Full compatibility with <code>dbplyr</code> 2.0 backend API</li>
</ul>
<h3 id="selectwhere-and-summarizeacrosswhere"><code>select(where(...))</code> and <code>summarize(across(where(...)))</code>
</h3>
<p>The dplyr <code>where(...)</code> construct is useful for applying a selection or
aggregation function to multiple columns that satisfy some boolean predicate.
For example,</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">iris</span> <span class="o">%&gt;%</span> <span class="nf">select</span><span class="p">(</span><span class="nf">where</span><span class="p">(</span><span class="n">is.numeric</span><span class="p">))</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>returns all numeric columns from the <code>iris</code> dataset, and</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">iris</span> <span class="o">%&gt;%</span> <span class="nf">summarize</span><span class="p">(</span><span class="nf">across</span><span class="p">(</span><span class="nf">where</span><span class="p">(</span><span class="n">is.numeric</span><span class="p">),</span> <span class="n">mean</span><span class="p">))</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>computes the average of each numeric column.</p>
<p>In sparklyr 1.6, both types of operations can be applied to Spark dataframes, e.g.,</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span><span class="lnt">8
</span><span class="lnt">9
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">sparklyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">sc</span> <span class="o">&lt;-</span> <span class="nf">spark_connect</span><span class="p">(</span><span class="n">master</span> <span class="o">=</span> <span class="s">&#34;local&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">iris_sdf</span> <span class="o">&lt;-</span> <span class="nf">copy_to</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="n">iris</span><span class="p">,</span> <span class="n">name</span> <span class="o">=</span> <span class="nf">random_string</span><span class="p">())</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">iris_sdf</span> <span class="o">%&gt;%</span> <span class="nf">select</span><span class="p">(</span><span class="nf">where</span><span class="p">(</span><span class="n">is.numeric</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">iris</span> <span class="o">%&gt;%</span> <span class="nf">summarize</span><span class="p">(</span><span class="nf">across</span><span class="p">(</span><span class="nf">where</span><span class="p">(</span><span class="n">is.numeric</span><span class="p">),</span> <span class="n">mean</span><span class="p">))</span>
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="if_all-and-if_any"><code>if_all()</code> and <code>if_any()</code>
</h3>
<p><code>if_all()</code> and <code>if_any()</code> are two convenience functions from <code>dplyr</code> 1.0.4 (see
<a href="https://www.tidyverse.org/blog/2021/02/dplyr-1-0-4-if-any" target="_blank" rel="noopener">here</a>
 for more details)
that effectively <sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>
combine the results of applying a boolean predicate to a tidy selection of columns
using the logical <code>and</code>/<code>or</code> operators.</p>
<p>Starting from sparklyr 1.6, <code>if_all()</code> and <code>if_any()</code> can also be applied to
Spark dataframes, .e.g.,</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">sparklyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">sc</span> <span class="o">&lt;-</span> <span class="nf">spark_connect</span><span class="p">(</span><span class="n">master</span> <span class="o">=</span> <span class="s">&#34;local&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">iris_sdf</span> <span class="o">&lt;-</span> <span class="nf">copy_to</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="n">iris</span><span class="p">,</span> <span class="n">name</span> <span class="o">=</span> <span class="nf">random_string</span><span class="p">())</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Select all records with Petal.Width &gt; 2 and Petal.Length &gt; 2</span>
</span></span><span class="line"><span class="cl"><span class="n">iris_sdf</span> <span class="o">%&gt;%</span> <span class="nf">filter</span><span class="p">(</span><span class="nf">if_all</span><span class="p">(</span><span class="nf">starts_with</span><span class="p">(</span><span class="s">&#34;Petal&#34;</span><span class="p">),</span> <span class="o">~</span> <span class="n">.x</span> <span class="o">&gt;</span> <span class="m">2</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Select all records with Petal.Width &gt; 5 or Petal.Length &gt; 5</span>
</span></span><span class="line"><span class="cl"><span class="n">iris_sdf</span> <span class="o">%&gt;%</span> <span class="nf">filter</span><span class="p">(</span><span class="nf">if_any</span><span class="p">(</span><span class="nf">starts_with</span><span class="p">(</span><span class="s">&#34;Petal&#34;</span><span class="p">),</span> <span class="o">~</span> <span class="n">.x</span> <span class="o">&gt;</span> <span class="m">5</span><span class="p">))</span>
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="compatibility-with-dbplyr-20-backend-api">Compatibility with <code>dbplyr</code> 2.0 backend API
</h3>
<p><code>Sparklyr</code> 1.6 is fully compatible with the newer <code>dbplyr</code> 2.0 backend API (by
implementing all interface changes recommended in
<a href="https://dbplyr.tidyverse.org/articles/backend-2.html" target="_blank" rel="noopener">here</a>
), while still
maintaining backward compatibility with the previous edition of <code>dbplyr</code> API, so
that <code>sparklyr</code> users will not be forced to switch to any particular version of
<code>dbplyr</code>.</p>
<p>This should be a mostly non-user-visible change as of now. In fact, the only
discernible behavior change will be the following code</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">dbplyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">sparklyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">sc</span> <span class="o">&lt;-</span> <span class="nf">spark_connect</span><span class="p">(</span><span class="n">master</span> <span class="o">=</span> <span class="s">&#34;local&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">print</span><span class="p">(</span><span class="nf">dbplyr_edition</span><span class="p">(</span><span class="n">sc</span><span class="p">))</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>outputting</p>
<pre><code>[1] 2
</code></pre>
<p>if <code>sparklyr</code> is working with <code>dbplyr</code> 2.0+, and</p>
<pre><code>[1] 1
</code></pre>
<p>if otherwise.</p>
<h2 id="acknowledgements">Acknowledgements
</h2>
<p>In chronological order, we would like to thank the following contributors for
making <code>sparklyr</code> 1.6 awesome:</p>
<ul>
<li><a href="https://github.com/yitao-li" target="_blank" rel="noopener">@yitao-li</a>
</li>
<li><a href="https://github.com/pgramme" target="_blank" rel="noopener">@pgramme</a>
</li>
<li><a href="https://github.com/javierluraschi" target="_blank" rel="noopener">@javierluraschi</a>
</li>
<li><a href="https://github.com/andrew-christianson" target="_blank" rel="noopener">@andrew-christianson</a>
</li>
<li><a href="https://github.com/jozefhajnala" target="_blank" rel="noopener">@jozefhajnala</a>
</li>
<li><a href="https://github.com/nathaneastwood" target="_blank" rel="noopener">@nathaneastwood</a>
</li>
<li><a href="https://github.com/mzorko" target="_blank" rel="noopener">@mzorko</a>
</li>
</ul>
<p>We would also like to give a big shout-out to the wonderful open-source community
behind <code>sparklyr</code>, without whom we would not have benefitted from numerous
<code>sparklyr</code>-related bug reports and feature suggestions.</p>
<p>Finally, the author of this blog post also very much appreciates the highly
valuable editorial suggestions from <a href="https://github.com/skeydan" target="_blank" rel="noopener">@skeydan</a>
.</p>
<p>If you wish to learn more about <code>sparklyr</code>, we recommend checking out
<a href="https://sparklyr.ai" target="_blank" rel="noopener">sparklyr.ai</a>
, <a href="https://spark.rstudio.com" target="_blank" rel="noopener">spark.rstudio.com</a>
,
and also some previous <code>sparklyr</code> release posts such as
<a href="https://posit-open-source.netlify.app/blog/ai/2020-12-14-sparklyr-1.5.0-released/">sparklyr 1.5</a>

and <a href="https://posit-open-source.netlify.app/blog/ai/2020-09-30-sparklyr-1.4.0-released/">sparklyr 1.4</a>
.</p>
<p>That is all. Thanks for reading!</p>
<p>Greenwald, Michael, and Sanjeev Khanna. 2001. &ldquo;Space-Efficient Online Computation of Quantile Summaries.&rdquo; <em>SIGMOD Rec.</em> (New York, NY, USA) 30 (2): 58&ndash;66. <a href="https://doi.org/10.1145/376284.375670" target="_blank" rel="noopener">https://doi.org/10.1145/376284.375670</a>
.</p>
<p>Lin, Frank, and William Cohen. 2010. &ldquo;Power Iteration Clustering.&rdquo; August, 655&ndash;62.</p>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>modulo possible implementation-dependent short-circuit evaluations&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></description>
      <enclosure url="https://posit-open-source.netlify.app/blog/ai/sparklyr-1.6/thumbnail.jpg" length="89031" type="image/jpeg" />
    </item>
    <item>
      <title>sparklyr 1.5: better dplyr interface, more sdf_* functions, and RDS-based serialization routines</title>
      <link>https://posit-open-source.netlify.app/blog/ai/sparklyr-1.5/</link>
      <pubDate>Mon, 14 Dec 2020 00:00:00 +0000</pubDate>
      <guid>https://posit-open-source.netlify.app/blog/ai/sparklyr-1.5/</guid>
      <dc:creator>Yitao Li</dc:creator><description><![CDATA[<p>We are thrilled to announce <a href="https://sparklyr.ai" target="_blank" rel="noopener"><code>sparklyr</code></a>
 1.5 is now
available on <a href="https://cran.r-project.org/web/packages/sparklyr/index.html" target="_blank" rel="noopener">CRAN</a>
!</p>
<p>To install <code>sparklyr</code> 1.5 from CRAN, run</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">install.packages</span><span class="p">(</span><span class="s">&#34;sparklyr&#34;</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>In this blog post, we will highlight the following aspects of <code>sparklyr</code> 1.5:</p>
<ul>
<li><a href="#better-dplyr-interface">Better <code>dplyr</code> interface</a>
</li>
<li><a href="#new-additions-to-the-sdf_-family-of-functions">4 useful additions to the <code>sdf_*</code> family of functions</a>
</li>
<li>New <a href="#rds-based-serialization-routines">RDS-based serialization routines</a>
 along with several serialization-related improvements and bug fixes</li>
</ul>
<h2 id="better-dplyr-interface">Better dplyr interface
</h2>
<p>A large fraction of pull requests that went into the <code>sparklyr</code> 1.5 release were focused on making
Spark dataframes work with various <code>dplyr</code> verbs in the same way that R dataframes do.
The full list of <code>dplyr</code>-related bugs and feature requests that were resolved in
<code>sparklyr</code> 1.5 can be found in <a href="https://github.com/sparklyr/sparklyr/issues?q=is%3Aissue&#43;is%3Aclosed&#43;label%3Adplyr&#43;milestone%3A1.5.0" target="_blank" rel="noopener">here</a>
.</p>
<p>In this section, we will showcase three new dplyr functionalities that were shipped with <code>sparklyr</code> 1.5.</p>
<h3 id="stratified-sampling">Stratified sampling
</h3>
<p>Stratified sampling on an R dataframe can be accomplished with a combination of <code>dplyr::group_by()</code> followed by
<code>dplyr::sample_n()</code> or <code>dplyr::sample_frac()</code>, where the grouping variables specified in the <code>dplyr::group_by()</code>
step are the ones that define each stratum. For instance, the following query will group <code>mtcars</code> by number
of cylinders and return a weighted random sample of size two from each group, without replacement, and weighted by
the <code>mpg</code> column:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">mtcars</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="n">dplyr</span><span class="o">::</span><span class="nf">group_by</span><span class="p">(</span><span class="n">cyl</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="n">dplyr</span><span class="o">::</span><span class="nf">sample_n</span><span class="p">(</span><span class="n">size</span> <span class="o">=</span> <span class="m">2</span><span class="p">,</span> <span class="n">weight</span> <span class="o">=</span> <span class="n">mpg</span><span class="p">,</span> <span class="n">replace</span> <span class="o">=</span> <span class="kc">FALSE</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">print</span><span class="p">()</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code>## # A tibble: 6 x 11
## # Groups:   cyl [3]
##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
## 1  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1
## 2  22.8     4 108      93  3.85  2.32  18.6     1     1     4     1
## 3  21.4     6 258     110  3.08  3.22  19.4     1     0     3     1
## 4  21       6 160     110  3.9   2.62  16.5     0     1     4     4
## 5  15.5     8 318     150  2.76  3.52  16.9     0     0     3     2
## 6  19.2     8 400     175  3.08  3.84  17.0     0     0     3     2
</code></pre>
<p>Starting from <code>sparklyr</code> 1.5, the same can also be done for Spark dataframes with Spark 3.0 or above, e.g.,:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span><span class="lnt">8
</span><span class="lnt">9
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">sparklyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">sc</span> <span class="o">&lt;-</span> <span class="nf">spark_connect</span><span class="p">(</span><span class="n">master</span> <span class="o">=</span> <span class="s">&#34;local&#34;</span><span class="p">,</span> <span class="n">version</span> <span class="o">=</span> <span class="s">&#34;3.0.0&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">mtcars_sdf</span> <span class="o">&lt;-</span> <span class="nf">copy_to</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="n">mtcars</span><span class="p">,</span> <span class="n">replace</span> <span class="o">=</span> <span class="kc">TRUE</span><span class="p">,</span> <span class="n">repartition</span> <span class="o">=</span> <span class="m">3</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">mtcars_sdf</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="n">dplyr</span><span class="o">::</span><span class="nf">group_by</span><span class="p">(</span><span class="n">cyl</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="n">dplyr</span><span class="o">::</span><span class="nf">sample_n</span><span class="p">(</span><span class="n">size</span> <span class="o">=</span> <span class="m">2</span><span class="p">,</span> <span class="n">weight</span> <span class="o">=</span> <span class="n">mpg</span><span class="p">,</span> <span class="n">replace</span> <span class="o">=</span> <span class="kc">FALSE</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">print</span><span class="p">()</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code># Source: spark&lt;?&gt; [?? x 11]
# Groups: cyl
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
1  21       6 160     110  3.9   2.62  16.5     0     1     4     4
2  21.4     6 258     110  3.08  3.22  19.4     1     0     3     1
3  27.3     4  79      66  4.08  1.94  18.9     1     1     4     1
4  32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
5  16.4     8 276.    180  3.07  4.07  17.4     0     0     3     3
6  18.7     8 360     175  3.15  3.44  17.0     0     0     3     2
</code></pre>
<p>or</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">mtcars_sdf</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="n">dplyr</span><span class="o">::</span><span class="nf">group_by</span><span class="p">(</span><span class="n">cyl</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="n">dplyr</span><span class="o">::</span><span class="nf">sample_frac</span><span class="p">(</span><span class="n">size</span> <span class="o">=</span> <span class="m">0.2</span><span class="p">,</span> <span class="n">weight</span> <span class="o">=</span> <span class="n">mpg</span><span class="p">,</span> <span class="n">replace</span> <span class="o">=</span> <span class="kc">FALSE</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">print</span><span class="p">()</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code>## # Source: spark&lt;?&gt; [?? x 11]
## # Groups: cyl
##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
## 1  21       6 160     110  3.9   2.62  16.5     0     1     4     4
## 2  21.4     6 258     110  3.08  3.22  19.4     1     0     3     1
## 3  22.8     4 141.     95  3.92  3.15  22.9     1     0     4     2
## 4  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1
## 5  30.4     4  95.1   113  3.77  1.51  16.9     1     1     5     2
## 6  15.5     8 318     150  2.76  3.52  16.9     0     0     3     2
## 7  18.7     8 360     175  3.15  3.44  17.0     0     0     3     2
## 8  16.4     8 276.    180  3.07  4.07  17.4     0     0     3     3
</code></pre>
<h3 id="row-sums">Row sums
</h3>
<p>The <code>rowSums()</code> functionality offered by <code>dplyr</code> is handy when one needs to sum up
a large number of columns within an R dataframe that are impractical to be enumerated
individually.
For example, here we have a six-column dataframe of random real numbers, where the
<code>partial_sum</code> column in the result contains the sum of columns <code>b</code> through <code>d</code> within
each row:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span><span class="lnt">8
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">ncols</span> <span class="o">&lt;-</span> <span class="m">6</span>
</span></span><span class="line"><span class="cl"><span class="n">nums</span> <span class="o">&lt;-</span> <span class="nf">seq</span><span class="p">(</span><span class="n">ncols</span><span class="p">)</span> <span class="o">%&gt;%</span> <span class="nf">lapply</span><span class="p">(</span><span class="kr">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="nf">runif</span><span class="p">(</span><span class="m">5</span><span class="p">))</span>
</span></span><span class="line"><span class="cl"><span class="nf">names</span><span class="p">(</span><span class="n">nums</span><span class="p">)</span> <span class="o">&lt;-</span> <span class="kc">letters</span><span class="n">[1</span><span class="o">:</span><span class="n">ncols]</span>
</span></span><span class="line"><span class="cl"><span class="n">tbl</span> <span class="o">&lt;-</span> <span class="n">tibble</span><span class="o">::</span><span class="nf">as_tibble</span><span class="p">(</span><span class="n">nums</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">tbl</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="n">dplyr</span><span class="o">::</span><span class="nf">mutate</span><span class="p">(</span><span class="n">partial_sum</span> <span class="o">=</span> <span class="nf">rowSums</span><span class="p">(</span><span class="n">.[2</span><span class="o">:</span><span class="m">5</span><span class="n">]</span><span class="p">))</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">print</span><span class="p">()</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code>## # A tibble: 5 x 7
##         a     b     c      d     e      f partial_sum
##     &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;  &lt;dbl&gt; &lt;dbl&gt;  &lt;dbl&gt;       &lt;dbl&gt;
## 1 0.781   0.801 0.157 0.0293 0.169 0.0978        1.16
## 2 0.696   0.412 0.221 0.941  0.697 0.675         2.27
## 3 0.802   0.410 0.516 0.923  0.190 0.904         2.04
## 4 0.200   0.590 0.755 0.494  0.273 0.807         2.11
## 5 0.00149 0.711 0.286 0.297  0.107 0.425         1.40
</code></pre>
<p>Beginning with <code>sparklyr</code> 1.5, the same operation can be performed with Spark dataframes:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span><span class="lnt">8
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">sparklyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">sc</span> <span class="o">&lt;-</span> <span class="nf">spark_connect</span><span class="p">(</span><span class="n">master</span> <span class="o">=</span> <span class="s">&#34;local&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">sdf</span> <span class="o">&lt;-</span> <span class="nf">copy_to</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="n">tbl</span><span class="p">,</span> <span class="n">overwrite</span> <span class="o">=</span> <span class="kc">TRUE</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">sdf</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="n">dplyr</span><span class="o">::</span><span class="nf">mutate</span><span class="p">(</span><span class="n">partial_sum</span> <span class="o">=</span> <span class="nf">rowSums</span><span class="p">(</span><span class="n">.[2</span><span class="o">:</span><span class="m">5</span><span class="n">]</span><span class="p">))</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">print</span><span class="p">()</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code>## # Source: spark&lt;?&gt; [?? x 7]
##         a     b     c      d     e      f partial_sum
##     &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;  &lt;dbl&gt; &lt;dbl&gt;  &lt;dbl&gt;       &lt;dbl&gt;
## 1 0.781   0.801 0.157 0.0293 0.169 0.0978        1.16
## 2 0.696   0.412 0.221 0.941  0.697 0.675         2.27
## 3 0.802   0.410 0.516 0.923  0.190 0.904         2.04
## 4 0.200   0.590 0.755 0.494  0.273 0.807         2.11
## 5 0.00149 0.711 0.286 0.297  0.107 0.425         1.40
</code></pre>
<p>As a bonus from implementing the <code>rowSums</code> feature for Spark dataframes,
<code>sparklyr</code> 1.5 now also offers limited support for the column-subsetting
operator on Spark dataframes.
For example, all code snippets below will return some subset of columns from
the dataframe named <code>sdf</code>:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="c1"># select columns `b` through `e`</span>
</span></span><span class="line"><span class="cl"><span class="n">sdf[2</span><span class="o">:</span><span class="m">5</span><span class="n">]</span>
</span></span></code></pre></td></tr></table>
</div>
</div><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="c1"># select columns `b` and `c`</span>
</span></span><span class="line"><span class="cl"><span class="n">sdf</span><span class="nf">[c</span><span class="p">(</span><span class="s">&#34;b&#34;</span><span class="p">,</span> <span class="s">&#34;c&#34;</span><span class="p">)</span><span class="n">]</span>
</span></span></code></pre></td></tr></table>
</div>
</div><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="c1"># drop the first and third columns and return the rest</span>
</span></span><span class="line"><span class="cl"><span class="n">sdf</span><span class="nf">[c</span><span class="p">(</span><span class="m">-1</span><span class="p">,</span> <span class="m">-3</span><span class="p">)</span><span class="n">]</span>
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="weighted-mean-summarizer">Weighted-mean summarizer
</h3>
<p>Similar to the two <code>dplyr</code> functions mentioned above, the <code>weighted.mean()</code> summarizer is another
useful function that has become part of the <code>dplyr</code> interface for Spark dataframes in <code>sparklyr</code> 1.5.
One can see it in action by, for example, comparing the output from the following</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span><span class="lnt">8
</span><span class="lnt">9
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">sparklyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">sc</span> <span class="o">&lt;-</span> <span class="nf">spark_connect</span><span class="p">(</span><span class="n">master</span> <span class="o">=</span> <span class="s">&#34;local&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">mtcars_sdf</span> <span class="o">&lt;-</span> <span class="nf">copy_to</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="n">mtcars</span><span class="p">,</span> <span class="n">replace</span> <span class="o">=</span> <span class="kc">TRUE</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">mtcars_sdf</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="n">dplyr</span><span class="o">::</span><span class="nf">group_by</span><span class="p">(</span><span class="n">cyl</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="n">dplyr</span><span class="o">::</span><span class="nf">summarize</span><span class="p">(</span><span class="n">mpg_wm</span> <span class="o">=</span> <span class="nf">weighted.mean</span><span class="p">(</span><span class="n">mpg</span><span class="p">,</span> <span class="n">wt</span><span class="p">))</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">print</span><span class="p">()</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>with output from the equivalent operation on <code>mtcars</code> in R:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">mtcars</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="n">dplyr</span><span class="o">::</span><span class="nf">group_by</span><span class="p">(</span><span class="n">cyl</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="n">dplyr</span><span class="o">::</span><span class="nf">summarize</span><span class="p">(</span><span class="n">mpg_wm</span> <span class="o">=</span> <span class="nf">weighted.mean</span><span class="p">(</span><span class="n">mpg</span><span class="p">,</span> <span class="n">wt</span><span class="p">))</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">print</span><span class="p">()</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>both of them should evaluate to the following:</p>
<pre><code>##     cyl mpg_wm
##   &lt;dbl&gt;  &lt;dbl&gt;
## 1     4   25.9
## 2     6   19.6
## 3     8   14.8
</code></pre>
<h2 id="new-additions-to-the-sdf_-family-of-functions">New additions to the <code>sdf_*</code> family of functions
</h2>
<p><code>sparklyr</code> provides a large number of convenience functions for working with Spark dataframes,
and all of them have names starting with the <code>sdf_</code> prefix.</p>
<p>In this section we will briefly mention four new additions
and show some example scenarios in which those functions are useful.</p>
<h3 id="sdf_expand_grid"><code>sdf_expand_grid()</code>
</h3>
<p>As the name suggests, <code>sdf_expand_grid()</code> is simply the Spark equivalent of <code>expand.grid()</code>.
Rather than running <code>expand.grid()</code> in R and importing the resulting R dataframe to Spark, one
can now run <code>sdf_expand_grid()</code>, which accepts both R vectors and Spark dataframes and supports
hints for broadcast hash joins. The example below shows <code>sdf_expand_grid()</code> creating a
100-by-100-by-10-by-10 grid in Spark over 1000 Spark partitions, with broadcast hash join hints
on variables with small cardinalities:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">sparklyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">sc</span> <span class="o">&lt;-</span> <span class="nf">spark_connect</span><span class="p">(</span><span class="n">master</span> <span class="o">=</span> <span class="s">&#34;local&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">grid_sdf</span> <span class="o">&lt;-</span> <span class="nf">sdf_expand_grid</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">  <span class="n">sc</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="n">var1</span> <span class="o">=</span> <span class="nf">seq</span><span class="p">(</span><span class="m">100</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">  <span class="n">var2</span> <span class="o">=</span> <span class="nf">seq</span><span class="p">(</span><span class="m">100</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">  <span class="n">var3</span> <span class="o">=</span> <span class="nf">seq</span><span class="p">(</span><span class="m">10</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">  <span class="n">var4</span> <span class="o">=</span> <span class="nf">seq</span><span class="p">(</span><span class="m">10</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">  <span class="n">broadcast_vars</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="n">var3</span><span class="p">,</span> <span class="n">var4</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">  <span class="n">repartition</span> <span class="o">=</span> <span class="m">1000</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">grid_sdf</span> <span class="o">%&gt;%</span> <span class="nf">sdf_nrow</span><span class="p">()</span> <span class="o">%&gt;%</span> <span class="nf">print</span><span class="p">()</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code>## [1] 1e+06
</code></pre>
<h3 id="sdf_partition_sizes"><code>sdf_partition_sizes()</code>
</h3>
<p>As <code>sparklyr</code> user <a href="https://github.com/sbottelli" target="_blank" rel="noopener">@sbottelli</a>
 suggested <a href="https://github.com/sparklyr/sparklyr/issues/2791" target="_blank" rel="noopener">here</a>
,
one thing that would be great to have in <code>sparklyr</code> is an efficient way to query partition sizes of a Spark dataframe.
In <code>sparklyr</code> 1.5, <code>sdf_partition_sizes()</code> does exactly that:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">sparklyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">sc</span> <span class="o">&lt;-</span> <span class="nf">spark_connect</span><span class="p">(</span><span class="n">master</span> <span class="o">=</span> <span class="s">&#34;local&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">sdf_len</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="m">1000</span><span class="p">,</span> <span class="n">repartition</span> <span class="o">=</span> <span class="m">5</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">sdf_partition_sizes</span><span class="p">()</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">print</span><span class="p">(</span><span class="n">row.names</span> <span class="o">=</span> <span class="kc">FALSE</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code>##  partition_index partition_size
##                0            200
##                1            200
##                2            200
##                3            200
##                4            200
</code></pre>
<h3 id="sdf_unnest_longer-and-sdf_unnest_wider"><code>sdf_unnest_longer()</code> and <code>sdf_unnest_wider()</code>
</h3>
<p><code>sdf_unnest_longer()</code> and <code>sdf_unnest_wider()</code> are the equivalents of
<code>tidyr::unnest_longer()</code> and <code>tidyr::unnest_wider()</code> for Spark dataframes.
<code>sdf_unnest_longer()</code> expands all elements in a struct column into multiple rows, and
<code>sdf_unnest_wider()</code> expands them into multiple columns. As illustrated with an example
dataframe below,</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">sparklyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">sc</span> <span class="o">&lt;-</span> <span class="nf">spark_connect</span><span class="p">(</span><span class="n">master</span> <span class="o">=</span> <span class="s">&#34;local&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">sdf</span> <span class="o">&lt;-</span> <span class="nf">copy_to</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">  <span class="n">sc</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="n">tibble</span><span class="o">::</span><span class="nf">tibble</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">id</span> <span class="o">=</span> <span class="nf">seq</span><span class="p">(</span><span class="m">3</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">    <span class="n">attribute</span> <span class="o">=</span> <span class="nf">list</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">      <span class="nf">list</span><span class="p">(</span><span class="n">name</span> <span class="o">=</span> <span class="s">&#34;Alice&#34;</span><span class="p">,</span> <span class="n">grade</span> <span class="o">=</span> <span class="s">&#34;A&#34;</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">      <span class="nf">list</span><span class="p">(</span><span class="n">name</span> <span class="o">=</span> <span class="s">&#34;Bob&#34;</span><span class="p">,</span> <span class="n">grade</span> <span class="o">=</span> <span class="s">&#34;B&#34;</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">      <span class="nf">list</span><span class="p">(</span><span class="n">name</span> <span class="o">=</span> <span class="s">&#34;Carol&#34;</span><span class="p">,</span> <span class="n">grade</span> <span class="o">=</span> <span class="s">&#34;C&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="p">)</span>
</span></span><span class="line"><span class="cl">  <span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">sdf</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">sdf_unnest_longer</span><span class="p">(</span><span class="n">col</span> <span class="o">=</span> <span class="n">record</span><span class="p">,</span> <span class="n">indices_to</span> <span class="o">=</span> <span class="s">&#34;key&#34;</span><span class="p">,</span> <span class="n">values_to</span> <span class="o">=</span> <span class="s">&#34;value&#34;</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">print</span><span class="p">()</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>evaluates to</p>
<pre><code>## # Source: spark&lt;?&gt; [?? x 3]
##      id value key
##   &lt;int&gt; &lt;chr&gt; &lt;chr&gt;
## 1     1 A     grade
## 2     1 Alice name
## 3     2 B     grade
## 4     2 Bob   name
## 5     3 C     grade
## 6     3 Carol name
</code></pre>
<p>whereas</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">sdf</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">sdf_unnest_wider</span><span class="p">(</span><span class="n">col</span> <span class="o">=</span> <span class="n">record</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">print</span><span class="p">()</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>evaluates to</p>
<pre><code>## # Source: spark&lt;?&gt; [?? x 3]
##      id grade name
##   &lt;int&gt; &lt;chr&gt; &lt;chr&gt;
## 1     1 A     Alice
## 2     2 B     Bob
## 3     3 C     Carol
</code></pre>
<h2 id="rds-based-serialization-routines">RDS-based serialization routines
</h2>
<p>Some readers must be wondering why a brand new serialization format would need to be implemented in <code>sparklyr</code> at all.
Long story short, the reason is that RDS serialization is a strictly better replacement for its CSV predecessor.
It possesses all desirable attributes the CSV format has,
while avoiding a number of disadvantages that are common among text-based data formats.</p>
<p>In this section, we will briefly outline why <code>sparklyr</code> should support at least one serialization format other than <code>arrow</code>,
deep-dive into issues with CSV-based serialization,
and then show how the new RDS-based serialization is free from those issues.</p>
<h3 id="why-arrow-is-not-for-everyone">Why <code>arrow</code> is not for everyone?
</h3>
<p>To transfer data between Spark and R correctly and efficiently, <code>sparklyr</code> must rely on some data serialization
format that is well-supported by both Spark and R.
Unfortunately, not many serialization formats satisfy this requirement,
and among the ones that do are text-based formats such as CSV and JSON,
and binary formats such as Apache Arrow, Protobuf, and as of recent, a small subset of RDS version 2.
Further complicating the matter is the additional consideration that
<code>sparklyr</code> should support at least one serialization format whose implementation can be fully self-contained within the <code>sparklyr</code> code base,
i.e., such serialization should not depend on any external R package or system library,
so that it can accommodate users who want to use <code>sparklyr</code> but who do not necessarily have the required C++ compiler tool chain and
other system dependencies for setting up R packages such as <a href="https://cran.r-project.org/web/packages/arrow/index.html" target="_blank" rel="noopener"><code>arrow</code></a>
 or
<a href="https://cran.r-project.org/web/packages/protolite/index.html" target="_blank" rel="noopener"><code>protolite</code></a>
.
Prior to <code>sparklyr</code> 1.5, CSV-based serialization was the default alternative to fallback to when users do not have the <code>arrow</code> package installed or
when the type of data being transported from R to Spark is unsupported by the version of <code>arrow</code> available.</p>
<h3 id="why-is-the-csv-format-not-ideal">Why is the CSV format not ideal?
</h3>
<p>There are at least three reasons to believe CSV format is not the best choice when it comes to exporting data from R to Spark.</p>
<p>One reason is efficiency. For example, a double-precision floating point number such as <code>.Machine$double.eps</code> needs to
be expressed as <code>&quot;2.22044604925031e-16&quot;</code> in CSV format in order to not incur any loss of precision, thus taking up 20 bytes
rather than 8 bytes.</p>
<p>But more important than efficiency are correctness concerns. In a R dataframe, one can store both <code>NA_real_</code> and
<code>NaN</code> in a column of floating point numbers. <code>NA_real_</code> should ideally translate to <code>null</code> within a Spark dataframe, whereas
<code>NaN</code> should continue to be <code>NaN</code> when transported from R to Spark. Unfortunately, <code>NA_real_</code> in R becomes indistinguishable
from <code>NaN</code> once serialized in CSV format, as evident from a quick demo shown below:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">original_df</span> <span class="o">&lt;-</span> <span class="nf">data.frame</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="kc">NA_real_</span><span class="p">,</span> <span class="kc">NaN</span><span class="p">))</span>
</span></span><span class="line"><span class="cl"><span class="n">original_df</span> <span class="o">%&gt;%</span> <span class="n">dplyr</span><span class="o">::</span><span class="nf">mutate</span><span class="p">(</span><span class="n">is_nan</span> <span class="o">=</span> <span class="nf">is.nan</span><span class="p">(</span><span class="n">x</span><span class="p">))</span> <span class="o">%&gt;%</span> <span class="nf">print</span><span class="p">()</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code>##     x is_nan
## 1  NA  FALSE
## 2 NaN   TRUE
</code></pre>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">csv_file</span> <span class="o">&lt;-</span> <span class="s">&#34;/tmp/data.csv&#34;</span>
</span></span><span class="line"><span class="cl"><span class="nf">write.csv</span><span class="p">(</span><span class="n">original_df</span><span class="p">,</span> <span class="n">file</span> <span class="o">=</span> <span class="n">csv_file</span><span class="p">,</span> <span class="n">row.names</span> <span class="o">=</span> <span class="kc">FALSE</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">deserialized_df</span> <span class="o">&lt;-</span> <span class="nf">read.csv</span><span class="p">(</span><span class="n">csv_file</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">deserialized_df</span> <span class="o">%&gt;%</span> <span class="n">dplyr</span><span class="o">::</span><span class="nf">mutate</span><span class="p">(</span><span class="n">is_nan</span> <span class="o">=</span> <span class="nf">is.nan</span><span class="p">(</span><span class="n">x</span><span class="p">))</span> <span class="o">%&gt;%</span> <span class="nf">print</span><span class="p">()</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code>##    x is_nan
## 1 NA  FALSE
## 2 NA  FALSE
</code></pre>
<p>Another correctness issue very much similar to the one above was the fact that
<code>&quot;NA&quot;</code> and <code>NA</code> within a string column of an R dataframe become indistinguishable
once serialized in CSV format, as correctly pointed out in
<a href="https://github.com/sparklyr/sparklyr/issues/2031" target="_blank" rel="noopener">this Github issue</a>

by <a href="https://github.com/caewok" target="_blank" rel="noopener">@caewok</a>
 and others.</p>
<h3 id="rds-to-the-rescue">RDS to the rescue!
</h3>
<p>RDS format is one of the most widely used binary formats for serializing R objects.
It is described in some detail in chapter 1, section 8 of
<a href="https://cran.r-project.org/doc/manuals/r-patched/R-ints.pdf" target="_blank" rel="noopener">this document</a>
.
Among advantages of the RDS format are efficiency and accuracy: it has a reasonably
efficient implementation in base R, and supports all R data types.</p>
<p>Also worth noticing is the fact that when an R dataframe containing only data types
with sensible equivalents in Apache Spark (e.g., <code>RAWSXP</code>, <code>LGLSXP</code>, <code>CHARSXP</code>, <code>REALSXP</code>, etc)
is saved using RDS version 2,
(e.g., <code>serialize(mtcars, connection = NULL, version = 2L, xdr = TRUE)</code>),
only a tiny subset of the RDS format will be involved in the serialization process,
and implementing deserialization routines in Scala capable of decoding such a restricted
subset of RDS constructs is in fact a reasonably simple and straightforward task
(as shown in
<a href="https://github.com/sparklyr/sparklyr/blob/5e27668f16faa4852deae2db14828cfd1614c982/java/spark-1.5.2/rutils.scala#L47" target="_blank" rel="noopener">here</a>

).</p>
<p>Last but not least, because RDS is a binary format, it allows <code>NA_character_</code>, <code>&quot;NA&quot;</code>,
<code>NA_real_</code>, and <code>NaN</code> to all be encoded in an unambiguous manner, hence allowing <code>sparklyr</code>
1.5 to avoid all correctness issues detailed above in non-<code>arrow</code> serialization use cases.</p>
<h3 id="other-benefits-of-rds-serialization">Other benefits of RDS serialization
</h3>
<p>In addition to correctness guarantees, RDS format also offers quite a few other advantages.</p>
<p>One advantage is of course performance: for example, importing a non-trivially-sized dataset
such as <code>nycflights13::flights</code> from R to Spark using the RDS format in sparklyr 1.5 is
roughly 40%-50% faster compared to CSV-based serialization in sparklyr 1.4. The
current RDS-based implementation is still nowhere as fast as <code>arrow</code>-based serialization
though (<code>arrow</code> is about 3-4x faster), so for performance-sensitive tasks involving
heavy serialization, <code>arrow</code> should still be the top choice.</p>
<p>Another advantage is that with RDS serialization, <code>sparklyr</code> can import R dataframes containing
<code>raw</code> columns directly into binary columns in Spark. Thus, use cases such as the one below
will work in <code>sparklyr</code> 1.5</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span><span class="lnt">8
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">sparklyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">sc</span> <span class="o">&lt;-</span> <span class="nf">spark_connect</span><span class="p">(</span><span class="n">master</span> <span class="o">=</span> <span class="s">&#34;local&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">tbl</span> <span class="o">&lt;-</span> <span class="n">tibble</span><span class="o">::</span><span class="nf">tibble</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">  <span class="n">x</span> <span class="o">=</span> <span class="nf">list</span><span class="p">(</span><span class="nf">serialize</span><span class="p">(</span><span class="s">&#34;sparklyr&#34;</span><span class="p">,</span> <span class="kc">NULL</span><span class="p">),</span> <span class="nf">serialize</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">123456</span><span class="p">,</span> <span class="m">789</span><span class="p">),</span> <span class="kc">NULL</span><span class="p">))</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">sdf</span> <span class="o">&lt;-</span> <span class="nf">copy_to</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="n">tbl</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>While most <code>sparklyr</code> users probably won&rsquo;t find this capability of importing binary columns
to Spark immediately useful in their typical <code>sparklyr::copy_to()</code> or <code>sparklyr::collect()</code>
usages, it does play a crucial role in reducing serialization overheads in the Spark-based
<a href="https://blog.rstudio.com/2020/05/06/sparklyr-1-2/#foreach" target="_blank" rel="noopener"><code>foreach</code></a>
 parallel backend that
was first introduced in <code>sparklyr</code> 1.2.
This is because Spark workers can directly fetch the serialized R closures to be computed
from a binary Spark column instead of extracting those serialized bytes from intermediate
representations such as base64-encoded strings.
Similarly, the R results from executing worker closures will be directly available in RDS
format which can be efficiently deserialized in R, rather than being delivered in other
less efficient formats.</p>
<h2 id="acknowledgement">Acknowledgement
</h2>
<p>In chronological order, we would like to thank the following contributors for making their pull
requests part of <code>sparklyr</code> 1.5:</p>
<ul>
<li><a href="https://github.com/wkdavis" target="_blank" rel="noopener">@wkdavis</a>
</li>
<li><a href="https://github.com/yitao-li" target="_blank" rel="noopener">@yitao-li</a>
</li>
<li><a href="https://github.com/falaki" target="_blank" rel="noopener">@falaki</a>
</li>
<li><a href="https://github.com/nathaneastwood" target="_blank" rel="noopener">@nathaneastwood</a>
</li>
<li><a href="https://github.com/pgramme" target="_blank" rel="noopener">@pgramme</a>
</li>
</ul>
<p>We would also like to express our gratitude towards numerous bug reports and feature requests for
<code>sparklyr</code> from a fantastic open-source community.</p>
<p>Finally, the author of this blog post is indebted to
<a href="https://github.com/javierluraschi" target="_blank" rel="noopener">@javierluraschi</a>
,
<a href="https://github.com/batpigandme" target="_blank" rel="noopener">@batpigandme</a>
,
and <a href="https://github.com/skeydan" target="_blank" rel="noopener">@skeydan</a>
 for their valuable editorial inputs.</p>
<p>If you wish to learn more about <code>sparklyr</code>, check out <a href="https://sparklyr.ai" target="_blank" rel="noopener">sparklyr.ai</a>
,
<a href="https://spark.rstudio.com" target="_blank" rel="noopener">spark.rstudio.com</a>
, and some of the previous release posts such as
<a href="https://posit-open-source.netlify.app/blog/ai/2020-09-30-sparklyr-1.4.0-released">sparklyr 1.4</a>
 and
<a href="https://blog.rstudio.com/2020/07/16/sparklyr-1-3/" target="_blank" rel="noopener">sparklyr 1.3</a>
.</p>
<p>Thanks for reading!</p>
]]></description>
      <enclosure url="https://posit-open-source.netlify.app/blog/ai/sparklyr-1.5/thumbnail.jpg" length="752491" type="image/jpeg" />
    </item>
    <item>
      <title>sparklyr 1.4: Weighted Sampling, Tidyr Verbs, Robust Scaler, RAPIDS, and more</title>
      <link>https://posit-open-source.netlify.app/blog/ai/sparklyr-1.4/</link>
      <pubDate>Wed, 30 Sep 2020 00:00:00 +0000</pubDate>
      <guid>https://posit-open-source.netlify.app/blog/ai/sparklyr-1.4/</guid>
      <dc:creator>Yitao Li</dc:creator><description><![CDATA[<p><a href="https://sparklyr.ai" target="_blank" rel="noopener"><code>sparklyr</code></a>
 1.4 is now available on <a href="https://cran.r-project.org/web/packages/sparklyr/index.html" target="_blank" rel="noopener">CRAN</a>
! To install <code>sparklyr</code> 1.4 from CRAN, run</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">install.packages</span><span class="p">(</span><span class="s">&#34;sparklyr&#34;</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>In this blog post, we will showcase the following much-anticipated new functionalities from the <code>sparklyr</code> 1.4 release:</p>
<ul>
<li><a href="#parallelized-weighted-sampling">Parallelized Weighted Sampling</a>
 with Spark</li>
<li>Support for <a href="#tidyr-verbs">Tidyr Verbs</a>
 on Spark Dataframes</li>
<li><a href="#robust-scaler"><code>ft_robust_scaler</code></a>
 as the R interface for <a href="https://spark.apache.org/docs/3.0.0/api/java/org/apache/spark/ml/feature/RobustScaler.html" target="_blank" rel="noopener">RobustScaler</a>
 from Spark 3.0</li>
<li>Option for enabling <a href="#rapids"><code>RAPIDS</code></a>
 GPU acceleration plugin in <code>spark_connect()</code></li>
<li><a href="#higher-order-functions-and-dplyr-related-improvements">Higher-order functions and <code>dplyr</code>-related improvements</a>
</li>
</ul>
<h2 id="parallelized-weighted-sampling">Parallelized Weighted Sampling
</h2>
<p>Readers familiar with <code>dplyr::sample_n()</code> and <code>dplyr::sample_frac()</code> functions may have noticed that both of them support weighted-sampling use cases on R dataframes, e.g.,</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">dplyr</span><span class="o">::</span><span class="nf">sample_n</span><span class="p">(</span><span class="n">mtcars</span><span class="p">,</span> <span class="n">size</span> <span class="o">=</span> <span class="m">3</span><span class="p">,</span> <span class="n">weight</span> <span class="o">=</span> <span class="n">mpg</span><span class="p">,</span> <span class="n">replace</span> <span class="o">=</span> <span class="kc">FALSE</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code>               mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Fiat 128      32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Merc 280C     17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
</code></pre>
<p>and</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">dplyr</span><span class="o">::</span><span class="nf">sample_frac</span><span class="p">(</span><span class="n">mtcars</span><span class="p">,</span> <span class="n">size</span> <span class="o">=</span> <span class="m">0.1</span><span class="p">,</span> <span class="n">weight</span> <span class="o">=</span> <span class="n">mpg</span><span class="p">,</span> <span class="n">replace</span> <span class="o">=</span> <span class="kc">FALSE</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code>             mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Honda Civic 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Merc 450SE  16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Fiat X1-9   27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
</code></pre>
<p>will select some random subset of <code>mtcars</code> using the <code>mpg</code> attribute as the sampling weight for each row. If <code>replace = FALSE</code> is set, then a row is removed from the sampling population once it gets selected, whereas when setting <code>replace = TRUE</code>, each row will always stay in the sampling population and can be selected multiple times.</p>
<p>Now the exact same use cases are supported for Spark dataframes in <code>sparklyr</code> 1.4! For example:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">sparklyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">sc</span> <span class="o">&lt;-</span> <span class="nf">spark_connect</span><span class="p">(</span><span class="n">master</span> <span class="o">=</span> <span class="s">&#34;local&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">mtcars_sdf</span> <span class="o">&lt;-</span> <span class="nf">copy_to</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="n">mtcars</span><span class="p">,</span> <span class="n">repartition</span> <span class="o">=</span> <span class="m">4L</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">dplyr</span><span class="o">::</span><span class="nf">sample_n</span><span class="p">(</span><span class="n">mtcars_sdf</span><span class="p">,</span> <span class="n">size</span> <span class="o">=</span> <span class="m">5</span><span class="p">,</span> <span class="n">weight</span> <span class="o">=</span> <span class="n">mpg</span><span class="p">,</span> <span class="n">replace</span> <span class="o">=</span> <span class="kc">FALSE</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>will return a random subset of size 5 from the Spark dataframe <code>mtcars_sdf</code>.</p>
<p>More importantly, the sampling algorithm implemented in <code>sparklyr</code> 1.4 is something that fits perfectly into the MapReduce paradigm: as we have split our <code>mtcars</code> data into 4 partitions of <code>mtcars_sdf</code> by specifying <code>repartition = 4L</code>, the algorithm will first process each partition independently and in parallel, selecting a sample set of size up to 5 from each, and then reduce all 4 sample sets into a final sample set of size 5 by choosing records having the top 5 highest sampling priorities among all.</p>
<p>How is such parallelization possible, especially for the sampling without replacement scenario, where the desired result is defined as the outcome of a sequential process? A detailed answer to this question is in <a href="https://posit-open-source.netlify.app/blog/ai/2020-07-29-parallelized-sampling/">this blog post</a>
, which includes a definition of the problem (in particular, the exact meaning of sampling weights in term of probabilities), a high-level explanation of the current solution and the motivation behind it, and also, some mathematical details all hidden in one link to a PDF file, so that non-math-oriented readers can get the gist of everything else without getting scared away, while math-oriented readers can enjoy working out all the integrals themselves before peeking at the answer.</p>
<h2 id="tidyr-verbs">Tidyr Verbs
</h2>
<p>The specialized implementations of the following <a href="https://tidyr.tidyverse.org/" target="_blank" rel="noopener"><code>tidyr</code></a>
 verbs that work efficiently with Spark dataframes were included as part of <code>sparklyr</code> 1.4:</p>
<ul>
<li><a href="https://tidyr.tidyverse.org/reference/fill.html" target="_blank" rel="noopener"><code>tidyr::fill</code></a>
</li>
<li><a href="https://tidyr.tidyverse.org/reference/nest.html" target="_blank" rel="noopener"><code>tidyr::nest</code></a>
</li>
<li><a href="https://tidyr.tidyverse.org/reference/nest.html" target="_blank" rel="noopener"><code>tidyr::unnest</code></a>
</li>
<li><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html" target="_blank" rel="noopener"><code>tidyr::pivot_wider</code></a>
</li>
<li><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html" target="_blank" rel="noopener"><code>tidyr::pivot_longer</code></a>
</li>
<li><a href="https://tidyr.tidyverse.org/reference/separate.html" target="_blank" rel="noopener"><code>tidyr::separate</code></a>
</li>
<li><a href="https://tidyr.tidyverse.org/reference/unite.html" target="_blank" rel="noopener"><code>tidyr::unite</code></a>
</li>
</ul>
<p>We can demonstrate how those verbs are useful for tidying data through some examples.</p>
<p>Let&rsquo;s say we are given <code>mtcars_sdf</code>, a Spark dataframe containing all rows from <code>mtcars</code> plus the name of each row:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">sparklyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">sc</span> <span class="o">&lt;-</span> <span class="nf">spark_connect</span><span class="p">(</span><span class="n">master</span> <span class="o">=</span> <span class="s">&#34;local&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">mtcars_sdf</span> <span class="o">&lt;-</span> <span class="nf">cbind</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">  <span class="nf">data.frame</span><span class="p">(</span><span class="n">model</span> <span class="o">=</span> <span class="nf">rownames</span><span class="p">(</span><span class="n">mtcars</span><span class="p">)),</span>
</span></span><span class="line"><span class="cl">  <span class="nf">data.frame</span><span class="p">(</span><span class="n">mtcars</span><span class="p">,</span> <span class="n">row.names</span> <span class="o">=</span> <span class="kc">NULL</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">copy_to</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="n">.,</span> <span class="n">repartition</span> <span class="o">=</span> <span class="m">4L</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">print</span><span class="p">(</span><span class="n">mtcars_sdf</span><span class="p">,</span> <span class="n">n</span> <span class="o">=</span> <span class="m">5</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code># Source: spark&lt;?&gt; [?? x 12]
  model          mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  &lt;chr&gt;        &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
1 Mazda RX4     21       6   160   110  3.9   2.62  16.5     0     1     4     4
2 Mazda RX4 W…  21       6   160   110  3.9   2.88  17.0     0     1     4     4
3 Datsun 710    22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
4 Hornet 4 Dr…  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
5 Hornet Spor…  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
# … with more rows
</code></pre>
<p>and we would like to turn all numeric attributes in <code>mtcar_sdf</code> (in other words, all columns other than the <code>model</code> column) into key-value pairs stored in 2 columns, with the <code>key</code> column storing the name of each attribute, and the <code>value</code> column storing each attribute&rsquo;s numeric value. One way to accomplish that with <code>tidyr</code> is by utilizing the <code>tidyr::pivot_longer</code> functionality:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">mtcars_kv_sdf</span> <span class="o">&lt;-</span> <span class="n">mtcars_sdf</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="n">tidyr</span><span class="o">::</span><span class="nf">pivot_longer</span><span class="p">(</span><span class="n">cols</span> <span class="o">=</span> <span class="o">-</span><span class="n">model</span><span class="p">,</span> <span class="n">names_to</span> <span class="o">=</span> <span class="s">&#34;key&#34;</span><span class="p">,</span> <span class="n">values_to</span> <span class="o">=</span> <span class="s">&#34;value&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">print</span><span class="p">(</span><span class="n">mtcars_kv_sdf</span><span class="p">,</span> <span class="n">n</span> <span class="o">=</span> <span class="m">5</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code># Source: spark&lt;?&gt; [?? x 3]
  model     key   value
  &lt;chr&gt;     &lt;chr&gt; &lt;dbl&gt;
1 Mazda RX4 am      1
2 Mazda RX4 carb    4
3 Mazda RX4 cyl     6
4 Mazda RX4 disp  160
5 Mazda RX4 drat    3.9
# … with more rows
</code></pre>
<p>To undo the effect of <code>tidyr::pivot_longer</code>, we can apply <code>tidyr::pivot_wider</code> to our <code>mtcars_kv_sdf</code> Spark dataframe, and get back the original data that was present in <code>mtcars_sdf</code>:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">tbl</span> <span class="o">&lt;-</span> <span class="n">mtcars_kv_sdf</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="n">tidyr</span><span class="o">::</span><span class="nf">pivot_wider</span><span class="p">(</span><span class="n">names_from</span> <span class="o">=</span> <span class="n">key</span><span class="p">,</span> <span class="n">values_from</span> <span class="o">=</span> <span class="n">value</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">print</span><span class="p">(</span><span class="n">tbl</span><span class="p">,</span> <span class="n">n</span> <span class="o">=</span> <span class="m">5</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code># Source: spark&lt;?&gt; [?? x 12]
  model         carb   cyl  drat    hp   mpg    vs    wt    am  disp  gear  qsec
  &lt;chr&gt;        &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
1 Mazda RX4        4     6  3.9    110  21       0  2.62     1  160      4  16.5
2 Hornet 4 Dr…     1     6  3.08   110  21.4     1  3.22     0  258      3  19.4
3 Hornet Spor…     2     8  3.15   175  18.7     0  3.44     0  360      3  17.0
4 Merc 280C        4     6  3.92   123  17.8     1  3.44     0  168.     4  18.9
5 Merc 450SLC      3     8  3.07   180  15.2     0  3.78     0  276.     3  18
# … with more rows
</code></pre>
<p>Another way to reduce many columns into fewer ones is by using <code>tidyr::nest</code> to move some columns into nested tables. For instance, we can create a nested table <code>perf</code> encapsulating all performance-related attributes from <code>mtcars</code> (namely, <code>hp</code>, <code>mpg</code>, <code>disp</code>, and <code>qsec</code>). However, unlike R dataframes, Spark Dataframes do not have the concept of nested tables, and the closest to nested tables we can get is a <code>perf</code> column containing named structs with <code>hp</code>, <code>mpg</code>, <code>disp</code>, and <code>qsec</code> attributes:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">mtcars_nested_sdf</span> <span class="o">&lt;-</span> <span class="n">mtcars_sdf</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="n">tidyr</span><span class="o">::</span><span class="nf">nest</span><span class="p">(</span><span class="n">perf</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="n">hp</span><span class="p">,</span> <span class="n">mpg</span><span class="p">,</span> <span class="n">disp</span><span class="p">,</span> <span class="n">qsec</span><span class="p">))</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>We can then inspect the type of <code>perf</code> column in <code>mtcars_nested_sdf</code>:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">sdf_schema</span><span class="p">(</span><span class="n">mtcars_nested_sdf</span><span class="p">)</span><span class="o">$</span><span class="n">perf</span><span class="o">$</span><span class="n">type</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code>[1] &quot;ArrayType(StructType(StructField(hp,DoubleType,true), StructField(mpg,DoubleType,true), StructField(disp,DoubleType,true), StructField(qsec,DoubleType,true)),true)&quot;
</code></pre>
<p>and inspect individual struct elements within <code>perf</code>:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">perf</span> <span class="o">&lt;-</span> <span class="n">mtcars_nested_sdf</span> <span class="o">%&gt;%</span> <span class="n">dplyr</span><span class="o">::</span><span class="nf">pull</span><span class="p">(</span><span class="n">perf</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">unlist</span><span class="p">(</span><span class="n">perf[[1]]</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code>    hp    mpg   disp   qsec
110.00  21.00 160.00  16.46
</code></pre>
<p>Finally, we can also use <code>tidyr::unnest</code> to undo the effects of <code>tidyr::nest</code>:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">mtcars_unnested_sdf</span> <span class="o">&lt;-</span> <span class="n">mtcars_nested_sdf</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="n">tidyr</span><span class="o">::</span><span class="nf">unnest</span><span class="p">(</span><span class="n">col</span> <span class="o">=</span> <span class="n">perf</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">print</span><span class="p">(</span><span class="n">mtcars_unnested_sdf</span><span class="p">,</span> <span class="n">n</span> <span class="o">=</span> <span class="m">5</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code># Source: spark&lt;?&gt; [?? x 12]
  model          cyl  drat    wt    vs    am  gear  carb    hp   mpg  disp  qsec
  &lt;chr&gt;        &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
1 Mazda RX4        6  3.9   2.62     0     1     4     4   110  21    160   16.5
2 Hornet 4 Dr…     6  3.08  3.22     1     0     3     1   110  21.4  258   19.4
3 Duster 360       8  3.21  3.57     0     0     3     4   245  14.3  360   15.8
4 Merc 280         6  3.92  3.44     1     0     4     4   123  19.2  168.  18.3
5 Lincoln Con…     8  3     5.42     0     0     3     4   215  10.4  460   17.8
# … with more rows
</code></pre>
<h2 id="robust-scaler">Robust Scaler
</h2>
<p><a href="https://spark.apache.org/docs/3.0.0/api/java/org/apache/spark/ml/feature/RobustScaler.html" target="_blank" rel="noopener">RobustScaler</a>
 is a new functionality introduced in Spark 3.0 (<a href="https://issues.apache.org/jira/browse/SPARK-28399" target="_blank" rel="noopener">SPARK-28399</a>
). Thanks to a <a href="https://github.com/sparklyr/sparklyr/pull/2254" target="_blank" rel="noopener">pull request</a>
 by <a href="https://github.com/zero323" target="_blank" rel="noopener">@zero323</a>
, an R interface for <code>RobustScaler</code>, namely, the <code>ft_robust_scaler()</code> function, is now part of <code>sparklyr</code>.</p>
<p>It is often observed that many machine learning algorithms perform better on numeric inputs that are standardized. Many of us have learned in stats 101 that given a random variable $X$, we can compute its mean $\mu = E[X]$, standard deviation $\sigma = \sqrt{E[X^2] - (E[X])^2}$, and then obtain a standard score $z = \frac{X - \mu}{\sigma}$ which has mean of 0 and standard deviation of 1.</p>
<p>However, notice both $E[X]$ and $E[X^2]$ from above are quantities that can be easily skewed by extreme outliers in $X$, causing distortions in $z$. A particular bad case of it would be if all non-outliers among $X$ are very close to $0$, hence making $E[X]$ close to $0$, while extreme outliers are all far in the negative direction, hence dragging down $E[X]$ while skewing $E[X^2]$ upwards.</p>
<p>An alternative way of standardizing $X$ based on its median, 1st quartile, and 3rd quartile values, all of which are robust against outliers, would be the following:</p>
<p>$\displaystyle z = \frac{X - \text{Median}(X)}{\text{P75}(X) - \text{P25}(X)}$</p>
<p>and this is precisely what <a href="https://spark.apache.org/docs/3.0.0/api/java/org/apache/spark/ml/feature/RobustScaler.html" target="_blank" rel="noopener">RobustScaler</a>
 offers.</p>
<p>To see <code>ft_robust_scaler()</code> in action and demonstrate its usefulness, we can go through a contrived example consisting of the following steps:</p>
<ul>
<li>Draw 500 random samples from the standard normal distribution</li>
</ul>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">sample_values</span> <span class="o">&lt;-</span> <span class="nf">rnorm</span><span class="p">(</span><span class="m">500</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">print</span><span class="p">(</span><span class="n">sample_values</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code>  [1] -0.626453811  0.183643324 -0.835628612  1.595280802  0.329507772
  [6] -0.820468384  0.487429052  0.738324705  0.575781352 -0.305388387
  ...
</code></pre>
<ul>
<li>Inspect the minimal and maximal values among the $500$ random samples:</li>
</ul>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">print</span><span class="p">(</span><span class="nf">min</span><span class="p">(</span><span class="n">sample_values</span><span class="p">))</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code>  [1] -3.008049
</code></pre>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">print</span><span class="p">(</span><span class="nf">max</span><span class="p">(</span><span class="n">sample_values</span><span class="p">))</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code>  [1] 3.810277
</code></pre>
<ul>
<li>Now create $10$ other values that are extreme outliers compared to the $500$ random samples above. Given that we know all $500$ samples are within the range of $(-4, 4)$, we can choose $-501, -502, \ldots, -509, -510$ as our $10$ outliers:</li>
</ul>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">outliers</span> <span class="o">&lt;-</span> <span class="m">-500L</span> <span class="o">-</span> <span class="nf">seq</span><span class="p">(</span><span class="m">10</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><ul>
<li>Copy all $510$ values into a Spark dataframe named <code>sdf</code></li>
</ul>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">sparklyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">sc</span> <span class="o">&lt;-</span> <span class="nf">spark_connect</span><span class="p">(</span><span class="n">master</span> <span class="o">=</span> <span class="s">&#34;local&#34;</span><span class="p">,</span> <span class="n">version</span> <span class="o">=</span> <span class="s">&#34;3.0.0&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">sdf</span> <span class="o">&lt;-</span> <span class="nf">copy_to</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="nf">data.frame</span><span class="p">(</span><span class="n">value</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="n">sample_values</span><span class="p">,</span> <span class="n">outliers</span><span class="p">)))</span>
</span></span></code></pre></td></tr></table>
</div>
</div><ul>
<li>We can then apply <code>ft_robust_scaler()</code> to obtain the standardized value for each input:</li>
</ul>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">scaled</span> <span class="o">&lt;-</span> <span class="n">sdf</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">ft_vector_assembler</span><span class="p">(</span><span class="s">&#34;value&#34;</span><span class="p">,</span> <span class="s">&#34;input&#34;</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">ft_robust_scaler</span><span class="p">(</span><span class="s">&#34;input&#34;</span><span class="p">,</span> <span class="s">&#34;scaled&#34;</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="n">dplyr</span><span class="o">::</span><span class="nf">pull</span><span class="p">(</span><span class="n">scaled</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">unlist</span><span class="p">()</span>
</span></span></code></pre></td></tr></table>
</div>
</div><ul>
<li>Plotting the result shows the non-outlier data points being scaled to values that still more or less form a bell-shaped distribution centered around $0$, as expected, so the scaling is robust against influence of the outliers:</li>
</ul>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">ggplot</span><span class="p">(</span><span class="nf">data.frame</span><span class="p">(</span><span class="n">scaled</span> <span class="o">=</span> <span class="n">scaled</span><span class="p">),</span> <span class="nf">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">scaled</span><span class="p">))</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">  <span class="nf">xlim</span><span class="p">(</span><span class="m">-7</span><span class="p">,</span> <span class="m">7</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">  <span class="nf">geom_histogram</span><span class="p">(</span><span class="n">binwidth</span> <span class="o">=</span> <span class="m">0.2</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><img src="https://posit-open-source.netlify.app/blog/ai/sparklyr-1.4/images/scaled.png" id="id" class="class" style="width:60.0%;height:60.0%" />
<ul>
<li>Finally, we can compare the distribution of the scaled values above with the distribution of z-scores of all input values, and notice how scaling the input with only mean and standard deviation would have caused noticeable skewness &ndash; which the robust scaler has successfully avoided:</li>
</ul>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">all_values</span> <span class="o">&lt;-</span> <span class="nf">c</span><span class="p">(</span><span class="n">sample_values</span><span class="p">,</span> <span class="n">outliers</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">z_scores</span> <span class="o">&lt;-</span> <span class="p">(</span><span class="n">all_values</span> <span class="o">-</span> <span class="nf">mean</span><span class="p">(</span><span class="n">all_values</span><span class="p">))</span> <span class="o">/</span> <span class="nf">sd</span><span class="p">(</span><span class="n">all_values</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">ggplot</span><span class="p">(</span><span class="nf">data.frame</span><span class="p">(</span><span class="n">scaled</span> <span class="o">=</span> <span class="n">z_scores</span><span class="p">),</span> <span class="nf">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">scaled</span><span class="p">))</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">  <span class="nf">xlim</span><span class="p">(</span><span class="m">-0.05</span><span class="p">,</span> <span class="m">0.2</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">  <span class="nf">geom_histogram</span><span class="p">(</span><span class="n">binwidth</span> <span class="o">=</span> <span class="m">0.005</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><img src="https://posit-open-source.netlify.app/blog/ai/sparklyr-1.4/images/skewed.png" id="id" class="class" style="width:60.0%;height:60.0%" />
<ul>
<li>From the 2 plots above, one can observe while both standardization processes produced some distributions that were still bell-shaped, the one produced by <code>ft_robust_scaler()</code> is centered around $0$, correctly indicating the average among all non-outlier values, while the z-score distribution is clearly not centered around $0$ as its center has been noticeably shifted by the $10$ outlier values.</li>
</ul>
<h2 id="rapids">RAPIDS
</h2>
<p>Readers following Apache Spark releases closely probably have noticed the recent addition of <a href="https://rapids.ai/" target="_blank" rel="noopener">RAPIDS</a>
 GPU acceleration support in Spark 3.0. Catching up with this recent development, an option to enable RAPIDS in Spark connections was also created in <code>sparklyr</code> and shipped in <code>sparklyr</code> 1.4. On a host with RAPIDS-capable hardware (e.g., an Amazon EC2 instance of type &lsquo;p3.2xlarge&rsquo;), one can install <code>sparklyr</code> 1.4 and observe RAPIDS hardware acceleration being reflected in Spark SQL physical query plans:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">sparklyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">sc</span> <span class="o">&lt;-</span> <span class="nf">spark_connect</span><span class="p">(</span><span class="n">master</span> <span class="o">=</span> <span class="s">&#34;local&#34;</span><span class="p">,</span> <span class="n">version</span> <span class="o">=</span> <span class="s">&#34;3.0.0&#34;</span><span class="p">,</span> <span class="n">packages</span> <span class="o">=</span> <span class="s">&#34;rapids&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">dplyr</span><span class="o">::</span><span class="nf">db_explain</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s">&#34;SELECT 4&#34;</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code>== Physical Plan ==
*(2) GpuColumnarToRow false
+- GpuProject [4 AS 4#45]
   +- GpuRowToColumnar TargetSize(2147483647)
      +- *(1) Scan OneRowRelation[]
</code></pre>
<h2 id="higher-order-functions-and-dplyr-related-improvements">Higher-Order Functions and <code>dplyr</code>-Related Improvements
</h2>
<p>All newly introduced higher-order functions from Spark 3.0, such as <code>array_sort()</code> with custom comparator, <code>transform_keys()</code>, <code>transform_values()</code>, and <code>map_zip_with()</code>, are supported by <code>sparklyr</code> 1.4.</p>
<p>In addition, all higher-order functions can now be accessed directly through <code>dplyr</code> rather than their <code>hof_*</code> counterparts in <code>sparklyr</code>. This means, for example, that we can run the following <code>dplyr</code> queries to calculate the square of all array elements in column <code>x</code> of <code>sdf</code>, and then sort them in descending order:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">sparklyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">sc</span> <span class="o">&lt;-</span> <span class="nf">spark_connect</span><span class="p">(</span><span class="n">master</span> <span class="o">=</span> <span class="s">&#34;local&#34;</span><span class="p">,</span> <span class="n">version</span> <span class="o">=</span> <span class="s">&#34;3.0.0&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">sdf</span> <span class="o">&lt;-</span> <span class="nf">copy_to</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="n">tibble</span><span class="o">::</span><span class="nf">tibble</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="nf">list</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">-3</span><span class="p">,</span> <span class="m">-2</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> <span class="m">5</span><span class="p">),</span> <span class="nf">c</span><span class="p">(</span><span class="m">6</span><span class="p">,</span> <span class="m">-7</span><span class="p">,</span> <span class="m">5</span><span class="p">,</span> <span class="m">8</span><span class="p">))))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">sq_desc</span> <span class="o">&lt;-</span> <span class="n">sdf</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="n">dplyr</span><span class="o">::</span><span class="nf">mutate</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="nf">transform</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="o">~</span> <span class="n">.x</span> <span class="o">*</span> <span class="n">.x</span><span class="p">))</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="n">dplyr</span><span class="o">::</span><span class="nf">mutate</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="nf">array_sort</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="o">~</span> <span class="nf">as.integer</span><span class="p">(</span><span class="nf">sign</span><span class="p">(</span><span class="n">.y</span> <span class="o">-</span> <span class="n">.x</span><span class="p">))))</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="n">dplyr</span><span class="o">::</span><span class="nf">pull</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">print</span><span class="p">(</span><span class="n">sq_desc</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code>[[1]]
[1] 25  9  4  1

[[2]]
[1] 64 49 36 25
</code></pre>
<h2 id="acknowledgement">Acknowledgement
</h2>
<p>In chronological order, we would like to thank the following individuals for their contributions to <code>sparklyr</code> 1.4:</p>
<ul>
<li><a href="https://github.com/javierluraschi" target="_blank" rel="noopener">@javierluraschi</a>
</li>
<li><a href="https://github.com/nealrichardson" target="_blank" rel="noopener">@nealrichardson</a>
</li>
<li><a href="https://github.com/yitao-li" target="_blank" rel="noopener">@yitao-li</a>
</li>
<li><a href="https://github.com/wkdavis" target="_blank" rel="noopener">@wkdavis</a>
</li>
<li><a href="https://github.com/Loquats" target="_blank" rel="noopener">@Loquats</a>
</li>
<li><a href="https://github.com/zero323" target="_blank" rel="noopener">@zero323</a>
</li>
</ul>
<p>We also appreciate bug reports, feature requests, and valuable other feedback about <code>sparklyr</code> from our awesome open-source community (e.g., the weighted sampling feature in <code>sparklyr</code> 1.4 was largely motivated by this <a href="https://github.com/sparklyr/sparklyr/issues/2592" target="_blank" rel="noopener">Github issue</a>
 filed by <a href="https://github.com/ajing" target="_blank" rel="noopener">@ajing</a>
, and some <code>dplyr</code>-related bug fixes in this release were initiated in <a href="https://github.com/sparklyr/sparklyr/issues/2648" target="_blank" rel="noopener">#2648</a>
 and completed with this <a href="https://github.com/sparklyr/sparklyr/pull/2651" target="_blank" rel="noopener">pull request</a>
 by <a href="https://github.com/wkdavis" target="_blank" rel="noopener">@wkdavis</a>
).</p>
<p>Last but not least, the author of this blog post is extremely grateful for fantastic editorial suggestions from <a href="https://github.com/javierluraschi" target="_blank" rel="noopener">@javierluraschi</a>
, <a href="https://github.com/batpigandme" target="_blank" rel="noopener">@batpigandme</a>
, and <a href="https://github.com/skeydan" target="_blank" rel="noopener">@skeydan</a>
.</p>
<p>If you wish to learn more about <code>sparklyr</code>, we recommend checking out <a href="https://sparklyr.ai" target="_blank" rel="noopener">sparklyr.ai</a>
, <a href="https://spark.rstudio.com" target="_blank" rel="noopener">spark.rstudio.com</a>
, and also some of the previous release posts such as <a href="https://blog.rstudio.com/2020/07/16/sparklyr-1-3/" target="_blank" rel="noopener">sparklyr 1.3</a>
 and <a href="https://posit-open-source.netlify.app/blog/ai/2020-04-21-sparklyr-1.2.0-released/">sparklyr 1.2</a>
.</p>
<p>Thanks for reading!</p>
]]></description>
      <enclosure url="https://posit-open-source.netlify.app/blog/ai/sparklyr-1.4/thumbnail.jpg" length="408042" type="image/jpeg" />
    </item>
    <item>
      <title>Training ImageNet with R</title>
      <link>https://posit-open-source.netlify.app/blog/ai/2020-08-24-training-imagenet-with-r/</link>
      <pubDate>Mon, 24 Aug 2020 00:00:00 +0000</pubDate>
      <guid>https://posit-open-source.netlify.app/blog/ai/2020-08-24-training-imagenet-with-r/</guid>
      <dc:creator>Javier Luraschi</dc:creator><description><![CDATA[<p><a href="http://www.image-net.org/" target="_blank" rel="noopener">ImageNet</a>
 (Deng et al. 2009) is an image database organized according to the <a href="http://wordnet.princeton.edu/" target="_blank" rel="noopener">WordNet</a>
 (Miller 1995) hierarchy which, historically, has been used in computer vision benchmarks and research. However, it was not until AlexNet (Krizhevsky et al. 2012) demonstrated the efficiency of deep learning using convolutional neural networks on GPUs that the computer-vision discipline turned to deep learning to achieve state-of-the-art models that revolutionized their field. Given the importance of ImageNet and AlexNet, this post introduces tools and techniques to consider when training ImageNet and other large-scale datasets with R.</p>
<p>Now, in order to process ImageNet, we will first have to <em>divide and conquer</em>, partitioning the dataset into several manageable subsets. Afterwards, we will train ImageNet using AlexNet across multiple GPUs and compute instances. <a href="#preprocessing-imagenet">Preprocessing ImageNet</a>
 and <a href="#distributed-training">distributed training</a>
 are the two topics that this post will present and discuss, starting with preprocessing ImageNet.</p>
<h2 id="preprocessing-imagenet">Preprocessing ImageNet
</h2>
<p>When dealing with large datasets, even simple tasks like downloading or reading a dataset can be much harder than what you would expect. For instance, since ImageNet is roughly 300GB in size, you will need to make sure to have at least 600GB of free space to leave some room for download and decompression. But no worries, you can always borrow computers with huge disk drives from your favorite cloud provider. While you are at it, you should also request compute instances with multiple GPUs, Solid State Drives (SSDs), and a reasonable amount of CPUs and memory. If you want to use the exact configuration we used, take a look at the <a href="https://github.com/mlverse/imagenet" target="_blank" rel="noopener">mlverse/imagenet</a>
 repo, which contains a Docker image and configuration commands required to provision reasonable computing resources for this task. In summary, make sure you have access to sufficient compute resources.</p>
<p>Now that we have resources capable of working with ImageNet, we need to find a place to download ImageNet from. The easiest way is to use a variation of ImageNet used in the <a href="http://www.image-net.org/challenges/LSVRC/" target="_blank" rel="noopener">ImageNet Large Scale Visual Recognition Challenge (ILSVRC)</a>
, which contains a subset of about 250GB of data and can be easily downloaded from many <a href="https://kaggle.com" target="_blank" rel="noopener">Kaggle</a>
 competitions, like the <a href="https://www.kaggle.com/c/imagenet-object-localization-challenge" target="_blank" rel="noopener">ImageNet Object Localization Challenge</a>
.</p>
<p>If you&rsquo;ve read some of our previous posts, you might be already thinking of using the <a href="https://pins.rstudio.com" target="_blank" rel="noopener">pins</a>
 package, which you can use to: cache, discover and share resources from many services, including Kaggle. You can learn more about data retrieval from Kaggle in the <a href="http://pins.rstudio.com/articles/boards-kaggle.html" target="_blank" rel="noopener">Using Kaggle Boards</a>
 article; in the meantime, let&rsquo;s assume you are already familiar with this package.</p>
<p>All we need to do now is register the Kaggle board, retrieve ImageNet as a pin, and decompress this file. Warning, the following code requires you to stare at a progress bar for, potentially, over an hour.</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">pins</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">board_register</span><span class="p">(</span><span class="s">&#34;kaggle&#34;</span><span class="p">,</span> <span class="n">token</span> <span class="o">=</span> <span class="s">&#34;kaggle.json&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">pin_get</span><span class="p">(</span><span class="s">&#34;c/imagenet-object-localization-challenge&#34;</span><span class="p">,</span> <span class="n">board</span> <span class="o">=</span> <span class="s">&#34;kaggle&#34;</span><span class="p">)</span><span class="n">[1]</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">untar</span><span class="p">(</span><span class="n">exdir</span> <span class="o">=</span> <span class="s">&#34;/localssd/imagenet/&#34;</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>If we are going to be training this model over and over using multiple GPUs and even multiple compute instances, we want to make sure we don&rsquo;t waste too much time downloading ImageNet every single time.</p>
<p>The first improvement to consider is getting a faster hard drive. In our case, we locally-mounted an array of SSDs into the <code>/localssd</code> path. We then used <code>/localssd</code> to extract ImageNet and configured R&rsquo;s temp path and pins cache to use the SSDs as well. Consult your cloud provider&rsquo;s documentation to configure SSDs, or take a look at <a href="https://github.com/mlverse/imagenet" target="_blank" rel="noopener">mlverse/imagenet</a>
.</p>
<p>Next, a well-known approach we can follow is to partition ImageNet into chunks that can be individually downloaded to perform distributed training later on.</p>
<p>In addition, it is also faster to download ImageNet from a nearby location, ideally from a URL stored within the same data center where our cloud instance is located. For this, we can also use pins to register a board with our cloud provider and then re-upload each partition. Since ImageNet is already partitioned by category, we can easily split ImageNet into multiple zip files and re-upload to our closest data center as follows. Make sure the storage bucket is created in the same region as your computing instances.</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">board_register</span><span class="p">(</span><span class="s">&#34;&lt;board&gt;&#34;</span><span class="p">,</span> <span class="n">name</span> <span class="o">=</span> <span class="s">&#34;imagenet&#34;</span><span class="p">,</span> <span class="n">bucket</span> <span class="o">=</span> <span class="s">&#34;r-imagenet&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">train_path</span> <span class="o">&lt;-</span> <span class="s">&#34;/localssd/imagenet/ILSVRC/Data/CLS-LOC/train/&#34;</span>
</span></span><span class="line"><span class="cl"><span class="kr">for</span> <span class="p">(</span><span class="n">path</span> <span class="kr">in</span> <span class="nf">dir</span><span class="p">(</span><span class="n">train_path</span><span class="p">,</span> <span class="n">full.names</span> <span class="o">=</span> <span class="kc">TRUE</span><span class="p">))</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="nf">dir</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">full.names</span> <span class="o">=</span> <span class="kc">TRUE</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">    <span class="nf">pin</span><span class="p">(</span><span class="n">name</span> <span class="o">=</span> <span class="nf">basename</span><span class="p">(</span><span class="n">path</span><span class="p">),</span> <span class="n">board</span> <span class="o">=</span> <span class="s">&#34;imagenet&#34;</span><span class="p">,</span> <span class="n">zip</span> <span class="o">=</span> <span class="kc">TRUE</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>We can now retrieve a subset of ImageNet quite efficiently. If you are motivated to do so and have about one gigabyte to spare, feel free to follow along executing this code. Notice that ImageNet contains <em>lots</em> of JPEG images for each WordNet category.</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">board_register</span><span class="p">(</span><span class="s">&#34;https://storage.googleapis.com/r-imagenet/&#34;</span><span class="p">,</span> <span class="s">&#34;imagenet&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">categories</span> <span class="o">&lt;-</span> <span class="nf">pin_get</span><span class="p">(</span><span class="s">&#34;categories&#34;</span><span class="p">,</span> <span class="n">board</span> <span class="o">=</span> <span class="s">&#34;imagenet&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">pin_get</span><span class="p">(</span><span class="n">categories</span><span class="o">$</span><span class="n">id[1]</span><span class="p">,</span> <span class="n">board</span> <span class="o">=</span> <span class="s">&#34;imagenet&#34;</span><span class="p">,</span> <span class="n">extract</span> <span class="o">=</span> <span class="kc">TRUE</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="n">tibble</span><span class="o">::</span><span class="nf">as_tibble</span><span class="p">()</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code># A tibble: 1,300 x 1
   value                                                           
   &lt;chr&gt;                                                           
 1 /localssd/pins/storage/n01440764/n01440764_10026.JPEG
 2 /localssd/pins/storage/n01440764/n01440764_10027.JPEG
 3 /localssd/pins/storage/n01440764/n01440764_10029.JPEG
 4 /localssd/pins/storage/n01440764/n01440764_10040.JPEG
 5 /localssd/pins/storage/n01440764/n01440764_10042.JPEG
 6 /localssd/pins/storage/n01440764/n01440764_10043.JPEG
 7 /localssd/pins/storage/n01440764/n01440764_10048.JPEG
 8 /localssd/pins/storage/n01440764/n01440764_10066.JPEG
 9 /localssd/pins/storage/n01440764/n01440764_10074.JPEG
10 /localssd/pins/storage/n01440764/n01440764_1009.JPEG 
# … with 1,290 more rows
</code></pre>
<p>When doing distributed training over ImageNet, we can now let a single compute instance process a partition of ImageNet with ease. Say, 1/16 of ImageNet can be retrieved and extracted, in under a minute, using parallel downloads with the <a href="https://callr.r-lib.org/" target="_blank" rel="noopener">callr</a>
 package:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">categories</span> <span class="o">&lt;-</span> <span class="nf">pin_get</span><span class="p">(</span><span class="s">&#34;categories&#34;</span><span class="p">,</span> <span class="n">board</span> <span class="o">=</span> <span class="s">&#34;imagenet&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">categories</span> <span class="o">&lt;-</span> <span class="n">categories</span><span class="o">$</span><span class="n">id[1</span><span class="o">:</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">categories</span><span class="o">$</span><span class="n">id</span><span class="p">)</span> <span class="o">/</span> <span class="m">16</span><span class="p">)</span><span class="n">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">procs</span> <span class="o">&lt;-</span> <span class="nf">lapply</span><span class="p">(</span><span class="n">categories</span><span class="p">,</span> <span class="kr">function</span><span class="p">(</span><span class="n">cat</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">  <span class="n">callr</span><span class="o">::</span><span class="nf">r_bg</span><span class="p">(</span><span class="kr">function</span><span class="p">(</span><span class="n">cat</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="nf">library</span><span class="p">(</span><span class="n">pins</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="nf">board_register</span><span class="p">(</span><span class="s">&#34;https://storage.googleapis.com/r-imagenet/&#34;</span><span class="p">,</span> <span class="s">&#34;imagenet&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="nf">pin_get</span><span class="p">(</span><span class="n">cat</span><span class="p">,</span> <span class="n">board</span> <span class="o">=</span> <span class="s">&#34;imagenet&#34;</span><span class="p">,</span> <span class="n">extract</span> <span class="o">=</span> <span class="kc">TRUE</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">  <span class="p">},</span> <span class="n">args</span> <span class="o">=</span> <span class="nf">list</span><span class="p">(</span><span class="n">cat</span><span class="p">))</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">  
</span></span><span class="line"><span class="cl"><span class="kr">while</span> <span class="p">(</span><span class="nf">any</span><span class="p">(</span><span class="nf">sapply</span><span class="p">(</span><span class="n">procs</span><span class="p">,</span> <span class="kr">function</span><span class="p">(</span><span class="n">p</span><span class="p">)</span> <span class="n">p</span><span class="o">$</span><span class="nf">is_alive</span><span class="p">())))</span> <span class="nf">Sys.sleep</span><span class="p">(</span><span class="m">1</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>We can wrap this up partition in a list containing a map of images and categories, which we will later use in our AlexNet model through <a href="https://tensorflow.rstudio.com/guide/tfdatasets/introduction/" target="_blank" rel="noopener">tfdatasets</a>
.</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span><span class="lnt">8
</span><span class="lnt">9
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">data</span> <span class="o">&lt;-</span> <span class="nf">list</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">image</span> <span class="o">=</span> <span class="nf">unlist</span><span class="p">(</span><span class="nf">lapply</span><span class="p">(</span><span class="n">categories</span><span class="p">,</span> <span class="kr">function</span><span class="p">(</span><span class="n">cat</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="nf">pin_get</span><span class="p">(</span><span class="n">cat</span><span class="p">,</span> <span class="n">board</span> <span class="o">=</span> <span class="s">&#34;imagenet&#34;</span><span class="p">,</span> <span class="n">download</span> <span class="o">=</span> <span class="kc">FALSE</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="p">})),</span>
</span></span><span class="line"><span class="cl">    <span class="n">category</span> <span class="o">=</span> <span class="nf">unlist</span><span class="p">(</span><span class="nf">lapply</span><span class="p">(</span><span class="n">categories</span><span class="p">,</span> <span class="kr">function</span><span class="p">(</span><span class="n">cat</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="nf">rep</span><span class="p">(</span><span class="n">cat</span><span class="p">,</span> <span class="nf">length</span><span class="p">(</span><span class="nf">pin_get</span><span class="p">(</span><span class="n">cat</span><span class="p">,</span> <span class="n">board</span> <span class="o">=</span> <span class="s">&#34;imagenet&#34;</span><span class="p">,</span> <span class="n">download</span> <span class="o">=</span> <span class="kc">FALSE</span><span class="p">)))</span>
</span></span><span class="line"><span class="cl">    <span class="p">})),</span>
</span></span><span class="line"><span class="cl">    <span class="n">categories</span> <span class="o">=</span> <span class="n">categories</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>Great! We are halfway there training ImageNet. The next section will focus on introducing distributed training using multiple GPUs.</p>
<h2 id="distributed-training">Distributed Training
</h2>
<p>Now that we have broken down ImageNet into manageable parts, we can forget for a second about the size of ImageNet and focus on training a deep learning model for this dataset. However, any model we choose is likely to require a GPU, even for a 1/16 subset of ImageNet. So make sure your GPUs are properly configured by running <code>is_gpu_available()</code>. If you need help getting a GPU configured, the <a href="https://www.youtube.com/watch?v=i5Bjm3jG_d8" target="_blank" rel="noopener">Using GPUs with TensorFlow and Docker</a>
 video can help you get up to speed.</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">tensorflow</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">tf</span><span class="o">$</span><span class="n">test</span><span class="o">$</span><span class="nf">is_gpu_available</span><span class="p">()</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code>[1] TRUE
</code></pre>
<p>We can now decide which deep learning model would best be suited for ImageNet classification tasks. Instead, for this post, we will go back in time to the glory days of AlexNet and use the <a href="https://github.com/r-tensorflow/alexnet" target="_blank" rel="noopener">r-tensorflow/alexnet</a>
 repo instead. This repo contains a port of AlexNet to R, but please notice that this port has not been tested and is not ready for any real use cases. In fact, we would appreciate PRs to improve it if someone feels inclined to do so. Regardless, the focus of this post is on workflows and tools, not about achieving state-of-the-art image classification scores. So by all means, feel free to use more appropriate models.</p>
<p>Once we&rsquo;ve chosen a model, we will want to me make sure that it properly trains on a subset of ImageNet:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">remotes</span><span class="o">::</span><span class="nf">install_github</span><span class="p">(</span><span class="s">&#34;r-tensorflow/alexnet&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">alexnet</span><span class="o">::</span><span class="nf">alexnet_train</span><span class="p">(</span><span class="n">data</span> <span class="o">=</span> <span class="n">data</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code>Epoch 1/2
 103/2269 [&gt;...............] - ETA: 5:52 - loss: 72306.4531 - accuracy: 0.9748
</code></pre>
<p>So far so good! However, this post is about enabling large-scale training across multiple GPUs, so we want to make sure we are using as many as we can. Unfortunately, running <code>nvidia-smi</code> will show that only one GPU currently being used:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">nvidia-smi
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code>+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.152.00   Driver Version: 418.152.00   CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:05.0 Off |                    0 |
| N/A   48C    P0    89W / 149W |  10935MiB / 11441MiB |     28%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 00000000:00:06.0 Off |                    0 |
| N/A   74C    P0    74W / 149W |     71MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
</code></pre>
<p>In order to train across multiple GPUs, we need to define a distributed-processing strategy. If this is a new concept, it might be a good time to take a look at the <a href="https://tensorflow.rstudio.com/tutorials/advanced/distributed/distributed_training_with_keras/" target="_blank" rel="noopener">Distributed Training with Keras</a>
 tutorial and the <a href="https://www.tensorflow.org/guide/distributed_training" target="_blank" rel="noopener">distributed training with TensorFlow</a>
 docs. Or, if you allow us to oversimplify the process, all you have to do is define and compile your model under the right scope. A step-by-step explanation is available in the <a href="https://www.youtube.com/watch?v=DQyLTlD1IBc" target="_blank" rel="noopener">Distributed Deep Learning with TensorFlow and R</a>
 video. In this case, the <code>alexnet</code> model <a href="https://github.com/r-tensorflow/alexnet/blob/57546/R/alexnet_train.R#L92-L94" target="_blank" rel="noopener">already supports</a>
 a strategy parameter, so all we have to do is pass it along.</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">tensorflow</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">strategy</span> <span class="o">&lt;-</span> <span class="n">tf</span><span class="o">$</span><span class="n">distribute</span><span class="o">$</span><span class="nf">MirroredStrategy</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">  <span class="n">cross_device_ops</span> <span class="o">=</span> <span class="n">tf</span><span class="o">$</span><span class="n">distribute</span><span class="o">$</span><span class="nf">ReductionToOneDevice</span><span class="p">())</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">alexnet</span><span class="o">::</span><span class="nf">alexnet_train</span><span class="p">(</span><span class="n">data</span> <span class="o">=</span> <span class="n">data</span><span class="p">,</span> <span class="n">strategy</span> <span class="o">=</span> <span class="n">strategy</span><span class="p">,</span> <span class="n">parallel</span> <span class="o">=</span> <span class="m">6</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>Notice also <code>parallel = 6</code> which configures <code>tfdatasets</code> to make use of multiple CPUs when loading data into our GPUs, see <a href="https://tensorflow.rstudio.com/guide/tfdatasets/introduction/#parallel-mapping" target="_blank" rel="noopener">Parallel Mapping</a>
 for details.</p>
<p>We can now re-run <code>nvidia-smi</code> to validate all our GPUs are being used:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">nvidia-smi
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code>+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.152.00   Driver Version: 418.152.00   CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:05.0 Off |                    0 |
| N/A   49C    P0    94W / 149W |  10936MiB / 11441MiB |     53%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 00000000:00:06.0 Off |                    0 |
| N/A   76C    P0   114W / 149W |  10936MiB / 11441MiB |     26%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
</code></pre>
<p>The <code>MirroredStrategy</code> can help us scale up to about 8 GPUs per compute instance; however, we are likely to need 16 instances with 8 GPUs each to train ImageNet in a reasonable time (see Jeremy Howard&rsquo;s post on <a href="https://www.fast.ai/2018/08/10/fastai-diu-imagenet/" target="_blank" rel="noopener">Training Imagenet in 18 Minutes</a>
). So where do we go from here?</p>
<p>Welcome to <code>MultiWorkerMirroredStrategy</code>: This strategy can use not only multiple GPUs, but also multiple GPUs across multiple computers. To configure them, all we have to do is define a <code>TF_CONFIG</code> environment variable with the right addresses and run the exact same code in each compute instance.</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">tensorflow</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">partition</span> <span class="o">&lt;-</span> <span class="m">0</span>
</span></span><span class="line"><span class="cl"><span class="nf">Sys.setenv</span><span class="p">(</span><span class="n">TF_CONFIG</span> <span class="o">=</span> <span class="n">jsonlite</span><span class="o">::</span><span class="nf">toJSON</span><span class="p">(</span><span class="nf">list</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">cluster</span> <span class="o">=</span> <span class="nf">list</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">        <span class="n">worker</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="s">&#34;10.100.10.100:10090&#34;</span><span class="p">,</span> <span class="s">&#34;10.100.10.101:10090&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="p">),</span>
</span></span><span class="line"><span class="cl">    <span class="n">task</span> <span class="o">=</span> <span class="nf">list</span><span class="p">(</span><span class="n">type</span> <span class="o">=</span> <span class="s">&#39;worker&#39;</span><span class="p">,</span> <span class="n">index</span> <span class="o">=</span> <span class="n">partition</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">),</span> <span class="n">auto_unbox</span> <span class="o">=</span> <span class="kc">TRUE</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">strategy</span> <span class="o">&lt;-</span> <span class="n">tf</span><span class="o">$</span><span class="n">distribute</span><span class="o">$</span><span class="nf">MultiWorkerMirroredStrategy</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">  <span class="n">cross_device_ops</span> <span class="o">=</span> <span class="n">tf</span><span class="o">$</span><span class="n">distribute</span><span class="o">$</span><span class="nf">ReductionToOneDevice</span><span class="p">())</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">alexnet</span><span class="o">::</span><span class="nf">imagenet_partition</span><span class="p">(</span><span class="n">partition</span> <span class="o">=</span> <span class="n">partition</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="n">alexnet</span><span class="o">::</span><span class="nf">alexnet_train</span><span class="p">(</span><span class="n">strategy</span> <span class="o">=</span> <span class="n">strategy</span><span class="p">,</span> <span class="n">parallel</span> <span class="o">=</span> <span class="m">6</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>Please note that <code>partition</code> must change for each compute instance to uniquely identify it, and that the IP addresses also need to be adjusted. In addition, <code>data</code> should point to a different partition of ImageNet, which we can retrieve with <code>pins</code>; although, for convenience, <code>alexnet</code> contains similar code under <code>alexnet::imagenet_partition()</code>. Other than that, the code that you need to run in each compute instance is exactly the same.</p>
<p>However, if we were to use 16 machines with 8 GPUs each to train ImageNet, it would be quite time-consuming and error-prone to manually run code in each R session. So instead, we should think of making use of cluster-computing frameworks, like Apache Spark with <a href="https://blog.rstudio.com/2020/01/29/sparklyr-1-1/#barrier-execution" target="_blank" rel="noopener">barrier execution</a>
. If you are new to Spark, there are many resources available at <a href="https://sparklyr.ai" target="_blank" rel="noopener">sparklyr.ai</a>
. To learn just about running Spark and TensorFlow together, watch our <a href="https://www.youtube.com/watch?v=Zm20P3ADa14" target="_blank" rel="noopener">Deep Learning with Spark, TensorFlow and R</a>
 video.</p>
<p>Putting it all together, training ImageNet in R with TensorFlow and Spark looks as follows:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span><span class="lnt">21
</span><span class="lnt">22
</span><span class="lnt">23
</span><span class="lnt">24
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">sparklyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">sc</span> <span class="o">&lt;-</span> <span class="nf">spark_connect</span><span class="p">(</span><span class="s">&#34;yarn|mesos|etc&#34;</span><span class="p">,</span> <span class="n">config</span> <span class="o">=</span> <span class="nf">list</span><span class="p">(</span><span class="s">&#34;sparklyr.shell.num-executors&#34;</span> <span class="o">=</span> <span class="m">16</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">sdf_len</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="m">16</span><span class="p">,</span> <span class="n">repartition</span> <span class="o">=</span> <span class="m">16</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">spark_apply</span><span class="p">(</span><span class="kr">function</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">barrier</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nf">library</span><span class="p">(</span><span class="n">tensorflow</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">      <span class="nf">Sys.setenv</span><span class="p">(</span><span class="n">TF_CONFIG</span> <span class="o">=</span> <span class="n">jsonlite</span><span class="o">::</span><span class="nf">toJSON</span><span class="p">(</span><span class="nf">list</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">        <span class="n">cluster</span> <span class="o">=</span> <span class="nf">list</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">          <span class="n">worker</span> <span class="o">=</span> <span class="nf">paste</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">            <span class="nf">gsub</span><span class="p">(</span><span class="s">&#34;:[0-9]+$&#34;</span><span class="p">,</span> <span class="s">&#34;&#34;</span><span class="p">,</span> <span class="n">barrier</span><span class="o">$</span><span class="n">address</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">            <span class="m">8000</span> <span class="o">+</span> <span class="nf">seq_along</span><span class="p">(</span><span class="n">barrier</span><span class="o">$</span><span class="n">address</span><span class="p">),</span> <span class="n">sep</span> <span class="o">=</span> <span class="s">&#34;:&#34;</span><span class="p">)),</span>
</span></span><span class="line"><span class="cl">        <span class="n">task</span> <span class="o">=</span> <span class="nf">list</span><span class="p">(</span><span class="n">type</span> <span class="o">=</span> <span class="s">&#39;worker&#39;</span><span class="p">,</span> <span class="n">index</span> <span class="o">=</span> <span class="n">barrier</span><span class="o">$</span><span class="n">partition</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">      <span class="p">),</span> <span class="n">auto_unbox</span> <span class="o">=</span> <span class="kc">TRUE</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">      
</span></span><span class="line"><span class="cl">      <span class="kr">if</span> <span class="p">(</span><span class="nf">is.null</span><span class="p">(</span><span class="nf">tf_version</span><span class="p">()))</span> <span class="nf">install_tensorflow</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">      
</span></span><span class="line"><span class="cl">      <span class="n">strategy</span> <span class="o">&lt;-</span> <span class="n">tf</span><span class="o">$</span><span class="n">distribute</span><span class="o">$</span><span class="nf">MultiWorkerMirroredStrategy</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">      <span class="n">result</span> <span class="o">&lt;-</span> <span class="n">alexnet</span><span class="o">::</span><span class="nf">imagenet_partition</span><span class="p">(</span><span class="n">partition</span> <span class="o">=</span> <span class="n">barrier</span><span class="o">$</span><span class="n">partition</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">        <span class="n">alexnet</span><span class="o">::</span><span class="nf">alexnet_train</span><span class="p">(</span><span class="n">strategy</span> <span class="o">=</span> <span class="n">strategy</span><span class="p">,</span> <span class="n">epochs</span> <span class="o">=</span> <span class="m">10</span><span class="p">,</span> <span class="n">parallel</span> <span class="o">=</span> <span class="m">6</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">      
</span></span><span class="line"><span class="cl">      <span class="n">result</span><span class="o">$</span><span class="n">metrics</span><span class="o">$</span><span class="n">accuracy</span>
</span></span><span class="line"><span class="cl">  <span class="p">},</span> <span class="n">barrier</span> <span class="o">=</span> <span class="kc">TRUE</span><span class="p">,</span> <span class="n">columns</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="n">accuracy</span> <span class="o">=</span> <span class="s">&#34;numeric&#34;</span><span class="p">))</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>We hope this post gave you a reasonable overview of what training large-datasets in R looks like &ndash; thanks for reading along!</p>
<p>Deng, Jia, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. &ldquo;Imagenet: A Large-Scale Hierarchical Image Database.&rdquo; <em>2009 IEEE Conference on Computer Vision and Pattern Recognition</em>, 248&ndash;55.</p>
<p>Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. 2012. &ldquo;Imagenet Classification with Deep Convolutional Neural Networks.&rdquo; <em>Advances in Neural Information Processing Systems</em>, 1097&ndash;105.</p>
<p>Miller, George A. 1995. &ldquo;WordNet: A Lexical Database for English.&rdquo; <em>Communications of the ACM</em> 38 (11): 39&ndash;41.</p>
]]></description>
      <enclosure url="https://posit-open-source.netlify.app/blog/ai/2020-08-24-training-imagenet-with-r/thumbnail.jpg" length="62582" type="image/jpeg" />
    </item>
    <item>
      <title>Parallelized sampling using exponential variates</title>
      <link>https://posit-open-source.netlify.app/blog/ai/2020-07-29-parallelized-sampling/</link>
      <pubDate>Wed, 29 Jul 2020 00:00:00 +0000</pubDate>
      <guid>https://posit-open-source.netlify.app/blog/ai/2020-07-29-parallelized-sampling/</guid>
      <dc:creator>Yitao Li</dc:creator><description><![CDATA[<p>\usepackage{algorithm2e}</p>
<p>As part of our recent work to support weighted sampling of Spark data frames in <code>sparklyr</code>, we embarked on a journey searching for algorithms that can perform weighted sampling, especially sampling without replacement, in efficient and scalable ways within a distributed cluster-computing framework, such as Apache Spark.</p>
<p>In the interest of brevity, &ldquo;weighted sampling without replacement&rdquo; shall be shortened into <strong>SWoR</strong> for the remainder of this blog post.</p>
<p>In the following sections, we will explain and illustrate what <strong>SWoR</strong> means probability-wise, briefly outline some alternative solutions we have considered but were not completely satisfied with, and then deep-dive into exponential variates, a simple mathematical construct that made the ideal solution for this problem possible.</p>
<p>If you cannot wait to jump into action, there is also a <a href="#examples">section</a>
 in which we showcase example usages of <code>sdf_weighted_sample()</code> in <code>sparklyr</code>. In addition, you can examine the implementation detail of <code>sparklyr::sdf_weighted_sample()</code> in this <a href="https://github.com/sparklyr/sparklyr/pull/2606" target="_blank" rel="noopener">pull request</a>
.</p>
<h2 id="how-it-all-started">How it all started
</h2>
<p>Our journey started from a <a href="https://github.com/sparklyr/sparklyr/issues/2592" target="_blank" rel="noopener">Github issue</a>
 inquiring about the possibility of supporting the equivalent of <code>dplyr::sample_frac(..., weight = &lt;weight_column&gt;)</code> for Spark data frames in <code>sparklyr</code>. For example,</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">dplyr</span><span class="o">::</span><span class="nf">sample_frac</span><span class="p">(</span><span class="n">mtcars</span><span class="p">,</span> <span class="m">0.25</span><span class="p">,</span> <span class="n">weight</span> <span class="o">=</span> <span class="n">gear</span><span class="p">,</span> <span class="n">replace</span> <span class="o">=</span> <span class="kc">FALSE</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code>##                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Merc 280C         17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat X1-9         27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Porsche 914-2     26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Maserati Bora     15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Ferrari Dino      19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
</code></pre>
<p>will randomly select one-fourth of all rows from a R data frame named &ldquo;mtcars&rdquo; without replacement, using <code>mtcars$gear</code> as weights. We were unable to find any function implementing the weighted versions of <code>dplyr::sample_frac</code> among <a href="https://spark.apache.org/docs/3.0.0/api/sql/index.html" target="_blank" rel="noopener">Spark SQL built-in functions</a>
 in Spark 3.0 or in earlier versions, which means a future version of <code>sparklyr</code> will need to run its own weighted sampling algorithm to support such use cases.</p>
<h2 id="what-exactly-is-swor">What exactly is <strong>SWoR</strong>
</h2>
<p>The purpose of this section is to mathematically describe the probability distribution generated by <strong>SWoR</strong> in terms of $w_1, \dotsc, w_N$, so that readers can clearly see that the exponential-variate based algorithm presented in a subsequent section in fact samples from precisely the same probability distribution. Readers already having a crystal-clear mental picture of what <strong>SWoR</strong> entails should probably skip most of this section. The key take-away here is given $N$ rows $r_1, \dotsc, r_N$ and their weights $w_1, \dotsc, w_N$ and a desired sample size $n$, the probability of <strong>SWoR</strong> selecting $(r_1, \dotsc, r_n)$ is $\prod\limits_{j = 1}^{n} \left( {w_j} \middle/ {\sum\limits_{k = j}^{N}{w_k}} \right)$.</p>
<p><code>SWOR</code> is conceptually equivalent to a $n$-step process of selecting 1 out of $(n - j + 1)$ remaining rows in the $j$-th step for $j \in \{1, \dotsc, n\}$, with each remaining row&rsquo;s likelihood of getting selected being linearly proportional to its weight in any of the steps, i.e.,</p>
<pre><code>samples := {}
population := {r[1], ..., r[N]}

for j = 1 to n
  select r[x] from population with probability
    (w[x] / TotalWeight(population))
  samples := samples + {r[x]}
  population := population - {r[x]}
</code></pre>
<p>Notice the outcome of a <strong>SWoR</strong> process is in fact order-significant, which is why in this post it will always be represented as an ordered tuple of elements.</p>
<p>Intuitively, <strong>SWoR</strong> is analogous to throwing darts at a bunch of tiles. For example, let&rsquo;s say the size of our sample space is 5:</p>
<ul>
<li>
<p>Imagine $r_1, r_2, \dotsc, r_5$ as 5 rectangular tiles laid out contiguously on a wall with widths $w_1, w_2, \dotsc, w_5$, with $r_1$ covering $[0, w_1)$, $r_2$ covering $[w_1, w_1 + w_2)$, &hellip;, and $r_5$ covering $\left[\sum\limits_{j = 1}^{4} w_j, \sum\limits_{j = 1}^{5} w_j\right)$</p>
</li>
<li>
<p>Equate drawing a random sample in each step to throwing a dart uniformly randomly within the interval covered by all tiles that are not hit yet</p>
</li>
<li>
<p>After a tile is hit, it gets taken out and remaining tiles are re-arranged so that they continue to cover a contiguous interval without overlapping</p>
</li>
</ul>
<p>If our sample size is 3, then we shall ask ourselves what is the probability of the dart hitting $(r_1, r_2, r_3)$ in that order?</p>
<p>In step $j = 1$, the dart will hit $r_1$ with probability $\left. w_1 \middle/ \left(\sum\limits_{k = 1}^{N}w_k\right) \right.$</p>
<p><img src="https://posit-open-source.netlify.app/blog/ai/2020-07-29-parallelized-sampling/images/d1.jpg" style="width:50.0%;height:50.0%" alt="step 1" /> .</p>
<p>After deleting $r_1$ from the sample space after it&rsquo;s hit, step $j = 2$ will look like this:</p>
<p><img src="https://posit-open-source.netlify.app/blog/ai/2020-07-29-parallelized-sampling/images/d2.jpg" style="width:48.0%;height:45.0%" alt="step 2" /> ,</p>
<p>and the probability of the dart hitting $r_2$ in step 2 is $\left. w_2 \middle/ \left(\sum\limits_{k = 2}^{N}w_k\right) \right.$ .</p>
<p>Finally, moving on to step $j = 3$, we have:</p>
<p><img src="https://posit-open-source.netlify.app/blog/ai/2020-07-29-parallelized-sampling/images/d3.jpg" style="width:40.0%;height:30.0%" alt="step 3" /> ,</p>
<p>with the probability of the dart hitting $r_3$ being $\left. w_3 \middle/ \left(\sum\limits_{k = 3}^{N}w_k\right) \right.$.</p>
<p>So, combining all of the above, the overall probability of selecting $(r_1, r_2, r_3)$ is $\prod\limits_{j = 1}^{3} \left( {w_j} \middle/ {\sum\limits_{k = j}^{N}{w_k}} \right)$.</p>
<h2 id="naive-approaches-for-implementing-swor">Naive approaches for implementing <strong>SWoR</strong>
</h2>
<p>This section outlines some possible approaches that were briefly under consideration. Because none of these approaches scales well to a large number of rows or a non-trivial number of partitions in a Spark data frame, we decided to avoid all of them in <code>sparklyr</code>.</p>
<h3 id="a-tree-base-approach">A tree-base approach
</h3>
<p>One possible way to accomplish <strong>SWoR</strong> is to have a mutable data structure keeping track of the sample space at each step.</p>
<p>Continuing with the dart-throwing analogy from the previous section, let us say initially, none of the tiles has been taken out yet, and a dart has landed at some point $x \in \left[0, \sum\limits_{k = 1}^{N} w_k\right)$. Which tile did it hit? This can be answered efficiently if we have a binary tree, pictured as the following (or in general, some $b$-ary tree for integer $b \ge 2$)</p>
<figure>
<img src="https://posit-open-source.netlify.app/blog/ai/2020-07-29-parallelized-sampling/images/tree.jpg" style="width:60.0%;height:40.0%" alt="." />
<figcaption aria-hidden="true">.</figcaption>
</figure>
<p>To find the tile that was hit given the dart&rsquo;s position $x$, we simply need to traverse down the tree, going through the box containing $x$ in each level, incurring a $O(\log(N))$ cost in time complexity for each sample. To take a tile out of the picture, we update the width of the tile to $0$ and propagate this change upwards from leaf level to root of the tree, again incurring a $O(\log(N))$ cost in time complexity, making the overall time complexity of selecting $n$ samples $O(n \cdot \log(N))$, which is not so great for large data sets, and also, not parallelizable across multiple partitions of a Spark data frame.</p>
<h3 id="rejection-sampling">Rejection sampling
</h3>
<p>Another possible approach is to use rejection sampling. In term of the previously mentioned dart-throwing analogy, that means not removing any tile that is hit, hence avoiding the performance cost of keeping the sample space up-to-date, but then having to re-throw the dart in each of the subsequent rounds until the dart lands on a tile that was not hit previously. This approach, just like the previous one, would not be performant, and would not be parallelizable across multiple partitions of a Spark data frame either.</p>
<h1 id="exponential-variates-to-the-rescue">Exponential variates to the rescue
</h1>
<p>A solution that has proven to be much better than any of the naive approaches turns out to be a numerical stable variant of the algorithm described in &ldquo;Weighted Random Sampling&rdquo; (Efraimidis and Spirakis 2016) by Pavlos S. Efraimidis and Paul G. Spirakis.</p>
<p>A version of this sampling algorithm implemented by <code>sparklyr</code> does the following to sample $n$ out of $N$ rows from a Spark data frame $X$:</p>
<ul>
<li>For each row $r_j \in X$, draw a random number $u_j$ independently and uniformly randomly from $(0, 1)$ and compute the key of $r_j$ as $k_j = \ln(u_j) / w_j$, where $w_j$ is the weight of $r_j$. Perform this calulation in parallel across all partitions of $X$.</li>
<li>Select $n$ rows with largest keys and return them as the result. This step is also mostly parallelizable: for each partition of $X$, one can select up to $n$ rows having largest keys within that partition as candidates, and after selecting candidates from all partitions in parallel, simply extract the top $n$ rows among all candidates, and return them as the $n$ chosen samples.</li>
</ul>
<p>There are at least 4 reasons why this solution is highly appealing and was chosen to be implemented in <code>sparklyr</code>:</p>
<ul>
<li>It is a one-pass algorithm (i.e., only need to iterate through all rows of a data frame exactly once).</li>
<li>Its computational overhead is quite low (as selecting top $n$ rows at any stage only requires a bounded priority queue of max size $n$, which costs $O(\log(n))$ per update in time complexity).</li>
<li>More importantly, most of its required computations can be performed in parallel. In fact, the only non-parallelizable step is the very last stage of combining top candidates from all partitions and choosing the top $n$ rows among those candidates. So, it fits very well into the world of Spark / MapReduce, and has drastically better horizontal scalability compared to the naive approaches.</li>
<li>Bonus: It is also suitable for weighted reservoir sampling (i.e., can sample $n$ out of a possibly infinite stream of rows according to their weights such that at any moment the $n$ samples will be a weighted representation of all rows that have been processed so far).</li>
</ul>
<h2 id="why-does-this-algorithm-work">Why does this algorithm work
</h2>
<p>As an interesting aside, some readers have probably seen this technique presented in a slightly different form under another name. It is in fact equivalent to a generalized version of the <a href="https://lips.cs.princeton.edu/the-gumbel-max-trick-for-discrete-distributions" target="_blank" rel="noopener">Gumbel-max trick</a>
 which is commonly referred to as the Gumbel-top-k trick. Readers familiar with properties of the Gumbel distribution will no doubt have an easy time convincing themselves the algorithm above works as expected.</p>
<p>In this section, we will also present a proof of correctness for this algorithm based on elementary properties of <a href="https://en.wikipedia.org/wiki/Probability_density_function" target="_blank" rel="noopener">probability density function</a>
 (shortened as PDF from now on), <a href="https://en.wikipedia.org/wiki/Cumulative_distribution_function" target="_blank" rel="noopener">cumulative distribution function</a>
 (shortened as CDF from now on), and basic calculus.</p>
<p>First of all, to make sense of all the $\ln(u_j) / w_j$ calculations in this algorithm, one has to understand <a href="https://en.wikipedia.org/wiki/Inverse_transform_sampling" target="_blank" rel="noopener">inverse transform sampling</a>
. For each $j \in \{1, \dotsc, N\}$, consider the probability distribution defined on $(-\infty, 0)$ with CDF $F_j(x) = e^{w_j \cdot x}$. In order to pluck out a value $y$ from this distribution, we first sample a value $u_j$ uniformly randomly out of $(0, 1)$ that determines the percentile of $y$ (i.e., how our $y$ value ranks relative to all possible $y$ values, a.k.a, the &ldquo;overall population&rdquo;, from this distribution), and then apply $F_j^{-1}$ to $u_j$ to find $y$, so, $y = F_j^{-1}(u_j) = \ln(u_j) / w_j$.</p>
<p>Secondly, after defining all the required CDF functions $F_j(x) = e^{w_j \cdot x}$ for $j \in \{1, \dotsc, N\}$, we can also easily derive their corresponding PDF functions $f_j$: </p>
$$f_j(x) = \frac{d F_j(x)}{dx} = w_j e^{w_j \cdot x}$$<p>.</p>
<p>Finally, with a clear understanding of the family of probability distributions involved, one can prove the probability of this algorithm selecting a given sequence of rows $(r_1, \dotsc, r_n)$ is equal to $\prod\limits_{j = 1}^{n} \left( {w_j} \middle/ {\sum\limits_{k = j}^{N}{w_k}} \right)$, identical to the probability previously mentioned in the <a href="#swor">&ldquo;What exactly is <strong>SWoR</strong>&rdquo;</a>
 section, which implies the possible outcomes of this algorithm will follow exactly the same probability distribution as that of a $n$-step <strong>SWoR</strong>.</p>
<p>In order to not deprive our dear readers the pleasure of completing this proof by themselves, we have decided to not inline the rest of the proof (which boils down to a calculus exercise) within this blog post, but it is available in <a href="proof.pdf">this file</a>
.</p>
<h1 id="weighted-sampling-with-replacement">Weighted sampling with replacement
</h1>
<p>While all previous sections focused entirely on weighted sampling without replacement, this section will briefly discuss how the exponential-variate approach can also benefit the weighted-sampling-with-replacement use case (which will be shortened as <code>SWR</code> from now on).</p>
<p>Although <code>SWR</code> with sample size $n$ can be carried out by $n$ independent processes each selecting $1$ sample, parallelizing a <code>SWR</code> workload across all partitions of a Spark data frame (let&rsquo;s call it $X$) will still be more performant if the number of partitions is much larger than $n$ and more than $n$ executors are available in a Spark cluster.</p>
<p>An initial solution we had in mind was to run <code>SWR</code> with sample size $n$ in parallel on each partition of $X$, and then re-sample the results based on relative total weights of each partition. Despite sounding deceptively simple when summarized in words, implementing such a solution in practice would be a moderately complicated task. First, one has to apply the <a href="https://en.wikipedia.org/wiki/Alias_method" target="_blank" rel="noopener">alias method</a>
 or similar in order to perform weighted sampling efficiently on each partition of $X$, and on top of that, implementing the re-sampling logic across all partitions correctly and verifying the correctness of such procedure will also require considerable effort.</p>
<p>In comparison, with the help of exponential variates, a <code>SWR</code> carried out as $n$ independent <strong>SWoR</strong> processes each selecting $1$ sample is much simpler to implement, while still being comparable to our initial solution in terms of efficiency and scalability. An example implementation of it (which takes fewer than 60 lines of Scala) is presented in <a href="samplingutils.scala">samplingutils.scala</a>
.</p>
<h1 id="visualization">Visualization
</h1>
<p>How do we know <code>sparklyr::sdf_weighted_sample()</code> is working as expected? While the rigorous answer to this question is presented in full in the <a href="#testing">testing</a>
 section, we thought it would also be useful to first show some histograms that will help readers visualize what that test plan is. Therefore in this section, we will do the following:</p>
<ul>
<li>Run <code>dplyr::slice_sample()</code> multiple times on a small sample space, with each run using a different PRNG seed (sample size will be reduced to $2$ here so that there will fewer than 100 possible outcomes and visualization will be easier)</li>
<li>Do the same for <code>sdf_weighted_sample()</code></li>
<li>Use histograms to visualize the distribution of sampling outcomes</li>
</ul>
<p>Throughout this section, we will sample $2$ elements out of $\{0, \dotsc, 7\}$ without replacement according to some weights, so, the first step is to set up the following in R:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">sparklyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">sc</span> <span class="o">&lt;-</span> <span class="nf">spark_connect</span><span class="p">(</span><span class="n">master</span> <span class="o">=</span> <span class="s">&#34;local&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># `octs` will be our sample space</span>
</span></span><span class="line"><span class="cl"><span class="n">octs</span> <span class="o">&lt;-</span> <span class="nf">data.frame</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">  <span class="n">x</span> <span class="o">=</span> <span class="nf">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="m">7</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">  <span class="n">weight</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="m">4</span><span class="p">,</span> <span class="m">2</span><span class="p">,</span> <span class="m">8</span><span class="p">,</span> <span class="m">5</span><span class="p">,</span> <span class="m">7</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> <span class="m">4</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="c1"># `octs_sdf` will be our sample space copied into a Spark data frame</span>
</span></span><span class="line"><span class="cl"><span class="n">octs_sdf</span> <span class="o">&lt;-</span> <span class="nf">copy_to</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="n">octs</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">sample_size</span> <span class="o">&lt;-</span> <span class="m">2</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>In order to tally up and visualize the sampling outcomes efficiently, we shall map each possible outcome to an octal number (e.g., <code>(6, 7)</code> gets mapped to $6 \cdot 8^0 + 7 \cdot 8^1$) using a helper function <code>to_oct</code> in R:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">to_oct</span> <span class="o">&lt;-</span> <span class="kr">function</span><span class="p">(</span><span class="n">sample</span><span class="p">)</span> <span class="nf">sum</span><span class="p">(</span><span class="m">8</span> <span class="n">^</span> <span class="nf">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="n">sample_sz</span> <span class="o">-</span> <span class="m">1</span><span class="p">)</span> <span class="o">*</span> <span class="n">sample</span><span class="o">$</span><span class="n">x</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>We also need to tally up sampling outcomes from <code>dplyr::slice_sample()</code> and <code>sparklyr::sdf_weighted_sample()</code> in 2 separate arrays:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">max_possible_outcome</span> <span class="o">&lt;-</span> <span class="nf">to_oct</span><span class="p">(</span><span class="nf">list</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="nf">seq</span><span class="p">(</span><span class="m">8</span> <span class="o">-</span> <span class="n">sample_sz</span><span class="p">,</span> <span class="m">7</span><span class="p">)))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">sdf_weighted_sample_outcomes</span> <span class="o">&lt;-</span> <span class="nf">rep</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="n">max_possible_outcome</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">dplyr_slice_sample_outcomes</span> <span class="o">&lt;-</span> <span class="nf">rep</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="n">max_possible_outcome</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>Finally, we can run both <code>dplyr::slice_sample()</code> and <code>sparklyr::sdf_weighted_sample()</code> for arbitrary number of iterations and compare tallied outcomes from both:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span><span class="lnt">21
</span><span class="lnt">22
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">num_sampling_iters</span> <span class="o">&lt;-</span> <span class="m">1000</span>  <span class="c1"># actually we will vary this value from 500 to 5000</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kr">for</span> <span class="p">(</span><span class="n">x</span> <span class="kr">in</span> <span class="nf">seq</span><span class="p">(</span><span class="n">num_sampling_iters</span><span class="p">))</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="n">sample1</span> <span class="o">&lt;-</span> <span class="n">octs_sdf</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">    <span class="nf">sdf_weighted_sample</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">      <span class="n">k</span> <span class="o">=</span> <span class="n">sample_size</span><span class="p">,</span> <span class="n">weight_col</span> <span class="o">=</span> <span class="s">&#34;weight&#34;</span><span class="p">,</span> <span class="n">replacement</span> <span class="o">=</span> <span class="kc">FALSE</span><span class="p">,</span> <span class="n">seed</span> <span class="o">=</span> <span class="n">seed</span>
</span></span><span class="line"><span class="cl">    <span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">    <span class="nf">collect</span><span class="p">()</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">    <span class="nf">to_oct</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">  <span class="n">sdf_weighted_sample_outcomes[[sample1]]</span> <span class="o">&lt;-</span>
</span></span><span class="line"><span class="cl">      <span class="n">sdf_weighted_sample_outcomes[[sample1]]</span> <span class="o">+</span> <span class="m">1</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">  <span class="n">seed</span> <span class="o">&lt;-</span> <span class="n">x</span> <span class="o">*</span> <span class="m">97</span>
</span></span><span class="line"><span class="cl">  <span class="nf">set.seed</span><span class="p">(</span><span class="n">seed</span><span class="p">)</span> <span class="c1"># set random seed for dplyr::sample_slice()</span>
</span></span><span class="line"><span class="cl">  <span class="n">sample2</span> <span class="o">&lt;-</span> <span class="n">octs</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">    <span class="n">dplyr</span><span class="o">::</span><span class="nf">slice_sample</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">      <span class="n">n</span> <span class="o">=</span> <span class="n">sample_size</span><span class="p">,</span> <span class="n">weight_by</span> <span class="o">=</span> <span class="n">weight</span><span class="p">,</span> <span class="n">replace</span> <span class="o">=</span> <span class="kc">FALSE</span>
</span></span><span class="line"><span class="cl">    <span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">    <span class="nf">to_oct</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">  <span class="n">dplyr_slice_sample_outcomes[[sample2]]</span> <span class="o">&lt;-</span>
</span></span><span class="line"><span class="cl">      <span class="n">dplyr_slice_sample_outcomes[[sample2]]</span> <span class="o">+</span> <span class="m">1</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>After all the hard work above, we can now enjoy plotting the sampling outcomes from <code>dplyr::slice_sample()</code> and those from <code>sparklyr::sdf_weighted_sample()</code> after 500, 1000, and 5000 iterations and observe how the distributions of both start converging after a large number of iterations.</p>
<p>Sampling outcomes after 500, 1000, and 5000 iterations, shown in 3 histograms:</p>
<p><div class="not-prose"><figure>
    <img class="h-auto max-w-full rounded-lg"
      src="https://posit-open-source.netlify.app/blog/ai/2020-07-29-parallelized-sampling/images/viz.png"
      alt="" 
      loading="lazy"
    >
  </figure></div>

(you will most probably need to <a href="images/viz.png">view it in a separate tab</a>
 to see everything clearly)</p>
<h1 id="testing">Testing
</h1>
<p>While parallelized sampling based on exponential variates looks fantastic on paper, there are still plenty of potential pitfalls when it comes to translating such idea into code, and as usual, a good testing plan is necessary to ensure implementation correctness.</p>
<p>For instance, numerical instability issues from floating point numbers arise if $\ln(u_j) / w_j$ were replaced by $u_j ^ {1 / w_j}$ in the aforementioned computations.</p>
<p>Another more subtle source of error is the usage of PRNG seeds. For example, consider the following:</p>
<pre><code>  def sampleWithoutReplacement(
    rdd: RDD[Row],
    weightColumn: String,
    sampleSize: Int,
    seed: Long
  ): RDD[Row] = {
    val sc = rdd.context
    if (0 == sampleSize) {
      sc.emptyRDD
    } else {
      val random = new Random(seed)
      val mapRDDs = rdd.mapPartitions { iter =&gt;
        for (row &lt;- iter) {
          val weight = row.getAs[Double](weightColumn)
          val key = scala.math.log(random.nextDouble) / weight
          &lt;and then make sampling decision for `row` based on its `key`,
           as described in the previous section&gt;
        }
        ...
      }
      ...
    }
  }
</code></pre>
<p>Even though it might look OK upon first glance, <code>rdd.mapPartitions(...)</code> from the above will cause the same sequence of pseudorandom numbers to be applied to multiple partitions of the input Spark data frame, which will cause undesired bias (i.e., sampling outcomes from one partition will have non-trivial correlation with those from another partition when such correlation should be negligible in a correct implementation).</p>
<p>The code snippet below is an example implementation in which each partition of the input Spark data frame is sampled using a different sequence of pseudorandom numbers:</p>
<pre><code>  def sampleWithoutReplacement(
    rdd: RDD[Row],
    weightColumn: String,
    sampleSize: Int,
    seed: Long
  ): RDD[Row] = {
    val sc = rdd.context
    if (0 == sampleSize) {
      sc.emptyRDD
    } else {
      val mapRDDs = rdd.mapPartitionsWithIndex { (index, iter) =&gt;
        val random = new Random(seed + index)

        for (row &lt;- iter) {
          val weight = row.getAs[Double](weightColumn)
          val key = scala.math.log(random.nextDouble) / weight
          &lt;and then make sampling decision for `row` based on its `key`,
           as described in the previous section&gt;
        }

        ...
      }
    ...
  }
}
</code></pre>
<p>An example test case in which a two-sided Kolmogorov-Smirnov test is used to compare distribution of sampling outcomes from <code>dplyr::slice_sample()</code> with that from <code>sparklyr::sdf_weighted_sample()</code> is shown in <a href="test_plan">this file</a>
. Such tests have proven to be effective in surfacing non-obvious implementation errors such as the ones mentioned above.</p>
<h1 id="example-usages">Example Usages
</h1>
<p>Please note the <code>sparklyr::sdf_weighted_sample()</code> functionality is not included in any official release of <code>sparklyr</code> yet. We are aiming to ship it as part of <code>sparklyr</code> 1.4 in about 2 to 3 months from now.</p>
<p>In the meanwhile, you can try it out with the following steps:</p>
<p>First, make sure <code>remotes</code> is installed, and then run</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">remotes</span><span class="o">::</span><span class="nf">install_github</span><span class="p">(</span><span class="s">&#34;sparklyr/sparklyr&#34;</span><span class="p">,</span> <span class="n">ref</span> <span class="o">=</span> <span class="s">&#34;master&#34;</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>to install <code>sparklyr</code> from source.</p>
<p>Next, create a test data frame with numeric weight column consisting of non-negative weight for each row, and then copy it to Spark (see code snippet below as an example):</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">sparklyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">sc</span> <span class="o">&lt;-</span> <span class="nf">spark_connect</span><span class="p">(</span><span class="n">master</span> <span class="o">=</span> <span class="s">&#34;local&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">example_df</span> <span class="o">&lt;-</span> <span class="nf">data.frame</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">  <span class="n">x</span> <span class="o">=</span> <span class="nf">seq</span><span class="p">(</span><span class="m">100</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">  <span class="n">weight</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="nf">rep</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="m">50</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">    <span class="nf">rep</span><span class="p">(</span><span class="m">2</span><span class="p">,</span> <span class="m">25</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">    <span class="nf">rep</span><span class="p">(</span><span class="m">4</span><span class="p">,</span> <span class="m">10</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">    <span class="nf">rep</span><span class="p">(</span><span class="m">8</span><span class="p">,</span> <span class="m">10</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">    <span class="nf">rep</span><span class="p">(</span><span class="m">16</span><span class="p">,</span> <span class="m">5</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">  <span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">example_sdf</span> <span class="o">&lt;-</span> <span class="nf">copy_to</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="n">example_df</span><span class="p">,</span> <span class="n">repartition</span> <span class="o">=</span> <span class="m">5</span><span class="p">,</span> <span class="n">overwrite</span> <span class="o">=</span> <span class="kc">TRUE</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>Finally, run <code>sparklyr::sdf_weighted_sample()</code> on <code>example_sdf</code>:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">sample_size</span> <span class="o">&lt;-</span> <span class="m">5</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">samples_without_replacement</span> <span class="o">&lt;-</span> <span class="n">example_sdf</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">sdf_weighted_sample</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">weight_col</span> <span class="o">=</span> <span class="s">&#34;weight&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">k</span> <span class="o">=</span> <span class="n">sample_size</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">replacement</span> <span class="o">=</span> <span class="kc">FALSE</span>
</span></span><span class="line"><span class="cl">  <span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">samples_without_replacement</span> <span class="o">%&gt;%</span> <span class="nf">print</span><span class="p">(</span><span class="n">n</span> <span class="o">=</span> <span class="n">sample_size</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code>## # Source: spark&lt;?&gt; [?? x 2]
##       x weight
##   &lt;int&gt;  &lt;dbl&gt;
## 1    48      1
## 2    22      1
## 3    78      4
## 4    56      2
## 5   100     16
</code></pre>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span><span class="lnt">8
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">samples_with_replacement</span> <span class="o">&lt;-</span> <span class="n">example_sdf</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">sdf_weighted_sample</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">weight_col</span> <span class="o">=</span> <span class="s">&#34;weight&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">k</span> <span class="o">=</span> <span class="n">sample_size</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">replacement</span> <span class="o">=</span> <span class="kc">TRUE</span>
</span></span><span class="line"><span class="cl">  <span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">samples_with_replacement</span> <span class="o">%&gt;%</span> <span class="nf">print</span><span class="p">(</span><span class="n">n</span> <span class="o">=</span> <span class="n">sample_size</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code>## # Source: spark&lt;?&gt; [?? x 2]
##       x weight
##   &lt;int&gt;  &lt;dbl&gt;
## 1    86      8
## 2    97     16
## 3    91      8
## 4   100     16
## 5    65      2
</code></pre>
<h1 id="acknowledgement">Acknowledgement
</h1>
<p>First and foremost, the author wishes to thank <a href="https://github.com/ajing" target="_blank" rel="noopener">@ajing</a>
 for reporting the weighted sampling use cases were not properly supported yet in <code>sparklyr</code> 1.3 and suggesting it should be part of some future version of <code>sparklyr</code> in this <a href="https://github.com/sparklyr/sparklyr/issues/2592" target="_blank" rel="noopener">Github issue</a>
.</p>
<p>Special thanks also goes to Javier (<a href="https://github.com/javierluraschi" target="_blank" rel="noopener">@javierluraschi</a>
) for reviewing the <a href="https://github.com/sparklyr/sparklyr/pull/2606" target="_blank" rel="noopener">implementation</a>
 of all exponential-variate based sampling algorithms in <code>sparklyr</code>, and to Mara (<a href="https://github.com/batpigandme" target="_blank" rel="noopener">@batpigandme</a>
), Sigrid (<a href="https://github.com/skeydan" target="_blank" rel="noopener">@Sigrid</a>
), and Javier (<a href="https://github.com/javierluraschi" target="_blank" rel="noopener">@javierluraschi</a>
) for their valuable editorial suggestions.</p>
<p>We hope you have enjoyed reading this blog post! If you wish to learn more about <code>sparklyr</code>, we recommend visiting <a href="https://sparklyr.ai" target="_blank" rel="noopener">sparklyr.ai</a>
, <a href="https://spark.rstudio.com" target="_blank" rel="noopener">spark.rstudio.com</a>
, and some of the previous release posts such as <a href="https://blog.rstudio.com/2020/07/16/sparklyr-1-3/" target="_blank" rel="noopener">sparklyr 1.3</a>
 and <a href="https://posit-open-source.netlify.app/blog/ai/2020-04-21-sparklyr-1.2.0-released/">sparklyr 1.2</a>
. Also, your contributions to <code>sparklyr</code> are more than welcome. Please send your pull requests through <a href="https://github.com/sparklyr/sparklyr/pulls" target="_blank" rel="noopener">here</a>
 and file any bug report or feature request in <a href="https://github.com/sparklyr/sparklyr" target="_blank" rel="noopener">here</a>
.</p>
<p>Thanks for reading!</p>
<p>Efraimidis, Pavlos, and Paul (Pavlos) Spirakis. 2016. &ldquo;Weighted Random Sampling.&rdquo; In <em>Encyclopedia of Algorithms</em>, edited by Ming-Yang Kao. Springer New York. <a href="https://doi.org/10.1007/978-1-4939-2864-4_478" target="_blank" rel="noopener">https://doi.org/10.1007/978-1-4939-2864-4_478</a>
.</p>
]]></description>
      <enclosure url="https://posit-open-source.netlify.app/blog/ai/2020-07-29-parallelized-sampling/thumbnail.jpg" length="60526" type="image/jpeg" />
    </item>
    <item>
      <title>sparklyr 1.3: Higher-order Functions, Avro and Custom Serializers</title>
      <link>https://posit-open-source.netlify.app/blog/ai/sparklyr-1.3/</link>
      <pubDate>Thu, 16 Jul 2020 00:00:00 +0000</pubDate>
      <guid>https://posit-open-source.netlify.app/blog/ai/sparklyr-1.3/</guid>
      <dc:creator>Yitao Li</dc:creator><description><![CDATA[<p><a href="https://sparklyr.ai" target="_blank" rel="noopener"><code>sparklyr</code></a>
 1.3 is now available on <a href="https://cran.r-project.org/web/packages/sparklyr/index.html" target="_blank" rel="noopener">CRAN</a>
, with the following major new features:</p>
<ul>
<li><a href="#higher-order-functions">Higher-order Functions</a>
 to easily manipulate arrays and structs</li>
<li>Support for Apache <a href="#avro">Avro</a>
, a row-oriented data serialization framework</li>
<li><a href="#custom-serialization">Custom Serialization</a>
 using R functions to read and write any data format</li>
<li><a href="#other-improvements">Other Improvements</a>
 such as compatibility with EMR 6.0 &amp; Spark 3.0, and initial support for Flint time series library</li>
</ul>
<p>To install <code>sparklyr</code> 1.3 from CRAN, run</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">install.packages</span><span class="p">(</span><span class="s">&#34;sparklyr&#34;</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>In this post, we shall highlight some major new features introduced in sparklyr 1.3, and showcase scenarios where such features come in handy. While a number of enhancements and bug fixes (especially those related to <code>spark_apply()</code>, <a href="https://arrow.apache.org/" target="_blank" rel="noopener">Apache Arrow</a>
, and secondary Spark connections) were also an important part of this release, they will not be the topic of this post, and it will be an easy exercise for the reader to find out more about them from the sparklyr <a href="https://github.com/sparklyr/sparklyr/blob/master/NEWS.md" target="_blank" rel="noopener">NEWS</a>
 file.</p>
<h2 id="higher-order-functions">Higher-order Functions
</h2>
<p><a href="https://issues.apache.org/jira/browse/SPARK-19480" target="_blank" rel="noopener">Higher-order functions</a>
 are built-in Spark SQL constructs that allow user-defined lambda expressions to be applied efficiently to complex data types such as arrays and structs. As a quick demo to see why higher-order functions are useful, let&rsquo;s say one day Scrooge McDuck dove into his huge vault of money and found large quantities of pennies, nickels, dimes, and quarters. Having an impeccable taste in data structures, he decided to store the quantities and face values of everything into two Spark SQL array columns:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">sparklyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">sc</span> <span class="o">&lt;-</span> <span class="nf">spark_connect</span><span class="p">(</span><span class="n">master</span> <span class="o">=</span> <span class="s">&#34;local&#34;</span><span class="p">,</span> <span class="n">version</span> <span class="o">=</span> <span class="s">&#34;2.4.5&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">coins_tbl</span> <span class="o">&lt;-</span> <span class="nf">copy_to</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">  <span class="n">sc</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="n">tibble</span><span class="o">::</span><span class="nf">tibble</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">quantities</span> <span class="o">=</span> <span class="nf">list</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">4000</span><span class="p">,</span> <span class="m">3000</span><span class="p">,</span> <span class="m">2000</span><span class="p">,</span> <span class="m">1000</span><span class="p">)),</span>
</span></span><span class="line"><span class="cl">    <span class="n">values</span> <span class="o">=</span> <span class="nf">list</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="m">5</span><span class="p">,</span> <span class="m">10</span><span class="p">,</span> <span class="m">25</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">  <span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>Thus declaring his net worth of 4k pennies, 3k nickels, 2k dimes, and 1k quarters. To help Scrooge McDuck calculate the total value of each type of coin in sparklyr 1.3 or above, we can apply <code>hof_zip_with()</code>, the sparklyr equivalent of <a href="https://spark.apache.org/docs/latest/api/sql/index.html#zip_with" target="_blank" rel="noopener">ZIP_WITH</a>
, to <code>quantities</code> column and <code>values</code> column, combining pairs of elements from arrays in both columns. As you might have guessed, we also need to specify how to combine those elements, and what better way to accomplish that than a concise one-sided formula   <code>~ .x * .y</code>   in R, which says we want (quantity * value) for each type of coin? So, we have the following:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">result_tbl</span> <span class="o">&lt;-</span> <span class="n">coins_tbl</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">hof_zip_with</span><span class="p">(</span><span class="o">~</span> <span class="n">.x</span> <span class="o">*</span> <span class="n">.y</span><span class="p">,</span> <span class="n">dest_col</span> <span class="o">=</span> <span class="n">total_values</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="n">dplyr</span><span class="o">::</span><span class="nf">select</span><span class="p">(</span><span class="n">total_values</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">result_tbl</span> <span class="o">%&gt;%</span> <span class="n">dplyr</span><span class="o">::</span><span class="nf">pull</span><span class="p">(</span><span class="n">total_values</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code>[1]  4000 15000 20000 25000
</code></pre>
<p>With the result <code>4000 15000 20000 25000</code> telling us there are in total $40 dollars worth of pennies, $150 dollars worth of nickels, $200 dollars worth of dimes, and $250 dollars worth of quarters, as expected.</p>
<p>Using another sparklyr function named <code>hof_aggregate()</code>, which performs an <a href="https://spark.apache.org/docs/latest/api/sql/index.html#aggregate" target="_blank" rel="noopener">AGGREGATE</a>
 operation in Spark, we can then compute the net worth of Scrooge McDuck based on <code>result_tbl</code>, storing the result in a new column named <code>total</code>. Notice for this aggregate operation to work, we need to ensure the starting value of aggregation has data type (namely, <code>BIGINT</code>) that is consistent with the data type of <code>total_values</code> (which is <code>ARRAY&lt;BIGINT&gt;</code>), as shown below:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">result_tbl</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="n">dplyr</span><span class="o">::</span><span class="nf">mutate</span><span class="p">(</span><span class="n">zero</span> <span class="o">=</span> <span class="n">dplyr</span><span class="o">::</span><span class="nf">sql</span><span class="p">(</span><span class="s">&#34;CAST (0 AS BIGINT)&#34;</span><span class="p">))</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">hof_aggregate</span><span class="p">(</span><span class="n">start</span> <span class="o">=</span> <span class="n">zero</span><span class="p">,</span> <span class="o">~</span> <span class="n">.x</span> <span class="o">+</span> <span class="n">.y</span><span class="p">,</span> <span class="n">expr</span> <span class="o">=</span> <span class="n">total_values</span><span class="p">,</span> <span class="n">dest_col</span> <span class="o">=</span> <span class="n">total</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="n">dplyr</span><span class="o">::</span><span class="nf">select</span><span class="p">(</span><span class="n">total</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="n">dplyr</span><span class="o">::</span><span class="nf">pull</span><span class="p">(</span><span class="n">total</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code>[1] 64000
</code></pre>
<p>So Scrooge McDuck&rsquo;s net worth is $640 dollars.</p>
<p>Other higher-order functions supported by Spark SQL so far include <code>transform</code>, <code>filter</code>, and <code>exists</code>, as documented in <a href="https://spark.apache.org/docs/latest/api/sql/index.html" target="_blank" rel="noopener">here</a>
, and similar to the example above, their counterparts (namely, <code>hof_transform()</code>, <code>hof_filter()</code>, and <code>hof_exists()</code>) all exist in sparklyr 1.3, so that they can be integrated with other <code>dplyr</code> verbs in an idiomatic manner in R.</p>
<h2 id="avro">Avro
</h2>
<p>Another highlight of the sparklyr 1.3 release is its built-in support for Avro data sources. Apache Avro is a widely used data serialization protocol that combines the efficiency of a binary data format with the flexibility of JSON schema definitions. To make working with Avro data sources simpler, in sparklyr 1.3, as soon as a Spark connection is instantiated with <code>spark_connect(..., package = &quot;avro&quot;)</code>, sparklyr will automatically figure out which version of <code>spark-avro</code> package to use with that connection, saving a lot of potential headaches for sparklyr users trying to determine the correct version of <code>spark-avro</code> by themselves. Similar to how <code>spark_read_csv()</code> and <code>spark_write_csv()</code> are in place to work with CSV data, <code>spark_read_avro()</code> and <code>spark_write_avro()</code> methods were implemented in sparklyr 1.3 to facilitate reading and writing Avro files through an Avro-capable Spark connection, as illustrated in the example below:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span><span class="lnt">21
</span><span class="lnt">22
</span><span class="lnt">23
</span><span class="lnt">24
</span><span class="lnt">25
</span><span class="lnt">26
</span><span class="lnt">27
</span><span class="lnt">28
</span><span class="lnt">29
</span><span class="lnt">30
</span><span class="lnt">31
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">sparklyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># The `package = &#34;avro&#34;` option is only supported in Spark 2.4 or higher</span>
</span></span><span class="line"><span class="cl"><span class="n">sc</span> <span class="o">&lt;-</span> <span class="nf">spark_connect</span><span class="p">(</span><span class="n">master</span> <span class="o">=</span> <span class="s">&#34;local&#34;</span><span class="p">,</span> <span class="n">version</span> <span class="o">=</span> <span class="s">&#34;2.4.5&#34;</span><span class="p">,</span> <span class="n">package</span> <span class="o">=</span> <span class="s">&#34;avro&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">sdf</span> <span class="o">&lt;-</span> <span class="nf">sdf_copy_to</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">  <span class="n">sc</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="n">tibble</span><span class="o">::</span><span class="nf">tibble</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">a</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="kc">NaN</span><span class="p">,</span> <span class="m">3</span><span class="p">,</span> <span class="m">4</span><span class="p">,</span> <span class="kc">NaN</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">    <span class="n">b</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="m">-2L</span><span class="p">,</span> <span class="m">0L</span><span class="p">,</span> <span class="m">1L</span><span class="p">,</span> <span class="m">3L</span><span class="p">,</span> <span class="m">2L</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">    <span class="n">c</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="s">&#34;a&#34;</span><span class="p">,</span> <span class="s">&#34;b&#34;</span><span class="p">,</span> <span class="s">&#34;c&#34;</span><span class="p">,</span> <span class="s">&#34;&#34;</span><span class="p">,</span> <span class="s">&#34;d&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">  <span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># This example Avro schema is a JSON string that essentially says all columns</span>
</span></span><span class="line"><span class="cl"><span class="c1"># (&#34;a&#34;, &#34;b&#34;, &#34;c&#34;) of `sdf` are nullable.</span>
</span></span><span class="line"><span class="cl"><span class="n">avro_schema</span> <span class="o">&lt;-</span> <span class="n">jsonlite</span><span class="o">::</span><span class="nf">toJSON</span><span class="p">(</span><span class="nf">list</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">  <span class="n">type</span> <span class="o">=</span> <span class="s">&#34;record&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="n">name</span> <span class="o">=</span> <span class="s">&#34;topLevelRecord&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="n">fields</span> <span class="o">=</span> <span class="nf">list</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="nf">list</span><span class="p">(</span><span class="n">name</span> <span class="o">=</span> <span class="s">&#34;a&#34;</span><span class="p">,</span> <span class="n">type</span> <span class="o">=</span> <span class="nf">list</span><span class="p">(</span><span class="s">&#34;double&#34;</span><span class="p">,</span> <span class="s">&#34;null&#34;</span><span class="p">)),</span>
</span></span><span class="line"><span class="cl">    <span class="nf">list</span><span class="p">(</span><span class="n">name</span> <span class="o">=</span> <span class="s">&#34;b&#34;</span><span class="p">,</span> <span class="n">type</span> <span class="o">=</span> <span class="nf">list</span><span class="p">(</span><span class="s">&#34;int&#34;</span><span class="p">,</span> <span class="s">&#34;null&#34;</span><span class="p">)),</span>
</span></span><span class="line"><span class="cl">    <span class="nf">list</span><span class="p">(</span><span class="n">name</span> <span class="o">=</span> <span class="s">&#34;c&#34;</span><span class="p">,</span> <span class="n">type</span> <span class="o">=</span> <span class="nf">list</span><span class="p">(</span><span class="s">&#34;string&#34;</span><span class="p">,</span> <span class="s">&#34;null&#34;</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">  <span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">),</span> <span class="n">auto_unbox</span> <span class="o">=</span> <span class="kc">TRUE</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># persist the Spark data frame from above in Avro format</span>
</span></span><span class="line"><span class="cl"><span class="nf">spark_write_avro</span><span class="p">(</span><span class="n">sdf</span><span class="p">,</span> <span class="s">&#34;/tmp/data.avro&#34;</span><span class="p">,</span> <span class="nf">as.character</span><span class="p">(</span><span class="n">avro_schema</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># and then read the same data frame back</span>
</span></span><span class="line"><span class="cl"><span class="nf">spark_read_avro</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s">&#34;/tmp/data.avro&#34;</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code># Source: spark&lt;data&gt; [?? x 3]
      a     b c
  &lt;dbl&gt; &lt;int&gt; &lt;chr&gt;
  1     1    -2 &quot;a&quot;
  2   NaN     0 &quot;b&quot;
  3     3     1 &quot;c&quot;
  4     4     3 &quot;&quot;
  5   NaN     2 &quot;d&quot;
</code></pre>
<h2 id="custom-serialization">Custom Serialization
</h2>
<p>In addition to commonly used data serialization formats such as CSV, JSON, Parquet, and Avro, starting from sparklyr 1.3, customized data frame serialization and deserialization procedures implemented in R can also be run on Spark workers via the newly implemented <code>spark_read()</code> and <code>spark_write()</code> methods. We can see both of them in action through a quick example below, where <code>saveRDS()</code> is called from a user-defined writer function to save all rows within a Spark data frame into 2 RDS files on disk, and <code>readRDS()</code> is called from a user-defined reader function to read the data from the RDS files back to Spark:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span><span class="lnt">8
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">sparklyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">sc</span> <span class="o">&lt;-</span> <span class="nf">spark_connect</span><span class="p">(</span><span class="n">master</span> <span class="o">=</span> <span class="s">&#34;local&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">sdf</span> <span class="o">&lt;-</span> <span class="nf">sdf_len</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="m">7</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">paths</span> <span class="o">&lt;-</span> <span class="nf">c</span><span class="p">(</span><span class="s">&#34;/tmp/file1.RDS&#34;</span><span class="p">,</span> <span class="s">&#34;/tmp/file2.RDS&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">spark_write</span><span class="p">(</span><span class="n">sdf</span><span class="p">,</span> <span class="n">writer</span> <span class="o">=</span> <span class="kr">function</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">path</span><span class="p">)</span> <span class="nf">saveRDS</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">path</span><span class="p">),</span> <span class="n">paths</span> <span class="o">=</span> <span class="n">paths</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">spark_read</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="n">paths</span><span class="p">,</span> <span class="n">reader</span> <span class="o">=</span> <span class="kr">function</span><span class="p">(</span><span class="n">path</span><span class="p">)</span> <span class="nf">readRDS</span><span class="p">(</span><span class="n">path</span><span class="p">),</span> <span class="n">columns</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="n">id</span> <span class="o">=</span> <span class="s">&#34;integer&#34;</span><span class="p">))</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code># Source: spark&lt;?&gt; [?? x 1]
     id
  &lt;int&gt;
1     1
2     2
3     3
4     4
5     5
6     6
7     7
</code></pre>
<h2 id="other-improvements">Other Improvements
</h2>
<h3 id="sparklyrflint">Sparklyr.flint
</h3>
<p><a href="https://github.com/r-spark/sparklyr.flint" target="_blank" rel="noopener">Sparklyr.flint</a>
 is a sparklyr extension that aims to make functionalities from the <a href="https://github.com/twosigma/flint" target="_blank" rel="noopener">Flint</a>
 time-series library easily accessible from R. It is currently under active development. One piece of good news is that, while the original <a href="https://github.com/twosigma/flint" target="_blank" rel="noopener">Flint</a>
 library was designed to work with Spark 2.x, a slightly modified <a href="https://github.com/yl790/flint" target="_blank" rel="noopener">fork</a>
 of it will work well with Spark 3.0, and within the existing sparklyr extension framework. <code>sparklyr.flint</code> can automatically determine which version of the Flint library to load based on the version of Spark it&rsquo;s connected to. Another bit of good news is, as previously mentioned, <code>sparklyr.flint</code> doesn&rsquo;t know too much about its own destiny yet. Maybe you can play an active part in shaping its future!</p>
<h3 id="emr-60">EMR 6.0
</h3>
<p>This release also features a small but important change that allows sparklyr to correctly connect to the version of Spark 2.4 that is included in Amazon EMR 6.0.</p>
<p>Previously, sparklyr automatically assumed any Spark 2.x it was connecting to was built with Scala 2.11 and attempted to load any required Scala artifacts built with Scala 2.11 as well. This became problematic when connecting to Spark 2.4 from Amazon EMR 6.0, which is built with Scala 2.12. Starting from sparklyr 1.3, such problem can be fixed by simply specifying <code>scala_version = &quot;2.12&quot;</code> when calling <code>spark_connect()</code> (e.g., <code>spark_connect(master = &quot;yarn-client&quot;, scala_version = &quot;2.12&quot;)</code>).</p>
<h3 id="spark-30">Spark 3.0
</h3>
<p>Last but not least, it is worthwhile to mention sparklyr 1.3.0 is known to be fully compatible with the recently released Spark 3.0. We highly recommend upgrading your copy of sparklyr to 1.3.0 if you plan to have Spark 3.0 as part of your data workflow in future.</p>
<h2 id="acknowledgement">Acknowledgement
</h2>
<p>In chronological order, we want to thank the following individuals for submitting pull requests towards sparklyr 1.3:</p>
<ul>
<li><a href="https://github.com/jozefhajnala" target="_blank" rel="noopener">Jozef Hajnala</a>
</li>
<li><a href="https://github.com/falaki" target="_blank" rel="noopener">Hossein Falaki</a>
</li>
<li><a href="https://github.com/samuelmacedo83" target="_blank" rel="noopener">Samuel Macêdo</a>
</li>
<li><a href="https://github.com/yl790" target="_blank" rel="noopener">Yitao Li</a>
</li>
<li><a href="https://github.com/Loquats" target="_blank" rel="noopener">Andy Zhang</a>
</li>
<li><a href="https://github.com/javierluraschi" target="_blank" rel="noopener">Javier Luraschi</a>
</li>
<li><a href="https://github.com/nealrichardson" target="_blank" rel="noopener">Neal Richardson</a>
</li>
</ul>
<p>We are also grateful for valuable input on the sparklyr 1.3 roadmap, <a href="https://github.com/sparklyr/sparklyr/pull/2434" target="_blank" rel="noopener">#2434</a>
, and <a href="https://github.com/sparklyr/sparklyr/pull/2551" target="_blank" rel="noopener">#2551</a>
 from <a href="https://github.com/javierluraschi" target="_blank" rel="noopener">@javierluraschi</a>
, and great spiritual advice on <a href="https://github.com/sparklyr/sparklyr/issues/1773" target="_blank" rel="noopener">#1773</a>
 and <a href="https://github.com/sparklyr/sparklyr/issues/2514" target="_blank" rel="noopener">#2514</a>
 from <a href="https://github.com/mattpollock" target="_blank" rel="noopener">@mattpollock</a>
 and <a href="https://github.com/benmwhite" target="_blank" rel="noopener">@benmwhite</a>
.</p>
<p>Please note if you believe you are missing from the acknowledgement above, it may be because your contribution has been considered part of the next sparklyr release rather than part of the current release. We do make every effort to ensure all contributors are mentioned in this section. In case you believe there is a mistake, please feel free to contact the author of this blog post via e-mail (yitao at rstudio dot com) and request a correction.</p>
<p>If you wish to learn more about <code>sparklyr</code>, we recommend visiting <a href="https://sparklyr.ai" target="_blank" rel="noopener">sparklyr.ai</a>
, <a href="https://spark.rstudio.com" target="_blank" rel="noopener">spark.rstudio.com</a>
, and some of the previous release posts such as <a href="https://posit-open-source.netlify.app/blog/ai/2020-04-21-sparklyr-1.2.0-released/">sparklyr 1.2</a>
 and <a href="https://blog.rstudio.com/2020/01/29/sparklyr-1-1/" target="_blank" rel="noopener">sparklyr 1.1</a>
.</p>
<p>Thanks for reading!</p>
]]></description>
      <enclosure url="https://posit-open-source.netlify.app/blog/ai/sparklyr-1.3/thumbnail.jpg" length="93301" type="image/jpeg" />
    </item>
    <item>
      <title>pins 0.4.0: Versioning</title>
      <link>https://posit-open-source.netlify.app/blog/rstudio/pins-0-4-0-versioning/</link>
      <pubDate>Fri, 29 May 2020 00:00:00 +0000</pubDate>
      <guid>https://posit-open-source.netlify.app/blog/rstudio/pins-0-4-0-versioning/</guid>
      <dc:creator>Javier Luraschi</dc:creator><description><![CDATA[<p>A new version of <code>pins</code> is available on CRAN today, which adds support for <a href="http://pins.rstudio.com/articles/advanced-versions.html" target="_blank" rel="noopener">versioning</a>
 your datasets and <a href="http://pins.rstudio.com/articles/boards-dospace.html" target="_blank" rel="noopener">DigitalOcean Spaces</a>
 boards!</p>
<p>As a quick recap, the pins package allows you to cache, discover and share resources. You can use <code>pins</code> in a wide range of situations, from downloading a dataset from a URL to creating complex automation workflows (learn more at <a href="https://pins.rstudio.com" target="_blank" rel="noopener">pins.rstudio.com</a>
). You can also use <code>pins</code> in combination with TensorFlow and Keras; for instance, use <a href="https://tensorflow.rstudio.com/tools/cloudml" target="_blank" rel="noopener">cloudml</a>
 to train models in cloud GPUs, but rather than manually copying files into the GPU instance, you can store them as pins directly from R.</p>
<p>To install this new version of <code>pins</code> from CRAN, simply run:</p>
<pre tabindex="0"><code>install.packages(&#34;pins&#34;)
</code></pre><p>You can find a detailed list of improvements in the pins <a href="https://github.com/rstudio/pins/blob/master/NEWS.md" target="_blank" rel="noopener">NEWS</a>
 file.</p>
<h1 id="versioning">Versioning
</h1>
<p>To illustrate the new versioning functionality, let&rsquo;s start by downloading and caching a remote dataset with pins. For this example, we will download the weather in London, this happens to be in JSON format and requires <code>jsonlite</code> to be parsed:</p>
<pre tabindex="0"><code>library(pins)

weather_url &lt;- &#34;https://samples.openweathermap.org/data/2.5/weather?q=London,uk&amp;appid=b6907d289e10d714a6e88b30761fae22&#34;

pin(weather_url, &#34;weather&#34;) %&gt;%
  jsonlite::read_json() %&gt;%
  as.data.frame()
</code></pre><pre tabindex="0"><code>  coord.lon coord.lat weather.id weather.main     weather.description weather.icon
1     -0.13     51.51        300      Drizzle light intensity drizzle          09d
</code></pre><p>One advantage of using <code>pins</code> is that, even if the URL or your internet connection becomes unavailable, the above code will still work.</p>
<p>But back to <code>pins 0.4</code>! The new <code>signature</code> parameter in <code>pin_info()</code> allows you to retrieve the &ldquo;version&rdquo; of this dataset:</p>
<pre tabindex="0"><code>pin_info(&#34;weather&#34;, signature = TRUE)
</code></pre><pre tabindex="0"><code># Source: local&lt;weather&gt; [files]
# Signature: 624cca260666c6f090b93c37fd76878e3a12a79b
# Properties:
#   - path: weather
</code></pre><p>You can then validate the remote dataset has not changed by specifying its signature:</p>
<pre tabindex="0"><code>pin(weather_url, &#34;weather&#34;, signature = &#34;624cca260666c6f090b93c37fd76878e3a12a79b&#34;) %&gt;%
  jsonlite::read_json()
</code></pre><p>If the remote dataset changes, <code>pin()</code> will fail and you can take the appropriate steps to accept the changes by updating the signature or properly updating your code. The previous example is useful as a way of detecting version changes, but we might also want to retrieve specific versions even when the dataset changes.</p>
<p><code>pins 0.4</code> allows you to display and retrieve versions from services like GitHub, Kaggle and RStudio Connect. Even in boards that don&rsquo;t support versioning natively, you can opt-in by registering a board with <code>versions = TRUE</code>.</p>
<p>To keep this simple, let&rsquo;s focus on GitHub first. We will register a GitHub board and pin a dataset to it. Notice that you can also specify the <code>commit</code> parameter in GitHub boards as the commit message for this change.</p>
<pre tabindex="0"><code>board_register_github(repo = &#34;javierluraschi/datasets&#34;, branch = &#34;datasets&#34;)

pin(iris, name = &#34;versioned&#34;, board = &#34;github&#34;, commit = &#34;use iris as the main dataset&#34;)
</code></pre><p>Now suppose that a colleague comes along and updates this dataset as well:</p>
<pre tabindex="0"><code>pin(mtcars, name = &#34;versioned&#34;, board = &#34;github&#34;, commit = &#34;slight preference to mtcars&#34;)
</code></pre><p>From now on, your code could be broken or, even worse, produce incorrect results!</p>
<p>However, since GitHub was designed as a version control system and <code>pins 0.4</code> adds support for <code>pin_versions()</code>, we can now explore particular versions of this dataset:</p>
<pre tabindex="0"><code>pin_versions(&#34;versioned&#34;, board = &#34;github&#34;)
</code></pre><pre tabindex="0"><code># A tibble: 2 x 4
  version created              author         message                     
  &lt;chr&gt;   &lt;chr&gt;                &lt;chr&gt;          &lt;chr&gt;                       
1 6e6c320 2020-04-02T21:28:07Z javierluraschi slight preference to mtcars 
2 01f8ddf 2020-04-02T21:27:59Z javierluraschi use iris as the main dataset
</code></pre><p>You can then retrieve the version you are interested in as follows:</p>
<pre tabindex="0"><code>pin_get(&#34;versioned&#34;, version = &#34;01f8ddf&#34;, board = &#34;github&#34;)
</code></pre><pre tabindex="0"><code># A tibble: 150 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          &lt;dbl&gt;       &lt;dbl&gt;        &lt;dbl&gt;       &lt;dbl&gt; &lt;fct&gt;  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# … with 140 more rows
</code></pre><p>You can follow similar steps for <a href="http://pins.rstudio.com/articles/boards-rsconnect.html" target="_blank" rel="noopener">RStudio Connect</a>
 and <a href="http://pins.rstudio.com/articles/boards-kaggle.html" target="_blank" rel="noopener">Kaggle</a>
 boards, even for existing pins! Other boards like <a href="http://pins.rstudio.com/articles/boards-s3.html" target="_blank" rel="noopener">Amazon S3</a>
, <a href="http://pins.rstudio.com/articles/boards-gcloud.html" target="_blank" rel="noopener">Google Cloud</a>
, <a href="http://pins.rstudio.com/articles/boards-dospace.html" target="_blank" rel="noopener">Digital Ocean</a>
 and <a href="http://pins.rstudio.com/articles/boards-azure.html" target="_blank" rel="noopener">Microsoft Azure</a>
 require you explicitly enable versioning when registering your boards.</p>
<h1 id="digitalocean">DigitalOcean
</h1>
<p>To try out the new <a href="http://pins.rstudio.com/articles/boards-dospace.html" target="_blank" rel="noopener">DigitalOcean Spaces board</a>
, first you will have to register this board and enable versioning by setting <code>versions</code> to <code>TRUE</code>:</p>
<pre tabindex="0"><code>library(pins)
board_register_dospace(space = &#34;pinstest&#34;,
                       key = &#34;AAAAAAAAAAAAAAAAAAAA&#34;,
                       secret = &#34;ABCABCABCABCABCABCABCABCABCABCABCABCABCA==&#34;,
                       datacenter = &#34;sfo2&#34;,
                       versions = TRUE)
</code></pre><p>You can then use all the functionality pins provides, including versioning:</p>
<pre tabindex="0"><code># create pin and replace content in digitalocean
pin(iris, name = &#34;versioned&#34;, board = &#34;pinstest&#34;)
pin(mtcars, name = &#34;versioned&#34;, board = &#34;pinstest&#34;)

# retrieve versions from digitalocean
pin_versions(name = &#34;versioned&#34;, board = &#34;pinstest&#34;)
</code></pre><pre tabindex="0"><code># A tibble: 2 x 1
  version
  &lt;chr&gt;  
1 c35da04
2 d9034cd
</code></pre><p>Notice that enabling versions in cloud services requires additional storage space for each version of the dataset being stored:</p>
<p><div class="not-prose"><figure>
    <img class="h-auto max-w-full rounded-lg"
      src="https://posit-open-source.netlify.app/blog/rstudio/pins-0-4-0-versioning/images/digitalocean-spaces-pins-versioned.png"
      alt="" 
      loading="lazy"
    >
  </figure></div>
{width=100%}</p>
<p>To learn more visit the <a href="http://pins.rstudio.com/articles/advanced-versions.html" target="_blank" rel="noopener">Versioning</a>
 and <a href="http://pins.rstudio.com/articles/boards-dospace.html" target="_blank" rel="noopener">DigitalOcean</a>
 articles. To catch up with previous releases:</p>
<ul>
<li><a href="https://blog.rstudio.com/2019/11/28/pins-0-3-0-azure-gcloud-and-s3/" target="_blank" rel="noopener">pins 0.3</a>
: Azure, GCloud and S3</li>
<li><a href="https://blog.rstudio.com/2019/09/09/pin-discover-and-share-resources/" target="_blank" rel="noopener">pins 0.2</a>
: Pin, Discover and Share Resources</li>
</ul>
<p>Thanks for reading along!</p>
]]></description>
    </item>
    <item>
      <title>sparklyr 1.2: Foreach, Spark 3.0 and Databricks Connect</title>
      <link>https://posit-open-source.netlify.app/blog/rstudio/sparklyr-1-2/</link>
      <pubDate>Wed, 06 May 2020 00:00:00 +0000</pubDate>
      <guid>https://posit-open-source.netlify.app/blog/rstudio/sparklyr-1-2/</guid>
      <dc:creator>Yitao Li</dc:creator><description><![CDATA[<p>A new version of <a href="https://sparklyr.ai" target="_blank" rel="noopener"><code>sparklyr</code></a>
 is now available on CRAN! In this <code>sparklyr 1.2</code> release, the following new improvements have emerged into spotlight:</p>
<ul>
<li>A <code>registerDoSpark()</code> method to create a <a href="#foreach"><code>foreach</code></a>
 parallel backend powered by Spark that enables hundreds of existing R packages to run in Spark.</li>
<li>Support for <a href="#databricks-connect">Databricks Connect</a>
, allowing <code>sparklyr</code> to connect to remote Databricks clusters.</li>
<li>Improved support for Spark <a href="#structures">structures</a>
 when collecting and querying their nested attributes with <code>dplyr</code>.</li>
</ul>
<p>A number of inter-op issues observed with <code>sparklyr</code> and the Spark 3.0 preview were also addressed recently, in hope that by the time Spark 3.0 officially graces us with its presence, <code>sparklyr</code> will be fully ready to work with it. Most notably, key features such as <code>spark_submit()</code>, <code>sdf_bind_rows()</code>, and standalone connections are now finally working with Spark 3.0 preview.</p>
<p>To install <code>sparklyr</code> 1.2 from CRAN run,</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">install.packages</span><span class="p">(</span><span class="s">&#34;sparklyr&#34;</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>The full list of changes are available in the <code>sparklyr</code> <a href="https://github.com/sparklyr/sparklyr/blob/master/NEWS.md" target="_blank" rel="noopener">NEWS</a>
 file.</p>
<h2 id="foreach">Foreach
</h2>
<p>The <a href="https://CRAN.R-project.org/package=foreach" target="_blank" rel="noopener"><code>foreach</code></a>
 package provides the <code>%dopar%</code> operator to iterate over elements in a collection in parallel. Using <code>sparklyr</code> 1.2, you can now register Spark as a backend using <code>registerDoSpark()</code> and then easily iterate over R objects using Spark:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">sparklyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">foreach</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">sc</span> <span class="o">&lt;-</span> <span class="nf">spark_connect</span><span class="p">(</span><span class="n">master</span> <span class="o">=</span> <span class="s">&#34;local&#34;</span><span class="p">,</span> <span class="n">version</span> <span class="o">=</span> <span class="s">&#34;2.4&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">registerDoSpark</span><span class="p">(</span><span class="n">sc</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">foreach</span><span class="p">(</span><span class="n">i</span> <span class="o">=</span> <span class="m">1</span><span class="o">:</span><span class="m">3</span><span class="p">,</span> <span class="n">.combine</span> <span class="o">=</span> <span class="s">&#39;c&#39;</span><span class="p">)</span> <span class="o">%dopar%</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="nf">sqrt</span><span class="p">(</span><span class="n">i</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre tabindex="0"><code>[1] 1.000000 1.414214 1.732051
</code></pre><p>Since many R packages are based on <code>foreach</code> to perform parallel computation, we can now make use of all those great packages in Spark as well!</p>
<p>For instance, we can use <a href="https://tidymodels.github.io/parsnip/" target="_blank" rel="noopener"><code>parsnip</code></a>
 and the <a href="https://tidymodels.github.io/tune/" target="_blank" rel="noopener"><code>tune</code></a>
 package with data from <a href="https://CRAN.R-project.org/package=mlbench" target="_blank" rel="noopener"><code>mlbench</code></a>
 to perform hyperparameter tuning in Spark with ease:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">tune</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">parsnip</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">mlbench</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">data</span><span class="p">(</span><span class="n">Ionosphere</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">svm_rbf</span><span class="p">(</span><span class="n">cost</span> <span class="o">=</span> <span class="nf">tune</span><span class="p">(),</span> <span class="n">rbf_sigma</span> <span class="o">=</span> <span class="nf">tune</span><span class="p">())</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">set_mode</span><span class="p">(</span><span class="s">&#34;classification&#34;</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">set_engine</span><span class="p">(</span><span class="s">&#34;kernlab&#34;</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">tune_grid</span><span class="p">(</span><span class="n">Class</span> <span class="o">~</span> <span class="n">.,</span>
</span></span><span class="line"><span class="cl">    <span class="n">resamples</span> <span class="o">=</span> <span class="n">rsample</span><span class="o">::</span><span class="nf">bootstraps</span><span class="p">(</span><span class="n">dplyr</span><span class="o">::</span><span class="nf">select</span><span class="p">(</span><span class="n">Ionosphere</span><span class="p">,</span> <span class="o">-</span><span class="n">V2</span><span class="p">),</span> <span class="n">times</span> <span class="o">=</span> <span class="m">30</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">    <span class="n">control</span> <span class="o">=</span> <span class="nf">control_grid</span><span class="p">(</span><span class="n">verbose</span> <span class="o">=</span> <span class="kc">FALSE</span><span class="p">))</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre tabindex="0"><code># Bootstrap sampling
# A tibble: 30 x 4
   splits            id          .metrics          .notes
 * &lt;list&gt;            &lt;chr&gt;       &lt;list&gt;            &lt;list&gt;
 1 &lt;split [351/124]&gt; Bootstrap01 &lt;tibble [10 × 5]&gt; &lt;tibble [0 × 1]&gt;
 2 &lt;split [351/126]&gt; Bootstrap02 &lt;tibble [10 × 5]&gt; &lt;tibble [0 × 1]&gt;
 3 &lt;split [351/125]&gt; Bootstrap03 &lt;tibble [10 × 5]&gt; &lt;tibble [0 × 1]&gt;
 4 &lt;split [351/135]&gt; Bootstrap04 &lt;tibble [10 × 5]&gt; &lt;tibble [0 × 1]&gt;
 5 &lt;split [351/127]&gt; Bootstrap05 &lt;tibble [10 × 5]&gt; &lt;tibble [0 × 1]&gt;
 6 &lt;split [351/131]&gt; Bootstrap06 &lt;tibble [10 × 5]&gt; &lt;tibble [0 × 1]&gt;
 7 &lt;split [351/141]&gt; Bootstrap07 &lt;tibble [10 × 5]&gt; &lt;tibble [0 × 1]&gt;
 8 &lt;split [351/123]&gt; Bootstrap08 &lt;tibble [10 × 5]&gt; &lt;tibble [0 × 1]&gt;
 9 &lt;split [351/118]&gt; Bootstrap09 &lt;tibble [10 × 5]&gt; &lt;tibble [0 × 1]&gt;
10 &lt;split [351/136]&gt; Bootstrap10 &lt;tibble [10 × 5]&gt; &lt;tibble [0 × 1]&gt;
# … with 20 more rows
</code></pre><p>The Spark connection was already registered, so the code ran in Spark without any additional changes. We can verify that this was the case by navigating to the Spark web interface:</p>
<img src="https://posit-open-source.netlify.app/blog/images/2020-05-06-sparklyr-1-2-spark-backend-foreach-package.png" alt="Spark running foreach package using sparklyr"/>
<h2 id="databricks-connect">Databricks Connect
</h2>
<p><a href="https://docs.databricks.com/dev-tools/databricks-connect.html" target="_blank" rel="noopener">Databricks Connect</a>
 allows you to connect your favorite IDE (like <a href="https://rstudio.com/products/rstudio/download/" target="_blank" rel="noopener">RStudio</a>
!) to a Spark <a href="https://databricks.com/" target="_blank" rel="noopener">Databricks</a>
 cluster.</p>
<p>You will first have to install the <code>databricks-connect</code> Python package as described in our <a href="https://github.com/sparklyr/sparklyr#connecting-through-databricks-connect" target="_blank" rel="noopener">README</a>
 and start a Databricks cluster, but once that&rsquo;s ready, connecting to the remote cluster is as easy as running:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">sc</span> <span class="o">&lt;-</span> <span class="nf">spark_connect</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">  <span class="n">method</span> <span class="o">=</span> <span class="s">&#34;databricks&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="n">spark_home</span> <span class="o">=</span> <span class="nf">system2</span><span class="p">(</span><span class="s">&#34;databricks-connect&#34;</span><span class="p">,</span> <span class="s">&#34;get-spark-home&#34;</span><span class="p">,</span> <span class="n">stdout</span> <span class="o">=</span> <span class="kc">TRUE</span><span class="p">))</span>
</span></span></code></pre></td></tr></table>
</div>
</div><img src="https://posit-open-source.netlify.app/blog/images/2020-05-06-sparklyr-1-2-spark-databricks-connect-rstudio.png" alt="Databricks Connect with RStudio Desktop"/>
<p>That&rsquo;s about it, you are now remotely connected to a Databricks cluster from your local R session.</p>
<h2 id="structures">Structures
</h2>
<p>If you previously used <code>collect()</code> to deserialize structurally complex Spark data frames into their equivalents in R, you likely have noticed that Spark SQL struct columns were only mapped into JSON strings in R, which was non-ideal. You might also have run into a much dreaded <code>java.lang.IllegalArgumentException: Invalid type list</code> error when using <code>dplyr</code> to query nested attributes from any struct column of a Spark data frame in <code>sparklyr</code>.</p>
<p>Unfortunately, often times in real-world Spark use cases, data describing entities comprised of sub-entities (e.g., a product catalog of all hardware components of some computers) needs to be denormalized / shaped in an object-oriented manner in the form of Spark SQL structs to allow efficient read queries. When <code>sparklyr</code> had the limitations mentioned above, users often had to invent their own workarounds when querying Spark struct columns, which explained why there was a mass popular demand for <code>sparklyr</code> to have better support for such use cases.</p>
<p>The good news is with <code>sparklyr</code> 1.2, those limitations no longer exist when working running with Spark 2.4 or above.</p>
<p>As a concrete example, consider the following catalog of computers:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">computers</span> <span class="o">&lt;-</span> <span class="n">tibble</span><span class="o">::</span><span class="nf">tibble</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">  <span class="n">id</span> <span class="o">=</span> <span class="nf">seq</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="m">2</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">  <span class="n">attributes</span> <span class="o">=</span> <span class="nf">list</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="nf">list</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">      <span class="n">processor</span> <span class="o">=</span> <span class="nf">list</span><span class="p">(</span><span class="n">freq</span> <span class="o">=</span> <span class="m">2.4</span><span class="p">,</span> <span class="n">num_cores</span> <span class="o">=</span> <span class="m">256</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">      <span class="n">price</span> <span class="o">=</span> <span class="m">100</span>
</span></span><span class="line"><span class="cl">   <span class="p">),</span>
</span></span><span class="line"><span class="cl">   <span class="nf">list</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">     <span class="n">processor</span> <span class="o">=</span> <span class="nf">list</span><span class="p">(</span><span class="n">freq</span> <span class="o">=</span> <span class="m">1.6</span><span class="p">,</span> <span class="n">num_cores</span> <span class="o">=</span> <span class="m">512</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">     <span class="n">price</span> <span class="o">=</span> <span class="m">133</span>
</span></span><span class="line"><span class="cl">   <span class="p">)</span>
</span></span><span class="line"><span class="cl">  <span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">computers</span> <span class="o">&lt;-</span> <span class="nf">copy_to</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="n">computers</span><span class="p">,</span> <span class="n">overwrite</span> <span class="o">=</span> <span class="kc">TRUE</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>A typical <code>dplyr</code> use case involving <code>computers</code> would be the following:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">high_freq_computers</span> <span class="o">&lt;-</span> <span class="n">computers</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                       <span class="nf">filter</span><span class="p">(</span><span class="n">attributes.processor.freq</span> <span class="o">&gt;=</span> <span class="m">2</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                       <span class="nf">collect</span><span class="p">()</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>As previously mentioned, before <code>sparklyr</code> 1.2, such query would fail with <code>Error: java.lang.IllegalArgumentException: Invalid type list</code>.</p>
<p>Whereas with <code>sparklyr</code> 1.2, the expected result is returned in the following form:</p>
<pre tabindex="0"><code># A tibble: 1 x 2
     id attributes
  &lt;int&gt; &lt;list&gt;
1     1 &lt;named list [2]&gt;
</code></pre><p>where <code>high_freq_computers$attributes</code> is what we would expect:</p>
<pre tabindex="0"><code>[[1]]
[[1]]$price
[1] 100
[[1]]$processor
[[1]]$processor$freq
[1] 2.4
[[1]]$processor$num_cores
[1] 256
</code></pre><h2 id="and-more">And More!
</h2>
<p>Last but not least, we heard about a number of pain points <code>sparklyr</code> users have run into, and have addressed many of them in this release as well. For example:</p>
<ul>
<li>Date type in R is now correctly serialized into Spark SQL date type by <code>copy_to()</code></li>
<li><code>&lt;spark dataframe&gt; %&gt;% print(n = 20)</code> now actually prints 20 rows as expected instead of 10</li>
<li><code>spark_connect(master = &quot;local&quot;)</code> will emit a more informative error message if it&rsquo;s failing because the loopback interface is not up</li>
</ul>
<p>&hellip; to name just a few. We want to thank the open source community for their continuous feedback on <code>sparklyr</code>, and are looking forward to incorporating more of that feedback to make <code>sparklyr</code> even better in the future.</p>
<p>Finally, in chronological order, we wish to thank the following individuals for contributing to <code>sparklyr</code> 1.2: <a href="https://github.com/zero323" target="_blank" rel="noopener">zero323</a>
, <a href="https://github.com/Loquats" target="_blank" rel="noopener">Andy Zhang</a>
, <a href="https://github.com/yl790" target="_blank" rel="noopener">Yitao Li</a>
,
<a href="https://github.com/javierluraschi" target="_blank" rel="noopener">Javier Luraschi</a>
, <a href="https://github.com/falaki" target="_blank" rel="noopener">Hossein Falaki</a>
, <a href="https://github.com/lu-wang-dl" target="_blank" rel="noopener">Lu Wang</a>
, <a href="https://github.com/samuelmacedo83" target="_blank" rel="noopener">Samuel Macedo</a>
 and <a href="https://github.com/jozefhajnala" target="_blank" rel="noopener">Jozef Hajnala</a>
. Great job everyone!</p>
<p>If you need to catch up on <code>sparklyr</code>, please visit <a href="https://sparklyr.ai" target="_blank" rel="noopener">sparklyr.ai</a>
, <a href="https://spark.rstudio.com" target="_blank" rel="noopener">spark.rstudio.com</a>
, or some of the previous release posts: <a href="https://blog.rstudio.com/2020/01/29/sparklyr-1-1/" target="_blank" rel="noopener">sparklyr 1.1</a>
 and <a href="https://blog.rstudio.com/2019/03/15/sparklyr-1-0/" target="_blank" rel="noopener">sparklyr 1.0</a>
.</p>
<p>Thank you for reading this post.</p>
<p>This post was originally published on <a href="https://blogs.rstudio.com/ai/" target="_blank" rel="noopener">blogs.rstudio.com/ai/</a>
</p>
]]></description>
      <enclosure url="https://posit-open-source.netlify.app/blog/rstudio/sparklyr-1-2/thumbnail.jpg" length="3509" type="image/jpeg" />
    </item>
    <item>
      <title>sparklyr 1.2: Foreach, Spark 3.0 and Databricks Connect</title>
      <link>https://posit-open-source.netlify.app/blog/ai/sparklyr-1.2/</link>
      <pubDate>Tue, 21 Apr 2020 00:00:00 +0000</pubDate>
      <guid>https://posit-open-source.netlify.app/blog/ai/sparklyr-1.2/</guid>
      <dc:creator>Yitao Li</dc:creator><description><![CDATA[<p>Behold the glory that is <a href="https://sparklyr.ai" target="_blank" rel="noopener">sparklyr</a>
 1.2! In this release, the following new hotnesses have emerged into spotlight:</p>
<ul>
<li>A <code>registerDoSpark</code> method to create a <a href="#foreach">foreach</a>
 parallel backend powered by Spark that enables hundreds of existing R packages to run in Spark.</li>
<li>Support for <a href="#databricks-connect">Databricks Connect</a>
, allowing <code>sparklyr</code> to connect to remote Databricks clusters.</li>
<li>Improved support for Spark <a href="#structures">structures</a>
 when collecting and querying their nested attributes with <code>dplyr</code>.</li>
</ul>
<p>A number of inter-op issues observed with <code>sparklyr</code> and Spark 3.0 preview were also addressed recently, in hope that by the time Spark 3.0 officially graces us with its presence, <code>sparklyr</code> will be fully ready to work with it. Most notably, key features such as <code>spark_submit</code>, <code>sdf_bind_rows</code>, and standalone connections are now finally working with Spark 3.0 preview.</p>
<p>To install <code>sparklyr</code> 1.2 from CRAN run,</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">install.packages</span><span class="p">(</span><span class="s">&#34;sparklyr&#34;</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>The full list of changes are available in the sparklyr <a href="https://github.com/sparklyr/sparklyr/blob/master/NEWS.md" target="_blank" rel="noopener">NEWS</a>
 file.</p>
<h2 id="foreach">Foreach
</h2>
<p>The <code>foreach</code> package provides the <code>%dopar%</code> operator to iterate over elements in a collection in parallel. Using <code>sparklyr</code> 1.2, you can now register Spark as a backend using <code>registerDoSpark()</code> and then easily iterate over R objects using Spark:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span><span class="lnt">8
</span><span class="lnt">9
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">sparklyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">foreach</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">sc</span> <span class="o">&lt;-</span> <span class="nf">spark_connect</span><span class="p">(</span><span class="n">master</span> <span class="o">=</span> <span class="s">&#34;local&#34;</span><span class="p">,</span> <span class="n">version</span> <span class="o">=</span> <span class="s">&#34;2.4&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">registerDoSpark</span><span class="p">(</span><span class="n">sc</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">foreach</span><span class="p">(</span><span class="n">i</span> <span class="o">=</span> <span class="m">1</span><span class="o">:</span><span class="m">3</span><span class="p">,</span> <span class="n">.combine</span> <span class="o">=</span> <span class="s">&#39;c&#39;</span><span class="p">)</span> <span class="o">%dopar%</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="nf">sqrt</span><span class="p">(</span><span class="n">i</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code>[1] 1.000000 1.414214 1.732051
</code></pre>
<p>Since many R packages are based on <code>foreach</code> to perform parallel computation, we can now make use of all those great packages in Spark as well!</p>
<p>For instance, we can use <a href="https://tidymodels.github.io/parsnip/" target="_blank" rel="noopener">parsnip</a>
 and the <a href="https://tidymodels.github.io/tune/" target="_blank" rel="noopener">tune</a>
 package with data from <a href="https://CRAN.R-project.org/package=mlbench" target="_blank" rel="noopener">mlbench</a>
 to perform hyperparameter tuning in Spark with ease:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">tune</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">parsnip</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">mlbench</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">data</span><span class="p">(</span><span class="n">Ionosphere</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">svm_rbf</span><span class="p">(</span><span class="n">cost</span> <span class="o">=</span> <span class="nf">tune</span><span class="p">(),</span> <span class="n">rbf_sigma</span> <span class="o">=</span> <span class="nf">tune</span><span class="p">())</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">set_mode</span><span class="p">(</span><span class="s">&#34;classification&#34;</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">set_engine</span><span class="p">(</span><span class="s">&#34;kernlab&#34;</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">tune_grid</span><span class="p">(</span><span class="n">Class</span> <span class="o">~</span> <span class="n">.,</span>
</span></span><span class="line"><span class="cl">    <span class="n">resamples</span> <span class="o">=</span> <span class="n">rsample</span><span class="o">::</span><span class="nf">bootstraps</span><span class="p">(</span><span class="n">dplyr</span><span class="o">::</span><span class="nf">select</span><span class="p">(</span><span class="n">Ionosphere</span><span class="p">,</span> <span class="o">-</span><span class="n">V2</span><span class="p">),</span> <span class="n">times</span> <span class="o">=</span> <span class="m">30</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">    <span class="n">control</span> <span class="o">=</span> <span class="nf">control_grid</span><span class="p">(</span><span class="n">verbose</span> <span class="o">=</span> <span class="kc">FALSE</span><span class="p">))</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code># Bootstrap sampling
# A tibble: 30 x 4
   splits            id          .metrics          .notes
 * &lt;list&gt;            &lt;chr&gt;       &lt;list&gt;            &lt;list&gt;
 1 &lt;split [351/124]&gt; Bootstrap01 &lt;tibble [10 × 5]&gt; &lt;tibble [0 × 1]&gt;
 2 &lt;split [351/126]&gt; Bootstrap02 &lt;tibble [10 × 5]&gt; &lt;tibble [0 × 1]&gt;
 3 &lt;split [351/125]&gt; Bootstrap03 &lt;tibble [10 × 5]&gt; &lt;tibble [0 × 1]&gt;
 4 &lt;split [351/135]&gt; Bootstrap04 &lt;tibble [10 × 5]&gt; &lt;tibble [0 × 1]&gt;
 5 &lt;split [351/127]&gt; Bootstrap05 &lt;tibble [10 × 5]&gt; &lt;tibble [0 × 1]&gt;
 6 &lt;split [351/131]&gt; Bootstrap06 &lt;tibble [10 × 5]&gt; &lt;tibble [0 × 1]&gt;
 7 &lt;split [351/141]&gt; Bootstrap07 &lt;tibble [10 × 5]&gt; &lt;tibble [0 × 1]&gt;
 8 &lt;split [351/123]&gt; Bootstrap08 &lt;tibble [10 × 5]&gt; &lt;tibble [0 × 1]&gt;
 9 &lt;split [351/118]&gt; Bootstrap09 &lt;tibble [10 × 5]&gt; &lt;tibble [0 × 1]&gt;
10 &lt;split [351/136]&gt; Bootstrap10 &lt;tibble [10 × 5]&gt; &lt;tibble [0 × 1]&gt;
# … with 20 more rows
</code></pre>
<p>The Spark connection was already registered, so the code ran in Spark without any additional changes. We can verify this was the case by navigating to the Spark web interface:</p>
<p><div class="not-prose"><figure>
    <img class="h-auto max-w-full rounded-lg"
      src="https://posit-open-source.netlify.app/blog/ai/sparklyr-1.2/images/spark-backend-foreach-package.png"
      alt="" 
      loading="lazy"
    >
  </figure></div>
</p>
<h2 id="databricks-connect">Databricks Connect
</h2>
<p><a href="https://docs.databricks.com/dev-tools/databricks-connect.html" target="_blank" rel="noopener">Databricks Connect</a>
 allows you to connect your favorite IDE (like <a href="https://rstudio.com/products/rstudio/download/" target="_blank" rel="noopener">RStudio</a>
!) to a Spark <a href="https://databricks.com/" target="_blank" rel="noopener">Databricks</a>
 cluster.</p>
<p>You will first have to install the <code>databricks-connect</code> package as described in our <a href="https://github.com/sparklyr/sparklyr#connecting-through-databricks-connect" target="_blank" rel="noopener">README</a>
 and start a Databricks cluster, but once that&rsquo;s ready, connecting to the remote cluster is as easy as running:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">sc</span> <span class="o">&lt;-</span> <span class="nf">spark_connect</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">  <span class="n">method</span> <span class="o">=</span> <span class="s">&#34;databricks&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="n">spark_home</span> <span class="o">=</span> <span class="nf">system2</span><span class="p">(</span><span class="s">&#34;databricks-connect&#34;</span><span class="p">,</span> <span class="s">&#34;get-spark-home&#34;</span><span class="p">,</span> <span class="n">stdout</span> <span class="o">=</span> <span class="kc">TRUE</span><span class="p">))</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p><div class="not-prose"><figure>
    <img class="h-auto max-w-full rounded-lg"
      src="https://posit-open-source.netlify.app/blog/ai/sparklyr-1.2/images/spark-databricks-connect-rstudio.png"
      alt="" 
      loading="lazy"
    >
  </figure></div>
</p>
<p>That&rsquo;s about it, you are now remotely connected to a Databricks cluster from your local R session.</p>
<h2 id="structures">Structures
</h2>
<p>If you previously used <code>collect</code> to deserialize structurally complex Spark dataframes into their equivalents in R, you likely have noticed Spark SQL struct columns were only mapped into JSON strings in R, which was non-ideal. You might also have run into a much dreaded <code>java.lang.IllegalArgumentException: Invalid type list</code> error when using <code>dplyr</code> to query nested attributes from any struct column of a Spark dataframe in sparklyr.</p>
<p>Unfortunately, often times in real-world Spark use cases, data describing entities comprising of sub-entities (e.g., a product catalog of all hardware components of some computers) needs to be denormalized / shaped in an object-oriented manner in the form of Spark SQL structs to allow efficient read queries. When sparklyr had the limitations mentioned above, users often had to invent their own workarounds when querying Spark struct columns, which explained why there was a mass popular demand for sparklyr to have better support for such use cases.</p>
<p>The good news is with <code>sparklyr</code> 1.2, those limitations no longer exist any more when working running with Spark 2.4 or above.</p>
<p>As a concrete example, consider the following catalog of computers:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">computers</span> <span class="o">&lt;-</span> <span class="n">tibble</span><span class="o">::</span><span class="nf">tibble</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">  <span class="n">id</span> <span class="o">=</span> <span class="nf">seq</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="m">2</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">  <span class="n">attributes</span> <span class="o">=</span> <span class="nf">list</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="nf">list</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">      <span class="n">processor</span> <span class="o">=</span> <span class="nf">list</span><span class="p">(</span><span class="n">freq</span> <span class="o">=</span> <span class="m">2.4</span><span class="p">,</span> <span class="n">num_cores</span> <span class="o">=</span> <span class="m">256</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">      <span class="n">price</span> <span class="o">=</span> <span class="m">100</span>
</span></span><span class="line"><span class="cl">   <span class="p">),</span>
</span></span><span class="line"><span class="cl">   <span class="nf">list</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">     <span class="n">processor</span> <span class="o">=</span> <span class="nf">list</span><span class="p">(</span><span class="n">freq</span> <span class="o">=</span> <span class="m">1.6</span><span class="p">,</span> <span class="n">num_cores</span> <span class="o">=</span> <span class="m">512</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">     <span class="n">price</span> <span class="o">=</span> <span class="m">133</span>
</span></span><span class="line"><span class="cl">   <span class="p">)</span>
</span></span><span class="line"><span class="cl">  <span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">computers</span> <span class="o">&lt;-</span> <span class="nf">copy_to</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="n">computers</span><span class="p">,</span> <span class="n">overwrite</span> <span class="o">=</span> <span class="kc">TRUE</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>A typical <code>dplyr</code> use case involving <code>computers</code> would be the following:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">high_freq_computers</span> <span class="o">&lt;-</span> <span class="n">computers</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                       <span class="nf">filter</span><span class="p">(</span><span class="n">attributes.processor.freq</span> <span class="o">&gt;=</span> <span class="m">2</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">                       <span class="nf">collect</span><span class="p">()</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>As previously mentioned, before <code>sparklyr</code> 1.2, such query would fail with <code>Error: java.lang.IllegalArgumentException: Invalid type list</code>.</p>
<p>Whereas with <code>sparklyr</code> 1.2, the expected result is returned in the following form:</p>
<pre><code># A tibble: 1 x 2
     id attributes
  &lt;int&gt; &lt;list&gt;
1     1 &lt;named list [2]&gt;
</code></pre>
<p>where <code>high_freq_computers$attributes</code> is what we would expect:</p>
<pre><code>[[1]]
[[1]]$price
[1] 100

[[1]]$processor
[[1]]$processor$freq
[1] 2.4

[[1]]$processor$num_cores
[1] 256
</code></pre>
<h2 id="and-more">And More!
</h2>
<p>Last but not least, we heard about a number of pain points <code>sparklyr</code> users have run into, and have addressed many of them in this release as well. For example:</p>
<ul>
<li>Date type in R is now correctly serialized into Spark SQL date type by <code>copy_to</code></li>
<li><code>&lt;spark dataframe&gt; %&gt;% print(n = 20)</code> now actually prints 20 rows as expected instead of 10</li>
<li><code>spark_connect(master = &quot;local&quot;)</code> will emit a more informative error message if it&rsquo;s failing because the loopback interface is not up</li>
</ul>
<p>&hellip; to just name a few. We want to thank the open source community for their continuous feedback on <code>sparklyr</code>, and are looking forward to incorporating more of that feedback to make <code>sparklyr</code> even better in the future.</p>
<p>Finally, in chronological order, we wish to thank the following individuals for contributing to <code>sparklyr</code> 1.2: <a href="https://github.com/zero323" target="_blank" rel="noopener">zero323</a>
, <a href="https://github.com/Loquats" target="_blank" rel="noopener">Andy Zhang</a>
, <a href="https://github.com/yl790" target="_blank" rel="noopener">Yitao Li</a>
,
<a href="https://github.com/javierluraschi" target="_blank" rel="noopener">Javier Luraschi</a>
, <a href="https://github.com/falaki" target="_blank" rel="noopener">Hossein Falaki</a>
, <a href="https://github.com/lu-wang-dl" target="_blank" rel="noopener">Lu Wang</a>
, <a href="https://github.com/samuelmacedo83" target="_blank" rel="noopener">Samuel Macedo</a>
 and <a href="https://github.com/jozefhajnala" target="_blank" rel="noopener">Jozef Hajnala</a>
. Great job everyone!</p>
<p>If you need to catch up on <code>sparklyr</code>, please visit <a href="https://sparklyr.ai" target="_blank" rel="noopener">sparklyr.ai</a>
, <a href="https://spark.rstudio.com" target="_blank" rel="noopener">spark.rstudio.com</a>
, or some of the previous release posts: <a href="https://blog.rstudio.com/2020/01/29/sparklyr-1-1/" target="_blank" rel="noopener">sparklyr 1.1</a>
 and <a href="https://blog.rstudio.com/2019/03/15/sparklyr-1-0/" target="_blank" rel="noopener">sparklyr 1.0</a>
.</p>
<p>Thank you for reading this post.</p>
]]></description>
      <enclosure url="https://posit-open-source.netlify.app/blog/ai/sparklyr-1.2/thumbnail.png" length="25088" type="image/png" />
    </item>
    <item>
      <title>pins 0.4: Versioning</title>
      <link>https://posit-open-source.netlify.app/blog/ai/pins-0.4/</link>
      <pubDate>Mon, 13 Apr 2020 00:00:00 +0000</pubDate>
      <guid>https://posit-open-source.netlify.app/blog/ai/pins-0.4/</guid>
      <dc:creator>Javier Luraschi</dc:creator><description><![CDATA[<p>A new version of <code>pins</code> is available on CRAN today, which adds support for <a href="http://pins.rstudio.com/articles/advanced-versions.html" target="_blank" rel="noopener">versioning</a>
 your datasets and <a href="http://pins.rstudio.com/articles/boards-dospace.html" target="_blank" rel="noopener">DigitalOcean Spaces</a>
 boards!</p>
<p>As a quick recap, the pins package allows you to cache, discover and share resources. You can use <code>pins</code> in a wide range of situations, from downloading a dataset from a URL to creating complex automation workflows (learn more at <a href="https://pins.rstudio.com" target="_blank" rel="noopener">pins.rstudio.com</a>
). You can also use <code>pins</code> in combination with TensorFlow and Keras; for instance, use <a href="https://tensorflow.rstudio.com/tools/cloudml" target="_blank" rel="noopener">cloudml</a>
 to train models in cloud GPUs, but rather than manually copying files into the GPU instance, you can store them as pins directly from R.</p>
<p>To install this new version of <code>pins</code> from CRAN, simply run:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">install.packages</span><span class="p">(</span><span class="s">&#34;pins&#34;</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>You can find a detailed list of improvements in the pins <a href="https://github.com/rstudio/pins/blob/master/NEWS.md" target="_blank" rel="noopener">NEWS</a>
 file.</p>
<h1 id="versioning">Versioning
</h1>
<p>To illustrate the new versioning functionality, let&rsquo;s start by downloading and caching a remote dataset with pins. For this example, we will download the weather in London, this happens to be in JSON format and requires <code>jsonlite</code> to be parsed:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">pins</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">weather_url</span> <span class="o">&lt;-</span> <span class="s">&#34;https://samples.openweathermap.org/data/2.5/weather?q=London,uk&amp;appid=b6907d289e10d714a6e88b30761fae22&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">pin</span><span class="p">(</span><span class="n">weather_url</span><span class="p">,</span> <span class="s">&#34;weather&#34;</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="n">jsonlite</span><span class="o">::</span><span class="nf">read_json</span><span class="p">()</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">as.data.frame</span><span class="p">()</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code>  coord.lon coord.lat weather.id weather.main     weather.description weather.icon
1     -0.13     51.51        300      Drizzle light intensity drizzle          09d
</code></pre>
<p>One advantage of using <code>pins</code> is that, even if the URL or your internet connection becomes unavailable, the above code will still work.</p>
<p>But back to <code>pins 0.4</code>! The new <code>signature</code> parameter in <code>pin_info()</code> allows you to retrieve the &ldquo;version&rdquo; of this dataset:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">pin_info</span><span class="p">(</span><span class="s">&#34;weather&#34;</span><span class="p">,</span> <span class="n">signature</span> <span class="o">=</span> <span class="kc">TRUE</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code># Source: local&lt;weather&gt; [files]
# Signature: 624cca260666c6f090b93c37fd76878e3a12a79b
# Properties:
#   - path: weather
</code></pre>
<p>You can then validate the remote dataset has not changed by specifying its signature:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">pin</span><span class="p">(</span><span class="n">weather_url</span><span class="p">,</span> <span class="s">&#34;weather&#34;</span><span class="p">,</span> <span class="n">signature</span> <span class="o">=</span> <span class="s">&#34;624cca260666c6f090b93c37fd76878e3a12a79b&#34;</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="n">jsonlite</span><span class="o">::</span><span class="nf">read_json</span><span class="p">()</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>If the remote dataset changes, <code>pin()</code> will fail and you can take the appropriate steps to accept the changes by updating the signature or properly updating your code. The previous example is useful as a way of detecting version changes, but we might also want to retrieve specific versions even when the dataset changes.</p>
<p><code>pins 0.4</code> allows you to display and retrieve versions from services like GitHub, Kaggle and RStudio Connect. Even in boards that don&rsquo;t support versioning natively, you can opt-in by registering a board with <code>versions = TRUE</code>.</p>
<p>To keep this simple, let&rsquo;s focus on GitHub first. We will register a GitHub board and pin a dataset to it. Notice that you can also specify the <code>commit</code> parameter in GitHub boards as the commit message for this change.</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">board_register_github</span><span class="p">(</span><span class="n">repo</span> <span class="o">=</span> <span class="s">&#34;javierluraschi/datasets&#34;</span><span class="p">,</span> <span class="n">branch</span> <span class="o">=</span> <span class="s">&#34;datasets&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">pin</span><span class="p">(</span><span class="n">iris</span><span class="p">,</span> <span class="n">name</span> <span class="o">=</span> <span class="s">&#34;versioned&#34;</span><span class="p">,</span> <span class="n">board</span> <span class="o">=</span> <span class="s">&#34;github&#34;</span><span class="p">,</span> <span class="n">commit</span> <span class="o">=</span> <span class="s">&#34;use iris as the main dataset&#34;</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>Now suppose that a colleague comes along and updates this dataset as well:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">pin</span><span class="p">(</span><span class="n">mtcars</span><span class="p">,</span> <span class="n">name</span> <span class="o">=</span> <span class="s">&#34;versioned&#34;</span><span class="p">,</span> <span class="n">board</span> <span class="o">=</span> <span class="s">&#34;github&#34;</span><span class="p">,</span> <span class="n">commit</span> <span class="o">=</span> <span class="s">&#34;slight preference to mtcars&#34;</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>From now on, your code could be broken or, even worse, produce incorrect results!</p>
<p>However, since GitHub was designed as a version control system and <code>pins 0.4</code> adds support for <code>pin_versions()</code>, we can now explore particular versions of this dataset:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">pin_versions</span><span class="p">(</span><span class="s">&#34;versioned&#34;</span><span class="p">,</span> <span class="n">board</span> <span class="o">=</span> <span class="s">&#34;github&#34;</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code># A tibble: 2 x 4
  version created              author         message                     
  &lt;chr&gt;   &lt;chr&gt;                &lt;chr&gt;          &lt;chr&gt;                       
1 6e6c320 2020-04-02T21:28:07Z javierluraschi slight preference to mtcars 
2 01f8ddf 2020-04-02T21:27:59Z javierluraschi use iris as the main dataset
</code></pre>
<p>You can then retrieve the version you are interested in as follows:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">pin_get</span><span class="p">(</span><span class="s">&#34;versioned&#34;</span><span class="p">,</span> <span class="n">version</span> <span class="o">=</span> <span class="s">&#34;01f8ddf&#34;</span><span class="p">,</span> <span class="n">board</span> <span class="o">=</span> <span class="s">&#34;github&#34;</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code># A tibble: 150 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          &lt;dbl&gt;       &lt;dbl&gt;        &lt;dbl&gt;       &lt;dbl&gt; &lt;fct&gt;  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# … with 140 more rows
</code></pre>
<p>You can follow similar steps for <a href="http://pins.rstudio.com/articles/boards-rsconnect.html" target="_blank" rel="noopener">RStudio Connect</a>
 and <a href="http://pins.rstudio.com/articles/boards-kaggle.html" target="_blank" rel="noopener">Kaggle</a>
 boards, even for existing pins! Other boards like <a href="http://pins.rstudio.com/articles/boards-s3.html" target="_blank" rel="noopener">Amazon S3</a>
, <a href="http://pins.rstudio.com/articles/boards-gcloud.html" target="_blank" rel="noopener">Google Cloud</a>
, <a href="http://pins.rstudio.com/articles/boards-dospace.html" target="_blank" rel="noopener">Digital Ocean</a>
 and <a href="http://pins.rstudio.com/articles/boards-azure.html" target="_blank" rel="noopener">Microsoft Azure</a>
 require you explicitly enable versioning when registering your boards.</p>
<h1 id="digitalocean">DigitalOcean
</h1>
<p>To try out the new <a href="http://pins.rstudio.com/articles/boards-dospace.html" target="_blank" rel="noopener">DigitalOcean Spaces board</a>
, first you will have to register this board and enable versioning by setting <code>versions</code> to <code>TRUE</code>:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">pins</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">board_register_dospace</span><span class="p">(</span><span class="n">space</span> <span class="o">=</span> <span class="s">&#34;pinstest&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                       <span class="n">key</span> <span class="o">=</span> <span class="s">&#34;AAAAAAAAAAAAAAAAAAAA&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                       <span class="n">secret</span> <span class="o">=</span> <span class="s">&#34;ABCABCABCABCABCABCABCABCABCABCABCABCABCA==&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                       <span class="n">datacenter</span> <span class="o">=</span> <span class="s">&#34;sfo2&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                       <span class="n">versions</span> <span class="o">=</span> <span class="kc">TRUE</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>You can then use all the functionality pins provides, including versioning:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="c1"># create pin and replace content in digitalocean</span>
</span></span><span class="line"><span class="cl"><span class="nf">pin</span><span class="p">(</span><span class="n">iris</span><span class="p">,</span> <span class="n">name</span> <span class="o">=</span> <span class="s">&#34;versioned&#34;</span><span class="p">,</span> <span class="n">board</span> <span class="o">=</span> <span class="s">&#34;pinstest&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">pin</span><span class="p">(</span><span class="n">mtcars</span><span class="p">,</span> <span class="n">name</span> <span class="o">=</span> <span class="s">&#34;versioned&#34;</span><span class="p">,</span> <span class="n">board</span> <span class="o">=</span> <span class="s">&#34;pinstest&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># retrieve versions from digitalocean</span>
</span></span><span class="line"><span class="cl"><span class="nf">pin_versions</span><span class="p">(</span><span class="n">name</span> <span class="o">=</span> <span class="s">&#34;versioned&#34;</span><span class="p">,</span> <span class="n">board</span> <span class="o">=</span> <span class="s">&#34;pinstest&#34;</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre><code># A tibble: 2 x 1
  version
  &lt;chr&gt;  
1 c35da04
2 d9034cd
</code></pre>
<p>Notice that enabling versions in cloud services requires additional storage space for each version of the dataset being stored:</p>
<img src="https://posit-open-source.netlify.app/blog/ai/pins-0.4/images/digitalocean-spaces-pins-versioned.png" style="width:100.0%" />
<p>To learn more visit the <a href="http://pins.rstudio.com/articles/advanced-versions.html" target="_blank" rel="noopener">Versioning</a>
 and <a href="http://pins.rstudio.com/articles/boards-dospace.html" target="_blank" rel="noopener">DigitalOcean</a>
 articles. To catch up with previous releases:</p>
<ul>
<li><a href="http://pins.rstudio.com/blog/posts/pins-0-3-0/" target="_blank" rel="noopener">pins 0.3</a>
: Azure, GCloud and S3</li>
<li><a href="https://blog.rstudio.com/2019/09/09/pin-discover-and-share-resources/" target="_blank" rel="noopener">pins 0.2</a>
: Pin, Discover and Share Resources</li>
</ul>
<p>Thanks for reading along!</p>
]]></description>
      <enclosure url="https://posit-open-source.netlify.app/blog/ai/pins-0.4/thumbnail.jpg" length="51651" type="image/jpeg" />
    </item>
    <item>
      <title>sparklyr 1.1: Foundations, Books, Lakes and Barriers</title>
      <link>https://posit-open-source.netlify.app/blog/rstudio/sparklyr-1-1/</link>
      <pubDate>Wed, 29 Jan 2020 00:00:00 +0000</pubDate>
      <guid>https://posit-open-source.netlify.app/blog/rstudio/sparklyr-1-1/</guid>
      <dc:creator>Javier Luraschi</dc:creator><description><![CDATA[<img src="https://posit-open-source.netlify.app/blog-images/2020-01-29-sparklyr-1-1-linux-foundation-roadmap.png" style="display: none;" alt="Linux Foundation roadmap projects and sparklyr"/>
<p>Today we are excited to share that <a href="https://github.com/sparklyr/sparklyr" target="_blank" rel="noopener">sparklyr</a>
 <code>1.1</code> is now available on <a href="https://CRAN.R-project.org/package=sparklyr" target="_blank" rel="noopener">CRAN</a>
!</p>
<p>In a nutshell, you can use sparklyr to scale datasets across computing clusters running <a href="http://spark.apache.org" target="_blank" rel="noopener">Apache Spark</a>
. For this particular release, we would like to highlight the following new features:</p>
<ul>
<li><strong><a href="#delta-lake">Delta Lake</a>
</strong> enables database-like properties in Spark.</li>
<li><strong><a href="#spark-3-0">Spark 3.0</a>
</strong> preview is now available through sparklyr.</li>
<li><strong><a href="#barrier-execution">Barrier Execution</a>
</strong> paves the way to use Spark with deep learning frameworks.</li>
<li><strong><a href="#qubole">Qubole</a>
</strong> clusters running Spark can be easily used with sparklyr.</li>
</ul>
<p>In addition, new community <strong><a href="#extensions">Extensions</a>
</strong> enable natural language processing and genomics, sparklyr is now being hosted within the <strong><a href="#linux-foundation">Linux Foundation</a>
</strong>, and the <strong><a href="#mastering-spark-with-r">Mastering Spark with R</a>
</strong> book is now available and free-to-use online.</p>
<p>You can install <code>sparklyr 1.1</code> from CRAN as follows:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">install.packages</span><span class="p">(</span><span class="s">&#34;sparklyr&#34;</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><h2 id="delta-lake">Delta Lake
</h2>
<p>The <a href="https://delta.io/" target="_blank" rel="noopener">Delta Lake</a>
 project is an open-source storage layer that brings <a href="https://en.wikipedia.org/wiki/ACID" target="_blank" rel="noopener">ACID transactions</a>
 to Apache Spark. To use Delta Lake, first connect using the new <code>packages</code> parameter set to <code>&quot;delta&quot;</code>.</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">sparklyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">sc</span> <span class="o">&lt;-</span> <span class="nf">spark_connect</span><span class="p">(</span><span class="n">master</span> <span class="o">=</span> <span class="s">&#34;local&#34;</span><span class="p">,</span> <span class="n">version</span> <span class="o">=</span> <span class="s">&#34;2.4&#34;</span><span class="p">,</span> <span class="n">packages</span> <span class="o">=</span> <span class="s">&#34;delta&#34;</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>As a simple example, let&rsquo;s write a small data frame to Delta using <code>spark_write_delta()</code>, overwrite it, and then read it back with  <code>spark_read_delta()</code>:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">sdf_len</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="m">5</span><span class="p">)</span> <span class="o">%&gt;%</span> <span class="nf">spark_write_delta</span><span class="p">(</span><span class="n">path</span> <span class="o">=</span> <span class="s">&#34;delta-test&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">sdf_len</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="m">3</span><span class="p">)</span> <span class="o">%&gt;%</span> <span class="nf">spark_write_delta</span><span class="p">(</span><span class="n">path</span> <span class="o">=</span> <span class="s">&#34;delta-test&#34;</span><span class="p">,</span> <span class="n">mode</span> <span class="o">=</span> <span class="s">&#34;overwrite&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">spark_read_delta</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s">&#34;/tmp/delta-1&#34;</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre tabindex="0"><code># Source: spark&lt;delta1&gt; [?? x 1]
     id
  &lt;int&gt;
1     1
2     2
3     3
</code></pre><p>Now, since Delta is capable of tracking all versions of your data, you can easily time travel to retrieve the version that we overwrote:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">spark_read_delta</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s">&#34;delta-test&#34;</span><span class="p">,</span> <span class="n">version</span> <span class="o">=</span> <span class="m">0L</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre tabindex="0"><code># Source: spark&lt;delta1&gt; [?? x 1]
     id
  &lt;int&gt;
1     1
2     2
3     3
4     4
5     5
</code></pre><h2 id="spark-30">Spark 3.0
</h2>
<p>To install and try out Spark 3.0 preview, simply run:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">sparklyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">spark_install</span><span class="p">(</span><span class="s">&#34;3.0.0-preview&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">sc</span> <span class="o">&lt;-</span> <span class="nf">spark_connect</span><span class="p">(</span><span class="n">master</span> <span class="o">=</span> <span class="s">&#34;local&#34;</span><span class="p">,</span> <span class="n">version</span> <span class="o">=</span> <span class="s">&#34;3.0.0-preview&#34;</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>You can then preview upcoming features, like the ability to read binary files. To demonstrate this, we can use <a href="https://blog.rstudio.com/2019/09/09/pin-discover-and-share-resources/" target="_blank" rel="noopener">pins</a>
 to download a 237MB subset of <a href="http://www.image-net.org/" target="_blank" rel="noopener">ImageNet</a>
, and then load them into Spark:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">tiny_imagenet</span> <span class="o">&lt;-</span> <span class="n">pins</span><span class="o">::</span><span class="nf">pin</span><span class="p">(</span><span class="s">&#34;http://cs231n.stanford.edu/tiny-imagenet-200.zip&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">spark_read_source</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="nf">dirname</span><span class="p">(</span><span class="n">tiny_imagenet[1]</span><span class="p">),</span> <span class="n">source</span> <span class="o">=</span> <span class="s">&#34;binaryFile&#34;</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre tabindex="0"><code># Source: spark&lt;images&gt; [?? x 4]
   path                       modificationTime    length content   
   &lt;chr&gt;                      &lt;dttm&gt;               &lt;dbl&gt; &lt;list&gt;    
 1 file:images/test_2009.JPEG 2020-01-08 20:36:41   3138 &lt; [3,138]&gt;
 2 file:images/test_8245.JPEG 2020-01-08 20:36:43   3066 &lt; [3,066]&gt;
 3 file:images/test_4186.JPEG 2020-01-08 20:36:42   2998 &lt; [2,998]&gt;
 4 file:images/test_403.JPEG  2020-01-08 20:36:39   2980 &lt; [2,980]&gt;
 5 file:images/test_8544.JPEG 2020-01-08 20:36:38   2958 &lt; [2,958]&gt;
 6 file:images/test_5814.JPEG 2020-01-08 20:36:38   2929 &lt; [2,929]&gt;
 7 file:images/test_1063.JPEG 2020-01-08 20:36:41   2920 &lt; [2,920]&gt;
 8 file:images/test_1942.JPEG 2020-01-08 20:36:39   2908 &lt; [2,908]&gt;
 9 file:images/test_5456.JPEG 2020-01-08 20:36:42   2906 &lt; [2,906]&gt;
10 file:images/test_5859.JPEG 2020-01-08 20:36:39   2896 &lt; [2,896]&gt;
# … with more rows
</code></pre><p>Please notice that the <a href="https://spark.apache.org/news/spark-3.0.0-preview.html" target="_blank" rel="noopener">Spark 3.0.0 preview</a>
 not a stable release in terms of either API or functionality.</p>
<h2 id="barrier-execution">Barrier Execution
</h2>
<p>Barrier execution is a new feature introduced in <a href="https://spark.apache.org/releases/spark-release-2-4-0.html" target="_blank" rel="noopener">Spark 2.4</a>
 which enables Deep Learning on Apache Spark by implementing an all-or-nothing scheduler into Apache Spark. This allows Spark to not only process analytic workflows, but also to use Spark as a high-performance computing cluster where other framework, like <a href="https://www.openmp.org/" target="_blank" rel="noopener">OpenMP</a>
 or <a href="https://www.tensorflow.org/guide/distributed_training" target="_blank" rel="noopener">TensorFlow Distributed</a>
, can reuse cluster machines and have them directly communicate with each other for a given task.</p>
<p>In general, we don&rsquo;t expect most users to use this feature directly; instead, this is a feature relevant to advanced users interested in creating extensions that support additional modeling frameworks. You can learn more about barrier execution in Reynold Xin&rsquo;s <a href="https://vimeo.com/274267107" target="_blank" rel="noopener">keynote</a>
.</p>
<p>To use barrier execution from R, set the <code>barrier = TRUE</code> parameter in <code>spark_apply()</code> and then make use of a new  parameter in the R closure to retrieve the network address of the additional nodes available for this task. A simple example follows:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">sparklyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">sc</span> <span class="o">&lt;-</span> <span class="nf">spark_connect</span><span class="p">(</span><span class="n">master</span> <span class="o">=</span> <span class="s">&#34;local&#34;</span><span class="p">,</span> <span class="n">version</span> <span class="o">=</span> <span class="s">&#34;2.4&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">sdf_len</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> <span class="n">repartition</span> <span class="o">=</span> <span class="m">1</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">spark_apply</span><span class="p">(</span><span class="o">~</span> <span class="n">.y</span><span class="o">$</span><span class="n">address</span><span class="p">,</span> <span class="n">barrier</span> <span class="o">=</span> <span class="kc">TRUE</span><span class="p">,</span> <span class="n">columns</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="n">address</span> <span class="o">=</span> <span class="s">&#34;character&#34;</span><span class="p">))</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">collect</span><span class="p">()</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre tabindex="0"><code># A tibble: 1 x 1
  address        
  &lt;chr&gt;          
1 localhost:50693
</code></pre><h2 id="qubole">Qubole
</h2>
<p><a href="https://www.qubole.com/product/data-platform/" target="_blank" rel="noopener">Qubole</a>
 is a fully self-service multi-cloud data platform based on enterprise-grade data processing engines including Apache Spark.</p>
<p>If you are using Qubole clusters, you can now easily connect to a Spark through a new <code>&quot;qubole&quot;</code> connection method:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">sparklyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">sc</span> <span class="o">&lt;-</span> <span class="nf">spark_connect</span><span class="p">(</span><span class="n">method</span> <span class="o">=</span> <span class="s">&#34;qubole&#34;</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>Once connected, you can use Spark and R as usual. To learn more, visit <a href="https://docs.qubole.com/en/latest/user-guide/engines/spark/rstudio_spark.html" target="_blank" rel="noopener">RStudio for Running Distributed R Jobs</a>
.</p>
<h2 id="extensions">Extensions
</h2>
<p>The new <a href="https://github.com/r-spark" target="_blank" rel="noopener">github.com/r-spark</a>
 repo contains new community extensions. To mention a few, <a href="https://CRAN.R-project.org/package=variantspark" target="_blank" rel="noopener">variantspark</a>
 and <a href="https://CRAN.R-project.org/package=sparkhail" target="_blank" rel="noopener">sparkhail</a>
 are two new extensions for genomic research, <a href="https://github.com/r-spark/sparknlp" target="_blank" rel="noopener">sparknlp</a>
 adds support for natural language processing.</p>
<p>For those of you with background in genomics, you can use <code>sparkhail</code> by first installing this extension from CRAN. Followed by connecting to Spark, creating a Hail Context, and then loading a subset of the <a href="https://www.internationalgenome.org/data/" target="_blank" rel="noopener">1000 Genomes</a>
 dataset using <a href="https://hail.is/" target="_blank" rel="noopener">Hail</a>
:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">sparklyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">sparkhail</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">sc</span> <span class="o">&lt;-</span> <span class="nf">spark_connect</span><span class="p">(</span><span class="n">master</span> <span class="o">=</span> <span class="s">&#34;local&#34;</span><span class="p">,</span> <span class="n">version</span> <span class="o">=</span> <span class="s">&#34;2.4&#34;</span><span class="p">,</span> <span class="n">config</span> <span class="o">=</span> <span class="nf">hail_config</span><span class="p">())</span>
</span></span><span class="line"><span class="cl"><span class="n">hc</span> <span class="o">&lt;-</span> <span class="nf">hail_context</span><span class="p">(</span><span class="n">sc</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">hail_data</span> <span class="o">&lt;-</span> <span class="n">pins</span><span class="o">::</span><span class="nf">pin</span><span class="p">(</span><span class="s">&#34;https://github.com/r-spark/sparkhail/blob/master/inst/extdata/1kg.zip?raw=true&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">hail_df</span> <span class="o">&lt;-</span> <span class="nf">hail_read_matrix</span><span class="p">(</span><span class="n">hc</span><span class="p">,</span> <span class="nf">file.path</span><span class="p">(</span><span class="nf">dirname</span><span class="p">(</span><span class="n">hail_data[1]</span><span class="p">),</span> <span class="s">&#34;1kg.mt&#34;</span><span class="p">))</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">hail_dataframe</span><span class="p">()</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>You can then analyze it with packages like <code>dplyr</code>, <code>sparklyr.nested</code>, and <code>dbplot</code>:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">sdf_separate_column</span><span class="p">(</span><span class="n">hail_df</span><span class="p">,</span> <span class="s">&#34;alleles&#34;</span><span class="p">)</span> <span class="o">%&gt;%</span> 
</span></span><span class="line"><span class="cl">  <span class="nf">group_by</span><span class="p">(</span><span class="n">alleles_1</span><span class="p">,</span> <span class="n">alleles_2</span><span class="p">)</span> <span class="o">%&gt;%</span> 
</span></span><span class="line"><span class="cl">  <span class="nf">tally</span><span class="p">()</span> <span class="o">%&gt;%</span> 
</span></span><span class="line"><span class="cl">  <span class="nf">arrange</span><span class="p">(</span><span class="o">-</span><span class="n">n</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre tabindex="0"><code># Source:     spark&lt;?&gt; [?? x 3]
# Groups:     alleles_1
# Ordered by: -n
   alleles_1 alleles_2     n
   &lt;chr&gt;     &lt;chr&gt;     &lt;dbl&gt;
 1 C         T          2436
 2 G         A          2387
 3 A         G          1944
 4 T         C          1879
 5 C         A           496
 6 G         T           480
 7 T         G           468
 8 A         C           454
 9 C         G           150
10 G         C           112
# … with more rows
</code></pre><p>Notice that these frequencies come in pairs, C/T and G/A are actually the same mutation, just viewed from opposite strands. You can then create a histogram over the DP field, depth of the proband, as follows:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">sparklyr.nested</span><span class="o">::</span><span class="nf">sdf_select</span><span class="p">(</span><span class="n">hail_df</span><span class="p">,</span> <span class="n">dp</span> <span class="o">=</span> <span class="n">info.DP</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="n">dbplot</span><span class="o">::</span><span class="nf">dbplot_histogram</span><span class="p">(</span><span class="n">dp</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><img src="https://posit-open-source.netlify.app/blog-images/2020-01-29-sparklyr-1-1-hail-histogram-pd.png" alt="Apache Spark, Hail, R, and sparklyr histogram"/>
<p>This code was adapted from Hail&rsquo;s <a href="https://hail.is/docs/0.2/tutorials/01-genome-wide-association-study.html" target="_blank" rel="noopener">Genome Wide Association-Study</a>
. You can learn more about this Hail community extensions from <a href="https://github.com/r-spark/sparkhail" target="_blank" rel="noopener">r-spark/sparkhail</a>
.</p>
<h2 id="linux-foundation">Linux Foundation
</h2>
<p>The <a href="https://www.linuxfoundation.org" target="_blank" rel="noopener">Linux Foundation</a>
 is home of projects such as <a href="https://www.linuxfoundation.org/projects/linux/" target="_blank" rel="noopener">Linux</a>
, <a href="https://kubernetes.io/" target="_blank" rel="noopener">Kubernetes</a>
, <a href="https://js.foundation/" target="_blank" rel="noopener">Node.js</a>
 and umbrella foundations such as <a href="https://lfai.foundation/" target="_blank" rel="noopener">LF AI</a>
, <a href="https://www.lfedge.org/" target="_blank" rel="noopener">LF Edge</a>
, and <a href="https://www.lfnetworking.org/" target="_blank" rel="noopener">LF Network</a>
. We are very excited to have sparklyr be hosted as an incubation project within LF AI alongside <a href="https://www.acumos.org/" target="_blank" rel="noopener">Acumos</a>
, <a href="https://lfai.foundation/projects/angel-ml/" target="_blank" rel="noopener">Angel</a>
, <a href="https://lfai.foundation/projects/horovod/" target="_blank" rel="noopener">Horovod</a>
, <a href="https://pyro.ai/" target="_blank" rel="noopener">Pyro</a>
, <a href="https://onnx.ai/" target="_blank" rel="noopener">ONNX</a>
 and several others.</p>
<p>Hosting sparklyr in LF AI within the Linux Foundation provides a neutral entity to hold the project assets and open governance. Furthermore, we believe hosting with LF AI will also help bring additional talent, ideas, and shared components from other Linux Foundation projects like <a href="https://delta.io" target="_blank" rel="noopener">Delta Lake</a>
, <a href="https://eng.uber.com/horovod/" target="_blank" rel="noopener">Horovod</a>
, <a href="https://onnx.ai" target="_blank" rel="noopener">ONNX</a>
, and so on into sparklyr as part of cross-project and cross-foundation collaboration.</p>
<p>This makes it a great time for you to join the sparklyr community, contribute, and help this project grow. You can learn more about this in <a href="https://sparklyr.org" target="_blank" rel="noopener">sparklyr.org</a>
.</p>
<h2 id="mastering-spark-with-r">Mastering Spark with R
</h2>
<p><a href="https://therinspark.com" target="_blank" rel="noopener">Mastering Spark with R</a>
 is a new book to help you learn and master Apache Spark with R from start to finish. It introduces data analysis with well-known tools like <a href="https://dplyr.tidyverse.org/" target="_blank" rel="noopener">dplyr</a>
, and covers everything else related to processing large-scale datasets, modeling, productionizing pipelines, using extensions, distributing R code, and processing real-time data &ndash; if you are not yet familiar with Spark, this is a great resource to get started!</p>
<p><a href="https://therinspark.com">&lt;img src=&quot;/blog-images/2020-01-29-sparklyr-1-1-book-cover.jpg&quot; width=&ldquo;200px&rdquo;&quot; alt=&ldquo;Mastering Spark with R book cover&rdquo;/&gt;</a></p>
<p>This book was published by <a href="http://shop.oreilly.com/product/0636920223764.do" target="_blank" rel="noopener">O&rsquo;Reilly</a>
, is available on <a href="https://www.amazon.com/gp/product/149204637X" target="_blank" rel="noopener">Amazon</a>
, and is also free-to-use <a href="https://therinspark.com/" target="_blank" rel="noopener">online</a>
. We hope you find this book useful and easy to read.</p>
<p>To catch up on previous releases, take a look at the <a href="https://blog.rstudio.com/2019/03/15/sparklyr-1-0/" target="_blank" rel="noopener">sparklyr 1.0</a>
 post or watch various video tutorials in the <a href="https://www.youtube.com/channel/UCAwJMtPx4HgmMXEDTvZBJ4A/playlists" target="_blank" rel="noopener">mlverse</a>
 channel.</p>
<p>Thank you for reading along!</p>
]]></description>
    </item>
    <item>
      <title>pins 0.3.0: Azure, GCloud and S3</title>
      <link>https://posit-open-source.netlify.app/blog/rstudio/pins-0-3-0-azure-gcloud-and-s3/</link>
      <pubDate>Thu, 28 Nov 2019 00:00:00 +0000</pubDate>
      <guid>https://posit-open-source.netlify.app/blog/rstudio/pins-0-3-0-azure-gcloud-and-s3/</guid>
      <dc:creator>Javier Luraschi</dc:creator><description><![CDATA[<p>A new version of <code>pins</code> is available on CRAN! <code>pins 0.3</code> comes with many improvements and the following major features:</p>
<ul>
<li>Retrieve <strong>pin information</strong> with <code>pin_info()</code> including properties particular to each board.</li>
</ul>
<p>You can install this new version from CRAN as follows:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">install.packages</span><span class="p">(</span><span class="s">&#34;pins&#34;</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>In addition, there is a new <a href="https://rstudio.github.io/pins/articles/use-cases.html" target="_blank" rel="noopener">Use Cases</a>
 section in our docs, various improvements (see <a href="https://rstudio.github.io/pins/news/index.html" target="_blank" rel="noopener">NEWS</a>
) and two community extensions being developed to support <a href="https://rstudio.github.io/connections/#pins" target="_blank" rel="noopener">databases</a>
 and <a href="https://gitlab.com/gwmngilfen/nextcloudr" target="_blank" rel="noopener">Nextcloud</a>
 as boards.</p>
<h2 id="cloud-boards">Cloud Boards
</h2>
<p><code>pins 0.3</code> adds support to find, retrieve and store resources in various cloud providers like: <a href="https://azure.microsoft.com/" target="_blank" rel="noopener">Microsoft Azure</a>
, <a href="https://cloud.google.com/" target="_blank" rel="noopener">Google Cloud</a>
 and <a href="https://aws.amazon.com/" target="_blank" rel="noopener">Amazon Web Services</a>
.</p>
<p><div class="not-prose"><figure>
    <img class="h-auto max-w-full rounded-lg"
      src="https://posit-open-source.netlify.app/blog/rstudio/pins-0-3-0-azure-gcloud-and-s3/images/pins-cloud-boards-azure-gcloud-s3.png"
      alt="" 
      loading="lazy"
    >
  </figure></div>
</p>
<p>To illustrate how they work, lets first try to find the World Bank indicators dataset in <a href="https://www.kaggle.com/" target="_blank" rel="noopener">Kaggle</a>
:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">pins</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">pin_find</span><span class="p">(</span><span class="s">&#34;indicators&#34;</span><span class="p">,</span> <span class="n">board</span> <span class="o">=</span> <span class="s">&#34;kaggle&#34;</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre tabindex="0"><code># A tibble: 6 x 4
  name                                            description                             type  board 
  &lt;chr&gt;                                           &lt;chr&gt;                                   &lt;chr&gt; &lt;chr&gt; 
1 worldbank/world-development-indicators          World Development Indicators            files kaggle
2 theworldbank/world-development-indicators       World Development Indicators            files kaggle
3 cdc/chronic-disease                             Chronic Disease Indicators              files kaggle
4 bigquery/worldbank-wdi                          World Development Indicators (WDI) Data files kaggle
5 rajanand/key-indicators-of-annual-health-survey Health Analytics                        files kaggle
6 loveall/human-happiness-indicators              Human Happiness Indicators              files kaggle
</code></pre><p>Which we can then easily download with <code>pin_get()</code>, beware this is a 2GB download:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">pin_get</span><span class="p">(</span><span class="s">&#34;worldbank/world-development-indicators&#34;</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre tabindex="0"><code>[1] &#34;/.../worldbank/world-development-indicators/Country.csv&#34;     
[2] &#34;/.../worldbank/world-development-indicators/CountryNotes.csv&#34;
[3] &#34;/.../worldbank/world-development-indicators/database.sqlite&#34; 
[4] &#34;/.../worldbank/world-development-indicators/Footnotes.csv&#34;   
[5] &#34;/.../worldbank/world-development-indicators/hashes.txt&#34;      
[6] &#34;/.../worldbank/world-development-indicators/Indicators.csv&#34;  
[7] &#34;/.../worldbank/world-development-indicators/Series.csv&#34;      
[8] &#34;/.../worldbank/world-development-indicators/SeriesNotes.csv&#34; 
</code></pre><p>The <code>Indicators.csv</code> file contains all the indicators, so let&rsquo;s load it with <a href="https://readr.tidyverse.org/" target="_blank" rel="noopener">readr</a>
:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">indicators</span> <span class="o">&lt;-</span> <span class="nf">pin_get</span><span class="p">(</span><span class="s">&#34;worldbank/world-development-indicators&#34;</span><span class="p">)</span><span class="n">[6]</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="n">readr</span><span class="o">::</span><span class="nf">read_csv</span><span class="p">()</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>Analysing this dataset would be quite interesting; however, this post focuses on how to share this in S3, Google Cloud or Azure storage. More specifically, we will learn how to publish to an <a href="https://pins.rstudio.com/articles/boards-s3.html" target="_blank" rel="noopener">S3 board</a>
. To publish to other cloud providers, take a look at the <a href="https://pins.rstudio.com/articles/boards-gcloud.html" target="_blank" rel="noopener">Google Cloud</a>
 and <a href="https://pins.rstudio.com/articles/boards-azure.html" target="_blank" rel="noopener">Azure boards</a>
 articles.</p>
<p>As you would expect, the first step is to register the S3 board. When using RStudio, you can use the <a href="https://pins.rstudio.com/articles/pins-rstudio.html" target="_blank" rel="noopener">New Connection</a>
 action to guide you through this process, or you can specify your <code>key</code> and <code>secret</code> as follows. Please refer to the <a href="https://pins.rstudio.com/articles/boards-s3.html" target="_blank" rel="noopener">S3 board</a>
 article to understand how to store your credentials securely.</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">board_register_s3</span><span class="p">(</span><span class="n">name</span> <span class="o">=</span> <span class="s">&#34;rpins&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                  <span class="n">bucket</span>  <span class="o">=</span> <span class="s">&#34;rpins&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                  <span class="n">key</span> <span class="o">=</span> <span class="s">&#34;VerySecretKey&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                  <span class="n">secret</span> <span class="o">=</span> <span class="s">&#34;EvenMoreImportantSecret&#34;</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>With the S3 board registered, we can now pin the indicators dataset with <code>pin()</code>:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">pin</span><span class="p">(</span><span class="n">indicators</span><span class="p">,</span> <span class="n">name</span> <span class="o">=</span> <span class="s">&#34;worldbank/indicators&#34;</span><span class="p">,</span> <span class="n">board</span> <span class="o">=</span> <span class="s">&#34;rpins&#34;</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>That&rsquo;s about it! We can now find and retrieve this dataset from S3 using <code>pin_find()</code>, <code>pin_get()</code> or view the uploaded resources in the S3 management console:</p>
<p><div class="not-prose"><figure>
    <img class="h-auto max-w-full rounded-lg"
      src="https://posit-open-source.netlify.app/blog/rstudio/pins-0-3-0-azure-gcloud-and-s3/images/pins-upload-s3-results.png"
      alt="" 
      loading="lazy"
    >
  </figure></div>
</p>
<p>To make this even easier for others to consume, we can make this S3 bucket public; which means you can now connect to this board without even having to configure S3, making it possible to retrieve this dataset with just one line of R code!</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">pins</span><span class="o">::</span><span class="nf">pin_get</span><span class="p">(</span><span class="s">&#34;worldbank/indicators&#34;</span><span class="p">,</span> <span class="s">&#34;https://rpins.s3.amazonaws.com&#34;</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre tabindex="0"><code># A tibble: 5,656,458 x 6
   CountryName CountryCode IndicatorName                          IndicatorCode    Year      Value
   &lt;chr&gt;       &lt;chr&gt;       &lt;chr&gt;                                  &lt;chr&gt;           &lt;dbl&gt;      &lt;dbl&gt;
 1 Arab World  ARB         Adolescent fertility rate (births per… SP.ADO.TFRT      1960    1.34e+2
 2 Arab World  ARB         Age dependency ratio (% of working-ag… SP.POP.DPND      1960    8.78e+1
 3 Arab World  ARB         Age dependency ratio, old (% of worki… SP.POP.DPND.OL   1960    6.63e+0
 4 Arab World  ARB         Age dependency ratio, young (% of wor… SP.POP.DPND.YG   1960    8.10e+1
 5 Arab World  ARB         Arms exports (SIPRI trend indicator v… MS.MIL.XPRT.KD   1960    3.00e+6
 6 Arab World  ARB         Arms imports (SIPRI trend indicator v… MS.MIL.MPRT.KD   1960    5.38e+8
 7 Arab World  ARB         Birth rate, crude (per 1,000 people)   SP.DYN.CBRT.IN   1960    4.77e+1
 8 Arab World  ARB         CO2 emissions (kt)                     EN.ATM.CO2E.KT   1960    5.96e+4
 9 Arab World  ARB         CO2 emissions (metric tons per capita) EN.ATM.CO2E.PC   1960    6.44e-1
10 Arab World  ARB         CO2 emissions from gaseous fuel consu… EN.ATM.CO2E.GF…  1960    5.04e+0
# … with 5,656,448 more rows
</code></pre><p>This works since <code>pins 0.3</code> automatically register URLs as a <a href="https://pins.rstudio.com/articles/boards-websites.html" target="_blank" rel="noopener">website board</a>
 to save you from having to explicitly call <code>board_register_datatxt()</code>.</p>
<p>It&rsquo;s also worth mentioning that <code>pins</code> stores the dataset using an R native format, which requires only 72MB and loads much faster than the original 2GB dataset.</p>
<h2 id="pin-information">Pin Information
</h2>
<p>Boards like <a href="https://pins.rstudio.com/articles/boards-kaggle.html" target="_blank" rel="noopener">Kaggle</a>
 and <a href="https://pins.rstudio.com/articles/boards-rsconnect.html" target="_blank" rel="noopener">RStudio Connect</a>
, store additional information for each pin which you can now easily retrieve with <code>pin_info()</code>.</p>
<p>For instance, we can retrieve additional properties from the indicators pin from Kaggle as follows,</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">pin_info</span><span class="p">(</span><span class="s">&#34;worldbank/world-development-indicators&#34;</span><span class="p">,</span> <span class="n">board</span> <span class="o">=</span> <span class="s">&#34;kaggle&#34;</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre tabindex="0"><code># Source: kaggle&lt;worldbank/world-development-indicators&gt; [files]
# Description: World Development Indicators
# Properties:
#   - id: 23
#   - subtitle: Explore country development indicators from around the world
#   - tags: (ref) business, economics, international relations, business finance...
#   - creatorName: Megan Risdal
#   - creatorUrl: mrisdal
#   - totalBytes: 387054886
#   - url: https://www.kaggle.com/worldbank/world-development-indicators
#   - lastUpdated: 2017-05-01T17:50:44.863Z
#   - downloadCount: 42961
#   - isPrivate: FALSE
#   - isReviewed: TRUE
#   - isFeatured: FALSE
#   - licenseName: World Bank Dataset Terms of Use
#   - ownerName: World Bank
#   - ownerRef: worldbank
#   - kernelCount: 422
#   - topicCount: 7
#   - viewCount: 254379
#   - voteCount: 1121
#   - currentVersionNumber: 2
#   - usabilityRating: 0.7647
#   - extension: zip
</code></pre><p>And from RStudio Connect boards as well,</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">pin_info</span><span class="p">(</span><span class="s">&#34;worldnews&#34;</span><span class="p">,</span> <span class="n">board</span> <span class="o">=</span> <span class="s">&#34;rsconnect&#34;</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><pre tabindex="0"><code># Source: rsconnect&lt;jluraschi/worldnews&gt; [table]
# Properties:
#   - id: 6446
#   - guid: 1b9f04c5-ddd4-43ca-8352-98f6f01a7034
#   - access_type: all
#   - url: https://beta.rstudioconnect.com/content/6446/
#   - vanity_url: FALSE
#   - bundle_id: 16216
#   - app_mode: 4
#   - content_category: pin
#   - has_parameters: FALSE
#   - created_time: 2019-09-30T18:20:21.911777Z
#   - last_deployed_time: 2019-11-18T16:00:16.919478Z
#   - build_status: 2
#   - run_as_current_user: FALSE
#   - owner_first_name: Javier
#   - owner_last_name: Luraschi
#   - owner_username: jluraschi
#   - owner_guid: ac498f34-174c-408f-8089-a9f10c630a37
#   - owner_locked: FALSE
#   - is_scheduled: FALSE
#   - rows: 44
#   - cols: 1
</code></pre><p>To retrieve all the extended information when discovering pins, pass <code>extended = TRUE</code> to <code>pin_find()</code>.</p>
<p>Thank you for reading this post!</p>
<p>Please refer to <a href="https://rstudio.github.io/pins" target="_blank" rel="noopener">rstudio.github.io/pins</a>
 for detailed documentation and <a href="https://github.com/rstudio/pins/issues/new" target="_blank" rel="noopener">GitHub</a>
 to file issues or feature requests.</p>
]]></description>
    </item>
  </channel>
</rss>
