rvest
Simple web scraping for R
rvest is an R package for web scraping that extracts data from HTML web pages. It uses a pipe-friendly syntax inspired by libraries like Beautiful Soup to make common scraping tasks straightforward.
The package provides functions to parse HTML, select elements using CSS selectors or XPath, extract text and attributes, and convert HTML tables directly to data frames. It integrates well with tidyverse workflows and supports both single-element and multi-element extraction. For scraping multiple pages, it works alongside the polite package to respect robots.txt and avoid overwhelming servers.
Contributors#
Resources featuring rvest#
Inspecting websites to find JSON data APIs | Marcos Huerta | Data Science Lab
The Data Science Lab is a live weekly call. Register at pos.it/dslab! Discord invites go out each week on lives calls. We’d love to have you!
The Lab is an open, messy space for learning and asking questions. Think of it like pair coding with a friend or two. Learn something new, and share what you know to help others grow.
On this call, Libby Heeren is joined by Marcos Huerta, a Data Science Manager at Carmax, as he walks us through the guts of websites looking for data we can play with. He shows us how to find hidden REST/JSON APIs by using the web inspector in Safari/Firefox and then how to get what’s necessary to pull the same data programmatically in python or R.
Hosting crew from Posit: Libby Heeren, Isabella Velasquez, Daniel Chen
Marcos’s urls: Website: https://marcoshuerta.com GitHub: https://github.com/astrowonk/
Resources from the hosts and from participants in the Discord chat:
Postman: https://www.postman.com/ Insomnia (open source alternative to Postman): https://insomnia.rest/ Baseball Savant website Marcos is using: https://baseballsavant.mlb.com/gamefeed/?gamePk=777076 Isabella Velasquez’s blog on using {polite} R package to help scrape Wikipedia: https://ivelasq.rbind.io/blog/politely-scraping/ Festivas Mac app Marcos used to add the lights to his desktop: https://festivitas.app/ Ted Laderas blog post on parsing JSON in R: https://laderast.github.io/intro_apis_json_cascadia/#/how-does-r-translate-json New rvest read_html_live() function: https://rvest.tidyverse.org/reference/read_html_live.html yyjsonr R package: https://github.com/coolbutuseless/yyjsonr tuber R package: https://github.com/gojiplus/tuber WikipediaR R package: https://www.quantargo.com/help/r/latest/packages/WikipediaR/1.1/WikipediaR-package rookiepy python package: https://pypi.org/project/rookiepy/
► Subscribe to Our Channel Here: https://bit.ly/2TzgcOu
Follow Us Here: Website: https://www.posit.co The Lab: https://pos.it/dslab Hangout: https://pos.it/dsh LinkedIn: https://www.linkedin.com/company/posit-software Bluesky: https://bsky.app/profile/posit.co
Thanks for learning with us!
Timestamps 00:00 Introduction 03:05 Web scraping vs. API calls 04:12 Server-side rendering vs. client-side JSON 06:12 Warning: Rate limits and business ethics (ahem) 08:39 Demo: Baseball Savant website 08:57 Using browser Developer Tools and the Network tab 12:15 “What is curl?” 13:30 Importing curl into Postman 16:03 Generating Python code from Postman 16:50 “Are there open source alternatives to Postman?” 17:50 Using the generated code in Python/Jupyter 22:28 R packages for JSON (jsonlite, yyjsonr) 25:09 Demo: Massachusetts Lottery website 28:17 Example: scripts Marcos automated with Cron jobs 30:17 Handling logins and cookies with RookiePie 32:19 Demo: CNN Election Data 34:26 Inspecting ESPN’s website 36:58 “Can you scrape YouTube?” 38:19 Finding hidden JSON in CardsMania history 45:00 Benefits of API inspection over Beautiful Soup 46:59 New rvest function: read_html_live 50:40 Inspecting LinkedIn and finding GraphQL 53:58 Encouragement on handling API pagination
