A mini web-scraping example

A mini web-scraping example

I was recently on a call with a former client, who briefly inquired about the difficulty of a quick web-scraping task.

This person also knew a little bit about R (and she had brought it up as the potential tool for the job). I was keen to impress upon her (once again) it's immense power for automation of mundane tasks.

As she described it, the task seemed simple enough. I kept on saying that it would be "easy," which is a habit I've got to try to get out of.

The task

She chose one academic's profile, at random. She pointed to the table (highlighted in red).

The task was to scrape the "h-index" scores from a whole bunch of Google Scholar profiles. Each of the staff's urls would be provided.

I demonstrated how the SelectorGadget chrome extension could isolate the individual html node (of the table).

From there, it was just a matter of using the rvest package to bring that data down.

"Leave it with me", I said. "I will write you up a blog post for it".

Here is that blog post.

Demo-ing the process

These are the steps I took to get a working process up and running, using sample data:

1. Find a few more random profiles (I just chose a few people from the same field as our person above), to stand in for my client's actual staff members.

2. Open each of these profiles pages in separate tabs and collect all their profile urls using the Copy All Urls chrome extension.

3. Paste those urls into Excel and copy down each of their names.

4. Copy this table into R, using the datapasta addin operation "paste as tribble". (I use this all the time and even have a short-cut for it, Ctrl+Alt+D).

5. We are also going to need tidyverse and the rvest packages, so load them up, as well.

6. Experiment with a single url to make sure the selected html node is bringing down the right data (and that this matches with the page) in a format we can use.

With the specific html node I selected, the table values are brought down as a vector. And their row-column positions have to be assumed by their order. This is something to be a bit cautious of, as certain edge cases might have values missing and this might mess up the order). In such a situation, we would have to use a "broader" html node that included the column and row names. But, for now, we can run with this.

7. Tidy the downloaded data into a "tibble"

The "since 2015" could be updated manually, with each passing year (assuming this code gets used beyond 2020 -I have high hopes).

8. Run the process for all the urls, using the map() function. Then unnest the individual tables.

Neato! This took about 10 seconds to run, for 6 people's data.

9. Save the data to an Excel file, for easy distribution.

all_download %>% 
  openxlsx::write.xlsx(paste0("ggle_schlr_scp_", lubridate::today(), ".xlsx"), asTable = T)

This is what we get.

Nice one!

The full R script can be found in the gist below:

google scholar h-index scrape.R
GitHub Gist: instantly share code, notes, and snippets.

All up, this was a very quick process to test out. And a good practice for my web-scraping chops.

The rvest package allowed the web-scraping aspect to take place over a mere couple of lines of code. The rest was just wrangling the data into a tidy table.

For now, it is just a matter of my former client using this documentation and the script above to process her actual staff members' profile urls.

Good luck, Sue!

For those looking to learn a bit more about R, I have a primer introduction to R course, available on Udemy. It's "fast-paced but friendly" and will show you plenty of techniques that are incredibly helpful when starting out with R.

Learn Data Wrangling With R
<p>If you’ve always been a bit curious about how R works (specifically, how to do data wrangling with it) this course is for you. </p><p><br></p><p>I cover a nice example and demonstrate lots of nifty, very useful techniques for turning a collection of messy data sources into something that’s tidy.<…

Thanks for reading.