Expanding my Tableau Network on LinkedIn (using R to scrape the Tableau certification directory)

Expanding my Tableau Network on LinkedIn (using R to scrape the Tableau certification directory)

Back in the heady days of January 2019, I took a chance and signed myself up for the Tableau Certified Associate exam.

I'd been using Tableau at my old work (Monash Uni) for a number of years and thought I had a pretty good handle on things. But still, I had this niggling fear of failing and losing my money.

So I overprepared. I learnt all there was to know about LODs, parameters, dashboard actions and sets. Even found a useful practice exam up on Udemy. And, long story short, I was fortunate enough to pass.

I considered my passing a bit of an accomplishment. But it got me curious… Who else out there was willing to risk US$250 (perhaps their business paid?) for a nice tick of approval, saying that they know Tableau pretty darn well?

Actually, all their names (if they elect to make their achievement public) are up on the Tableau certification directory website.

That fourth guy down... what a legend :-)

So, let's say that I want to connect (via LinkedIn) with all the other Australian BI developers, who have got this same achievement, as me?

Or, say I was a recruiter, looking to find such people, to fill Tableau positions...

I would like to go through all these names (on a regular basis).

Okay, I will just download the data. Surely it will give me a nice neat spreadsheet I can look at in Excel...

What's this? No "Data" download option?!

Not to worry, though... Let's try that PDF option.

But I want it to be something like this...

Ah... all is well, when data is in a nice neat Excel table

TLDR:

Here the r script you'll need:

Tableau Certification Directory -pdf to csv
GitHub Gist: instantly share code, notes, and snippets.

And the Youtube video of me demonstrating it.


There are actually some R packages designed for extracting tables out of pdfs.

filename <- "Tableau Certification Directory.pdf"
tabulizer_import <- tabulizer::extract_tables(filename)

tabulizer_import[[1]]

Tabulizer took a > 5 minutes to analyse the entire pdf and the results looked something like this:

Pdftools imported the data much faster (~ a second), generated a large character vector, which could be split by a delimiter to result in something like this:

pdftools_import <- pdftools::pdf_text(filename) 

pdftools_import %>% 
  str_split("\r\n") %>% 
  pluck(1)

Obviously, a fair bit of work would be required to neaten up either of these two options.

My suspicion is that these functions are better suited to standard, simpler pdf tables. Basically, those generated by any other means, except Tableau.

I ended up utilising my fall-back, which was to simply Ctrl-A & Ctrl-C the pdf's content and paste it into notepad.

The full data came through in a considerably more usable format.

A nifty little read_csv() later and we’ve got a great starting point.

From here, it was a matter of extracting out and neatening up the component parts (taking special care with those people with multiple certifications).

See the above Youtube video and github gist.


Let's take a look at the data we got

Note how the older (expired) Associate level certifications are not included in the pdf. The Desktop Certified Associate is the obvious main-stayer. The Desktop Specialist seems to be a recent, very popular (and profitable) addition.

When we focus in on the Australians who have done the Desktop Certified Associate Exam....

We see a remarkably steady frequency. At US$250 a pop, this particular exam looks like (~150 * US$250) $37,500 revenue over 25 months ( = US$18,000 per year). And that's also excluding all the failed attempts.

From the Excel wookbook, let's take a recent name and have a look for them on LinkedIn.

Because there are already enough bots out there, I attached a personalised “Hey! I use Tableau too” message for each of my connection requests.


While this is not an entirely automated process, it might certainly strike a chord with certain people who deal with lots data trapped inside pdfs.

Pulling the data and wrangling it into a usable form certainly wasn't easy in this case but not every source is going to be as complicated.

Also, if your files don't have to be manually downloaded from a rigid data source, then there'll be even more opportunity to completely automate the process.


Did this post spark up any ideas, for you?

Fancy working with Julian on achieving them?