Seek.com.au is an Australian jobs site that, if you ask it to, will email you specific job alerts, based on keywords and locations, every day.
Quite a while back, I signed up for the daily alert for
Tableau jobs in
Melbourne and then proceeded to hardly ever open a single one of these emails.
But I never turned them off. And they just kept coming in, every morning at 7am, to this day.
As a freelance Tableau consultant, these companies now represent a great list of prospective clients.
If I could turn all those emails into a nice neat list (after filtering out all the recruiters, of course), I could probably network my way to the right person and offer my Tableau-related services.
What follows is the process I used to turn these emails into a nice neat Excel table.
If you would like to take a look at my code. I have published a gist.
To do this, I used the
I suspect there is very little that this package can't do, when it comes to working with everyone's extensive collection of gmail-based emails.
I'm not too sure exactly how much code you are going to want to see, as this process was pretty specific to these particular emails.
That being said, I thought I would share my key discoveries.
Authorising gmailr is best covered in the following blogpost.
Basically, you are giving your instance of Rstudio the capability to do some pretty powerful things with your account. So consider yourself forewarned.
- Figure out the search term that is going to capture the emails you're interested in (to the exclusion of all others).
search_term <- "from:(email@example.com) new jobs for tableau in Melbourne" messageIDs <- messages(search = search_term, num_results = 5)
messages() function brings down only the ids of the relevant emails. It stores them as members of an item in a list object. However, if you have more than 100 emails, it will bundle each 100 id's into separate list items.
2. Actually download the data related to these emails.
my_messages <- tibble(messageIDs) %>% mutate(downloaded_data = map(messageIDs, ~.x$messages %>% modify_depth(1, "id") %>% as.vector() %>% map(message))) %>% unnest(downloaded_data)
This takes time (~1 sec per email). So it's generally a good idea to save it to an rds file, at this point.
It also is a bit redundant to re-download historical emails. We could do one for the whole past year and then just append to it, with a smaller, time-limited search, going forward.
3. Build out a dataframe with the meta-data.
key_info <- my_messages %>% mutate(id = map_chr(downloaded_data, ~gmailr::id(.x)), date = map_chr(downloaded_data, ~gmailr::date(.x)), date = lubridate::dmy_hms(date) %>% as.Date(), subject = map_chr(downloaded_data, ~gmailr::subject(.x)))
4. Extract the body content.
with_body_content <- key_info %>% mutate(body_content = map(downloaded_data, ~.x$payload$parts[]$parts[]$body$data %>% RCurl::base64Decode(txt = .) %>% str_split("\r\n|\t") %>% unlist() %>% tibble %>% rename(text = 1) %>% filter(text != ""))) %>% select(-downloaded_data)
As you can see, the body of each email was hidden deeply within the list object. And it was also encoded in base64. So it was originally looking something like this (below), which had me completely miffed.
5. Process the body content.
The final parts of my process, involved just a whole bunch of
str_detect() functions to tease out the various key components of each job advertisement. Refer to the gist.
One thing I did notice, however, was that the number of distinct jobs I was pulling out from the body sections was not matching up with the number of jobs mentioned in the emails' subject lines. So perhaps there were some hidden compartments somewhere within that complicated list object.
But in the end, we still got a neat list (albeit mostly full of recruiters).
Hopefully, in addition to being interesting, this post has sparked some ideas, as to how your business's previously un-programmatically retrievable gmail-based emails might now be accessed and utilised.
If you need help with such a task, feel free to give me a call.