In a previous post, I gave an overview of the key general steps in a Machine Learning project. In essence, using R's tidymodels framework, they are:
- Import raw training data
- Pre-process raw data into a usable format
- Split training into training and testing data
- Create a recipe and determine key wrangling "step_"s
- Pick a model
- Create a grid of hyper parameter values
- Create a workflow (training data + recipe + steps + model)
- Create a series of unique analysis / assessment cross-validation folds, to run your grid_search over
- Run the tune_grid() operation
- Extract the best model configuration and finalise the workflow (ie tell the model which hyper-parameters to use).
- Use last_fit() to train the model with the full training split and see what sort of result you get with the testing set. Do this with fit_resamples, if you want, to get a distribution of the sorts of results you can expect with your unseen data.
(if you have actual unseen data that you need predictions for...)
- Import the raw "unseen" data and do the same pre-processing steps, as with the training.
- Train the model with the full, labeled training dataset and make predictions for the unseen data
- Submit these predictions where-ever they need to go.
Then iterate, iterate, iterate.
Try out different features, different models, hyper-parameters. See if you can budge your metrics in the right direction.
This Tidymodels framework, demonstrated (excellently) by Julia Silge in her Youtube videos, really makes the steps distinct and clear.
The question I then faced was: how can I make the redeployment of this process easier, across different projects? Also, some steps take my i3 CPU a fair amount of time to run, is there any way not to have to redo certain intensive steps each time (without creating a whole mess of saved rds files)?
Fortunately, it seemed that the drake package, was a readily available solution to these issues.
The Drake Package
So what is the drake package?
Basically, each of the steps above generates a new R object. The raw_data gets turned into the pre_processed_data, the pre_processed_data goes to the split_data, the split_data into training and testing splits, ect, ect.
Running a regular script, each of the code chunks that transition one object to the next have to be run, in turn. And this needs to be done each time you restart R.
Drake is a way that each of these objects is saved in a hidden "caching system" automatically and only refreshed when there are changes made up the line.
There are two things that this process enforces (in a good way).
- Defining the key steps of your process by the different objects generated.
- Defining the operations for generating each object (using drake the right way, each new object is generated by a function).
It's much easier to show how you'd make one. To be frank, the first time I saw one, I was a little miffed, as well. I don't like adopting an unfamiliar R project structure and I wanted to convert it back to regular "plain script".
Okay, how to get started?
- Make a project
- Run drake::use_drake()
This generates 4 r script files (actually, 5 but I'm ignoring the "_drake.R" file, for now):
- The packages.r file is the most self-explanatory of the bunch. Put all your required library() calls in here.
- The plan.r script is where you identify the key objects of your process and the function calls to create them.
Note the use of "=" rather then "<-" ("gets") and how they're all happening within the drake_plan() function.
And also how (in the example) the fit_model() function is able to utilise the earlier-created "data" object as an argument.
- The functions.r file just holds all the functions you are going to use
- The make.r file brings everything together. Sourcing the make.r file regenerates each of the objects in the plan (on an "as required" basis).
You can also set it up, so that the project "result" (like a csv file, for example) are generated, by adding an output object to the plan, like so:
generic_data_save_obj = write_csv(data, "data.csv")
The generic_data_save_obj isn't really something that the plan uses (it's an end point), but it would still be refreshed each time it needed to be (when we sourced the make.r file).
Conversely, you could also access this object independently, from a separate script (within the project) by using "readd(data)" or loadd(data) to access that object, saved to the cache previously (from running the make(plan) step).
To bring all the cached objects into the environment, simply run "loadd()." Note that all these second "d"s stands for "drake".
How does it work, in practice?
Okey dokey, we've covered the basics, let's see how it goes with an actual example.
Apologies. This goes on a fair bit.
Many, many moons ago, in my (half) masters of analytics, me and a buddy tackled a Kaggle competition for a group assignment. We crashed and burned, in the rankings (as to be expected from newbie R users) but we turned in a paper that showed what we had tried and got good marks (Thanks Dr. V).
The goal for this challenge is to predict which country a bunch of (American, first timer) AirBnB users would visit. Each submission allows for 5 guess for each person (including "No destination found") and we get less points the lower down our guess is to their actual destination.
The assessment metric, essentially, gives us 1 point for getting it right with the first guess and diminishing returns for subsequent correct guesses. Then we take the average.
Just guessing the top 5 destinations in the training data as everybody's destinations will still get you a pretty accurate score. The competition winners were only able to improve upon this baseline by "a small margin".
With a popular company like AirBnB small margins are probably still big money. The application of these predictions is probably some subtle UX improvement, such that the recommended destinations are more finely tuned to each user.
A retail business might use something similar to present new users with a better array of suggested purchases -and thus transition them from "anonymous peruser" into customer more efficiently.
In this case, because the "no information" prediction (ie the top 5 locations) is still pretty good and we don't know how improvement in the assessment metric translated to better earnings for AirBnB, it's hard to assign value to all the ML efforts that went into this competition. But it's still a fun exercise.
The real value, for me to re-engage in this challenge, was in setting up a system. To significantly reduce the "setting up" overhead for any future similar projects.
Here is my github repo for this project:
This post is going on a little long... feel free to continue to the details over here.