Following on from my previous post, introducing drake, with this post I'm hoping to convey the specifics of my process. I'll start with my plan. There's a lot in this.
One of the key data files is a "click log" of events that the users did on the AirBnB website / app. Unzipped this file expands to ~400MB, so it takes a bit to load and process. In this case, I do a very simple aggregation of total_seconds spent on the app and a count of unique devices used. Then I save it to an rds because I don't want to have to do this again anytime soon.
Note: this is not drake best-practice. But it is a long step and I wanted to save any people downloading the repo the time that it takes to run this step.
One key thing to note is how the file paths are wrapped in file_in(). This allows drake to detect when that file was last changed. Drake won't go through the effort of re-generating objects so long as any changes pre-date the current cached object.
The training data is neat and tidy, with 1 row per user. To get my head around all these variables, I whipped up a few charts in Tableau.
This is not a piece about best practice in EDA. There is plenty of really cool ways of drilling down into the nitty-gritty of the variables (to potentially find useful insights).
In this case, we're simply aiming to hand over the data as simply as possible, let the algorithm do it's work and then iterate for improvement only after our process is all set up correctly.
As a result we do minimal pre-processing.
The main thing is just to ignore the early stages of our timeline.
Our test data (in blue) is almost certainly more closely related to the months immediately preceding it (rather then going all the way back to the dawn of time, where the users were probably just the founders' immediate friends and family).
Splitting the data
The breaking the labelled training data into training and testing splits is normally a simple step. However, I have add a sub-sampling step here as well, so I have the capacity to reduce the time it takes for my models to train.
I also add a select() call to reduce the columns down to only the one's I'll be using (so that down the line my recipe formula is easier).
An alternative method of splitting the data is the initial_time_split() which is more realistic to our labelled and unseen (test) data difference, but I was having issues with this, with the last_fit() function, so I reverted back to the random splitting method of initial_split().
Generate analysis / assessment "folds" from the training split
Nothing too crazy here. Just a method to generate synthetic training / test splits to test all our different model hyper-parameters on. Many different ways of doing this.
Setup the recipe
Okay, the real meat and potatoes of the process. Here we stipulate the recipe and then all the steps we need to perform upon the data to get it nice and digestible for the model.
The step_mutate() us just binning the values of the age column. The step_unknown() is just replacing NA values with a default value. Step_other() is collecting any "levels" in those columns that make up less than 5% of the data, together into one "other" category. Step_dummy just creates a series of indicator columns for each level of each factor column.
This is really the beauty of the recipes package in that it ensures that the operations that are performed on the data prior to training the model are defined only by what's in the training data and not the testing component.
There are many different types of machine learning models out there. However, "xgboost" seems to get the most attention. So we start with that. The only problem is that, as a rookie, there are a whole heap of "hyper parameters" that we don't know what to do with.
As a result, these have to be "tuned". So when defining the model, we use those placeholders. (Eventually, (left image) we move towards defining the values, as we learn which seem to perform better).
Setup the configuration grid
With all these parameters to be tuned, we now need a grid of values (model configurations) to try out. This is where I have done something that I think is pretty novel (and nifty).
The grid_latin_hypercube() is just one way of generating a dataframe of a whole heap of different model configurations. The individual functions within, like trees(), simply provide relevant ranges of values for each hyper-parameter.
The next step, of trying out each of these model configs, over the different "folds" takes plenty of time. So I wanted a method of trialing only a few configuration at a time and storing the results, so that I wouldn't have run them again. That is what this function does.
It looks for previous configurations already run in the "model_config_results.rds" file... but if it is the first time, and that file does not exist, the function still proceeds anyways.
A simple function that collects these key components together into the one object. Not necessary, per se, but still handy.
Tuning / Grid Search
This is the workhorse of the entire process. Each of the grid's configuration are run over each of the folds.
This tune_grid() function is also set up for parallelization, so you can utilize all your CPU cores (I think there are differences in setting this up between Mac and Windows, but the use_parallel() function above works for my PC (and Ubuntu)).
One key thing to point out is how, for each configuration, the predictions for the assessment set are stored in the resulting object. We use these, in a second, to evaluate each configuration, rather then just working off the default performance metrics (like rmse or auc).
Evaluating the grid results
Using the output of the tune_grid(), we now need to explore which configuration did best.
As the assessment metric for this challenge relies upon the top 5 guesses for each person, we need to inspect the actual prediction values. And not only that, we also need to compare it to our "no information" prediction of just choosing the top 5 destinations (from the analysis split) for each user of the assessment split.
The output, from a run of 4 configurations, where we were only exploring learn_rate returned the following.
Rather than "accuracy" or "roc_auc" the most relevant performance metric for this challenge is "avg_pct_improvement". This is actually a percentage of metric improvement, based on the distance between the baseline score to the competition winner's score.
Save the evaluation into our collection
A neat little saving operation, where we keep previous configuration results (if we have them) to save ourselves running the same configs multiple times.
Using our best model
40% seemed to be the best result we were able to get, with this original set of features and recipe steps, from running across a whole range of model configurations. Sure there are more techniques for us to try out, such as using downsampling or bringing in the data_account_created column, but let's continue on.
I actually made a separate plan for this "evaluation" phase.
Out of all our configuration results, we import the best one.
We then train it on the full training split and make predictions for the testing split. I then use the following function to determine the pct_improvement.
Whatever number we get here, is our best point estimate for how our kaggle submission might go. When the competition was live, there was a limited number of submissions available per day. This is to prevent a "teaching to the test" approach.
This point estimate is the real metric that we have to budge (upward) by returning to the various feature creation, recipe steps, and hyper-parameter tuning to arrive at the ideal trained model to deploy on our unseen test data.
Submit to Kaggle
With our best model, we can then run this whole process with the entirety of the train dataset (ie not the split) and get some predictions for our unseen test data.
The efforts underlying this post, weren't about creating an elite-level machine learning model that could have won me money in this competition 4 years ago.
To be frank, I am totally wrapped with being even half as good as the winner. I also have some ideas for getting myself slightly more towards their number.
This hobby project was all about setting up the infrastructure -a small scale ML "pipeline" which, thanks to tidymodels, is incredibly transferable to a plethora of different machine learning challenges. And very neatly mapped out and efficiently run thanks to the <drake> package.
I also like the ability that I can (after deleting the git_ignore file in the drake folder) shift the process over to a "chonky VM", via a git push and pull. So that the intensive tune_grid() steps can be run on a machine with a few more cores than my local one.
Hope this outline will give you a bit of a head start. Thanks for reading.