I felt it was finally time to get a better understanding of ML. I'd really side-stepped it for a long time because of a perception that I would never get to implement it with any projects that came my way. But, "what the hell", I thought, let's learn it and see what comes along.
For most of my time coding in R, I have been a bit like that Steve Ballmer video on YouTube but instead of shouting "Developers!" I'm shouting "Tidyverse!" To me, if we could all just get onto the same page for R's syntax / structure, that would be great. I've obviously sided with the tidyverse, dplyr, pipes gang.
So that's why I am also using Tidymodels. Being a bit of an ML rookie, there might be some reason I'm not aware of, for why I might want to use the "native" method of a particular model package (please feel free to let me know some examples).
So let's get into it.
# load packages library(tidyverse) library(httr) library(tidytext) library(tidymodels) library(textrecipes) library(vip) # download the files url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00331/sentiment%20labelled%20sentences.zip" download.file(url, destfile = "download.zip") unzip("download.zip", exdir = "data") # prepare the data prepared_files_data <- dir("data", recursive = T, full.names = T, pattern = "*.txt") %>% tibble(file = .) %>% filter(!str_detect(file, "readme.txt")) %>% mutate(data = map(file, ~read_lines(.x))) %>% unnest(data) %>% mutate(sentiment = as.numeric(str_sub(data, -1)), sentiment = if_else(sentiment == 1, "good", "bad") %>% as.factor(), review = str_sub(data, 1, nchar(data)-1) %>% trimws(), source = word(file, -1, sep = "/"), source = word(source, 1, sep = "_"), review_id = row_number()) %>% select(-data, -file)
What do we got? A whole bunch of reviews of things and their corresponding sentiment (good / bad). Straight away, we've got our target and our predictor. Sweeto! Let's get some ML happening.
# create training and evaluation splits set.seed(123) review_split <- initial_split(prepared_files_data, strata = sentiment) review_train <- training(review_split) review_test <- testing(review_split) # set the recipe review_rec <- review_train %>% recipe(sentiment ~ review, data = .) %>% step_tokenize(review) %>% step_stopwords(review) %>% step_tokenfilter(review, max_tokens = 500) %>% step_tfidf(review) %>% step_normalize(all_predictors())
Splitting our data into training and testing ("evaluation," might be better) are the usual first steps.
Then we come to the "recipe" component of the tidymodels. The first component is the formula. Then we have a series of steps, describing how the columns will be altered before we train our model.
This is a straight copy and paste from Julia's post... understanding the particulars of these step operations is its own bag of worms.
Of particular note, however, is the max_tokens = 500 argument, dictating that we are only going to have a max of 500 words in our "corpus".
The next component is to specify our model, in this case a logistic regression powered by the glmnet package.
lasso_spec <- logistic_reg(penalty = tune(), mixture = 1) %>% set_engine("glmnet")
We have two "hyper parameters". One, "mixture" has been set to 1 (maybe it shouldn't be?). The other, "penalty" is something that we are going to "tune" (ie put a whole range of numbers through it and see which comes out best).
In order to test each value of penalty, we need a whole bunch of samples of our training data. The bootstraps() method produces 25 samples by default.
set.seed(123) review_folds <- bootstraps(review_train, strata = sentiment)
And what values of the penalty parameter are we going to test out?
lambda_grid <- grid_regular(penalty(), levels = 10)
The penalty() function contains a range of valid options for this parameter. The grid_regular() function creates N examples for us, according to a certain ("regular"?) spacing. There are other grid options besides grid_regular().
So now we just need to repeatedly train and evaluate our model at each of these levels over the bootstrap folds.
This training and evaluation is going to take a fair while. We have 10 levels of penalty and 25 folds to train/test over, each time. So, before we do that, let's get a sense of how long we can expect to have to wait. We are going to pick some random value for penalty and run the training testing process.
# pick a specific penalty value (hint, it's actually the best one we got) lasso_spec <- logistic_reg(penalty = 0.0088862, mixture = 1) %>% set_engine("glmnet")
The next step is to combine this model with the recipe into a workflow object.
lasso_wf <- workflow() %>% add_recipe(review_rec) %>% add_model(lasso_spec)
Then we need to:
1. train the model
2. test the model
3. determine the auc
tictoc::tic() training_fit <- lasso_wf %>% fit(review_train) predict(training_fit, new_data = review_test, type = "prob") %>% bind_cols(review_test) %>% roc_auc(truth = sentiment, .pred_good) tictoc::toc()
We could also have achieved the same thing via the last_fit() function, which is actually a bit neater (and gives us an accuracy metric, as well as roc_auc).
tictoc::tic() review_final <- lasso_wf %>% last_fit(review_split) review_final %>% collect_metrics() tictoc::toc()
But 4.46 seconds times 10 levels times 25 folds could equal ~18 minutes. This is something good to know before running the full gamut of levels / folds. And knowing it will take a fair while, we can try utilising multiple cores in our CPU (instead of R's default of just 1). For Macs, the following should work.
For my 2 core, Windows 10 machine, I tried the following (not too sure, if it worked):
library(future) plan(multiprocess, workers = 2)
We running the full grid by the following.
lasso_grid <- tune_grid( lasso_wf, resamples = review_folds, grid = lambda_grid, metrics = metric_set(roc_auc, ppv, npv), control = control_grid(verbose = TRUE) )
In this case, it is unusual that it only took ~3 minutes. Evidently, this method must involve some sort of optimization, so that it doesn't have to run every single combination. We can see the evidence of each sample being run with the different penalty values.
And the the resulting performance metrics:
lasso_grid %>% collect_metrics()
The select_best("roc_auc") command pulls out the grid arrangement that scored the best in terms of auc.
best_auc <- lasso_grid %>% select_best("roc_auc")
This is just a tibble (in this case with only 1 column, named the same as our only hyper-parameter). We can pass this 1 row tibble to "finalise" that single tune() value in our model specification (within our workflow object).
final_lasso_wf <- finalize_workflow(lasso_wf, best_auc)
Or we could have just finalized our model on it's own (using finalize_model() ) and recreated a workflow.
Now, with the best model hyper-parameters, we can train our model on the full training dataset (not just the bootstraps) and then run it on our test data.
training_fit <- final_lasso_wf %>% fit(review_train) predict(training_fit, new_data = review_test, type = "prob") %>% bind_cols(review_test) %>% roc_auc(truth = sentiment, .pred_good) predict(training_fit, new_data = review_test, type = "class") %>% bind_cols(review_test) %>% conf_mat(sentiment, .pred_class)
This is great. This confusion matrix allows us to work out the usual metrics of accuracy, sensitivity and specificity (in addition to the roc_auc. Alternatively, we could have utilised our review_split object (generated back at the start) with the last_fit() function, to get a slightly tidier result.
review_final <- final_lasso_wf %>% last_fit(review_split)
And then use the following "helper functions" to get the same output, as above, but with accuacy, in addition to roc_auc.
review_final %>% collect_metrics() review_final %>% collect_predictions() %>% conf_mat(sentiment, .pred_class)
We can also visualise with an AUC plot.
review_final %>% collect_predictions() %>% roc_curve(truth = sentiment, .pred_good) %>% autoplot()
This is all great, but we only have a point estimate for the sort of accuracy to expect with unseen data (i.e 75.9%). What would be better is if we did this process over all those review folds, to get a distribution for our expected accuracy. This takes a little time, as well.
fits_over_folds <- final_lasso_wf %>% fit_resamples(review_folds)
Collecting the (aggregated by default) metrics from this table, allows us to see that the standard error for the accuracy is pretty stable.
fits_over_folds %>% collect_metrics()
If we wanted to, I believe it is also possible to store the actual predictions made in each bootstrap, to be able to plot all their AUC curves together.
The result of this analysis, is that we now have a process set up for determining which words lead to positive or negative reviews. Note below that the vi() function relies upon the "fit" component of the fitted workflow (training_fit -something that we generated a fair bit earlier).
training_fit <- final_lasso_wf %>% fit(review_train)
training_fit %>% pull_workflow_fit() %>% vi() %>% group_by(Sign) %>% top_n(20, wt = abs(Importance)) %>% ungroup() %>% mutate(Importance = abs(Importance), Variable = word(Variable, -1, sep = "_"), Variable = fct_reorder(Variable, Importance)) %>% ggplot(aes(x = Importance, y = Variable, fill = Sign)) + geom_col(show.legend = FALSE) + facet_wrap(~Sign, scales = "free_y") + labs(y = NULL)
Nothing too amazing here. These words would probably receive similar importance scores in any context. But, simply, due to the fact that this is isolated to our own training data and the fact that we made it ourselves and can generate a similar outcomes for other datasets, is why this is cool.
But better than cool is something that is actually useful. So let's put this model to work, with some brand new unseen data.
new_data_to_be_scored <- tibble::tribble( ~"review", ~"sentiment", "I'm still infatuated with this phone.", "good", "Strike 2, who wants to be rushed.", "bad", "I enjoyed reading this book to my children when they were little.", "good", "We had a group of 70+ when we claimed we would only have 40 and they handled us beautifully.", "good", "The story is lame, not interesting and NEVER really explains the sinister origins of the puppets", "bad", "Better than you'd expect.", "good", "It was a huge awkward 1.5lb piece of cow that was 3/4ths gristle and fat.", "bad", "Yes, it's that bad.", "bad", "I did not expect this to be so good!", "good", "The only redeeming quality of the restaurant was that it was very inexpensive.", "good" ) %>% mutate(sentiment = as.factor(sentiment))
Here we've got some brand new data. Let's pretend that we don't already have their sentiment. This time, we should utilise all the data we have available for training our model:
full_training_fit <- final_lasso_wf %>% fit(prepared_files_data) new_data_to_be_scored %>% bind_cols( predict(full_training_fit, new_data = new_data_to_be_scored, type = "class"), predict(full_training_fit, new_data = new_data_to_be_scored, type = "prob") )
These results outline how NLP accuracy isn't an easy thing to achieve. We can see that we got numerous uncertain predictions (around the 50% mark) and these tended to be wrong more often than our more confident predictions.
Obviously, this is not the most advanced model we could possibly discover. This is probably the point where experienced ML practitioners start off from. People like myself are just happy to have this process documented for future reference and to pick up a few tips and tricks along the way (as there are many more models, that each require there own deeper understanding, as well).
At the risk of making this post a complete epic. I would also like to preempt my next post, by talking about what the drake package might be able to offer us, in this context.
At the moment this whole process is sitting in one r script. You can find it here.
This is great and convenient for other people to get the step by step understanding of the process I utilised.
But, as I pursue and experiment with more advanced models, I would like to avoid having to repeat all the individual parts, each time I boot up. For example, re-downloading the file, or re-running the tune_grid().
We could go ahead and split these sections into different files and include many save_rds and read_rds functions all about the place. But apparently, the package drake, can assist us with all this, quite significantly (by generating a series of cache files). So let's take a look at that (in the next post).
Thanks for reading.