NSW Fuel Prices Dashboard: A Take home assignment in R and Tableau

NSW Fuel Prices Dashboard: A Take home assignment in R and Tableau

What does it take to make an impressive dashboard? I did this one (below) as a take home assignment for a job interview. When presenting it, I got the coveted "We've never seen anyone do that with the data, before" response from the company's 2IC. All of a sudden, in the back of my mind, there was a little monkey pulling the handle of an antique cash register (making that "ca-ching" sound).

Somehow, I managed to convert this csv:

...into this.... (it's a bit too big for the embed window, on this website. You can find the full version on Tableau Public):

Full disclosure, I have spent a little time re-doing this since the interview (so that it works in a more efficient manner). But the concept is still the same.

Conception, exploratory, wrangling + augmentation

The challenge statement was “show us your data skills.” The data was a single 155mb csv file (download here). It was your typical hackathon-style setup.

With hackathons, exploring the data is essential but the experience soon becomes very lack-luster, if you are unable to come up with a cool enough, practical application for the data.

In this case, I arrived at a clear vision of a compare the market style app, where a user could click on their regular petrol station and determine how its prices compared with those in the immediate vicinity.

Having this "imaginary client demand" allowed me to focus my exploratory phase (making it easier to determine which findings were important and which were simply esoteric). Coming up with (what I thought was) a practical use case was a culmination of previous experience, creativity and no small amount of luck.

These were some of my key findings from the exploratory:

  • Some station brands did not stock particular fuel types.
  • There's something cyclical (kinda' monthly) happening with the prices. Maybe there is some macro level announcement that is causing these price spikes.
  • Different station brands update their fuel prices at different times of the day (maybe some might even be operating 24hrs and others not).
  • [Most important] There was no unique identifier for each specific station. And sometimes the same address was typed in differently.

My initial vision was something like this:

I now needed to work backwards and wrangle the data into "that certain way" that Tableau likes.

All up, I had about a day and a half available to work on this project. The following are the specific steps I needed to take:

  1. I would need to determine a unique_store_id.

This would involve ironing out all the discrepancies between how the same address could have been be typed in.

mutate(new_address = trimws(tolower(Address)),
         new_address = str_replace(new_address, "rd,", "road,"),
         new_address = str_replace(new_address, "st,", "street,"),
         new_address = str_replace(new_address, "hwy,", "highway,"))

And doing some unavoidable "checking by eye", to ensure distinct sites were actually different.

To do this, I utilised a system of (what I call) "cyclic recoding" where an excel file is generated to take in your determinations (duplicate or not) and then this can be added to and corrected, as time goes on.

I demonstrate this process in a previous blog post. Basically, my opinion is that any manual data recoding has to be done in Excel, as the table filtering and the drag down to fill feature are so intuitive.

2. I would also need the price for each fuel type as separate columns (so I could use a parameter to determine the dashboard user's "selected fuel"). This was the more typical sort of data-wrangling task. You can check out my steps in my r script, which I will be posting at the end.

3. Determine the lat longs for all service station locations. The ggmap package has a function for this:

mutate(lat_long_data = map(addresses, ~ggmap::geocode(.x, output = “latlona”, source = “google”)))

The package does require registration with the Google Cloud Platform for an API, but it came with plenty of free credits.

I did notice that some of the lat longs returned were outside of NSW (probably due to my forgetting to add "Australia" to the search term that got sent to the API). But I could identify these by their distance from the median (of all points) and, because I was pressed for time, I simply removed them.

mutate(distance_from_median_point = geosphere::distHaversine(cbind(lon, lat), cbind(median(lon), median(lat)))) %>% 
  filter(distance_from_median_point < 1.1E6)

Find my full r script at the gist below:

NSW fuel data wrangle
GitHub Gist: instantly share code, notes, and snippets.

Tableau Implementation

It was now time to move on to the fun part -bringing the initial vision to reality, using Tableau. The relevant, prepared csv files are available to download here if you would like to follow along.

I connected the three prepared data sources, as follows:

These are what the individual files contained:

valid_googled addresses.csv

With these three key tables all connected, to increase the retrieval speed, I made a *.hyper file extract, utilising the "Multiple tables" feature.

I have made the following improvements, following on from my presentation in the interview:

  • aggregating fuel prices up, from individual hours to the average daily price. This reduced the data 24-fold.
  • instead of pre-calculating the distances for every possible site combination and importing this into Tableau (~ 6,000 x 6,000 rows), I utilised a calculation to determine the distance (I demonstrate this later). In the interview, my original distances between all-site combinations method meant that there was a ~10 seconds lag each time a new site was selected. As embarrassing as this was, I wasn't too ashamed to admit that there were still plenty of ways that the dashboard could still be optimised.
  • Instead of utilising the entire date range, I restricted the import to only the latest couple months (which is all I determined would be relevant to the end user).

The essential component that made the whole dashboard possible, was the ability to update the selection of a specific station.

I actually didn't know that the "edit parameter action" existed before I started the assignment (I just assumed it would be possible). However, I was very fortunate that it had only just been added in the latest release (version 2019.2), so this had the added benefit of conveying the impression that I had my "finger on the pulse" of the latest Tableau developments.

With the parameter able to be updated, I could then utilise a Fixed LOD to make new columns that housed the selected site's latitude and longitude...

{FIXED : MAX(IF [full_name] = [Selected Store] THEN [lat] END)}
{FIXED : MAX(IF [full_name] = [Selected Store] THEN [lon] END)}

...and from these, determine the distance (in metres) of any other site, utilising a bit of trigonometry (the haversine formula, I think):

3959 * ACOS
SIN(RADIANS([lat])) * SIN(RADIANS([Selected Lat])) +
COS(RADIANS([lat])) * COS(RADIANS([Selected Lat])) *
COS(RADIANS([Selected Lon]) - RADIANS([lon]))
) * 1000

There was also a bit of a red-herring with Tableau's MAKEPOINT and DISTANCE functions, that looked like they would achieve the same result. However, it turned out that the DISTANCE formula would not work with the (non-live) hyper extract data source (I found this surprising).

Coupled with a radius parameter, we could then determine each site's Relationship with Selected Location... and so, I was well on my way.

The dashboard and hyper file can be downloaded here. It can also be found up on Tableau Public.

I might be adding some Youtube videos, demonstrating the creation the four key aspects, as some of these required some quite tricky LOD functions:

  • map
  • fuel price over time chart
  • days as cheapest station chart
  • cost per tank comparison chart

So there you have it. A nifty little project, smashed out under a tight timeframe. I have mixed feelings about take home assignments but this opportunity was certainly one I found incredibly valuable. Not just from a technical perspective but also from a practicing my presentation skills perspective (if you're going to do a take home assignment, make sure you will get this opportunity to present and receive feedback).

Some possible next steps might include:

  • Connecting directly to the official NSW government Fuel Watch data source.
  • Recreating this dashboard with Power BI or Rshiny?

Many thanks for reading.

If you have a similar short sharp dashboard idea that you want bring to fruition on a speedy timeline, feel free to get in touch.