Analysing Whisky Rating Data

8/3/2021 11-minute read

TL;DR

There are a number of different flavours that can be found in whiskys. In this analysis we analyse which flavours commonly appear together using a network graph. Then we perform cluster analysis and plot these clusters on a map to determine if geogrphic location has an impact on the flavours of whiskys.

Introduction

We are going to analyse some whiskey rating data. This data comes from 86 distilleries in Scotland. The structure for the data set is shown below. We can see that we have an ID column, the distillery name, then a whole bunch ratings of different flavour aspects of the whiskey followed by some location data about where the distillery is located.

whisky <- read.csv("/Users/jonahthomas/R_projects/personal_blog/content/post/2021-03-05-analysing-whisky-rating-data/whisky.csv")
whisky$Latitude <- as.numeric(whisky$Latitude)
whisky$Longitude <- as.numeric(whisky$Longitude)
str(whisky)
## 'data.frame':    86 obs. of  17 variables:
##  $ RowID     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Distillery: chr  "Aberfeldy" "Aberlour" "AnCnoc" "Ardbeg" ...
##  $ Body      : int  2 3 1 4 2 2 0 2 2 2 ...
##  $ Sweetness : int  2 3 3 1 2 3 2 3 2 3 ...
##  $ Smoky     : int  2 1 2 4 2 1 0 1 1 2 ...
##  $ Medicinal : int  0 0 0 4 0 1 0 0 0 1 ...
##  $ Tobacco   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Honey     : int  2 4 2 0 1 1 1 2 1 0 ...
##  $ Spicy     : int  1 3 0 2 1 1 1 1 0 2 ...
##  $ Winey     : int  2 2 0 0 1 1 0 2 0 0 ...
##  $ Nutty     : int  2 2 2 1 2 0 2 2 2 2 ...
##  $ Malty     : int  2 3 2 2 3 1 2 2 2 1 ...
##  $ Fruity    : int  2 3 3 1 1 1 3 2 2 2 ...
##  $ Floral    : int  2 2 2 0 1 2 3 1 2 1 ...
##  $ Postcode  : chr  "\tPH15 2EB" "\tAB38 9PJ" "\tAB5 5LI" "\tPA42 7EB" ...
##  $ Latitude  : num  286580 326340 352960 141560 355350 ...
##  $ Longitude : num  749680 842570 839320 646220 829140 ...

As we have some location data, lets start by plotting the distilleries on a map of Scotland.

uk_map <- map_data("world") %>%
  filter(region == "UK")

whisky.coord <- data.frame(whisky$Latitude, whisky$Longitude)
coordinates(whisky.coord) <- ~whisky.Latitude + whisky.Longitude
proj4string(whisky.coord) <- CRS("+init=epsg:27700")
whisky.coord <- spTransform(whisky.coord, CRS("+init=epsg:4326"))

whisky_map <- data.frame(Distillery = whisky$Distillery,
                         lat = whisky.coord$whisky.Latitude,
                         long = whisky.coord$whisky.Longitude)

uk_map %>%  
  filter(subregion == "Scotland") %>% 
  ggplot() + 
  geom_map(map = uk_map, 
           aes(x = long, y = lat, map_id = region),
           fill="white", colour = "black", show.legend = FALSE) + 
  coord_map() +
  geom_point(data = whisky_map, aes(x = lat, y = long, color = "red"))+
  theme_void()
Map of Scotland showing the location of the whisky distilleries.

Figure 1: Map of Scotland showing the location of the whisky distilleries.

We can see in this map the distilleries are relatively spread out across Scotland. There appears to be a large number of distilleries on a small island off the East coast of Scotland (location) along with a cluster towards the North East. Now we have visualised this data, lets start to look at the ratings data. To do this we will create a ridge plot to see the distribution of the ratings in each category. To plot this graph, we need to do some data manipulation.

K means clustering

To get a better sense of whether there are groups in our data, we can employ a clustering approach to our data. Whilst many clustering methodologies exist, today we will use k mean cluster analysis. To do this, we first scale our dataset and then use the kmeans function in base R. Next, we can use the FactoExtra package to plot our cluster analysis. For this analysis, we set the k value to 3.

kmean_data <- scale(whisky[3:14])
whisky_kmeans <- kmeans(kmean_data, 3, nstart = 25)
whisky <- cbind(whisky, whisky_kmeans$cluster)
fviz_cluster(whisky_kmeans, data = whisky[, 3:14],
             geom = "point",
             ellipse.type = "convex")
K mean cluster analysis.

Figure 6: K mean cluster analysis.

From this graph, we can see that our whisky ratings appear to fall into three key clusters. Now we understand there may be three clusters in our data, we can explore how these clusters differ. To do this, we will create another ridge plot but this time we will use the facet wrap function in ggplot2 to create seperate plots for each cluster.

whisky_metrics_kmeans <- whisky %>%
  select(RowID:Floral, `whisky_kmeans$cluster`) %>%
  pivot_longer(Body:Floral, names_to = "metric", values_to = "value")
  
whisky_metrics_kmeans %>%
  mutate(metric = fct_reorder(metric, value)) %>%
  ggplot(aes(value, metric)) +
  geom_density_ridges() +
  xlim(0, 4) +
  facet_wrap(~`whisky_kmeans$cluster`)

From this visualisation, we can understand how our clusters differ. Cluster three appears to be whisky’s with higher medicinal and smoky flavour characteristics as well as a strong body. This cluster also appears to exhibit low honey, sweetness and floral flavours. Cluser one and two are slightly harder to distinguish which is logical considering their boundaries are close on the cluster plot. Both these clusters show very low medicinal and tobacco flavours. It appears that cluster two may be slightly sweeter whiskys with slightly higher floral flavours whilst cluster one whiskys have more honey and slightly higher winey flavours.

Now we have a better understanding of how our clusters differ, we could revisit our map of distilleries from earlier. We can now colour our distilleries by the cluster they are in allowing us to see whether the geographic location of the whisky distillery has an impact on the flavour of the whisky produced.

whisky_map_kmean <- data.frame(Distillery = whisky$Distillery,
                         lat = whisky.coord$whisky.Latitude,
                         long = whisky.coord$whisky.Longitude,
                         cluster = as.factor(whisky_kmeans$cluster))

uk_map %>%  
  filter(subregion == "Scotland") %>% 
  ggplot() + 
  geom_map(map = uk_map, 
           aes(x = long, y = lat, map_id = region),
           fill="white", colour = "black") + 
  coord_map() +
  geom_point(data = whisky_map_kmean, aes(x = lat, y = long, color = cluster))+
  theme_void()
A map of Scotland with the distillery locations overlayed categorised by the distillery cluster.

Figure 7: A map of Scotland with the distillery locations overlayed categorised by the distillery cluster.

Overall, this data shows that the geographic location of the whisky distillery does not have a large impact on the flavour of the whisky produced. However, we can see that all the distilleries from cluster three appear to be located on the coast. However, distilleries from other cluster are also located along the coast. This suggests that the geographic location of the distillery may have a small impact on the flavour of the whisky produced but other factors likely play a larger role.

Conclusion

From this data we can see that whiskys tend to share a group of flavour characteristics. From this information, we were able to generate clusters within the ratings data and then explore how these clusters differed based on their flavour characteristics. We then plotted this cluster data on a map to identify whether geographic lotion had an impact on whisky flavours.