Nothing like music to get to know yourself

So let’s see what we got

I use these libraries for the following analysis. Nothing like the tidyverse, right?

library(tidyverse)
library(factoextra)

Here is the data table that I gathered. It contains a lot of info, including our usernames, which is why I am not showing it here. Suffice to say that there is a lot of metadata about the song, as well as an analysis of the songs’ characteristics

complete_info <- read_tsv('../user_song_information.tsv') %>% 
  filter(User %in% c("Roberto", "Sarah"))

Visual comparison of song characteristics

Now that we have collected the data, we can start to compare the songs that we like.

Song Features

An good initial way to compare the distribution of values is to look at violin plots. These plots are wider where there are more values and narrower where there are less values. Within each violin plot, there is a smaller, regular boxplot.

numerical_info <- complete_info %>% 
                  select(track.duration_ms, track.popularity, danceability,
                         energy, key, loudness, speechiness, acousticness,
                         instrumentalness, liveness, valence, tempo, track.id,
                         User)

numerical_info <- numerical_info %>% 
                  gather(track.duration_ms, track.popularity, danceability,
                         energy, key, loudness, speechiness, acousticness,
                         instrumentalness, liveness, valence, tempo,
                         key=feature, value=value)

ggplot(numerical_info, aes(x=User, y=value, fill=User)) +
  geom_violin() +
  geom_boxplot(width=0.1, fill="white")  +
  scale_fill_brewer(palette="Dark2") +
  facet_wrap(~feature, scales='free') +
  theme_minimal() +
  theme(legend.position = 'none') +
  labs(title = "Distribution of song feature values",
       x="Feature Value")

Interesting! Well, one thing to consider is that I have ~300 songs while Sarah has 100. Even so, it is remarkable that Sarah has so much consistency in her music taste. For me, it is unsurprising that Sarah has songs with more danceability. Also, speechiness seems to separate our songs quite well. For reference, rap songs have a high level of speechiness, therefore it seems that Sarah’s songs are more rappy. Her songs also tend to have a higher tempo. This actually makes a lot of sense, Sarah is a great DJ at a party. When it comes to energy and loudness, our median values are similar; however, her songs’ values are more similar to each other than mine. I have more songs with lower energy and loudness. This further confirms that Sarah is a party animal.

Other than that, our songs have a similar track.popularity, meaning we’re both cool.

Explicit language

One interesting thig to look at is whether the song contains explicit language. How many of our songs would our mothers disapprove of?

explicit_info <- complete_info %>% 
  group_by(User) %>% 
  count(track.explicit)

exp_fig <- ggplot(explicit_info, aes(x=User, y=n, fill=track.explicit)) +
  geom_bar(position = 'fill', stat='identity') +
  scale_fill_brewer(palette = 'Dark2') +
  theme_minimal() +
  labs(y='Proportion of songs',
       title="Proportion of songs that contain explicit language")
exp_fig

It seems that both of us are mostly children of God and listen to good christian lyrics. However, Sarah has a greater proportion of songs with explicit language.
Ay ay ay. Sarah! Would you show your grandmother your songs? Tss.

Favourite time period

How about what time period we listen to? We can compare the release dates of the albums that contain the songs we listen to. Maybe one of us likes oldies more.

time_info <- complete_info %>% 
  separate(track.album.release_date, c('album.year', 'album.month', 'album_day'), '-') %>% 
  mutate(album.year = as.integer(album.year),
         album.month = as.integer(album.month)) %>% 
  select(track.id, User, album.year, album.month) %>% 
  gather(album.year, album.month, key=timemeasure, value=value)

date_info <- time_info %>% 
  group_by(User, timemeasure) %>% 
  count(value) %>% 
  rename(n_songs = n,
         time_measure = value) 
  

bar <- ggplot(date_info, aes(x=time_measure, y=n_songs, fill=User)) +
  geom_bar(stat="identity") +
  scale_fill_brewer(palette="Dark2") +
  facet_wrap(~timemeasure, scales = 'free') + 
  theme_minimal() +
  labs(x="Month                                                        Year", 
       y='Number of Songs', title = "Number of songs for each month/year") 

bar

Looks like we both mostly listen to moder music. But also, for some reason January is a month where we both have a lot of songs…

Some Multidimensional Statistics

Now that we have seen the features of our songs, we can try to see if these values truly distinguish my songs from Carina’s. To do this, I first will try some Principal Component Analysis and K-Means clustering.

Are our songs distinguishable from eachother?

So, just how different are our songs? One way to asnwer this question is to perform an analysis known as Principal Component Analysis (PCA). Basically, we can collapse the variables that describe our songs (danceability, valence, speechiness, etc) into two components such that the variance between our songs is maximized. This will let us know if the feature values of our songs are different or not.

I think we can limit the analysis to the characteristics of the songs themselves, leaving out external stuff such as popularity.

# Getting shared songs
shared_songs <- complete_info %>% count(track.id) %>% filter(n > 1)
complete_info <- complete_info %>% 
  mutate(User = replace(User, track.id %in% shared_songs$track.id, "Shared")) %>% 
  distinct(track.id, .keep_all=TRUE)


song_numerical <- complete_info %>% 
  select(track.id, danceability, energy, loudness, speechiness, acousticness,
         instrumentalness, liveness, valence, tempo, duration_ms,) %>% 
  distinct(track.id, .keep_all=TRUE) %>% 
  column_to_rownames(var="track.id")

res.pca <- prcomp(song_numerical, scale = TRUE)
scree <- fviz_eig(res.pca)
scree

Well, that’s interesting. Looks like we need a lot of components to explain an important part of the variance. From this, it seems that our songs are not easily separated. Ideally we would like the first two components to explain about 80% of the variance, but looks like it’s much less. Let’s take a look at how the features themeselves relate to eachother.

fviz_pca_var(res.pca,
             col.var = "contrib", # Color by contributions to the PC
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE     # Avoid text overlapping
             )

This is a graph of variables. Positively correlated variables point toward the same direction. Negatively correlated variables point toward opposite directions. For example, acousticness is anti-correlated with energy. At the same time, energy and loudness are correlated. Makes sense, right? I’m interested in the fact the duration seems to be anticorrelated to danceability. Funny.

Ok, so what does this tell us about our songs. Not much yet, but now we can see how our songs are distributed in this PC space. We should be able to notice whether are songs are clustered together or not.

listener <- as.factor(complete_info$User)

fviz_pca_ind(res.pca,
             col.ind = listener, # color by groups
             palette = c("#1b9e77", "#d95f02","#7570b3"),
             legend.title = "Listener",
             repel = TRUE,
             geom = 'point'
             )

Well, it seems that our songs are not that different after all. But there is an interesting tendency. If you really focus, you can see that Sarah’s songs tend to be more on the right side of the space.

fviz_pca_biplot(res.pca, repel = TRUE,
                col.var = "black", # Variables color
                col.ind = listener,  # Individuals color
                palette = c("#1b9e77", "#d95f02","#7570b3"),
                geom = 'point')

AHA! As we observed from the violin plots, Sarah’s songs are mostly described by the danceability, speechiness and valence variables. I have a few songs in that, but when we look at instrumentalness, acousticness and duration only I have songs with high value. Also, Sarah’s songs are more similar to eachother, they cluster more closely. That is so cool.

Maybe with some more precise algorithms, such as UMAP, we could get more defined clusters.

K-means clustering

So there might be a another way to distinguish our songs. We can do something known as k-means clustering. Basically, we tell an algorithm to create two groups of songs based on their similarity to each other (euclidean distance).

We can get an idea of how many K’s to cluster our songs into, with the following plot. Basically, it lets us know how different the songs within a certain cluster are. As we increase the number of clusters, the sum of squares decreases, since the songs will become more similar.

Normally, we would be looking for an “elbow”, meaning a point where the WSS stops changing a lot. To me, it seems like that point is 3 or 4. But that is far too many clusters for us (I would say). So let’s start with two and see what we get.

song_numerical.scaled <- scale(song_numerical)

fviz_nbclust(song_numerical.scaled, kmeans, method = "wss")

Let’s do some k means for two. We can see that we rescue what we saw earlier; songs on the positive side of Dim1 are separeted from those on cluster 2.

set.seed(42)
k2 <- kmeans(song_numerical.scaled, 2, nstart = 25)

fviz_cluster(k2, data = song_numerical.scaled,
             geom='point', ellipse.type = "norm",
             palette='Dark2')

Ok ok, so that’s interesting. In the PCA space they did separate a bit. We already know that the PCA space doesn’t separate our songs. So it’s unlikely that these clusters represent the users. I am curious to see what actually separates them though. Let’s take a look.

song_clusters2 <-song_numerical %>% 
  mutate(cluster=as.factor(k2$cluster),
         User=complete_info$User) %>% 
  rownames_to_column(var="track.id") %>% 
  gather(danceability, energy, loudness, speechiness, acousticness, instrumentalness,
         liveness, valence, tempo, duration_ms,
         key=feature, value=value)

vp <- ggplot(song_clusters2, aes(x=cluster, y=value, fill=cluster)) +
  geom_violin() +
  geom_boxplot(width=0.1, fill="white")  +
  scale_fill_brewer(palette="Dark2") +
  facet_wrap(~feature, scales='free') +
  theme_minimal() +
  theme(legend.position = 'none')

vp

Now that’s interesting, as expected there is a notable difference between the energy, danceability, loudness, and acousticness values between the two clusters. That is super cool, and is consistent with what we saw in the PCA analysis. I wonder who has more songs in each cluster?

cluster2_user <- complete_info %>% 
  mutate(cluster=as.factor(k2$cluster)) %>% 
  group_by(User) %>% 
  count(cluster) %>% 
  subset(User != "Shared")


exp_fig <- ggplot(cluster2_user, aes(x=User, y=n, fill=cluster)) +
  geom_bar(position = 'fill', stat='identity') +
  scale_fill_brewer(palette = 'Dark2') +
  theme_minimal() +
  labs(y='Proportion of songs')
exp_fig

INCREDIBLE!
Almost all of Sarah’s songs are in cluster 2, which has songs with higher energy, loudness, danceability…the party cluster.

Out of curiosity, what would happen during 3 kmeans clustering?

set.seed(42)
k3 <- kmeans(song_numerical.scaled, 3, nstart = 25)

fviz_cluster(k3, data = song_numerical.scaled,
             geom='point', ellipse.type = "norm",
             palette='Dark2')

song_clusters3 <-song_numerical %>% 
  mutate(cluster=as.factor(k3$cluster),
         User=complete_info$User) %>% 
  rownames_to_column(var="track.id") %>% 
  gather(danceability, energy, loudness, speechiness, acousticness, instrumentalness,
         liveness, valence, tempo, duration_ms,
         key=feature, value=value)

vp <- ggplot(song_clusters3, aes(x=cluster, y=value, fill=cluster)) +
  geom_violin() +
  geom_boxplot(width=0.1, fill="white")  +
  scale_fill_brewer(palette="Dark2") +
  facet_wrap(~feature, scales='free') +
  theme_minimal() +
  theme(legend.position = 'none')

vp

cluster3_user <- complete_info %>% 
  mutate(cluster=as.factor(k3$cluster)) %>% 
  group_by(User) %>% 
  count(cluster) %>% 
  subset(User != "Shared")


exp_fig <- ggplot(cluster3_user, aes(x=User, y=n, fill=cluster)) +
  geom_bar(position = 'fill', stat='identity') +
  scale_fill_brewer(palette = 'Dark2') +
  theme_minimal() +
  labs(y='Proportion of songs')
exp_fig

Yeah, third cluster didn’t really add much.

Conclusions

Sarah is a party animal.

Statistical tests can be used to further support the differences observed in the violin plots. More sophisticated multidimensional statistics can also be used.

Finally, we can also look forward to further analyses. We can compare our general taste to the songs that we have been listening to more recently, or analyze the characteristics of our top artists and songs. Also, I’m pretty sure that spotify has some classification of songs, saying whether the song is happy or sad, etc. Gotta figure that out.

Also, if you have any suggestions of things to look at definitely let me know!