Nothing like music to get to know yourself
So let’s see what we got
I use these libraries for the following analysis. Nothing like the tidyverse, right?
library(tidyverse)
library(factoextra)
Here is the data table that I gathered. It contains a lot of info, including our usernames, which is why I am not showing it here. Suffice to say that there is a lot of metadata about the song, as well as an analysis of the songs’ characteristics
<- read_tsv('../user_song_information.tsv') %>%
complete_info filter(User %in% c("Roberto", "Alejandra"))
Visual comparison of song characteristics
Now that we have collected the data, we can start to compare the songs that we like.
Song Features
An good initial way to compare the distribution of values is to look at violin plots. These plots are wider where there are more values and narrower where there are less values. Within each violin plot, there is a smaller, regular boxplot.
<- complete_info %>%
numerical_info select(track.duration_ms, track.popularity, danceability,
energy, key, loudness, speechiness, acousticness,
instrumentalness, liveness, valence, tempo, track.id,
User)
<- numerical_info %>%
numerical_info gather(track.duration_ms, track.popularity, danceability,
energy, key, loudness, speechiness, acousticness,
instrumentalness, liveness, valence, tempo,key=feature, value=value)
ggplot(numerical_info, aes(x=User, y=value, fill=User)) +
geom_violin() +
geom_boxplot(width=0.1, fill="white") +
scale_fill_manual(values=c("#7570b3", "#d95f02")) +
facet_wrap(~feature, scales='free') +
theme_minimal() +
theme(legend.position = 'none') +
labs(title = "Distribution of song feature values",
x="Feature Value")
Interesting! Looks like Ale and I are pretty similar. One thing to consider is that we both have ~300; this makes the values truely comparable. For most features, the distribution of values is pretty similar, with only slight differences in the density of values. One clear difference comes in the track.popularity feature. It seems that I like the more popular songs (my hipster heart is really getting crushed now). Also, it seems like my songs are slightly more energetic, with the lower tail of the distribution being higher than hers. However, when is comes to danceability, Ale has a greater density of songs at higher values than I do. Maybe there is also a slight difference in tempo.
Explicit language
One interesting thing to look at is whether the song contains explicit language. How many of our songs would our mothers disapprove of?
<- complete_info %>%
explicit_info group_by(User) %>%
count(track.explicit)
<- ggplot(explicit_info, aes(x=User, y=n, fill=track.explicit)) +
exp_fig geom_bar(position = 'fill', stat='identity') +
scale_fill_manual(values=c("#7570b3", "#d95f02")) +
theme_minimal() +
labs(y='Proportion of songs',
title="Proportion of songs that contain explicit language")
exp_fig
It seems that both of us are mostly children of God and listen to good christian lyrics. Stay in school kids.
Favourite time period
How about what time period we listen to? We can compare the release dates of the albums that contain the songs we listen to. Maybe one of us likes oldies more.
<- complete_info %>%
time_info separate(track.album.release_date, c('album.year', 'album.month', 'album_day'), '-') %>%
mutate(album.year = as.integer(album.year),
album.month = as.integer(album.month)) %>%
select(track.id, User, album.year, album.month) %>%
gather(album.year, album.month, key=timemeasure, value=value)
<- time_info %>%
date_info group_by(User, timemeasure) %>%
count(value) %>%
rename(n_songs = n,
time_measure = value)
<- ggplot(date_info, aes(x=time_measure, y=n_songs, fill=User)) +
bar geom_bar(stat="identity") +
scale_fill_manual(values=c("#7570b3", "#d95f02")) +
facet_wrap(~timemeasure, scales = 'free') +
theme_minimal() +
labs(x="Month Year",
y='Number of Songs', title = "Number of songs for each month/year")
bar
Looks like we both mostly listen to modern music. But also, for some reason January is a month where we both have a lot of songs…
Ale has more songs in the 70s.
Some Multidimensional Statistics
Now that we have seen the features of our songs, we can try to see if these values truly distinguish my songs from Carina’s. To do this, I first will try some Principal Component Analysis and K-Means clustering.
Are our songs distinguishable from eachother?
So, just how different are our songs? One way to asnwer this question is to perform an analysis known as Principal Component Analysis (PCA). Basically, we can collapse the variables that describe our songs (danceability, valence, speechiness, etc) into two components such that the variance between our songs is maximized. This will let us know if the feature values of our songs are different or not.
I think we can limit the analysis to the characteristics of the songs themselves, leaving out external stuff such as popularity.
# Getting shared songs
<- complete_info %>% count(track.id) %>% filter(n > 1)
shared_songs <- complete_info %>%
complete_info mutate(User = replace(User, track.id %in% shared_songs$track.id, "Shared")) %>%
distinct(track.id, .keep_all=TRUE)
<- complete_info %>%
song_numerical select(track.id, danceability, energy, loudness, speechiness, acousticness,
%>%
instrumentalness, liveness, valence, tempo, duration_ms,) distinct(track.id, .keep_all=TRUE) %>%
column_to_rownames(var="track.id")
<- prcomp(song_numerical, scale = TRUE)
res.pca <- fviz_eig(res.pca)
scree scree
Well, that’s interesting. Looks like we need a lot of components to explain an important part of the variance. From this, it seems that our songs are not easily separated. Ideally we would like the first two components to explain about 80% of the variance, but looks like it’s much less. Let’s take a look at how the features relate to eachother.
fviz_pca_var(res.pca,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
This is a graph of variables. Positively correlated variables point toward the same direction. Negatively correlated variables point toward opposite directions. For example, acousticness is anti-correlated with energy. At the same time, energy and loudness are correlated. Makes sense, right? I’m interested in the fact the duration seems to be anticorrelated to danceability. Funny.
Ok, so what does this tell us about our songs. Not much yet, but now we can see how our songs are distributed in this PC space. We should be able to notice whether are songs are clustered together or not.
<- as.factor(complete_info$User)
listener
fviz_pca_ind(res.pca,
col.ind = listener, # color by groups
palette = rev(c("#1b9e77", "#d95f02","#7570b3")),
legend.title = "Listener",
repel = TRUE,
geom = 'point'
)
As expected, it seems that our songs are not that different after all. Though it is interesting to note that most of our songs are on the left side of PC1
fviz_pca_biplot(res.pca, repel = TRUE,
col.var = "black", # Variables color
col.ind = listener, # Individuals color
palette = rev(c("#1b9e77", "#d95f02","#7570b3")),
geom = 'point')
Seems like most of our songs are on the more danceable side of things. That’s unexpected, at least for me. I liked to think I was very moody.
Maybe with some more precise algorithms, such as UMAP, we could get more defined clusters.
K-means clustering
So there might be a another way to distinguish our songs. We can do something known as k-means clustering. Basically, we tell an algorithm to create two groups of songs based on their similarity to each other (euclidean distance).
We can get an idea of how many K’s to cluster our songs into, with the following plot. Basically, it lets us know how different the songs within a certain cluster are. As we increase the number of clusters, the sum of squares decreases, since the songs will become more similar.
Normally, we would be looking for an “elbow”, meaning a point where the WSS stops changing a lot. To me, it seems like that point is 3 or 4. But that is far too many clusters for us (I would say). So let’s start with two and see what we get.
<- scale(song_numerical)
song_numerical.scaled
fviz_nbclust(song_numerical.scaled, kmeans, method = "wss")
Let’s do some k means for two. We can see that we rescue what we saw earlier; songs on the positive side of Dim1 are separeted from those on cluster 2.
set.seed(42)
<- kmeans(song_numerical.scaled, 2, nstart = 25)
k2
fviz_cluster(k2, data = song_numerical.scaled,
geom='point', ellipse.type = "norm",
palette='Dark2')
Ok ok, so that’s interesting. In the PCA space they did separate a bit. We already know that the PCA space doesn’t separate our songs. So it’s unlikely that these clusters represent the users. I am curious to see what actually separates them though. Let’s take a look.
<-song_numerical %>%
song_clusters2 mutate(cluster=as.factor(k2$cluster),
User=complete_info$User) %>%
rownames_to_column(var="track.id") %>%
gather(danceability, energy, loudness, speechiness, acousticness, instrumentalness,
liveness, valence, tempo, duration_ms,key=feature, value=value)
<- ggplot(song_clusters2, aes(x=cluster, y=value, fill=cluster)) +
vp geom_violin() +
geom_boxplot(width=0.1, fill="white") +
scale_fill_manual(values=c("#7570b3", "#d95f02")) +
facet_wrap(~feature, scales='free') +
theme_minimal() +
theme(legend.position = 'none')
vp
It seems that the biggest differences come in duration, energy, loudness and acousticness. Basically, it is separating slow songs from more energetic songs. Is there a difference in the number of songs we have in each cluster?
<- complete_info %>%
cluster2_user mutate(cluster=as.factor(k2$cluster)) %>%
group_by(User) %>%
count(cluster) %>%
subset(User != "Shared")
<- ggplot(cluster2_user, aes(x=User, y=n, fill=cluster)) +
exp_fig geom_bar(position = 'fill', stat='identity') +
scale_fill_manual(values=c("#7570b3", "#d95f02")) +
theme_minimal() +
labs(y='Proportion of songs')
exp_fig
Interesting! It seems that I have a slightly greater proportion of songs that are in the more energetic cluster. Maybe I am not that moody after all.
Out of curiosity, what would happen during 3 kmeans clustering?
set.seed(42)
<- kmeans(song_numerical.scaled, 3, nstart = 25)
k3
fviz_cluster(k3, data = song_numerical.scaled,
geom='point', ellipse.type = "norm",
palette='Dark2')
<-song_numerical %>%
song_clusters3 mutate(cluster=as.factor(k3$cluster),
User=complete_info$User) %>%
rownames_to_column(var="track.id") %>%
gather(danceability, energy, loudness, speechiness, acousticness, instrumentalness,
liveness, valence, tempo, duration_ms,key=feature, value=value)
<- ggplot(song_clusters3, aes(x=cluster, y=value, fill=cluster)) +
vp geom_violin() +
geom_boxplot(width=0.1, fill="white") +
scale_fill_brewer(palette="Dark2") +
facet_wrap(~feature, scales='free') +
theme_minimal() +
theme(legend.position = 'none')
vp
<- complete_info %>%
cluster3_user mutate(cluster=as.factor(k3$cluster)) %>%
group_by(User) %>%
count(cluster) %>%
subset(User != "Shared")
<- ggplot(cluster3_user, aes(x=User, y=n, fill=cluster)) +
exp_fig geom_bar(position = 'fill', stat='identity') +
scale_fill_brewer(palette = 'Dark2') +
theme_minimal() +
labs(y='Proportion of songs')
exp_fig
It seems to me that cluster 3 captures the more slow and acoustic songs from our collections. As expected, Alejandra has a greater proportion of songs in this category than I do. A true hipster.
Conclusions
Alejandra and I have great music taste.
Also I’m not as edgy as I thought I was.
Statistical tests can be used to further support the differences observed in the violin plots. More sophisticated multidimensional statistics can also be used.
Finally, we can also look forward to further analyses. We can compare our general taste to the songs that we have been listening to more recently, or analyze the characteristics of our top artists and songs. Also, I’m pretty sure that spotify has some classification of songs, saying whether the song is happy or sad, etc. Gotta figure that out.
Also, if you have any suggestions of things to look at definitely let me know!