A Sentiment Analysis of Kanye West Lyrics

In anticipation of Kanye’s upcoming album Yandhi we will delve into the ways that his music has changed throughout his career. In this analysis we will focus on Kanye’s core discography which is composed of 8 albums spanning from his debut album, The College Dropout, released in 2004 to his most recent lyrical masterpiece, ye. Data is sourced from Spotify using the ‘spotifyr’ package and the lyrics are those provided by Genius.

kanye <- get_discography("Kanye West")

#Gather tracklist from excluded earlier albums
latereg <- get_album_tracks("5ll74bqtkcXlKE7wwkMq4g")
thecollegedropout <- get_album_tracks("4Uv86qWpGTxf7fU7lG5X6F")

latereg$album_name <- "Late Registration"
latereg$album_release_date <- as.Date("2005-08-30")

thecollegedropout$album_name <- "The College Dropout"
thecollegedropout$album_release_date <- as.Date("2004-02-10")

earlyalbums <- bind_rows(thecollegedropout,latereg)

#Retrieve audio feature information
earlyfeatures <- get_track_audio_features(earlyalbums$id)

#Join on track id
earlycomp <- earlyalbums %>%
  left_join(earlyfeatures, by = "id") %>%
  mutate(track_name = name) %>%
  select(track_name, album_name, danceability, speechiness, instrumentalness, tempo, valence, explicit, album_release_date)

latercomp <- kanye %>%
  mutate(album_release_date = as.Date(album_release_date)) %>%
  select(track_name, album_name, danceability, speechiness, instrumentalness, tempo, valence, explicit, album_release_date)

kanyedisc <- earlycomp %>%
  bind_rows(latercomp) %>%
  distinct(track_name, explicit, .keep_all= TRUE)

#Remove duplicates not removed with the previous Distinct function
kanyedisc <- kanyedisc %>%
  filter(track_name != c("Pinocchio Story - Freestyle / Live From Singapore","Pinocchio Story (Freestyle Live From Singapore)"))

#create list of tracks to source lyrics for. Spotify records incomplete/use unconventional spelling
kanyedisc %>%
  select(album_name,track_name) %>%
  write.csv("kanyetracks.csv")

#Append lyrics
klyrics <- read_excel("C:/Users/Daniel Petterson/Documents/R/spotify/kanyetracks.xlsx")

#standardise 808s and Heartbreak name before performing join
kanyedisc$album_name <- gsub(" \\(Softpak\\)", "", kanyedisc$album_name)

kanyediscly <- kanyedisc %>%
  inner_join(klyrics, by = c("track_name", "album_name")) %>%
  select(track_name, album_name, danceability, speechiness, instrumentalness, tempo, valence, explicit, album_release_date, lyrics)

kanyediscly$lyrics <- gsub("\\n", " ", kanyediscly$lyrics) 
kanyediscly$lyrics <-gsub("\\[(.*?)\\]", " ", kanyediscly$lyrics)

Number of Explicit or Clean Tracks Per Album
Album	Explicit	n
Graduation	TRUE	13
Late Registration	TRUE	19
My Beautiful Dark Twisted Fantasy	TRUE	13
The College Dropout	TRUE	19
The Life Of Pablo	TRUE	20
ye	TRUE	7
Yeezus	TRUE	10
808s & Heartbreak	FALSE	12
Graduation	FALSE	13
Late Registration	FALSE	19
The College Dropout	FALSE	19
The Life Of Pablo	FALSE	19

We see from the this table that many of Kanye’s albums have both clean and explicit versions. The album 808s & Heartbreak features no explicit songs whereas My Beautiful Dark Twisted Fantasy (MBDTF), Ye, and Yeezus only contain explicit songs. For songs that have both clean and explicit versions we will use the explicit versions as these are more likely to accurately depict the sentiment and themes that he wished to convey.

Spotify Audio Features

Using the Spotify API service we are able to access the Audio Features of different tracks.

Speechiness

One of the metrics recorded by Spotify is “Speechiness”. Speechiness is an indicator of the prevalence of spoken words in a track. If the speechiness of a song is above 0.66, it is probably composed of spoken words, a score between 0.33 and 0.66 is a song that may contain both music and words, and a score below 0.33 means the song has minimal speech.

#Keeps explicit version where there are also radio edit versions
kanyedistinct <- kanyediscly %>%
  distinct(track_name, .keep_all= TRUE)

#Correct release date for Pinocchio Story
kanyedistinct$album_release_date <- gsub("2008-01-01", "2008-11-24", kanyedistinct$album_release_date)
  
#Set correct order for plot legends
kanyedistinct$album_name <- factor(kanyedistinct$album_name, levels = c("The College Dropout", "Late Registration", "Graduation", "808s & Heartbreak","My Beautiful Dark Twisted Fantasy", "Yeezus", "The Life Of Pablo", "ye"))

kanyedistinct %>%
  filter(track_name != "Frank's Track") %>% #This track has zero values for acoustic features
  ggplot(aes(x= year(album_release_date), y=speechiness, color = album_name))+ geom_point(alpha=0.7) + labs(x="Time", y="Speechiness") +
  geom_boxplot() +
  stat_summary(aes(y = speechiness,group=1), fun.y=median, colour="red", geom="line",group=1) +
  guides(color=guide_legend(title="Album"))+
  theme_minimal()+
  ggtitle("                 Speechiness over Time")

Over time we can see some substantial changes in Speechiness between Kanye’s main albums. His first two albums, The College Dropout and Late Registration have the largest ranges. This is due to these albums featuring a variety of themed skits that have minimal instrument usage, Skit #4 from Late Registration is a prime example of this. There appears to be a noticeable downward trend that bottoms out at the production of 808s & Heartbreak and then gradually rebounds. The change in musical styling associated with 808s & Heartbreak is often attributted to emotional and mental challenges in Kanye’s life, particularly the death of his mother who heavily influenced his career and was the inspiration for the singles “Hey Momma” and “Only One”.

Valence

Valence is a measure from 0 to 1 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). There is a decrease in median valence between the the introduction of Kanye’s debut album and 808s & Heartbreak. Surprisingly enough 808s & Heartbreaks does not appear to be much more negative than the average. This is likely a function of many of the backing sounds sounding rather upbeat even while the lyrical themes are quite the opposite.

kanyedistinct %>%
  filter(track_name != "Frank's Track") %>% #Track has zero values acoustic features
  ggplot(aes(x= year(album_release_date), y=valence, color = album_name))+ geom_point(alpha=0.7) + labs(x="Time", y="Valence") +
  geom_boxplot() +
  stat_summary(aes(y = valence,group=1), fun.y=median, colour="red", geom="line",group=1) +
  guides(color=guide_legend(title="Album"))+
  theme_minimal()+
  ggtitle("                 Valence over Time")

Tempo

There appears to be a marked difference in median tempo between Kanye’s first three albums and his later ones. The range of tempo values is however similar across all albums, this is not illistrated well in the box and whisker plot as values over a certain mutilpier of the interquartile range are treated as outliers and not connected by the “whiskers”.

kanyedistinct %>%
  filter(track_name != "Frank's Track") %>% #This track has zero values for acoustic features
  ggplot(aes(x= year(album_release_date), y=tempo, color = album_name))+ geom_point(alpha=0.7) + labs(x="Time", y="Tempo") +
  geom_boxplot() +
  stat_summary(aes(y = tempo,group=1), fun.y=median, colour="red", geom="line",group=1) +
  guides(color=guide_legend(title="Album"))+
  theme_minimal()+
  ggtitle("                 Tempo over Time")

Kanye’s Lyrics

A list of the 20 most common words is not particularly illuminating in this analysis because a lot of these words are filler or “stop” words which convey no sentiment. If we remove these words then we should get a more accurate picture of the themes expressed in Kanye’s lyrics.

kanyelyrics <-kanyedistinct %>%
  select(album_name, track_name, lyrics) %>%
  unnest_tokens(word, lyrics)

kanyelyrics %>%
  filter(!nchar(word) < 3) %>% #Words like "ah" or "oo" used in music
  count(word, sort = TRUE) %>%
  top_n(20) %>%
  ggplot(aes(reorder(word, -n), n)) +
  geom_col(show.legend = FALSE, aes(fill = word)) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = -60, hjust = 0, size = 12)) +
  labs(x = "Word", y = "Occurances", title = "Most common words in Kanye's lyrics")

It may be surprising to some people that the most frequent word in Kanye’s discography is Love but this is likely skewed by the word appearing in the choruses of multiple tracks. Alongside Love we see more stereotypical words used in rap such as Money and Baby/Girl. Some of these words are more ambiguous in their meanings without proper context but we can use the Bing lexicon to assign each word to either a positive or negative category.

kanyelyrics %>%
  filter(!nchar(word) < 3) %>%  #Remove words with less than 3 characters
  anti_join(stop_words) %>% #Remove stop words
  count(word, sort = TRUE) %>%
  top_n(20) %>% #select top 20 most common words
  ggplot(aes(reorder(word, -n), n)) +
  geom_col(show.legend = FALSE, aes(fill = word)) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = -60, hjust = 0, size = 12)) +
  labs(x = "Word", y = "Occurences", title = "Most common words in Kanye's lyrics")

Splitting Words into Positive and Negative Sentiment

Here we see the top 125 words from each category with the size of the word indicating the frequency with which it occurs. There are certain limitations to this method that we must bear in mind. One being that if the word does not appear in the lexicon then it will be excluded, this is likely more of a problem in rap lyrics than say, a politician’s speech, as there will often be words used which are improperly spelled and certain words may be used to express an idea different from the traditional definition of the word and therefore could be misleading as to the sentiment of the lyrics.

#Word cloud split to positive and negative
kanyelyrics %>%
  inner_join(get_sentiments("bing")) %>% #Reference against Bing lexicon to assign positive or negative sentiments to matching words
  anti_join(stop_words) %>% #Remove unimportant/nonsentimental words 
  count(word, sentiment, sort = TRUE) %>% #Establish a count of sentimental words
  acast(word ~ sentiment, value.var = "n", fill = 0) %>% #Spread sentiment into positive and negative columns
  comparison.cloud(colors = c("#F8766D", "#00BFC4"), max.words = 250) #Generate Wordcloud

How The Love is Spread Around

One issue that can arise when dealing with lyrics is the inclusion of phonetic representations of sounds or non-words. While this won’t be a problem when carrying out an inner join with a sentiment lexicon like Bing because they would simply be ignored, it would impact the proportion of total words that particular words represent. For that reason I decided to cross-reference the list of tokens(words) with a comprehensive english dictionary which can be found here.

album_words %>%
  anti_join(stop_words) %>%
  filter(word == "love") %>%
  ggplot(aes(factor(album_name, levels = rev(levels(album_name))), freq, fill = album_name)) +
  geom_col(show.legend = FALSE) +
  scale_y_continuous() +
  coord_flip() +
  labs(x = "Album", y = "Frequency of Love", title = "Frequency of usage of the word Love across Kanye's albums")

The theme of love is not evenly distributed across all of Kanye’s albums. It becomes considerably more prevalent after his first three albums. As a proportion of all words, love is greatest in 808s & Heartbreak and ye, both of which are considered to be albums that he wrote after suffering mental hardship. The majority of the use of the word love in ye is actually related to lyrics about suicide and murder so while Love may be categorised as having positive sentiment, this is not always the case.

Most Frequent Words by Album

Looking at the most frequently used words per album we see a pattern, quite a few of the words are only listed because they appear in the chorus of a single song per album. In Kanye’s cover of Daft Punk’s classic Stronger, the words stronger, harder and faster all feature in the rather repetitive chorus.

album_words %>%
  filter(!nchar(word) < 3) %>%  #Remove words with less than 3 characters
  bind_tf_idf(word, album_name, n) %>%
  group_by(album_name) %>%
  top_n(5, tf_idf) %>%
  ggplot(aes(reorder(word, tf_idf), tf_idf, fill = album_name)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  facet_wrap(~ album_name, scales = "free_y", ncol = 3) +
  labs(x = "Word", y = "Proportion of All Words In Album", title = "Uniquely frequent terms among Kanye's albums") +
  theme_minimal()

Number of Unique Words Per Song

The graph below uses Plotly so by hovering over the points you can see the song that it represents and the number of unique words used. We see that 808s & Heartbreak and ye have a similar range but the most verbally diverse song from 808s, Pinocchio Story, has less unique words than the majority of songs featured in ye.

ggplotly(
  ggplot(num_unique, aes(x=album_name, y=unique_words, color = album_name, label = track_name, label2 = unique_words)) + 
  geom_boxplot() +
  geom_point() +
  labs(x="Album", y="Number of Unique Words") +
  scale_x_discrete(limits = rev(levels(num_unique$album_name))) + #Change order to downwards chronological
  coord_flip() +
  theme_minimal() +
  theme(legend.position = "none"), #Remove legend applied by color aesthetic 
  tooltip = c("label", "label2")
  )

Positivity of Sentiment

There are a few extreme values at either -1 or 1 on the scale. With the exception of Low Lights all of these have either the fewest or second fewestunique words in their repective albums. The track with the most number of unique words, No More Parties In LA has a sentiment score just under zero indicating an overall neutral sentiment. The red line represents the mean sentiment score by album. Half of the albums in Kanye’s core discography have an average sentiment score above zero indicating an average positive sentiment in the lyrics while the others appear to have an overall negative sentiment.

ggplotly(
  kanyelyrics %>%
  group_by(album_name) %>%
  inner_join(get_sentiments("bing")) %>%
  count(track_name, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = (positive - negative)/(positive + negative)) %>%
  ggplot(aes(x=album_name,y = sentiment,label = track_name, label2 = sentiment)) +
  geom_point(size=4.5, aes(color = album_name), alpha=0.3) +
  stat_summary(aes(y = sentiment, group = 1), fun.y= mean, colour="red", geom="line",group=1) +
  scale_x_discrete(limits = rev(levels(num_unique$album_name))) + #Change order to downwards chronological
  coord_flip() +
  labs(x="Album", y="Sentiment") +
  theme_minimal() +
  theme(legend.position = "none"), #Remove legend applied by color aesthetic 
  tooltip = c("label", "label2")
)

A More Complex Lexicon

As simply categorising the words as positive or negative may not be the best indication of the sentiment and themes conveyed by the lyrics we can also use the NRC Emotion lexicon to derive more information from the data. In addition to defining words as positive or negative in a similar fashion to the Bing lexicon, words are also associated with one or more emotions such as anger, trust, anticipation or fear.

album_sentiment <- kanyelyrics %>%
  inner_join(get_sentiments("nrc")) %>%
  group_by(album_name, sentiment) %>%
  count(album_name, sentiment) %>%
  select(album_name, sentiment, sentiment_album_count = n)

total_sentiment_album <- kanyelyrics %>%
  count(album_name) %>% #Count total number of words in album
  select(album_name, sentiment_total = n)

#album_spider_chart
album_sentiment %>%
  inner_join(total_sentiment_album, by = "album_name") %>%
  mutate(percent = sentiment_album_count / sentiment_total * 100 ) %>% #Number of words per emotion/Total words per album
  select(-sentiment_album_count, -sentiment_total) %>% #Remove unnecessary columns
  spread(album_name, percent) %>%
  chartJSRadar(showToolTipLabel = TRUE,
               main = "NRC Album Spider Chart")

# Click on the album title to include/exclude it from the chart

Compared to the Bing lexicon we see similar results regarding negative sentiment with ye, Yeezus and My Beautiful Dark Twisted Fantasy being regarded as the most negative but the album with the highest positive sentiment score using NRC is actually ranked as rather negative using Bing. Looking at the values for Kanye’s first three albums compared to his most recent, ye, it is apparent that there is an increase in fear, anger and negative sentiment and a reduction in positive sentiment. Yeezus and to a lesser extent, My Beautiful Dark Twisted Fantasy, follow a similar distribution to ye while Late Registration and 808s & Heartbreak mimic each other almost perfectly. It could be said that without the inclusion of The Life of Pablo, an album that was seen as somewhat of a break from Kanye’s previous stylings, there is a clear transition to more negative or fear evoking lyrics.