Music Recommendation
Posted on Thu 30 November 2017 in misc
Music Recommendation - Part 1¶
This is part 1 of a series of articles where I explore music data sets using various similarity metrics and heuristics.
Objectives¶
By the end of this series of articles, I intend to get answers to the following questions by utilizing all the datasets.
- Which bands/songs are the best representatives of post-punk genre?
- Given a band/song that is tagged post-punk, output a list of bands/songs that are most similar to that.
- What are the genres that are most similar to post-punk?
- Implement a Boil the Frog clone that charts a seamless path between two disparate bands using a sequence of songs
Thoughts¶
Given how genres are human constructs and are - defined rather loosely, known to evolve with time and perceived differently from person to person, I strongly feel that data sets comprising user-tags can give better results than those containing meta-data of intrinsic properties for genre based segregation of songs/bands. The obvious drawback being the inability to recommend/classify new songs for which there is no data available.
Is it possible to recommend/classify songs by genres strictly based on intrinsic properties alone? Probably yes for some niche sub-genres as well as overarching genres like rock, electronic, metal etc by choosing the right set of features. There is a good amount of research done using the GTZAN dataset where 30 second samples of 100 typical representatives of 10 overarching genres are provided. The image below provided by /u/monkeasy is generated by considering mel spectrograms (no idea what they are) and using supervised learning and dimensionality reduction techniques.
I will explore the LastFM-ArtistTags2007 Data set with about a million entries to begin with. Eventually I intend to use results from all the data sets, combine them and see how they fare against results from Spotify / Last.fm / Musicbrainz services.
Some Stats¶
Total Lines: 952810
Unique Artists: 20907
Unique Tags: 100784
Total Tags: 7178442
I'm going to start with techniques that are elementary to begin with and gradually use more sophisticated techniques to tackle this problem. You'll be surprised to find how even the most elementary of the methods can give us pretty good results if we have a reasonably large data set.
Methodology¶
- Load the Data into a Pandas Dataframe
- Clean!
- Convert Artist Tag data to normalized numpy arrays. Each band can be thought of as a vector with genres as bases
- Explore similarity using elementary similarity metrics and a couple of distance measures I conjured up for this problem.
- Explore various metrics from this paper
- Use SVD and t-SNE maybe for dimensionality reduction - Each band is restricted to 100 tags and there are more than a hundred thousand unique tags!
Load Data and Filter Bands that are tagged Post Punk¶
import re
import pandas as pd
import numpy as np
dataPath = '/home/harsha/Downloads/data-sets/Lastfm-ArtistTags2007/ArtistTags.dat'
data = pd.read_csv(dataPath, delimiter='' , header=None, engine='python')
data = data.iloc[:,1:]
# Get bands that contain post punk tag
postPunkData = data[data[2].str.contains(r'[pP]ost[ \-]*[pP]unk', na=False)]
postPunkBands = sorted(set(postPunkData[1]))
# Get data of bands that are tagged post punk
postPunkRaw = data[data[1].isin(postPunkBands)]
postPunkRaw = postPunkRaw.fillna('unknown')
Cleaning up¶
$post\;punk\;=\;post-punk\;=\;Post\;\;-punk\;etc$
$70s\;=\;70\;s\;=\;70's$
# Remove spaces at the end and capital letters
postPunkRaw[2] = postPunkRaw[2].apply(lambda x: x.lower().strip())
postPunkRaw[2] = postPunkRaw[2].apply(lambda x: re.sub('\-+',' ',x))
# Replace contiguous space segments with a single space
postPunkRaw[2] = postPunkRaw[2].apply(lambda x: re.sub(' +',' ',x))
# Fix the ' s' "'s" issue
postPunkRaw[2] = postPunkRaw[2].apply(lambda x: re.sub(' s$', 's',x))
postPunkRaw[2] = postPunkRaw[2].apply(lambda x: re.sub("'s", 's',x))
genres = list(sorted(set(postPunkRaw[2])))
len(genres)
$post\;punk\;=\;postpunk\;\;\;indie\;rock\;=\;indierock$
duplicates = []
compoundGenres = list(filter(lambda x: ' ' in x or '-' in x, genres))
fixedCompounds = [genre.replace('-','').replace(' ','') for genre in compoundGenres]
monoGenres = set(genres) - set(compoundGenres)
for index, genre in enumerate(fixedCompounds):
if genre in monoGenres:
duplicates.append((genre, compoundGenres[index]))
monoComposites = [genre[0] for genre in duplicates]
composites = [genre[1] for genre in duplicates if '-' in genre[1]]
Cleaning Functions¶
def fixGenres(cellValue):
if cellValue in monoComposites:
return duplicates[monoComposites.index(cellValue)][1]
else:
return cellValue
def replaceArtistNames(cellValue):
return np.where(postPunkBands==cellValue)[0][0]
def replaceGenreNames(cellValue):
return np.where(genres==cellValue)[0][0]
def fixEmpty(cellValue):
if cellValue == '':
return "unknown"
else:
return cellValue
postPunkRaw[2] = postPunkRaw[2].apply(fixGenres)
postPunkRaw[2] = postPunkRaw[2].apply(lambda x: re.sub(' s$', 's',x))
postPunkRaw[2] = postPunkRaw[2].apply(lambda x: x.strip())
postPunkRaw[2] = postPunkRaw[2].apply(fixEmpty)
genres = np.array(sorted(set(postPunkRaw[2])))
postPunkBands = np.array(sorted(set(postPunkRaw[1])))
postPunkRaw[1] = postPunkRaw[1].apply(replaceArtistNames)
postPunkRaw[2] = postPunkRaw[2].apply(replaceGenreNames)
len(genres)
genrArray = np.array(genres)
bandarray = [np.zeros(len(genres)) for band in range(len(postPunkBands))]
for bandNumber in range(len(postPunkBands)):
pq = postPunkRaw[postPunkRaw[1]==bandNumber].groupby(2)[3].sum().sort_index()
for index, value in pq.iteritems():
bandarray[bandNumber][index] = value
normedBandArray = [band / np.linalg.norm(band) for band in bandarray]
Similarity Metrics¶
Let's start with an elementary metric that even a 10 year old kid can understand. Discounting the 'normalized weights' of various tags / genres given to bands, we can quantify the similarity between two bands by the number of common tags between them.
Say Band-1 has 5 tags - [ post punk, shoegaze, mathrock, alternative, UK 2000s ] and Band-2 has 4 tags - [post punk, indie rock, noise, alternative], the similarity between them is simply the number of common tags, which in this case equals 2 (post punk and alternative).
As trivial as this might seem, it spat out pretty decent results, more than passable in fact.
def elSimo(artistName):
try:
artistId = np.where(postPunkBands==artistName)[0][0]
simBands = []
xc = np.where(normedBandArray[artistId] != 0)
for index, band in enumerate(normedBandArray):
yc = np.where(band != 0)
common = len(set(xc[0]).intersection(set(yc[0])))
simBands.append((postPunkBands[index],\
common))
return sorted(simBands, key=lambda x: x[1], reverse=True)
except ValueError:
return "Band not in Database. Try something else"
elSimo('Franz Ferdinand')[1:20]
It can be easily seen that this trivial solution performs well only if all the bands have equal number of tags. We could take care of this by dividing the result by the sum of number of tags of the bands that are being compared.
def elSimore(artistName):
try:
artistId = np.where(postPunkBands==artistName)[0][0]
simBands = []
xc = np.where(normedBandArray[artistId] != 0)
for index, band in enumerate(normedBandArray):
yc = np.where(band != 0)
common = len(set(xc[0]).intersection(set(yc[0])))
simBands.append((postPunkBands[index],\
common/(len(xc[0]) + len(yc[0]))))
return sorted(simBands, key=lambda x: x[1], reverse=True)
except ValueError:
return "Band not in Database. Try something else"
elSimore('Franz Ferdinand')[1:20]
How about we penalize the similarity metric for the tags that are present in one band but not the other ? I will explore that and a few other metrics in the next article in this series. I'll leave it here with two more metrics based on Cosine Similarity and Euclidean Distance (the usual suspects) and their results to compare with our elementary metric. I can see that there is a definite improvement in the quality of results. Modified Cosine Similarity appears to be giving the best results out of the three.
def simDot(artistName):
try:
artistId = np.where(postPunkBands==artistName)[0][0]
simBands = []
xc = np.where(normedBandArray[artistId] != 0)
for index, band in enumerate(normedBandArray):
yc = np.where(band != 0)
common = len(set(xc[0]).intersection(set(yc[0])))
simBands.append((postPunkBands[index],\
np.dot(normedBandArray[artistId], band)\
*(common/(len(xc[0]) + len(yc[0])))))
return sorted(simBands, key=lambda x: x[1], reverse=True)
except ValueError:
return "Band not in Database. Try something else"
def simEucDis(artistName):
try:
artistId = np.where(postPunkBands==artistName)[0][0]
simBands = []
xc = np.where(normedBandArray[artistId] != 0)
for index, band in enumerate(normedBandArray):
yc = np.where(band != 0)
common = len(set(xc[0]).intersection(set(yc[0])))
simBands.append((postPunkBands[index],\
np.linalg.norm(normedBandArray[artistId]- band)))
return sorted(simBands, key=lambda x: x[1])
except ValueError:
return "Band not in Database. Try something else"
simDot('Franz Ferdinand')[1:20]
simEucDis('Franz Ferdinand')[1:20]