link..link

Point Curry Disco

Visual Search Engine!

Music Recommendation

Posted on Thu 30 November 2017 in misc

Music Recommendation - Part 1

This is part 1 of a series of articles where I explore music data sets using various similarity metrics and heuristics.

Datasets

Objectives


By the end of this series of articles, I intend to get answers to the following questions by utilizing all the datasets.

  • Which bands/songs are the best representatives of post-punk genre?
  • Given a band/song that is tagged post-punk, output a list of bands/songs that are most similar to that.
  • What are the genres that are most similar to post-punk?
  • Implement a Boil the Frog clone that charts a seamless path between two disparate bands using a sequence of songs

Thoughts

Given how genres are human constructs and are - defined rather loosely, known to evolve with time and perceived differently from person to person, I strongly feel that data sets comprising user-tags can give better results than those containing meta-data of intrinsic properties for genre based segregation of songs/bands. The obvious drawback being the inability to recommend/classify new songs for which there is no data available.

Is it possible to recommend/classify songs by genres strictly based on intrinsic properties alone? Probably yes for some niche sub-genres as well as overarching genres like rock, electronic, metal etc by choosing the right set of features. There is a good amount of research done using the GTZAN dataset where 30 second samples of 100 typical representatives of 10 overarching genres are provided. The image below provided by /u/monkeasy is generated by considering mel spectrograms (no idea what they are) and using supervised learning and dimensionality reduction techniques.

I will explore the LastFM-ArtistTags2007 Data set with about a million entries to begin with. Eventually I intend to use results from all the data sets, combine them and see how they fare against results from Spotify / Last.fm / Musicbrainz services.

Some Stats

Total Lines:      952810
Unique Artists:    20907
Unique Tags:      100784
Total Tags:      7178442

I'm going to start with techniques that are elementary to begin with and gradually use more sophisticated techniques to tackle this problem. You'll be surprised to find how even the most elementary of the methods can give us pretty good results if we have a reasonably large data set.

Methodology

  • Load the Data into a Pandas Dataframe
  • Clean!
  • Convert Artist Tag data to normalized numpy arrays. Each band can be thought of as a vector with genres as bases
  • Explore similarity using elementary similarity metrics and a couple of distance measures I conjured up for this problem.
  • Explore various metrics from this paper
  • Use SVD and t-SNE maybe for dimensionality reduction - Each band is restricted to 100 tags and there are more than a hundred thousand unique tags!

Load Data and Filter Bands that are tagged Post Punk

In [1]:
import re
import pandas as pd
import numpy as np
CPU times: user 260 ms, sys: 8 ms, total: 268 ms
Wall time: 270 ms
In [2]:
dataPath = '/home/harsha/Downloads/data-sets/Lastfm-ArtistTags2007/ArtistTags.dat'
data = pd.read_csv(dataPath, delimiter='', header=None, engine='python')
data = data.iloc[:,1:]

# Get bands that contain post punk tag
postPunkData = data[data[2].str.contains(r'[pP]ost[ \-]*[pP]unk', na=False)]
postPunkBands = sorted(set(postPunkData[1]))

# Get data of bands that are tagged post punk
postPunkRaw = data[data[1].isin(postPunkBands)]
postPunkRaw = postPunkRaw.fillna('unknown')
CPU times: user 3.99 s, sys: 68 ms, total: 4.06 s
Wall time: 4.06 s

Cleaning up

$post\;punk\;=\;post-punk\;=\;Post\;\;-punk\;etc$

$70s\;=\;70\;s\;=\;70's$

In [3]:
# Remove spaces at the end and capital letters
postPunkRaw[2] = postPunkRaw[2].apply(lambda x: x.lower().strip())
postPunkRaw[2] = postPunkRaw[2].apply(lambda x: re.sub('\-+',' ',x))
# Replace contiguous space segments with a single space
postPunkRaw[2] = postPunkRaw[2].apply(lambda x: re.sub(' +',' ',x))

# Fix the ' s' "'s" issue
postPunkRaw[2] = postPunkRaw[2].apply(lambda x: re.sub(' s$', 's',x))
postPunkRaw[2] = postPunkRaw[2].apply(lambda x: re.sub("'s", 's',x))
genres = list(sorted(set(postPunkRaw[2])))
CPU times: user 952 ms, sys: 0 ns, total: 952 ms
Wall time: 952 ms
In [4]:
len(genres)
Out[4]:
27427

$post\;punk\;=\;postpunk\;\;\;indie\;rock\;=\;indierock$

In [5]:
duplicates = []
compoundGenres = list(filter(lambda x: ' ' in x or '-' in x, genres))
fixedCompounds = [genre.replace('-','').replace(' ','') for genre in compoundGenres]

monoGenres = set(genres) - set(compoundGenres)

for index, genre in enumerate(fixedCompounds):
    if genre in monoGenres:
        duplicates.append((genre, compoundGenres[index]))
        
monoComposites = [genre[0] for genre in duplicates]
composites = [genre[1] for genre in duplicates if '-' in genre[1]]
CPU times: user 28 ms, sys: 4 ms, total: 32 ms
Wall time: 32.4 ms

Cleaning Functions

In [6]:
def fixGenres(cellValue):
    if cellValue in monoComposites:
        return duplicates[monoComposites.index(cellValue)][1]
    else:
        return cellValue
    
def replaceArtistNames(cellValue):
    return np.where(postPunkBands==cellValue)[0][0]

def replaceGenreNames(cellValue):
    return np.where(genres==cellValue)[0][0]

def fixEmpty(cellValue):
    if cellValue == '':
        return "unknown"
    else:
        return cellValue
In [7]:
postPunkRaw[2] = postPunkRaw[2].apply(fixGenres)
postPunkRaw[2] = postPunkRaw[2].apply(lambda x: re.sub(' s$', 's',x))
postPunkRaw[2] = postPunkRaw[2].apply(lambda x: x.strip())
postPunkRaw[2] = postPunkRaw[2].apply(fixEmpty)
CPU times: user 1.41 s, sys: 0 ns, total: 1.41 s
Wall time: 1.41 s
In [8]:
genres = np.array(sorted(set(postPunkRaw[2])))
postPunkBands = np.array(sorted(set(postPunkRaw[1])))
CPU times: user 64 ms, sys: 0 ns, total: 64 ms
Wall time: 62.4 ms
In [9]:
postPunkRaw[1] = postPunkRaw[1].apply(replaceArtistNames)
postPunkRaw[2] = postPunkRaw[2].apply(replaceGenreNames)
CPU times: user 39 s, sys: 8 ms, total: 39 s
Wall time: 38.9 s
In [30]:
len(genres)
Out[30]:
26901
In [11]:
genrArray = np.array(genres)
bandarray = [np.zeros(len(genres)) for band in range(len(postPunkBands))]
for bandNumber in range(len(postPunkBands)):
    pq = postPunkRaw[postPunkRaw[1]==bandNumber].groupby(2)[3].sum().sort_index()
    for index, value in pq.iteritems():
        bandarray[bandNumber][index] = value
normedBandArray = [band / np.linalg.norm(band) for band in bandarray]  
CPU times: user 3.54 s, sys: 168 ms, total: 3.71 s
Wall time: 3.03 s

Similarity Metrics

Let's start with an elementary metric that even a 10 year old kid can understand. Discounting the 'normalized weights' of various tags / genres given to bands, we can quantify the similarity between two bands by the number of common tags between them.

Say Band-1 has 5 tags - [ post punk, shoegaze, mathrock, alternative, UK 2000s ] and Band-2 has 4 tags - [post punk, indie rock, noise, alternative], the similarity between them is simply the number of common tags, which in this case equals 2 (post punk and alternative).

As trivial as this might seem, it spat out pretty decent results, more than passable in fact.

In [12]:
def elSimo(artistName):
    try:
        artistId = np.where(postPunkBands==artistName)[0][0]
        simBands = []
        xc = np.where(normedBandArray[artistId] != 0)
        for index, band in enumerate(normedBandArray):
            yc = np.where(band != 0)
            common = len(set(xc[0]).intersection(set(yc[0])))
            simBands.append((postPunkBands[index],\
                             common))
        return sorted(simBands, key=lambda x: x[1], reverse=True)
        
    except ValueError:
        return "Band not in Database. Try something else"
In [77]:
elSimo('Franz Ferdinand')[1:20]
Out[77]:
[('Bloc Party', 71),
 ('The Killers', 70),
 ('Oasis', 69),
 ('The Strokes', 69),
 ('Placebo', 68),
 ('The Futureheads', 68),
 ('Blur', 67),
 ('Kaiser Chiefs', 67),
 ('Interpol', 66),
 ('Muse', 66),
 ('The White Stripes', 66),
 ('David Bowie', 65),
 ('Snow Patrol', 65),
 ('Arctic Monkeys', 64),
 ('Gorillaz', 64),
 ('Radiohead', 64),
 ('Weezer', 64),
 ('Hot Hot Heat', 63),
 ('Incubus', 63)]

It can be easily seen that this trivial solution performs well only if all the bands have equal number of tags. We could take care of this by dividing the result by the sum of number of tags of the bands that are being compared.

In [74]:
def elSimore(artistName):
    try:
        artistId = np.where(postPunkBands==artistName)[0][0]
        simBands = []
        xc = np.where(normedBandArray[artistId] != 0)
        for index, band in enumerate(normedBandArray):
            yc = np.where(band != 0)
            common = len(set(xc[0]).intersection(set(yc[0])))
            simBands.append((postPunkBands[index],\
                             common/(len(xc[0]) + len(yc[0]))))
        return sorted(simBands, key=lambda x: x[1], reverse=True)
        
    except ValueError:
        return "Band not in Database. Try something else"
In [75]:
elSimore('Franz Ferdinand')[1:20]
Out[75]:
[('Bloc Party', 0.37967914438502676),
 ('The Killers', 0.3645833333333333),
 ('Oasis', 0.3631578947368421),
 ('Blur', 0.3602150537634409),
 ('The Futureheads', 0.35978835978835977),
 ('The Strokes', 0.359375),
 ('Kaiser Chiefs', 0.35638297872340424),
 ('Placebo', 0.35233160621761656),
 ('Snow Patrol', 0.3439153439153439),
 ('Interpol', 0.34375),
 ('Muse', 0.34375),
 ('The White Stripes', 0.34196891191709844),
 ('David Bowie', 0.3403141361256545),
 ('Arctic Monkeys', 0.3386243386243386),
 ('Weezer', 0.3368421052631579),
 ('Gorillaz', 0.33507853403141363),
 ('Radiohead', 0.3333333333333333),
 ('The Libertines', 0.3333333333333333),
 ('The Kinks', 0.3298429319371728)]

How about we penalize the similarity metric for the tags that are present in one band but not the other ? I will explore that and a few other metrics in the next article in this series. I'll leave it here with two more metrics based on Cosine Similarity and Euclidean Distance (the usual suspects) and their results to compare with our elementary metric. I can see that there is a definite improvement in the quality of results. Modified Cosine Similarity appears to be giving the best results out of the three.

In [81]:
def simDot(artistName):
    try:
        artistId = np.where(postPunkBands==artistName)[0][0]
        simBands = []
        xc = np.where(normedBandArray[artistId] != 0)
        for index, band in enumerate(normedBandArray):
            yc = np.where(band != 0)
            common = len(set(xc[0]).intersection(set(yc[0])))
            simBands.append((postPunkBands[index],\
                             np.dot(normedBandArray[artistId], band)\
                             *(common/(len(xc[0]) + len(yc[0])))))
        return sorted(simBands, key=lambda x: x[1], reverse=True)
        
    except ValueError:
        return "Band not in Database. Try something else"
In [88]:
def simEucDis(artistName):
    try:
        artistId = np.where(postPunkBands==artistName)[0][0]
        simBands = []
        xc = np.where(normedBandArray[artistId] != 0)
        for index, band in enumerate(normedBandArray):
            yc = np.where(band != 0)
            common = len(set(xc[0]).intersection(set(yc[0])))
            simBands.append((postPunkBands[index],\
                             np.linalg.norm(normedBandArray[artistId]- band)))
        return sorted(simBands, key=lambda x: x[1])
        
    except ValueError:
        return "Band not in Database. Try something else"
In [83]:
simDot('Franz Ferdinand')[1:20]
Out[83]:
[('Bloc Party', 0.35290809020672398),
 ('Kaiser Chiefs', 0.34585986294022369),
 ('The Killers', 0.34388177283229471),
 ('The Strokes', 0.34136284484516038),
 ('The Futureheads', 0.33427419644700562),
 ('Snow Patrol', 0.32987660430762833),
 ('Arctic Monkeys', 0.32194894777211541),
 ('Kasabian', 0.31415057615076475),
 ('Kings of Leon', 0.31069252988647111),
 ('The Bravery', 0.3106676110515369),
 ('Interpol', 0.30686743301559943),
 ('Placebo', 0.30406949277302053),
 ('The White Stripes', 0.30357277232200364),
 ('Stereophonics', 0.30267996958875787),
 ('Pixies', 0.29882397087719531),
 ('The Libertines', 0.29823714690508324),
 ('Hot Hot Heat', 0.2972862437054683),
 ('Maxïmo Park', 0.29534348989148862),
 ('Doves', 0.29343210238816736)]
In [89]:
simEucDis('Franz Ferdinand')[1:20]
Out[89]:
[('Gomez', 0.23899794261300678),
 ('Kaiser Chiefs', 0.24301254603310349),
 ('Hard-Fi', 0.25846065346205416),
 ('Kasabian', 0.27183904711180812),
 ('Black Rebel Motorcycle Club', 0.28059227453507657),
 ('Snow Patrol', 0.28572831559735179),
 ('Feeder', 0.28675821716581534),
 ('Elbow', 0.28991566657816936),
 ('Athlete', 0.29055997649377757),
 ('Razorlight', 0.30215799279601285),
 ('The Subways', 0.30353974365140896),
 ('22-20s', 0.30358339415649271),
 ('The Dandy Warhols', 0.30844860785051936),
 ('The Bravery', 0.30969047503804559),
 ('Nine Black Alps', 0.31077352379188788),
 ('The Music', 0.31174767647514168),
 ('Doves', 0.31265077747547387),
 ('Idlewild', 0.31376080584751176),
 ('Arctic Monkeys', 0.31382961495203643)]