Similarity in Music Preferences

Posted on Sat 29 October 2016 in misc

I want to compare my last.fm scrobbles with that of my friend's and arrive at a metric that quantifies the similarity between us in terms of musical preferences. I'm looking at just one variable here, the number of times a particular artist is played and will approach this problem using basic Mathematics and Python. Nothing fancy.

The dataset I'm comparing mine with is about 9 times more voluminous - 9000 vs 80000 plays. I used to listen to my music primarily on my iPod so unfortunately, all that I have to show on last.fm is the music I played on my laptop and about an year of data scrobbled from Rdio. So the data I have there, though is reflective of the kind of music I listen to - post punk, indie, experimental, noise etc, it is not reflective of all the artists I've been listening to over the years. Luckily enough, after about an hour of frantic searching, I managed to find a backup of my ipod stats for the years 2005 - 09. Pats past-self on his back Now we are looking at a more respectable 34000 vs 80000

Naive Solution

Find the artists present in both the datasets.
Taking these n artists as base vectors, our listening habits can then be described as vectors in this n-dimensional space, with the number of times an artist was played represented as magnitude of the component along that artist's base vector.
Calculate the cosine of angle between these two vectors (dot product divided by product of magnitudes) and now we have a number in the range $[0,1]$. 0 in case of no common artists at all and 1 in case of the ratio of number of plays of every common artist across the both sets is a constant.

Thoughts

Using common artists doesn't look like a good idea. Consider this trivial edge case. A and B listen to 1000 bands each, of which 10 are common. If their playing habits were similar for these 10 bands, does that mean their music tastes are similar? Of course not. One thing that comes at the top of my head is to multiply the cosine value with a function that quantifies the magnitude of common artists relative to the total number. Some thing like proportion of common artists to the total number of artists - $\frac{n(X\cap Y)}{n(X\cup Y)}$ Alternatively, take the union of artists from both the data sets and take them as base vectors. If a user hasn't heard a particular band, his component along the base vector would be zero, thereby contributing nothing to the numerator (Dot Product) but contributing to the denominator (Product of Magnitudes) and thus decreasing the compatibility value.
Why dot product? How about a custom function say 1 upon 1 + Euclidean distance between the vector coordinates? How about using other metrics for distance like Minkowski distance? How about using Pearson Correlation Score? I could think of a simpler function where there is no need to bother about the concept of distance at all. Let $ a_1, a_2, a_3,......, a_n$ be the union of artists present in both the sets and let $a_{m}^{x}, a_{m}^{y}$ represent the play counts of artist $a_m$ by $X$ and $Y$ respectively. Now consider the function
$$s(x) = \frac{\sum_{i\in X \cap Y } \frac{min(a_{i}^{x},a_{i}^{y} )}{max(a_{i}^{x},a_{i}^{y} )}}{n(X \cap Y)}$$
with a range $[0,1]$. How effective would this be in quantifying what we want? It churns out 1 if and only if the play counts of all the respective common artists are exactly the same across both the users, as opposed to the naive approach where it is 1 in case of the ratio of number of plays of each artist across the both sets is constant.

Results

(computed in Common Artists Mode with and without ipod data)

Pearson Correlation Score - 0.314 & 0.114
Dot Product Cosine - 0.416 & 0.2815
$s(x)$ score - 0.296 & 0.2633

That function I created out of thin air appears to be more consistent! And also pretty close to the 29% similarity we seem to have according to last.fm! Nice.

Code

Used this script to download last.fm data to a text file. Removed the data that isn't relevant to the task at hand - artist, album, song ids at musicbrainz. The data set I downloaded for my user profile was unfortunately missing values(Song and Album names) at many places. Had to fix that first. Then I had to clean up my ipod data and merge it with the last.fm data. Duplication due to wrong capitalization, Spellings, 'The' issues. Fun times.

from __future__ import division
import sys
import math


def filedata(file):
    a = open(file, 'rw')
    b = a.readlines()
    return b


def lastfmfreq(filedatas):
    artdict = {}
    for line in filedatas:
        first_tab = line.index('\t')
        second_tab = line.index('\t', first_tab + 1)
        art_name = line[first_tab + 1 : second_tab]
        if art_name in artdict.keys():
            artdict[art_name] += 1
        else:
            artdict[art_name] = 1
    return artdict


def commonartists(artdict1, artdict2):
    comartdict = {}
    for artist in artdict1.keys():
        if artist in artdict2.keys():
            comartdict[artist] = (artdict1[artist], artdict2[artist])
    return comartdict

class Vector(object):
    def __init__(self, components):
        self.components = components


    def dot_product(self, vektor):
        dimension = len(self.components)
        self_size = math.sqrt(sum([(component)**2 for component in self.components]))
        vektor_size = math.sqrt(sum([(component)**2 \
        for component in vektor.components]))
        comp_product = [(self.components[i]) * (vektor.components[i]) \
        for i in range[dimension]]
        magnitude_product = self_size * vektor_size
        return comp_product / magnitude_product

    def pearson(self, vektor):
        dimension = len(self.components)
        self_mean = sum(self.components)/dimension
        vektor_mean = sum(vektor.components)/dimension
        self_size = math.sqrt(sum([(component-self_mean)**2 \
        for component in self.components]))
        vektor_size = math.sqrt(sum([(component-vektor_mean)**2 \
        for component in vektor.components]))
        dot_product = sum([(self.components[i]-self_mean)*\
        (vektor.components[i]-vektor_mean) for i in range(dimension)])
        magnitude_product = self_size * vektor_size
        return dot_product/magnitude_product

    def sx(self, vektor):
        dimension = len(self.components)
        numerator = sum([min(self.components[i], vektor.components[i])\
        /max(self.components[i], vektor.components[i])\
        for i in range(dimension)])
        return (numerator/dimension)



if __name__ == "__main__":
    common_artists = commonartists(lastfmfreq(filedata(sys.argv[1])),\
    lastfmfreq(filedata(sys.argv[2]))).items()
    vector1 = Vector([element[1][0] for element in common_artists])
    vector2 = Vector([element[1][1] for element in common_artists])
    print vector1.dot_product(vector2)
    print vector1.pearson(vector2)
    print vector1.sx(vector2)

More Thoughts

Data of tracks that were skipped could add an extra dimension (metaphorical i.e.) to the users' perception of artists.
Does the similarity function need to be symmetric?

Update : Tversky index explores the notion of asymmetric similarity. It is not a similarity metric in the traditional sense, by virtue of it being asymmetric but I will dig more into the notion of asymmetric similarity anyway, may be a dedicated post sometime on how it could possibly make sense, with the help of a real world example I have in mind.

What Next

Need to refer to literature on this and learn more about various similarity metrics. Will also be exploring the humongous user-generated-tags data released by last.fm and see if I can come up with anything interesting.

Update: WOW!

Common Artists' Data (sorted by playcount of the artists I've listened to)

For those who are interested

The Clash (404, 2689)
Pixies (834, 1757)
Bloc Party (91, 1341)
Wire (79, 1331)
Deerhoof (561, 1295)
Blur (506, 1280)
The Dead Milkmen (34, 1176)
Joy Division (93, 1149)
Spoon (489, 1035)
Modest Mouse (885, 906)
Gang of Four (17, 794)
The Undertones (20, 725)
Melt-Banana (2, 676)
Franz Ferdinand (46, 666)
New Order (28, 611)
Talking Heads (70, 608)
Minutemen (1187, 587)
Siouxsie and the Banshees (35, 545)
The Dandy Warhols (74, 507)
Interpol (128, 493)
Urinals (22, 492)
The Rakes (126, 443)
Sonic Youth (215, 401)
Pulp (22, 326)
Swell Maps (3, 325)
Primus (13, 309)
The Smiths (278, 308)
Glaxo Babies (4, 306)
The Fall (279, 305)
New York Dolls (14, 305)
Happy Mondays (56, 282)
Arctic Monkeys (79, 273)
The Breeders (262, 272)
The Ex (12, 272)
The Shins (387, 259)
Graham Coxon (5, 252)
Stephen Malkmus (21, 213)
Pavement (1196, 211)
Japanther (38, 210)
Kraftwerk (73, 201)
Daft Punk (62, 200)
Liars (38, 200)
Weezer (60, 189)
Kaiser Chiefs (6, 188)
The Futureheads (9, 181)
Beck (466, 172)
The Fiery Furnaces (29, 170)
The Strokes (256, 155)
Young Knives (2, 152)
Fugazi (28, 135)
Buzzcocks (19, 124)
The Feelies (9, 123)
Ratatat (180, 116)
Public Image Ltd. (5, 110)
The Rapture (24, 108)
Au Pairs (3, 104)
Health (11, 97)
OOIOO (1, 96)
Delta 5 (1, 95)
Two Door Cinema Club (26, 94)
Nirvana (83, 92)
The Yardbirds (21, 88)
Merzbow (2, 86)
Big Black (7, 84)
Tokyo Police Club (19, 84)
The New Pornographers (1775, 83)
The Wombats (3, 82)
Stereophonics (11, 80)
Fountains of Wayne (54, 79)
Liliput (1, 78)
Butthole Surfers (156, 76)
Editors (28, 75)
Bauhaus (13, 73)
Gorillaz (125, 72)
Chicks On Speed (88, 70)
Negativland (40, 68)
Battles (242, 68)
Metronomy (2, 64)
The Kooks (19, 63)
Blonde Redhead (6, 60)
Mystery Jets (16, 56)
Caravan Palace (3, 56)
Marnie Stern (2, 54)
Chuck Berry (60, 53)
Lou Reed (13, 52)
MGMT (215, 50)
Bright Eyes (88, 48)
Stereolab (27, 46)
Tom Waits (153, 44)
XTC (1, 44)
Built to Spill (403, 42)
The Stranglers (57, 42)
Louis XIV (1, 40)
Vampire Weekend (260, 40)
The Mekons (5, 40)
The Naked and Famous (13, 40)
Sex Pistols (12, 38)
Yeah Yeah Yeahs (119, 38)
Flogging Molly (20, 37)
Pere Ubu (34, 37)
Digitalism (3, 36)
Good Shoes (1, 36)
The Spinto Band (180, 34)
Television (8, 33)
The Stone Roses (8, 32)
Klaxons (5, 32)
The Libertines (105, 29)
Neu! (25, 28)
1990s (2, 27)
The Vaselines (109, 26)
Mission of Burma (117, 26)
The View (1, 24)
The Modern Lovers (1, 23)
Tapes 'n Tapes (43, 23)
Jeff Beck (57, 23)
Angry Samoans (5, 22)
Josef K (10, 22)
The Jesus and Mary Chain (119, 21)
Muddy Waters (8, 21)
We Are Scientists (5, 21)
Madness (1, 21)
Poison Girls (21, 20)
The Killers (56, 20)
Parov Stelar (7, 20)
Black Randy & the Metrosquad (1, 19)
The Drums (4, 19)
The Offspring (2, 18)
The Raincoats (6, 18)
Deep Wound (8, 18)
The Velvet Underground (581, 18)
Radiohead (569, 18)
Original Soundtrack, Mulatu Astatqe (1, 17)
LCD Soundsystem (86, 17)
Blurt (5, 16)
Johnny Cash (5, 16)
Wavves (1, 16)
Lesbians On Ecstasy (8, 16)
Cabaret Voltaire (8, 16)
Cansei de Ser Sexy (86, 16)
The Postal Service (49, 15)
Drive Like Jehu (21, 15)
Tilly and the Wall (1, 15)
Rapeman (5, 14)
Sleater-Kinney (127, 14)
Wipers (2, 14)
British Sea Power (1, 14)
A Certain Ratio (31, 14)
The Evolution Control Committee (34, 14)
The Brian Jonestown Massacre (21, 14)
Ludus (1, 13)
This Heat (5, 12)
The Danse Society (4, 12)
Bérurier Noir (2, 12)
The Raveonettes (119, 10)
Boredoms (5, 10)
Dinosaur Jr. (84, 10)
Norah Jones (16, 10)
Ringo Deathstarr (5, 10)
The Whitest Boy Alive (2, 10)
The Kills (4, 10)
Red Hot Chili Peppers (53, 9)
PJ Harvey (174, 9)
The Monochrome Set (7, 8)
Iggy Pop (4, 8)
Pink Floyd (442, 8)
NoMeansNo (6, 8)
Unwound (2, 8)
The Decemberists (103, 7)
The Slits (92, 7)
Suicide (4, 6)
Jefferson Airplane (62, 6)
My Dad Is Dead (1, 6)
Beat Happening (5, 6)
Jello Biafra (2, 6)
Bombay Bicycle Club (28, 6)
Black Flag (13, 6)
The Names (6, 6)
Donovan (1, 6)
My Bloody Valentine (457, 6)
The Durutti Column (2, 5)
Hard-Fi (3, 5)
Nada Surf (21, 5)
Duke Ellington (2, 5)
No Age (7, 4)
The Cure (430, 4)
Violent Femmes (78, 4)
Liquid Liquid (1, 4)
Queen (316, 4)
A Place to Bury Strangers (6, 4)
AC/DC (14, 4)
Death from Above 1979 (15, 4)
Arcade Fire (432, 4)
Chrome (6, 4)
Friendly Fires (14, 4)
Half Japanese (5, 4)
The Vaccines (7, 4)
Kent (8, 3)
KT Tunstall (26, 3)
Neko Case (209, 3)
Kasabian (15, 3)
LOVE PSYCHEDELICO (9, 3)
Kool & The Gang (7, 3)
R.E.M. (192, 2)
The Royal Family And The Poor (6, 2)
The Glitch Mob (11, 2)
Jens Lekman (67, 2)
Nick Cave & The Bad Seeds (176, 2)
Janis Joplin (2, 2)
The Black Keys (55, 2)
Röyksopp (28, 2)
Tortoise (3, 2)
Naked Raygun (4, 2)
Datarock (2, 2)
23 Skidoo (6, 2)
The Jesus Lizard (2, 2)
The Rolling Stones (85, 2)
Le Tigre (2, 2)
Siege (2, 2)
The Kinks (74, 2)
Led Zeppelin (445, 2)
The Staple Singers (9, 2)
The Residents (118, 2)
Flipper (13, 2)
The Smashing Pumpkins (39, 2)
Rogue Wave (5, 2)
The Horrors (11, 2)
fIREHOSE (16, 2)
Washed Out (47, 2)
The Replacements (32, 2)
The Homosexuals (3, 2)
Justice (12, 2)
Sigur Rós (151, 2)
Five Or Six (6, 2)
The Pop Group (83, 2)
Passion Pit (14, 2)
Boris (2, 2)
Barenaked Ladies (2, 2)
Broken Bells (8, 2)
Van Halen (2, 2)
Sly & The Family Stone (3, 2)
Casiotone for the Painfully Alone (7, 2)
The xx (107, 2)
The dB's (7, 2)
Electrelane (46, 2)
Foster the People (48, 2)
SBTRKT (7, 2)
Les Savy Fav (8, 2)
Ladytron (22, 2)
The Centurians (5, 2)
PJ Harvey, PJ Harvey (10, 2)
Dead Kennedys (7, 2)
Frank Sinatra (16, 2)
Sebadoh (7, 2)
!!! (3, 2)
Phoenix (45, 2)
Yo La Tengo (876, 2)
Epic45 (9, 2)
Colin Newman (5, 2)
Elliott Smith (340, 2)
Clap Your Hands Say Yeah (54, 2)
Bonobo (16, 2)
Swirlies (6, 2)
Pretty Lights (10, 2)
Shpongle (1, 2)
Brendan Benson (2, 1)
Shinedown (1, 1)
James Brown (12, 1)
The Servant (1, 1)
Grandaddy (5, 1)
Switchfoot (1, 1)
Matt Costa (1, 1)
Ray Charles (3, 1)
Marvin Gaye (14, 1)
Coldplay (94, 1)
Captain Beefheart & His Magic Band (10, 1)
Neutral Milk Hotel (207, 1)
The Magic Numbers (62, 1)
Can (151, 1)