Feb 23, 2013

Intro to Mathematics in Similarity Classifiers Part 2


     This article will provide an introduction to the mathematics of similarity classifiers, specifically Jaccard, Extended Jaccard, and Cosine.  We will attempt to introduce the topics in layman's terms and provide examples that readers can play with and watch the results.  While this intro series of articles will go into theory and formulas full of Greek letters, they will first build upon simple explanations and real world examples.  Hopefully in reading this article you can gain an understanding of both the application and the math involved.


The Example

     For the first set of classifiers I'm stealing an example from the great book "Programming Collective Intelligence" by Toby Segaran.  But rather than focus on the code, this article will use the example to simplify and explain the math.

     The premise is that a store like Amazon.com wants to recommend items a customer might like.  In order to make a decision on what to recommend we can take two different approaches.  We could find similar movie buyers, and recommend movies that those similar buyers like.  This is akin to getting movie recommendations from a friend who has similar tastes as you.  Alternatively, we could try to classify how similar movies are to each other, and recommend similar movies to the movies that you already like.  For example, one could argue Batman is similar to Superman because both are superhero movies, so we could recommend Batman to buyers who purchase Superman.

     For brevity this article is going to stick with only talking about the similarity of movie viewers rather than item similarity, but the algorithms are the same.  Click the link below to play with a simple app that calculates the similarity between movie viewers using the mathematics we will discuss in detail.



Classifier:  Jaccard

     The Jaccard Similarity Coefficient is another useful similarity approach.  Wikipedia states the Jaccard coefficient measures similarity between sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets:


Jaccard English Translation

     In English, this simply means "The number of characteristics in common, divided by the total number of characteristics".  So while the mathematical formula might look scary to the lay person, I think the simple English description shows this to be a "face-palm" obvious measure of similarity.


Simple Example

     Stealing the example from here, suppose we want to determine the similarity between various kinds of fruits.  We could categorize fruits in terms of spherical shape, crunchy, sweet, and sour.  You might then characterize an apple and banana as:   

     Apple = (1, 1, 1, 1)  since an apple is spherical, crunchy, sweet, and sour

     Banana = (0, 0, 1, 0) since a banana is not spherical, not crunchy, sweet, and not sour

     | A n B | = 1, since Apple and Banana share only 1 characteristics

     | A u B | = 4, since there are a total 4 characteristics

     J(A, B) = 1/4

     Note:  Jaccard often drops properties from the union length (the denominator of the equation) if neither items have the property.  For example, if the Apple were defined as (1, 1, 1, 0), you could choose to ignore the "Sour" property all together since neither fruit has this property.  This would change the length of the union from 4 to 3, making the similarity 1/3 instead of 1/4.  If you ponder this for a second it actually makes sense.  You wouldn't want to penalize the similarity because two objects both don't have a specific property.  That shared absence of a characteristic actually makes them more similar, not less similar.


When to Use

     Jaccard makes sense when data attributes are binary.  E.g. Finding similarity between articles based on keywords would make sense because either a keyword exists or it doesn't.  Trying to find similarity between movie reviewers makes very little sense with this algorithm because the binary attributes are did the reviewer watch the movie or didn't they.  It would ignore their rating as to whether they actually liked the movie or not.


The Code

Classifier:  Extended Jaccard / Tanimoto

     Plain vanilla Jaccard is designed around the similarity of objects with binary characteristics (0 or 1).  This definition can be extended to handle continuous or discrete non-negative features.  If you express Jaccard over a bit vector you will get the following equation:





     Aside:  This equation holds true to Jaccard because when dealing with bit vectors where the value of each dimension is either 0 or 1  (i.e. binary data), this equation becomes the original Jaccard equation from above.  The pieces of the proof are shown below.





Extended Jaccard English Translation

     Again if we translate that equation to English it makes quite a bit of sense. The dot product is simply calculating the intersection, so the numerator in this equation is the same as standard Jaccard.  The denominator is representation for the length of A + length of B minus their intersection (since you wouldn't want to count the values in the intersection twice), which is the definition of union.

     Going back to the movie reviewer example, we can now use Jaccard to calculate similarity, but we will also take into account their actual review scores.  Or if we still wanted to classify fruit as in the original example, we could elaborate our fruit characterization to support "crunchy-ness" that could be values on a scale of 1 to 10, rather than just a binary is or is not crunchy.


When to Use

     The Extended Jaccard Coefficient has been now extended to handle cases where data is discrete or continuous.


The Code

Classifier:  Cosine Similarity

     Cosine Similarity is measure of similarity that is calculated by plotting data as vectors in n-space and calculating the cosine of the angle between those vectors.  If you've read my intro to the math behind Pearson and Euclidean, you've already seen how other measures make use of data as points or vectors in n-space and use that n-space representation to calculate similarity. But in this case we are using the angles as a measure of similarity.


Cosine English Translation

     Before I delve into the math, I want to elaborate on the concept just a bit.  If you understand that an item to be compared can be represented as a vector, where each attribute on the object is an entry in the vector, then we can plot that vector in n-space.  Once the two items are plotted in n-space (see plot below), you can see that the angle between the two vectors is a reasonable approach to compare the vectors.  Vectors with a small angle difference will have smaller differences in the individual vector values (i.e. item attributes).  And the opposite is true.  Vectors with smaller differences in the individual vector values (i.e. item attributes) will have a smaller angle difference between them.  To say that back...  Items that are similar will have a small angle between their vector representation.



The Math

     Using the Euclidean Dot Product Fomula


we can isolate the cosine angle using simple division.


At this point, every piece of the equation has been explained because they show up in Extended Jaccard, except Extended Jaccard uses length of A^2, and this equation uses length of A.


 The Code


     Hopefully this took some of the voodoo out of a couple similarity classifiers.  I am not a statistics professor, so I usually have to spend several hours studying the math, and trying to understand the intent before everything clicks in my mind.  With luck this tutorial can get you to zen-like understanding much quicker.

     As always, contact me (Jeff) at plummeronsoftware@gmail.com with any questions or comments.


Hua Cai said...

ray-ban sunglasses
louis vuitton bags
michael kors outlet online
lululemon uk
louis vuitton outlet
swarovski outlet
louis vuitton outlet stores
michael kors factory outlet
burberry outlet
mulberry outlet
hermes outlet store
ray ban sunglasses
michael kors handbags
cheap oakley sunglasses
hermes outlet
air max 2015
fitflops outlet sale
louis vuitton handbags
ferragamo outlet
tiffany and co
nike tn pas cher
rolex uk
mulberry sale
replica watches
bottega veneta outlet online
cheap michael kors handbags
fitflops clearance
toms shoes
christian louboutin outlet
chaussure louboutin
celine outlet online
toms outlet
cheap jordan shoes
nike blazer pas cher
ferragamo outlet

xumeiqing said...

sac longchamp pliage
coach factory outlet
coach outlet
coach factory outlet
adidas nmd runner
converse shoes
burberry outlet
fitflops sale clearance
nike roshe run women
gucci handbags
burberry outlet
ray ban outlet
cheap oakley sunglasses
ray ban sunglasses
louis vuitton outlet
burberry outlet
asics outlet
yeezy boost 350
reebok uk
ray bans
hermes belt
kate spade outlet
canada goose
holliste sale
cartier watches
yeezy boost 350 black
michael kors outlet
nike trainers
chaussure louboutin
kate spade outlet
christian louboutin shoes
louis vuitton factory outlet
nike free flyknit 4.0
bottega veneta handbags
running shoes
reebok pump

xjd7410@gmail.com said...

discount jordans
air jordans
coach factory outlet
retro 11
louboutin shoes
ray ban sunglasses
kobe 8
rolex watches
giuseppe zanotti
oakley sunglasses
adidas running shoes
tiffany jewelry
adidas superstar trainers
gucci handbags
louis vuitton purses
lebron 12
louis vuitton outlet
burberry handbags
air jordan femme
timberland outlet
michael kors outlet online
michael kors outlet
nike store
coach outlet online
ray ban sunglasses
fit flops
coach factory outlet online
louis vuitton handbags
supra for sale
coach outlet online

dong dong23 said...

christian louboutin shoes
michael kors outlet
chicago bulls jerseys
hermes outlet
ray ban sunglasses
adidas superstar
longchamp outlet
coach factory outlet
fitflops sale clearance
cheap jordans

chen wish said...

Nike Air Max 90 Shoes
Nike Free Flyknit 5.0 Shoes
Nike LunarGlide 8 Shoes
Adidas Yeezy 350 boost Shoes
Nike Mercurial Soccer Boots
Lebron James XIII Basketball Shoes

raybanoutlet001 said...

longchamp bags
ed hardy uk
polo ralph lauren outlet
nike trainers
michael kors handbags
michael kors handbags outlet
michael kors handbags
michael kors outlet
nike blazer low
cheap jordans

raybanoutlet001 said...

yeezy boost 350 white
patriots jerseys
cowboys jerseys
the north face outlet
michael kors outlet
gucci borse
ravens jerseys
converse shoes
hugo boss outlet
air jordan uk

dada24 Xu said...

louis vuitton pas cher
hermes bags
pandora charms sale clearance
cheap jordans for sale
longchamp handbags
moncler outlet online
canada goose
adidas nmd r1
kd shoes

dong dong23 said...

basketball shoes
michael kors outlet online
louis vuitton outlet
adidas yeezy 350
tods shoes
washington wizards jerseys
ugg outlet
canada goose clothing
toms shoes
michael kors outlet online

Meiqing Xu said...

moncler outlet
michael kors outlet
beats by dr dre
coach factory outlet
adidas shoes
longchamp outlet
gucci purses
omega replica watches
beats earbuds
the north face

LCc 03 said...

kobe shoes
kobe shoes
brady jersey
michael kors outlet online
longchamp outlet
air max
longchamp sale
cheap basketball shoes
air jordan shoes