Jan 21, 2013

Introduction to the Math Behind Similarity Classifiers – Part 1

     This article provides an introduction to the world of similarity classifiers in a way that is simple and fun.  We will attempt to introduce the topics in laymen's terms and provide examples that readers can play with and watch the results.  While this intro series of articles will go into theory and formulas full of Greek letters, they will first build upon simple explanations and real world examples.  Hopefully in reading this article you can gain an understanding of both the application and the math involved.


Introduction to Similarity Classifiers

     Similarity-based classifiers estimate the similarity between *things*.  Once we know things are similar, we can begin to group them and make decisions based on that grouping.  Commercial businesses can make more money knowing that people who buy one type of item, are very often interested in "similar" items.  Doctors who can successfully treat a specific pathology with specific medications can often successfully treat "similar" patients with those same medications.  And the list goes on and on.
  • Pandora: "Is this song similar to 'Linkin Park' such that we ought to group it with the ‘Link Park’ radio station?"  
  • Google News:  "Is this unknown web article similar to these other top news articles such that we ought to group them together in our news aggregator?"
  • Amazon:  "Is the movie 'Speed 2' similar to the movie 'Cinderella' and we should recommend it as a purchase to 'Speed 2' buyers?"
  • Image Organizer: "Is this image similar to these other images in a collection, and thus we think this new image is of the same person?" 

Similarity is Domain Specific and Subjective

     One of the fun aspects of similarity classification is that the approaches vary in different domains, and there is no true correct answer.  The approach you follow to determine if a news article is similar to another, would very likely not be the same approach you would use to determine if two images are similar.  The news article domain might use the probability of sharing certain key words as a similarity indicator, while for images you might want to look at colors and pixel groupings.

Example:  Finding the similarities between movies and movie buyers

     For the first set of classifiers I'm stealing an example from the great book "Programming Collective Intelligence" by Toby Segaran.  But rather than focus on the code, this article will use the example to simplify and explain the math.
     The premise is that a store like Amazon.com wants to recommend items a customer might like.  In order to make a decision on what to recommend we can take two different approaches.  We could find similar movie buyers, and recommend movies that those similar buyers like.  This is akin to getting movie recommendations from a friend who has similar tastes as you.  Alternatively, we could try to classify how similar movies are to each other, and recommend similar movies to the movies that you already like.  For example, one could argue Batman is similar to Superman because both are superhero movies, so we could recommend Batman to buyers who purchase Superman. 
     For brevity this article is going to stick with only talking about the similarity of movie viewers rather than item similarity, but the algorithms are the same.  Click the link below to play with a simple app that calculates the similarity between movie viewers using the mathematics we will discuss in detail.

Movie Viewer Similarity App

Similarity Classifiers

Euclidean Distance Score:

     The Euclidean distance score is a phenomenally simple way to measure similarity that works reasonably well when data is dense or continuous.  The premise is simple.  Figure out a way to map your data to points on a graph, and use the Pythagorean formula you learned back in high school to determine the distance between those points.  The closer the distance, the more similar the data.
     In the movie viewer example above, we can say 2 people are similar if their movie ratings are similar.  So if ‘Bob’ gave Batman and Superman 4-star and 3.5-star ratings, he might be considered similar to ‘Larry’ who gave Batman and Superman 2.5 and 3-star ratings.  Or at least he is more similar to 'Larry' than he is to 'Samantha' who gave Batman and Superman 1 star ratings.


     To calculate the similarity score between two variables, movie viewers in this case, it is just a simple 2 step process:
  1. Calculate the distance between variables using the distance formulaeuclidean_distance
  2. Invert distance to get a result where closer distance equals larger similarity.  Similarity = 1 / (1+distance)
Bob 4 3.5 3.5
Larry 3 2.5 2

Distance (Bob, Larry)  = SQRT ( (4-2)^2 + (3.5-2.5)^2 + (3.5-2)^2) = 2.06
Similarity (Bob, Larry) = 1/ (1+2.06) = 0.32

Bob 4 3.5 3.5
Samantha 1 1.5 1

Distance (Bob, Samantha) = SQRT( (4-1)^2 + (3.5-1.5)^2 + (3.5-1)^2) = 4.39
Similarity (Bob, Samantha) = 1 / (1+4.39) = 0.18

Where it Works and where it fails:
     Euclidean distance works quite well when data is dense or continuous, and well normalized. I cringed a little when I wrote that because I can picture those words coming from my statistics professor and it having very little meaning to everyone in the room.  So lets see if we can make that definition a little more simple to understand.

  • Data is Dense:  Consider the movie viewer domain in a more realistic light.  In the real world there are tens of thousands of movies.  If two viewers agree on Batman and Superman, but have no other overlap, meaning they haven't seen any other movies the other person has seen, are they really similar?  Using the Euclidean Distance score they would be considered similar.  In the real world I suspect 2 people who only share 2 movies in common are probably very different, and getting movie recommendations from that person would yield some rather interesting results.

  • Data is Normalized:  Continuing in the move critic domain, it is probably fairly obvious that not all people rate movies the same way.  Some critics give every movie they like 5-stars, and rarely rate a movie below 3-stars even if it bored them to sleep.  While others thought 'Transformers' was a fantastic night out, but can't give it more than 3-stars because Michael Bay didn't uncover any new ground on the human condition.  3-stars doesn't mean the same thing to everybody, and therefore ratings are not normalized across all people.  If we could magically normalize the data and make the translation that 5-stars to 'Bob' is equivalent to 4-stars from 'Larry' and 3-stars from 'Samantha' then the accuracy of this algorithm would improve greatly.

The Code


Pearson Correlation Score

     The Pearson Correlation Score as a measure of similarity is a good example of taking knowledge of statistics and applying it in new an interesting ways.  The Pearson Correlation score measures the correlation or linear dependence between two variables.  For example, students might want to determine the correlation between exam scores, and hours studied before an exam.  Or exam scores and alcoholic drinks consumed the night before an exam.  Pearson's coefficient will indicate if there is a relation or dependency between these variables.  When one variable goes up (hours studied), does the other variable (exam scores) go up as well?  Or up vs down as the likely relationship between exam scores and alcoholic drinks.  But ultimately does one variable seem to have a relationship with another variable.
     The Pearson's correlation coefficient between two variables is defined as the covariance of the two variables divided by the product of their standard deviation

\rho_{X,Y}={\mathrm{cov}(X,Y) \over \sigma_X \sigma_Y} ={E[(X-\mu_X)(Y-\mu_Y)] \over \sigma_X\sigma_Y}

     To better explain that definition, lets pull it apart and see if we can translate it into English.  Covariance is the measure of how much two variables change together.  If one variable changes positively, and the other variable also changes positively, then you have a positive covariance.  If one variable changes positively while the other variable changes negatively, the covariance is negative.  Lastly, when a change in one variable has no impact on the other variable, the covariance is 0, implying the variables are independent.
     Using our movie critic example, movie critics 'Bob' and 'Larry' are variables for which we want to measure their covariance in order to determine how similar they are to each other.  We can look at the changes in their movie recommendations over the movies 'Batman', 'Superman' and 'Ironman', and we can look to see if the values change similarly or contrary.  If the lines have a similar slope, then we can say the variables 'Bob' and 'Larry' have a higher covariance signifying these two movie critics may have similar tastes.  If the lines have very different slopes, the covariance will be lower (even negative) indicating these two movie critics may have very different tastes.  Take a look at the slopes between Bob and Larry’s movie reviews VS Bob and Samantha.  Using just your eyeballs you can probably tell which 2 reviewers have similar slopes, and therefore will have a higher Pearson Correlation score.


     The second half of the Pearson score is to divide the covariance by the standard deviation.  This part of the equation is done to normalize the result over the data.  By normalizing the result, the magnitude of the covariance represents the strength of the linear relations. 

Algorithm / Math:
     The sample based formula for calculating the Pearson Correlation can be defined as the following formula:

     This formula can be a bit daunting at first, so lets take it step by step to find the Pearson Similarity Score for Bob and Larry.

Bob 4 3.5 3.5
Larry 3 2.5 2


     Next lets calculate the sums and intermediate variables used in the equation. 


     Finally lets build out the rest of the equation.


     In the example above, Bob and Larry have a similarity score of 0.866. 

Where it Works and where it fails:
     Similar to Euclidean Distance Score, the Pearson Correlation Score works well when data is dense (see explanation in Euclidean to see what that means).  However, this score tends to give better results than Euclidean when the data is not normalized, as is the case with our movie critic example.  The reason is this calculation is normalizing the result and looking at the line slopes, and not where those slopes actually sit on the Y-axis.  So this algorithm will handle the case where 1 movie critic is consistently more harsh than another.


     In this brief article we introduced similarity classifiers, and have attempted to take the mystery out of the math for the Euclidean and Pearson Correlation scores.  Both scores are very similar in that they can be thought of as plotting variables in some n-space plane and looking at the properties of those plots.  This type of n-space plotting process is extremely common in machine learning, and we'll see lots more examples in the upcoming articles. 
    As always, contact me (Jeff) at plummeronsoftware@gmail.com with any questions or comments.



Nice article Jeff.

You might also want to look at Mahalanobis distance. Its typically a more robust distance measure than Eucledian.

I come from a very different field and apply this math to an entirely different (and slightly more complex) scenario (we are trying to monitor condition of railway tracks based on vibration signatures taken on the moving train).

Look forward to reading more from you.

Jeff Plummer said...

Yup, I was going to do Manhatten in another article, and then got side tracked. Be sure to checkout http://www.jeffplummer.net/2013/02/intro-to-mathematics-in-similarity.html for my next post in the series that covers the math behind Jaccard, Extended Jaccard, and Cosine similarity classifiers.

Jeff Plummer said...

PS: Thanks for the comment. Glad you like the article.

Coach Factory said...

sac louis vuitton, louboutin, lululemon, mulberry, new balance pas cher, vans pas cher, oakley pas cher, hogan outlet, ray ban pas cher, air force, nike free, montre homme, ralph lauren pas cher, louis vuitton uk, nike air max, lacoste, vanessa bruno, nike roshe, longchamp, nike roshe run, nike free pas cher, vans shoes, longchamp, michael kors, nike trainers, abercrombie and fitch, tn pas cher, hollister, sac guess, longchamps, converse pas cher, michael kors, ralph lauren, hollister, karen millen, michael kors, roshe run, sac louis vuitton, north face, louis vuitton, air max, nike huarache, burberry, hermes pas cher, nike blazer, barbour, air jordan, timberland, hollister, ray ban sunglasses

Coach Factory said...

celine handbags, babyliss pro, new balance shoes, uggs on sale, soccer jerseys, moncler, jimmy choo outlet, canada goose, canada goose jackets, ghd, herve leger, mcm handbags, valentino shoes, lululemon outlet, ugg boots, juicy couture outlet, moncler outlet, soccer shoes, moncler, marc jacobs, ugg boots, juicy couture outlet, mac cosmetics, moncler, instyler, canada goose uk, uggs outlet, ferragamo shoes, asics, p90x, bottega veneta, reebok outlet, beats by dre, birkin bag, mont blanc, rolex watches, ugg, north face jackets, supra shoes, wedding dresses, insanity, canada goose outlet, canada goose, doudoune canada goose, giuseppe zanotti, north face outlet, abercrombie and fitch, nfl jerseys, chi flat iron, moncler

Coach Factory said...

burberry outlet online, polo ralph lauren outlet, michael kors outlet, chanel handbags, kate spade outlet, coach outlet store online, coach purses, true religion outlet, gucci outlet, prada outlet, coach outlet, michael kors outlet, louis vuitton outlet online, longchamp handbags, christian louboutin shoes, louis vuitton outlet, burberry outlet online, louboutin outlet, oakley sunglasses, michael kors outlet, nike free, michael kors outlet, michael kors outlet online, louboutin, nike shoes, polo ralph lauren outlet online, louboutin shoes, louis vuitton outlet stores, longchamp handbags, oakley sunglasses, ray ban sunglasses, prada handbags, coach factory outlet, coach outlet, nike air max, louis vuitton, true religion jeans, tory burch outlet, tiffany and co, michael kors outlet online, kate spade handbags, louis vuitton handbags, longchamp outlet, air max, cheap oakley sunglasses, jordan shoes, ray ban sunglasses

ksingh said...

Hi Jeff,great article.
In the numerator it says - (12*4), shouldn't it be (11 * 7.5)


Guo Guo said...

herve leger dresses
michael kors outlet online
michael kors outlet online
thomas sabo
salvatore ferragamo
polo ralph lauren outlet
giuseppe zanotti
gucci handbags
polo ralph lauren
chanel bags
mont blanc pens
cheap oakley sunglasses
ray ban sunglasses online
christian louboutin outlet
coach factory outlet
adidas wings
swarovski jewelry
michael kors outlet
pandora charms
true religion
kate spade outlet
michael kors handbags
chanel handbags
abercrombie and fitch
coach outlet
discount oakley sunglasses
abercrombie outlet
links of london jewellery
kate spade handbags
new balance shoes
coach factory outlet
hollister shirts
marc jacobs
ralph lauren outlet
ray ban sunglasses uk
louis vuitton handbags
asics shoes
converse outlet
kate spade handbags

xiaohang lin said...

miami heat jersey
coach outlet canada
gucci shoes
tory burch outlet
cheap soccer jerseys
new orleans saints jerseys
kate spade outlet
christian louboutin uk
koby bryant shoes
true religion jeans
coach outlet
tory burch outlet
oakley sunglasses canada
michael kors outlet
hermes birkin
golden state warriors jerseys
the north face uk
washington redskins jerseys
air max 2015
air max 90
cheap nike shoes
atlanta falcons jersey
bottega veneta outlet
chanel outlet
nike outlet store
nike air max uk
juicy couture outlet
baltimore ravens jerseys
mbt shoes
mulberry outlet
michael kors uk
kansas city chiefs jerseys
real madrid jersey
arizona cardinals jerseys
beats headphones
ralph lauren uk
san francisco 49ers jerseys
burberry outlet
barcelona jersey
new york giants jerseys