This article provides an introduction to the world of similarity classifiers in a way that is simple and fun. We will attempt to introduce the topics in laymen's terms and provide examples that readers can play with and watch the results. While this intro series of articles will go into theory and formulas full of Greek letters, they will first build upon simple explanations and real world examples. Hopefully in reading this article you can gain an understanding of both the application and the math involved.

# Introduction to Similarity Classifiers

Similarity-based classifiers estimate the similarity between *things*. Once we know things are similar, we can begin to group them and make decisions based on that grouping. Commercial businesses can make more money knowing that people who buy one type of item, are very often interested in "similar" items. Doctors who can successfully treat a specific pathology with specific medications can often successfully treat "similar" patients with those same medications. And the list goes on and on.- Pandora: "Is this song similar to 'Linkin Park' such that we ought to group it with the ‘Link Park’ radio station?"
- Google News: "Is this unknown web article similar to these other top news articles such that we ought to group them together in our news aggregator?"
- Amazon: "Is the movie 'Speed 2' similar to the movie 'Cinderella' and we should recommend it as a purchase to 'Speed 2' buyers?"
- Image Organizer: "Is this image similar to these other images in a collection, and thus we think this new image is of the same person?"

### Similarity is Domain Specific and Subjective

One of the fun aspects of similarity classification is that the approaches vary in different domains, and there is no true correct answer. The approach you follow to determine if a news article is similar to another, would very likely not be the same approach you would use to determine if two images are similar. The news article domain might use the probability of sharing certain key words as a similarity indicator, while for images you might want to look at colors and pixel groupings.## Example: Finding the similarities between movies and movie buyers

For the first set of classifiers I'm stealing an example from the great book "Programming Collective Intelligence" by Toby Segaran. But rather than focus on the code, this article will use the example to simplify and explain the math.The premise is that a store like Amazon.com wants to recommend items a customer might like. In order to make a decision on what to recommend we can take two different approaches. We could find similar movie buyers, and recommend movies that those similar buyers like. This is akin to getting movie recommendations from a friend who has similar tastes as you. Alternatively, we could try to classify how similar movies are to each other, and recommend similar movies to the movies that you already like. For example, one could argue Batman is similar to Superman because both are superhero movies, so we could recommend Batman to buyers who purchase Superman.

For brevity this article is going to stick with only talking about the similarity of movie viewers rather than item similarity, but the algorithms are the same. Click the link below to play with a simple app that calculates the similarity between movie viewers using the mathematics we will discuss in detail.

###### Movie Viewer Similarity App

## Similarity Classifiers

### Euclidean Distance Score:

The Euclidean distance score is a phenomenally simple way to measure similarity that works reasonably well when data is dense or continuous. The premise is simple. Figure out a way to map your data to points on a graph, and use the Pythagorean formula you learned back in high school to determine the distance between those points. The closer the distance, the more similar the data.In the movie viewer example above, we can say 2 people are similar if their movie ratings are similar. So if ‘Bob’ gave Batman and Superman 4-star and 3.5-star ratings, he might be considered similar to ‘Larry’ who gave Batman and Superman 2.5 and 3-star ratings. Or at least he is more similar to 'Larry' than he is to 'Samantha' who gave Batman and Superman 1 star ratings.

__Algorithm/Math__

To calculate the similarity score between two variables, movie viewers in this case, it is just a simple 2 step process:

- Calculate the distance between variables using the distance formula
- Invert distance to get a result where closer distance equals larger similarity. Similarity = 1 / (1+distance)

Name |
Batman |
Superman |
Ironman |

Bob |
4 | 3.5 | 3.5 |

Larry |
3 | 2.5 | 2 |

Distance (Bob, Larry) = SQRT ( (4-2)^2 + (3.5-2.5)^2 + (3.5-2)^2) = 2.06

Similarity (Bob, Larry) = 1/ (1+2.06) = 0.32

Name | Batman | Superman | Ironman |

Bob | 4 | 3.5 | 3.5 |

Samantha | 1 | 1.5 | 1 |

Distance (Bob, Samantha) = SQRT( (4-1)^2 + (3.5-1.5)^2 + (3.5-1)^2) = 4.39

Similarity (Bob, Samantha) = 1 / (1+4.39) = 0.18

__Where it Works and where it fails__:

Euclidean distance works quite well when data is

*dense*or continuous, and well

*normalized*. I cringed a little when I wrote that because I can picture those words coming from my statistics professor and it having very little meaning to everyone in the room. So lets see if we can make that definition a little more simple to understand.

*Data is Dense:*Consider the movie viewer domain in a more realistic light. In the real world there are tens of thousands of movies. If two viewers agree on Batman and Superman, but have no other overlap, meaning they haven't seen any other movies the other person has seen, are they really similar? Using the Euclidean Distance score they would be considered similar. In the real world I suspect 2 people who only share 2 movies in common are probably very different, and getting movie recommendations from that person would yield some rather interesting results.

*Data is Normalized*: Continuing in the move critic domain, it is probably fairly obvious that not all people rate movies the same way. Some critics give every movie they like 5-stars, and rarely rate a movie below 3-stars even if it bored them to sleep. While others thought 'Transformers' was a fantastic night out, but can't give it more than 3-stars because Michael Bay didn't uncover any new ground on the human condition. 3-stars doesn't mean the same thing to everybody, and therefore ratings are not normalized across all people. If we could magically normalize the data and make the translation that 5-stars to 'Bob' is equivalent to 4-stars from 'Larry' and 3-stars from 'Samantha' then the accuracy of this algorithm would improve greatly.

#### The Code

### Pearson Correlation Score

The Pearson Correlation Score as a measure of similarity is a good example of taking knowledge of statistics and applying it in new an interesting ways. The Pearson Correlation score measures the correlation or linear dependence between two variables. For example, students might want to determine the correlation between exam scores, and hours studied before an exam. Or exam scores and alcoholic drinks consumed the night before an exam. Pearson's coefficient will indicate if there is a relation or dependency between these variables. When one variable goes up (hours studied), does the other variable (exam scores) go up as well? Or up vs down as the likely relationship between exam scores and alcoholic drinks. But ultimately does one variable seem to have a relationship with another variable.The Pearson's correlation coefficient between two variables is defined as the

*covariance*of the two variables divided by the product of their

*standard deviation*.

To better explain that definition, lets pull it apart and see if we can translate it into English. Covariance is the measure of how much two variables change together. If one variable changes positively, and the other variable also changes positively, then you have a positive covariance. If one variable changes positively while the other variable changes negatively, the covariance is negative. Lastly, when a change in one variable has no impact on the other variable, the covariance is 0, implying the variables are independent.

Using our movie critic example, movie critics 'Bob' and 'Larry' are variables for which we want to measure their covariance in order to determine how similar they are to each other. We can look at the changes in their movie recommendations over the movies 'Batman', 'Superman' and 'Ironman', and we can look to see if the values change similarly or contrary. If the lines have a similar slope, then we can say the variables 'Bob' and 'Larry' have a higher covariance signifying these two movie critics may have similar tastes. If the lines have very different slopes, the covariance will be lower (even negative) indicating these two movie critics may have very different tastes. Take a look at the slopes between Bob and Larry’s movie reviews VS Bob and Samantha. Using just your eyeballs you can probably tell which 2 reviewers have similar slopes, and therefore will have a higher Pearson Correlation score.

The second half of the Pearson score is to divide the covariance by the standard deviation. This part of the equation is done to normalize the result over the data. By normalizing the result, the magnitude of the covariance represents the strength of the linear relations.

__Algorithm / Math__:

The sample based formula for calculating the Pearson Correlation can be defined as the following formula:

This formula can be a bit daunting at first, so lets take it step by step to find the Pearson Similarity Score for Bob and Larry.

Name |
Batman |
Superman |
Ironman |

Bob |
4 | 3.5 | 3.5 |

Larry |
3 | 2.5 | 2 |

Next lets calculate the sums and intermediate variables used in the equation.

Finally lets build out the rest of the equation.

In the example above, Bob and Larry have a similarity score of 0.866.

__Where it Works and where it fails__:

Similar to Euclidean Distance Score, the Pearson Correlation Score works well when data is dense (see explanation in Euclidean to see what that means). However, this score tends to give better results than Euclidean when the data is not normalized, as is the case with our movie critic example. The reason is this calculation is normalizing the result and looking at the line slopes, and not where those slopes actually sit on the Y-axis. So this algorithm will handle the case where 1 movie critic is consistently more harsh than another.

#### Closing

In this brief article we introduced similarity classifiers, and have attempted to take the mystery out of the math for the Euclidean and Pearson Correlation scores. Both scores are very similar in that they can be thought of as plotting variables in some n-space plane and looking at the properties of those plots. This type of n-space plotting process is extremely common in machine learning, and we'll see lots more examples in the upcoming articles.As always, contact me (Jeff) at plummeronsoftware@gmail.com with any questions or comments.

## 3 comments:

Nice article Jeff.

You might also want to look at Mahalanobis distance. Its typically a more robust distance measure than Eucledian.

I come from a very different field and apply this math to an entirely different (and slightly more complex) scenario (we are trying to monitor condition of railway tracks based on vibration signatures taken on the moving train).

Look forward to reading more from you.

Yup, I was going to do Manhatten in another article, and then got side tracked. Be sure to checkout http://www.jeffplummer.net/2013/02/intro-to-mathematics-in-similarity.html for my next post in the series that covers the math behind Jaccard, Extended Jaccard, and Cosine similarity classifiers.

PS: Thanks for the comment. Glad you like the article.

Post a Comment