# Intro

This article will provide an introduction to the mathematics of similarity classifiers, specifically Jaccard, Extended Jaccard, and Cosine. We will attempt to introduce the topics in layman's terms and provide examples that readers can play with and watch the results. While this intro series of articles will go into theory and formulas full of Greek letters, they will first build upon simple explanations and real world examples. Hopefully in reading this article you can gain an understanding of both the application and the math involved.

# The Example

For the first set of classifiers I'm stealing an example from the great book "Programming Collective Intelligence" by Toby Segaran. But rather than focus on the code, this article will use the example to simplify and explain the math.

The premise is that a store like Amazon.com wants to recommend items a customer might like. In order to make a decision on what to recommend we can take two different approaches. We could find similar movie buyers, and recommend movies that those similar buyers like. This is akin to getting movie recommendations from a friend who has similar tastes as you. Alternatively, we could try to classify how similar movies are to each other, and recommend similar movies to the movies that you already like. For example, one could argue Batman is similar to Superman because both are superhero movies, so we could recommend Batman to buyers who purchase Superman.

For brevity this article is going to stick with only talking about the similarity of movie viewers rather than item similarity, but the algorithms are the same. Click the link below to play with a simple app that calculates the similarity between movie viewers using the mathematics we will discuss in detail.

# Classifiers

## Classifier: Jaccard

The Jaccard Similarity Coefficient is another useful similarity approach. Wikipedia states the Jaccard coefficient measures similarity between sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets:

### Jaccard English Translation

In English, this simply means *"The number of characteristics in common, divided by the total number of characteristics"*. So while the mathematical formula might look scary to the lay person, I think the simple English description shows this to be a "face-palm" obvious measure of similarity.

### Simple Example

Stealing the example from here, suppose we want to determine the similarity between various kinds of fruits. We could categorize fruits in terms of spherical shape, crunchy, sweet, and sour. You might then characterize an apple and banana as:

Apple = (1, 1, 1, 1) since an apple is spherical, crunchy, sweet, and sour

Banana = (0, 0, 1, 0) since a banana is not spherical, not crunchy, sweet, and not sour

| A n B | = 1, since Apple and Banana share only 1 characteristics

| A u B | = 4, since there are a total 4 characteristics

J(A, B) = 1/4

Note: Jaccard often drops properties from the union length (the denominator of the equation) if neither items have the property. For example, if the Apple were defined as (1, 1, 1, 0), you could choose to ignore the "Sour" property all together since neither fruit has this property. This would change the length of the union from 4 to 3, making the similarity 1/3 instead of 1/4. If you ponder this for a second it actually makes sense. You wouldn't want to penalize the similarity because two objects both don't have a specific property. That shared absence of a characteristic actually makes them more similar, not less similar.

### When to Use

Jaccard makes sense when data attributes are binary. E.g. Finding similarity between articles based on keywords would make sense because either a keyword exists or it doesn't. Trying to find similarity between movie reviewers makes very little sense with this algorithm because the binary attributes are did the reviewer watch the movie or didn't they. It would ignore their rating as to whether they actually liked the movie or not.

### The Code

## Classifier: Extended Jaccard / Tanimoto

Plain vanilla Jaccard is designed around the similarity of objects with binary characteristics (0 or 1). This definition can be extended to handle continuous or discrete non-negative features. If you express Jaccard over a bit vector you will get the following equation:

where

Aside: This equation holds true to Jaccard because when dealing with bit vectors where the value of each dimension is either 0 or 1 (i.e. binary data), this equation becomes the original Jaccard equation from above. The pieces of the proof are shown below.

### Extended Jaccard English Translation

Again if we translate that equation to English it makes quite a bit of sense. The dot product is simply calculating the intersection, so the numerator in this equation is the same as standard Jaccard. The denominator is representation for the length of A + length of B minus their intersection (since you wouldn't want to count the values in the intersection twice), which is the definition of union.

Going back to the movie reviewer example, we can now use Jaccard to calculate similarity, but we will also take into account their actual review scores. Or if we still wanted to classify fruit as in the original example, we could elaborate our fruit characterization to support "crunchy-ness" that could be values on a scale of 1 to 10, rather than just a binary is or is not crunchy.

### When to Use

The Extended Jaccard Coefficient has been now extended to handle cases where data is discrete or continuous.

### The Code

## Classifier: Cosine Similarity

Cosine Similarity is measure of similarity that is calculated by plotting data as vectors in n-space and calculating the cosine of the angle between those vectors. If you've read my intro to the math behind Pearson and Euclidean, you've already seen how other measures make use of data as points or vectors in n-space and use that n-space representation to calculate similarity. But in this case we are using the angles as a measure of similarity.

## Cosine English Translation

Before I delve into the math, I want to elaborate on the concept just a bit. If you understand that an item to be compared can be represented as a vector, where each attribute on the object is an entry in the vector, then we can plot that vector in n-space. Once the two items are plotted in n-space (see plot below), you can see that the angle between the two vectors is a reasonable approach to compare the vectors. Vectors with a small angle difference will have smaller differences in the individual vector values (i.e. item attributes). And the opposite is true. Vectors with smaller differences in the individual vector values (i.e. item attributes) will have a smaller angle difference between them. To say that back... Items that are similar will have a small angle between their vector representation.

### The Math

Using the Euclidean Dot Product Fomula,

we can isolate the cosine angle using simple division.

At this point, every piece of the equation has been explained because they show up in Extended Jaccard, except Extended Jaccard uses length of A^2, and this equation uses length of A.

### The Code

# Conclusion

Hopefully this took some of the voodoo out of a couple similarity classifiers. I am not a statistics professor, so I usually have to spend several hours studying the math, and trying to understand the intent before everything clicks in my mind. With luck this tutorial can get you to zen-like understanding much quicker.

As always, contact me (Jeff) at plummeronsoftware@gmail.com with any questions or comments.

## 15 comments:

ray-ban sunglasseslouis vuitton bagsmichael kors outlet onlinelululemon uklouis vuitton outletswarovski outletlouis vuitton outlet storesmichael kors factory outletburberry outletmulberry outlethermes outlet storeray ban sunglassesmichael kors handbagscheap oakley sunglasseshermes outletair max 2015fitflops outlet salelouis vuitton handbagsferragamo outlettiffany and conike tn pas cherrolex ukmulberry salereplica watchesbottega veneta outlet onlinecheap michael kors handbagsfitflops clearancetoms shoeschristian louboutin outletchaussure louboutinceline outlet onlinetoms outletcheap jordan shoesnike blazer pas cherferragamo outlet20160528caihuali

20160620meiqing

sac longchamp pliagecoach factory outletcoach outletcoach factory outletadidas nmd runnerconverse shoesburberry outletfitflops sale clearancenike roshe run womenfitflopsgucci handbagsburberry outletray ban outletcheap oakley sunglassesray ban sunglasseslouis vuitton outletburberry outletasics outletyeezy boost 350reebok ukray banshermes beltkate spade outletcanada gooseholliste salecartier watchesyeezy boost 350 blackmichael kors outletnike trainerschaussure louboutinkate spade outletchristian louboutin shoeslouis vuitton factory outletnike free flyknit 4.0bottega veneta handbagsrunning shoesreebok pumpdiscount jordansair jordanscoach factory outletretro 11louboutin shoesray ban sunglasseskobe 8rolex watchesgiuseppe zanottioakley sunglassesadidas running shoestiffany jewelryadidas superstar trainersgucci handbagslouis vuitton purseslebron 12louis vuitton outletburberry handbagsair jordan femmetimberland outletmichael kors outlet onlinemichael kors outletnike storecoach outlet onlineray ban sunglassesfit flopscoach factory outlet onlinelouis vuitton handbagssupra for salecoach outlet onlinechenyingying712

christian louboutin shoesmichael kors outletchicago bulls jerseyshermes outletray ban sunglassesadidas superstarlongchamp outletcoach factory outletfitflops sale clearancecheap jordans201610.6wengdongdong

Nike Air Max 90 ShoesNike Free Flyknit 5.0 ShoesNike LunarGlide 8 ShoesAdidas Yeezy 350 boost ShoesNike Mercurial Soccer BootsLebron James XIII Basketball Shoeslongchamp bagsed hardy ukpolo ralph lauren outletnike trainersmichael kors handbagsmichael kors handbags outletmichael kors handbagsmichael kors outletnike blazer lowcheap jordansyeezy boost 350 whitepatriots jerseyscowboys jerseysthe north face outletmichael kors outletgucci borseravens jerseysconverse shoeshugo boss outletair jordan uklouis vuitton pas cherhermes bagspandora charms sale clearancecheap jordans for salelongchamp handbagsuggsmoncler outlet onlinecanada gooseadidas nmd r1kd shoeszhi20161209

basketball shoesmichael kors outlet onlinelouis vuitton outletadidas yeezy 350tods shoeswashington wizards jerseysugg outletcanada goose clothingtoms shoesmichael kors outlet online201612.27wengdongdong

moncler outletmichael kors outletbeats by dr drecoach factory outletadidas shoeslongchamp outletgucci pursesomega replica watchesbeats earbudsthe north face20170209CAIYAN

kobe shoeskobe shoeslongchampbrady jerseymichael kors outlet onlinelongchamp outletair maxlongchamp salecheap basketball shoesair jordan shoespandora charms outletjordan ukoakley sunglasses outletcoach factory outlet onlineecco outletchristian louboutin outletcheap ray ban sunglassesnike outlet storechristian louboutin pas cherpolo ralph lauren outlet online20170707yuanyuan

versace jeansyeezy shoesoakley sunglasseschargers jerseyscoach outletugg outletchristian louboutin shoescoach outletralph lauren outletcheap ugg bootsultra boostmichael kors handbags outletyeezy shoesyeezy boosttimberland bootsreebok outletyeezy boost 350 v2jordan retronike zoom running shoekyrie 3 shoesqzz0621

ugg outletfitflopsmichael kors outletbulls jerseyspolo outletralph lauren polokate spade outletmichael kors outletg-star jeansfingerlings monkeyPost a Comment