12,547,806 members (38,829 online)
alternative version

36.5K views
19 bookmarked
Posted

# Finding Document Similarity using Cosine Theorem

, 7 Dec 2006
 Rate this:
Finding Similarity in Docs

In collage we learned that from the origin in euclidean space if we have two points
we can draw a line to two points and then find the cosine of the two lines but in data mining we can use this technique to find the similarity of these documents. But how ? for example

i have to go to school.
i have to go to toilet.

the words of the first sentence are i , have , to , go  , school and all the words frequency is except to

the words of the second sentence are i , have , to , go , to , tioilet and agai all the words frequency is 1

and if we think n-dimensional space the points of the words in space is

1   [i , have , to , go,  school , toilet] = [1,1,2,1,1,0]
2   [i , have , to , go , school , toilet] = [1,1,2,1,0,1]

cos = 1*1 + 1*1 + 2*2 + 1*1 + 1*0 + 0*1 / sqrt((1^2 + 1^2 + 2^2 + 1^2 + 1^2 + 0^2 ) + 1^2 + 1^2 + 2^2 + 1^1 + 0^0 + 1^2)

The interesting part is in the code is finding the non-existing words

`<FONT size=2><P></FONT>`

Term Frequency's aim is to set all words' frequencies to set [0,1] interval to normalize so we implement this to our project.

`<FONT size=2><P></FONT>`

A list of licenses authors might use can be found here

## Share

 Web Developer Turkey
No Biography provided

## You may also be interested in...

 Pro Pro

 First Prev Next
 My vote of 2 Member 104828471-Jan-14 17:25 Member 10482847 1-Jan-14 17:25
 My vote of 2 Andrew Rissing26-Feb-10 9:52 Andrew Rissing 26-Feb-10 9:52
 PrepareAllHashTables Edit Jetty2-Apr-07 23:14 Edit Jetty 2-Apr-07 23:14
 good article margiex7-Dec-06 18:02 margiex 7-Dec-06 18:02
 Re: good article m0nt0y424-Dec-06 7:16 m0nt0y4 24-Dec-06 7:16
 Re: good article margiex25-Dec-06 22:29 margiex 25-Dec-06 22:29
 Re: good article harijayakumar9-Aug-07 18:52 harijayakumar 9-Aug-07 18:52
 Last Visit: 31-Dec-99 18:00     Last Update: 21-Oct-16 10:36 Refresh 1