Click here to Skip to main content
Licence 
First Posted 7 Dec 2006
Views 24,189
Bookmarked 17 times

Finding Document Similarity using Cosine Theorem

By | 7 Dec 2006 | Article
Finding Similarity in Docs

 

Download sourcecode

 In collage we learned that from the origin in euclidean space if we have two points
we can draw a line to two points and then find the cosine of the two lines but in data mining we can use this technique to find the similarity of these documents. But how ? for example

i have to go to school.
i have to go to toilet.

the words of the first sentence are i , have , to , go  , school and all the words frequency is except to

the words of the second sentence are i , have , to , go , to , tioilet and agai all the words frequency is 1

and if we think n-dimensional space the points of the words in space is

1   [i , have , to , go,  school , toilet] = [1,1,2,1,1,0]
2   [i , have , to , go , school , toilet] = [1,1,2,1,0,1]

cos = 1*1 + 1*1 + 2*2 + 1*1 + 1*0 + 0*1 / sqrt((1^2 + 1^2 + 2^2 + 1^2 + 1^2 + 0^2 ) + 1^2 + 1^2 + 2^2 + 1^1 + 0^0 + 1^2)

The interesting part is in the code is finding the non-existing words

<FONT size=2><P></FONT><FONT color=#0000ff size=2>private</FONT><FONT size=2> </FONT><FONT color=#0000ff size=2>static</FONT><FONT size=2> </FONT><FONT color=#0000ff size=2>void</FONT><FONT size=2> PrepareTwoHashTable(</FONT><FONT color=#008080 size=2>Dictionary</FONT><FONT size=2><</FONT><FONT color=#0000ff size=2>string</FONT><FONT size=2>, </FONT><FONT color=#0000ff size=2>double</FONT><FONT size=2>> table1, </FONT><FONT color=#008080 size=2>Dictionary</FONT><FONT size=2><</FONT><FONT color=#0000ff size=2>string</FONT><FONT size=2>, </FONT><FONT color=#0000ff size=2>double</FONT><FONT size=2>> table2)</P><P>{ </P><P></FONT><FONT color=#008000 size=2>//for table1</P></FONT><FONT size=2><P></FONT><FONT color=#0000ff size=2>	foreach</FONT><FONT size=2> (</FONT><FONT color=#008080 size=2>KeyValuePair</FONT><FONT size=2><</FONT><FONT color=#0000ff size=2>string</FONT><FONT size=2>,</FONT><FONT color=#0000ff size=2>double</FONT><FONT size=2>> kv </FONT><FONT color=#0000ff size=2>in</FONT><FONT size=2> table1)</P><P>	{</P><P></FONT><FONT color=#0000ff size=2>	if</FONT><FONT size=2> (!table2.ContainsKey(kv.Key))</P><P>		table2.Add(kv.Key, 0);</P><P>	}</P><P></FONT><FONT color=#008000 size=2>	//for table2</P></FONT><FONT size=2><P></FONT><FONT color=#0000ff size=2>	foreach</FONT><FONT size=2> (</FONT><FONT color=#008080 size=2>KeyValuePair</FONT><FONT size=2><</FONT><FONT color=#0000ff size=2>string</FONT><FONT size=2>, </FONT><FONT color=#0000ff size=2>double</FONT><FONT size=2>> kv </FONT><FONT color=#0000ff size=2>in</FONT><FONT size=2> table2)</P><P>	{</P><P></FONT><FONT color=#0000ff size=2>		if</FONT><FONT size=2> (!table1.ContainsKey(kv.Key))</P><P>		table1.Add(kv.Key, 0);</P><P>	}</P><P>}</P></FONT>


 

Term Frequency's aim is to set all words' frequencies to set [0,1] interval to normalize so we implement this to our project.

<FONT size=2><P></FONT><FONT color=#0000ff size=2>private</FONT><FONT size=2> </FONT><FONT color=#0000ff size=2>static</FONT><FONT size=2> </FONT><FONT color=#008080 size=2>Dictionary</FONT><FONT size=2><</FONT><FONT color=#0000ff size=2>string</FONT><FONT size=2>,</FONT><FONT color=#0000ff size=2>double</FONT><FONT size=2>> TfFactorized(</FONT><FONT color=#008080 size=2>Dictionary</FONT><FONT size=2><</FONT><FONT color=#0000ff size=2>string</FONT><FONT size=2>,</FONT><FONT color=#0000ff size=2>double</FONT><FONT size=2>> table)</P><P>{</P><P></FONT><FONT color=#0000ff size=2>	double</FONT><FONT size=2> sum = 0;</P><P></FONT><FONT color=#0000ff size=2>	foreach</FONT><FONT size=2> (</FONT><FONT color=#008080 size=2>KeyValuePair</FONT><FONT size=2><</FONT><FONT color=#0000ff size=2>string</FONT><FONT size=2>, </FONT><FONT color=#0000ff size=2>double</FONT><FONT size=2>> kv </FONT><FONT color=#0000ff size=2>in</FONT><FONT size=2> table)</P><P>	{</P><P>		sum += kv.Value;</P><P>	}</P><P> </P><P></FONT><FONT color=#008080 size=2>	Dictionary</FONT><FONT size=2><</FONT><FONT color=#0000ff size=2>string</FONT><FONT size=2>, </FONT><FONT color=#0000ff size=2>double</FONT><FONT size=2>> tfTable = </FONT><FONT color=#0000ff size=2>new</FONT><FONT size=2> </FONT><FONT color=#008080 size=2>Dictionary</FONT><FONT size=2><</FONT><FONT color=#0000ff size=2>string</FONT><FONT size=2>, </FONT><FONT color=#0000ff size=2>double</FONT><FONT size=2>>();</P><P></FONT><FONT color=#0000ff size=2>	foreach</FONT><FONT size=2> (</FONT><FONT color=#008080 size=2>KeyValuePair</FONT><FONT size=2><</FONT><FONT color=#0000ff size=2>string</FONT><FONT size=2>, </FONT><FONT color=#0000ff size=2>double</FONT><FONT size=2>> kv </FONT><FONT color=#0000ff size=2>in</FONT><FONT size=2> table)</P><P>	{</P><P>		tfTable.Add(kv.Key, kv.Value / sum); </P><P>	}</P><P></FONT><FONT color=#0000ff size=2>	return</FONT><FONT size=2> tfTable;</P><P>}</P></FONT></FONT>

 


License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

m0nt0y4

Web Developer

Turkey Turkey

Member



Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board. (secure sign-in)
 
Search this forum  
 FAQ
    Noise  Layout  Per page   
  Refresh
Generalfinding similarity Pinmemberjamal saad2:29 19 Apr '11  
GeneralMy vote of 2 PinmemberAndrew Rissing9:52 26 Feb '10  
GeneralPrepareAllHashTables PinmemberEdit Jetty23:14 2 Apr '07  
Generalgood article Pinmembermargiex18:02 7 Dec '06  
GeneralRe: good article Pinmemberm0nt0y47:16 24 Dec '06  
GeneralRe: good article Pinmembermargiex22:29 25 Dec '06  
GeneralRe: good article Pinmemberharijayakumar18:52 9 Aug '07  

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Mobile
Web03 | 2.5.120517.1 | Last Updated 7 Dec 2006
Article Copyright 2006 by m0nt0y4
Everything else Copyright © CodeProject, 1999-2012
Terms of Use
Layout: fixed | fluid