Click here to Skip to main content
Click here to Skip to main content
Go to top

Finding Document Similarity using Cosine Theorem

, 7 Dec 2006
Rate this:
Please Sign up or sign in to vote.
Finding Similarity in Docs

 

Download sourcecode

 In collage we learned that from the origin in euclidean space if we have two points
we can draw a line to two points and then find the cosine of the two lines but in data mining we can use this technique to find the similarity of these documents. But how ? for example

i have to go to school.
i have to go to toilet.

the words of the first sentence are i , have , to , go  , school and all the words frequency is except to

the words of the second sentence are i , have , to , go , to , tioilet and agai all the words frequency is 1

and if we think n-dimensional space the points of the words in space is

1   [i , have , to , go,  school , toilet] = [1,1,2,1,1,0]
2   [i , have , to , go , school , toilet] = [1,1,2,1,0,1]

cos = 1*1 + 1*1 + 2*2 + 1*1 + 1*0 + 0*1 / sqrt((1^2 + 1^2 + 2^2 + 1^2 + 1^2 + 0^2 ) + 1^2 + 1^2 + 2^2 + 1^1 + 0^0 + 1^2)

The interesting part is in the code is finding the non-existing words

<FONT size=2><P></FONT><FONT color=#0000ff size=2>private</FONT><FONT size=2> </FONT><FONT color=#0000ff size=2>static</FONT><FONT size=2> </FONT><FONT color=#0000ff size=2>void</FONT><FONT size=2> PrepareTwoHashTable(</FONT><FONT color=#008080 size=2>Dictionary</FONT><FONT size=2><</FONT><FONT color=#0000ff size=2>string</FONT><FONT size=2>, </FONT><FONT color=#0000ff size=2>double</FONT><FONT size=2>> table1, </FONT><FONT color=#008080 size=2>Dictionary</FONT><FONT size=2><</FONT><FONT color=#0000ff size=2>string</FONT><FONT size=2>, </FONT><FONT color=#0000ff size=2>double</FONT><FONT size=2>> table2)</P><P>{ </P><P></FONT><FONT color=#008000 size=2>//for table1</P></FONT><FONT size=2><P></FONT><FONT color=#0000ff size=2>	foreach</FONT><FONT size=2> (</FONT><FONT color=#008080 size=2>KeyValuePair</FONT><FONT size=2><</FONT><FONT color=#0000ff size=2>string</FONT><FONT size=2>,</FONT><FONT color=#0000ff size=2>double</FONT><FONT size=2>> kv </FONT><FONT color=#0000ff size=2>in</FONT><FONT size=2> table1)</P><P>	{</P><P></FONT><FONT color=#0000ff size=2>	if</FONT><FONT size=2> (!table2.ContainsKey(kv.Key))</P><P>		table2.Add(kv.Key, 0);</P><P>	}</P><P></FONT><FONT color=#008000 size=2>	//for table2</P></FONT><FONT size=2><P></FONT><FONT color=#0000ff size=2>	foreach</FONT><FONT size=2> (</FONT><FONT color=#008080 size=2>KeyValuePair</FONT><FONT size=2><</FONT><FONT color=#0000ff size=2>string</FONT><FONT size=2>, </FONT><FONT color=#0000ff size=2>double</FONT><FONT size=2>> kv </FONT><FONT color=#0000ff size=2>in</FONT><FONT size=2> table2)</P><P>	{</P><P></FONT><FONT color=#0000ff size=2>		if</FONT><FONT size=2> (!table1.ContainsKey(kv.Key))</P><P>		table1.Add(kv.Key, 0);</P><P>	}</P><P>}</P></FONT>


 

Term Frequency's aim is to set all words' frequencies to set [0,1] interval to normalize so we implement this to our project.

<FONT size=2><P></FONT><FONT color=#0000ff size=2>private</FONT><FONT size=2> </FONT><FONT color=#0000ff size=2>static</FONT><FONT size=2> </FONT><FONT color=#008080 size=2>Dictionary</FONT><FONT size=2><</FONT><FONT color=#0000ff size=2>string</FONT><FONT size=2>,</FONT><FONT color=#0000ff size=2>double</FONT><FONT size=2>> TfFactorized(</FONT><FONT color=#008080 size=2>Dictionary</FONT><FONT size=2><</FONT><FONT color=#0000ff size=2>string</FONT><FONT size=2>,</FONT><FONT color=#0000ff size=2>double</FONT><FONT size=2>> table)</P><P>{</P><P></FONT><FONT color=#0000ff size=2>	double</FONT><FONT size=2> sum = 0;</P><P></FONT><FONT color=#0000ff size=2>	foreach</FONT><FONT size=2> (</FONT><FONT color=#008080 size=2>KeyValuePair</FONT><FONT size=2><</FONT><FONT color=#0000ff size=2>string</FONT><FONT size=2>, </FONT><FONT color=#0000ff size=2>double</FONT><FONT size=2>> kv </FONT><FONT color=#0000ff size=2>in</FONT><FONT size=2> table)</P><P>	{</P><P>		sum += kv.Value;</P><P>	}</P><P> </P><P></FONT><FONT color=#008080 size=2>	Dictionary</FONT><FONT size=2><</FONT><FONT color=#0000ff size=2>string</FONT><FONT size=2>, </FONT><FONT color=#0000ff size=2>double</FONT><FONT size=2>> tfTable = </FONT><FONT color=#0000ff size=2>new</FONT><FONT size=2> </FONT><FONT color=#008080 size=2>Dictionary</FONT><FONT size=2><</FONT><FONT color=#0000ff size=2>string</FONT><FONT size=2>, </FONT><FONT color=#0000ff size=2>double</FONT><FONT size=2>>();</P><P></FONT><FONT color=#0000ff size=2>	foreach</FONT><FONT size=2> (</FONT><FONT color=#008080 size=2>KeyValuePair</FONT><FONT size=2><</FONT><FONT color=#0000ff size=2>string</FONT><FONT size=2>, </FONT><FONT color=#0000ff size=2>double</FONT><FONT size=2>> kv </FONT><FONT color=#0000ff size=2>in</FONT><FONT size=2> table)</P><P>	{</P><P>		tfTable.Add(kv.Key, kv.Value / sum); </P><P>	}</P><P></FONT><FONT color=#0000ff size=2>	return</FONT><FONT size=2> tfTable;</P><P>}</P></FONT></FONT>

 


License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

Share

About the Author

m0nt0y4
Web Developer
Turkey Turkey
No Biography provided

Comments and Discussions

 
GeneralMy vote of 2 PinmemberMember 104828471-Jan-14 17:25 
Generalfinding similarity Pinmemberjamal saad19-Apr-11 2:29 
GeneralMy vote of 2 PinmemberAndrew Rissing26-Feb-10 9:52 
GeneralPrepareAllHashTables PinmemberEdit Jetty2-Apr-07 23:14 
Generalgood article Pinmembermargiex7-Dec-06 18:02 
GeneralRe: good article Pinmemberm0nt0y424-Dec-06 7:16 
GeneralRe: good article Pinmembermargiex25-Dec-06 22:29 
GeneralRe: good article Pinmemberharijayakumar9-Aug-07 18:52 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web04 | 2.8.140922.1 | Last Updated 7 Dec 2006
Article Copyright 2006 by m0nt0y4
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid