how to calculate the similarity of two users in Twitter

Question

5.00/5 (1 vote)

See more:

I am working on a project about data mining. my company has given me 6 million dummy customer info of twitter. I was assigned to find out the similarity between any two users. can anyone could give me some ideas how to deal with the large community data? Thanks in advance

Problem : I use the tweets & hashtag info(hashtags are those words highlighted by user) as the two criteria to measure the similarity between two different users. Since the large number of users, and especially there may be millions of hastags & tweets of each user. Can anyone tell me a good way to fast calculate the similarity between two users? I have tried to use FT-IDF to calculate the similarity between two different users, but it seems infeasible. can anyone have a very super algorithm or good ideas which could make me fast find all the similarities between users?

For example:
user A's hashtag = {cat, bull, cow, chicken, duck}
user B's hashtag ={cat, chicken, cloth}
user C's hashtag = {lenovo, Hp, Sony}

clearly, C has no relation with A, so it is not necessary to calculate the similarity to waste time, we may filter out all those unrelated user first before calculate the similarity. in fact, more than 90% of the total users are unrelated with a particular user. How to use hashtag as criteria to fast find those potential similar user group of A? is this a good idea? or we just directly calculate the relative similarity between A and all other users? what algorithm would be the fastest and customized algorithm for the problem?

Posted 27-Dec-12 7:18am

ldaneil

Add a Solution

3 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

**Zoltán Zörgő** · Answer 1 · 2012-12-27T09:26:00

If we want groups, than we are talking about clustering. There are many[^] of them.
Most of the clustering algorithms are using distance measures. Of course, the simplest distance measure we can imagine is the Cartesian one but if we are in a complex space, we have to look for better ones. Since a distance measure involves only two entities, it is not hard to figure out one. But it is hard to find the most suitable!
In your case a possible measure could be the number of common hashtags - taking into account the number of them also. So first of all, you need the distribution of the hashtags assigned to the people. This way you can figure out the weight you can assign to the number of tags compared to the number of common tags. Something like this:
- let's say, the maximum of hashtags is 20, and the distribution of this number is linear across the sample
- so, the most distant people are the ones, that have 20-20 tags, and none of them is common
- I think, that this distance has to be less than the maximum in case of the people having only 1-1 tags, and those are different
- the next distance step is 1-1 common tags
- the closest are those, that have 20 tags, all common
- but the number of tags is less important as the number of common tags, since the tag number is low, and linear.
So you should figure out a calculation that assigns a number to this logic.
But you will probably need to test several clustering methods and measures too until you will be satisfied with the result.

This might be useful: http://msdn.microsoft.com/en-us/library/ms174879(v=sql.105).aspx[^].

snorkie · Answer 2 · 2012-12-27T08:53:00

ldaneil305

I would probably insert the hashtags into a database. From there you can run queries to find out how similar people are to you.

I would create 3 simple tables as follows

UserTable
UserID
UserName

HashtagTable
HashtagID
Hashtag

UserHashtag
UserID
HashtagID

You need to decide what the threshold for user similarity is. Is it two similar tags or three?

Try the query below. You'll have to loop through each of your users to find similar users, but it does appear to work. I'm sure there are faster ways...

SQL

SELECT HashtagID,
            COUNT(HashtagID)
from UserHashtag
WHERE HashtagID IN (
                    SELECT HashtagID
                    FROM UserHashtag
                    WHERE UserID = 1
                   ) --Get all the tags that belong to this user.
and UserID != 1             --don't match the current user
HAVING COUNT(HashtagID) > 2 --For 3 or more matches
GROUP BY UserID 
order by COUNT(HashtagID)

Good luck!

Hogan

ldaneil · Answer 3 · 2013-01-03T21:05:00

Due to the large number of users, the hashtag for each user also can be very large amount. How to fast sort/compute the hashtag similarity between two users. let's say the similarity computation is just simply : (2 * the number of common hashtags)/(the toatal number of hashtags of A + the toatal number of hashtags of B). In fact, the main problem here is a sorting problem.

how to calculate the similarity of two users in Twitter

3 solutions

Solution 2

Solution 1

Solution 3

Add your solution here

Preview 0