15,961,678 members
5.00/5 (1 vote)
See more:
I am working on a project about data mining. my company has given me 6 million dummy customer info of twitter. I was assigned to find out the similarity between any two users. can anyone could give me some ideas how to deal with the large community data? Thanks in advance

Problem : I use the tweets & hashtag info(hashtags are those words highlighted by user) as the two criteria to measure the similarity between two different users. Since the large number of users, and especially there may be millions of hastags & tweets of each user. Can anyone tell me a good way to fast calculate the similarity between two users? I have tried to use FT-IDF to calculate the similarity between two different users, but it seems infeasible. can anyone have a very super algorithm or good ideas which could make me fast find all the similarities between users?

For example:
user A's hashtag = {cat, bull, cow, chicken, duck}
user B's hashtag ={cat, chicken, cloth}
user C's hashtag = {lenovo, Hp, Sony}

clearly, C has no relation with A, so it is not necessary to calculate the similarity to waste time, we may filter out all those unrelated user first before calculate the similarity. in fact, more than 90% of the total users are unrelated with a particular user. How to use hashtag as criteria to fast find those potential similar user group of A? is this a good idea? or we just directly calculate the relative similarity between A and all other users? what algorithm would be the fastest and customized algorithm for the problem?
Posted

## Solution 2

If we want groups, than we are talking about clustering. There are many[^] of them.
Most of the clustering algorithms are using distance measures. Of course, the simplest distance measure we can imagine is the Cartesian one but if we are in a complex space, we have to look for better ones. Since a distance measure involves only two entities, it is not hard to figure out one. But it is hard to find the most suitable!
In your case a possible measure could be the number of common hashtags - taking into account the number of them also. So first of all, you need the distribution of the hashtags assigned to the people. This way you can figure out the weight you can assign to the number of tags compared to the number of common tags. Something like this:
- let's say, the maximum of hashtags is 20, and the distribution of this number is linear across the sample
- so, the most distant people are the ones, that have 20-20 tags, and none of them is common
- I think, that this distance has to be less than the maximum in case of the people having only 1-1 tags, and those are different
- the next distance step is 1-1 common tags
- the closest are those, that have 20 tags, all common
- but the number of tags is less important as the number of common tags, since the tag number is low, and linear.
So you should figure out a calculation that assigns a number to this logic.
But you will probably need to test several clustering methods and measures too until you will be satisfied with the result.

This might be useful: http://msdn.microsoft.com/en-us/library/ms174879(v=sql.105).aspx[^].

## Solution 1

ldaneil305

I would probably insert the hashtags into a database. From there you can run queries to find out how similar people are to you.

I would create 3 simple tables as follows

UserTable
UserID

HashtagTable
HashtagID
Hashtag

UserHashtag
UserID
HashtagID

You need to decide what the threshold for user similarity is. Is it two similar tags or three?

Try the query below. You'll have to loop through each of your users to find similar users, but it does appear to work. I'm sure there are faster ways...

SQL
```SELECT HashtagID,
COUNT(HashtagID)
from UserHashtag
WHERE HashtagID IN (
SELECT HashtagID
FROM UserHashtag
WHERE UserID = 1
) --Get all the tags that belong to this user.
and UserID != 1             --don't match the current user
HAVING COUNT(HashtagID) > 2 --For 3 or more matches
GROUP BY UserID
order by COUNT(HashtagID) ```

Good luck!

Hogan

## Solution 3

Due to the large number of users, the hashtag for each user also can be very large amount. How to fast sort/compute the hashtag similarity between two users. let's say the similarity computation is just simply : (2 * the number of common hashtags)/(the toatal number of hashtags of A + the toatal number of hashtags of B). In fact, the main problem here is a sorting problem.