|
I am working on a project about data mining. my company has given me 6 million dummy customer info of twitter. I was assigned to find out the similarity between any two users. can anyone could give me some ideas how to deal with the large community data? Thanks in advance
Problem : I use the tweets & hashtag info(hashtags are those words highlighted by user) as the two criteria to measure the similarity between two different users. Since the large number of users, and especially there may be millions of hastags & tweets of each user. Can anyone tell me a good way to fast calculate the similarity between two users? I have tried to use FT-IDF to calculate the similarity between two different users, but it seems infeasible. can anyone have a very super algorithm or good ideas which could make me fast find all the similarities between users?
For example:
user A's hashtag = {cat, bull, cow, chicken, duck}
user B's hashtag ={cat, chicken, cloth}
user C's hashtag = {lenovo, Hp, Sony}
clearly, C has no relation with A, so it is not necessary to calculate the similarity to waste time, we may filter out all those unrelated user first before calculate the similarity. in fact, more than 90% of the total users are unrelated with a particular user. How to use hashtag as criteria to fast find those potential similar user group of A? is this a good idea? or we just directly calculate the relative similarity between A and all other users? what algorithm would be the fastest and customized algorithm for the problem?
|
|
|
|
|
Is your company going to give your salary to anyone here for solving this? It's your job after all, not ours.
|
|
|
|
|
No, I am a University student, and I did not get any salary. I am just want to discuss with some coding Pro and those smart guy. I will be very appreciated if someone could give me some ideas. I think the forum is to discuss programming question, we could help each other and enhance our programming skills. I hope those capable coding Pro give me some hints. Thanks.
|
|
|
|
|
You should eliminate trivial words like 'a', 'and', etc.
And then research matching algorithms, I would start with the following google string.
algorithms for set matching -string
|
|
|
|
|
yes, definitely have to use String and array to process the data. However, I don't know how exactly to do it. The idea is not clear yet. Thanks very much for your reply.
|
|
|
|
|
Well - you could try find the similarities or "document distance" of and between the Twitter users by matching their tweets against each other - kind of like the way one search for plagiarism, perhaps that might work. You could start by out by searching the tweets of a particular Twitter user - using some sort of application. If I am not mistaken - I believe Twitter does have something like this available - furthermore, comparisons between and of the groups against each other can be carried out, therefore that way we can get a comparison of the similarity or "document distance" of Twitter users.
April
Comm100 - Leading Live Chat Software Provider
modified 27-May-14 8:34am.
|
|
|
|
|
Thanks very much for your suggestion. I will try to do some research about document distance. To process so huge amount of data like this, normal way is definitely infeasible, have to find a good idea on how to implement it. The project's focus is the idea, the coding should be very simple, but if the idea is very lousy, the whole project will become useless. I am very appreciated for your suggestion.
|
|
|
|
|
You're very welcome! It was what initially popped into my head - though I believe there is probably a stronger and ideal way to carry such a project out with regards to the large amounts of data you will be dealing with.
I find your project quite interesting!
Best of Luck!
With Kind Regards,
April
Comm100 - Leading Live Chat Software Provider
modified 27-May-14 8:33am.
|
|
|
|
|
Take a look at the Levenshtein distance
|
|
|
|