If we want groups, than we are talking about clustering. There are many[^] of them.

Most of the clustering algorithms are using distance measures. Of course, the simplest distance measure we can imagine is the Cartesian one but if we are in a complex space, we have to look for better ones. Since a distance measure involves only two entities, it is not hard to figure out one. But it is hard to find the most suitable!

In your case a possible measure could be the number of common hashtags - taking into account the number of them also. So first of all, you need the distribution of the hashtags assigned to the people. This way you can figure out the weight you can assign to the number of tags compared to the number of common tags. Something like this:

- let's say, the maximum of hashtags is 20, and the distribution of this number is linear across the sample

- so, the most distant people are the ones, that have 20-20 tags, and none of them is common

- I think, that this distance has to be less than the maximum in case of the people having only 1-1 tags, and those are different

- the next distance step is 1-1 common tags

- the closest are those, that have 20 tags, all common

- but the number of tags is less important as the number of common tags, since the tag number is low, and linear.

So you should figure out a calculation that assigns a number to this logic.

But you will probably need to test several clustering methods and measures too until you will be satisfied with the result.

This might be useful: http://msdn.microsoft.com/en-us/library/ms174879(v=sql.105).aspx[^].

15,567,731 members