I want to do canopy clustering over strings to reduce the distance and the measures. But I not having any idea how to do canopy clustering over set of strings.
When I searched I got the Apache hadoop implementation of text clustering. But in that they said the input format should be sequential vector file in which the input should vector readable format.
I have a column of strings and how to change this into sequential file and vector file in java and how to use hadoop canopy clustering efficiently.
example of one column words :
help me thanks