Click here to Skip to main content
15,891,136 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I have a class named WordListBuilder that reads input from a text file in any language. Output is to generate a .CSV file where it stores the word and frequency of that word, that is from any language. IndoEuropeanTokenizerFactory is used to create tokens. Another class Language Utils has a method that converts lines to words(has been used in wordlist builder) is shown. The tokens do get printed on console but code does not write anything into CSV file. Help will be appreciated.

Java
public class WordListBuilder {

private static final Pattern english = Pattern.compile("[a-zA-Z0-9]+");

 public static void main(final String[] args) throws Exception {

     List<String> lines = IOUtils.readLines
                ("E:/Ms Thesis/Implementation Month Task/English Novels/english.txt", 0);
        LanguageUtils languageUtils = new LanguageUtils();
        HashMap<String, Integer> wordFreq = new HashMap<String, Integer>();
        Set<String> used = new HashSet<String>();
        for (String line : lines) {
            if(used.contains(line)){
                continue;
            }else {
                used.add(line);
            }
            line = line.toLowerCase();
            ArrayList<String> words = languageUtils.lineToWords(line, true);
            for (String word : words) {

                int freq = 1;
                if(english.matcher(word).matches()){
                    continue;
                }
                if(word.length() < 2) continue;

                if (wordFreq.keySet().contains(word)) {
                    freq = freq + wordFreq.get(word);
                  System.out.println(word+ " " +freq);
                }
              // System.out.println(freq);
                wordFreq.put(word, freq);


            }


        }
        System.out.println("Size: "+wordFreq.size());
        Collection<Integer> freqs = wordFreq.values();
        HashSet<Integer> freqSet = new HashSet<Integer>();
        for(Integer i : freqs){
            freqSet.add(i);
        }
        Comparator<Object>  comparator = Collections.reverseOrder();
        Set<Integer> treeSet = new TreeSet<Integer>(comparator);
        treeSet.addAll(freqSet);

    //Collections.sort(freqSet.toArray(new Integer [freqSet.size()]), comparator);
      StringBuilder sb = new StringBuilder();
        int limit = 1000000;
        for (Integer freq : treeSet) {
            System.out.println(freq);
            for (String word : wordFreq.keySet()) {
                if(!word.matches("\\p{L}*")) continue;
                int freq1 = wordFreq.get(word);
                if (freq == freq1) {
                    limit = limit - 1;
                    String out =  word+ freq + "\n";
                    sb.append(out);
                    if(limit < 1){
                        System.out.println(limit);
                        break;
                    }
                 }
            }
        }
        writeOut("E:/Ms Thesis/Implementation Month Task/English Novels/temp.csv", sb.toString(), true);
       System.out.println(sb.toString());
    }

    public static synchronized void writeOut(String filename, String outString, boolean append) throws IOException {
        FileWriter fstream = new FileWriter(filename, append);
        BufferedWriter out = new BufferedWriter(fstream);
        out.write(outString);
        if (append) {
            out.write("\n");
        }
        out.close();
        fstream.close();
        System.out.println("Done writing");
    }
      }

   ........................
Method to convert lines to words(From language utils class)


 public ArrayList<String> lineToWords(String line, boolean hasSpace){ 

        this.hasSpace = hasSpace;

        return lineToWords(line); // store and return it to lineToWords
    }

  // Line to words conversion

    public ArrayList<String> lineToWords(String string)
{
        ArrayList<String> result = new ArrayList<String>();
        if(!hasSpace)
{  String[] result0 = string.split("");

     result = new ArrayList<String>(Arrays.asList(result0));

            return result; // if no space create a new array list and assign it to result
        }
        TokenizerFactory tokenizerFactory = new IndoEuropeanTokenizerFactory(); 

    Tokenizer tokenizer = tokenizerFactory.tokenizer(string.toCharArray(), 0, string.length());
        while (true) {
            String token = tokenizer.nextToken();
            if (token == null) {
                break;
            }
            // Iterate loop till the end of string. If Null then break
            result.add(token);
        }
return result;
}
Posted
Comments
Richard MacCutchan 15-Oct-14 5:09am    
Have you checked the content of the .csv file? Have you traced the writeOut method to see what it is doing?
Richard MacCutchan 15-Oct-14 5:23am    
I just reproduced and ran your writeOut method and it works fine. Are you sure you are not catching an exception and ignoring it?
Member 11154607 15-Oct-14 15:10pm    
@Richard Yes there is no exception still I get 0 Bytes when it creates .CSV file. It does read the sentences and covert them to words, words do get printed on console . I cannot catch the flaw. Does it has something to do with file format ? as Its reading different languages and creating tokens, later storing them.
Richard MacCutchan 16-Oct-14 3:55am    
This should not make a difference. I notice that when you open the FileWriter as in FileWriter fstream = new FileWriter(filename, append);, you use the append value to tell the system how to open the file. You also use that parameter to decide whether to add a newline character after the text. If you later call writeOut with append set to false then the file will be opened as new, and its contents destroyed. Is it possible that this is what is happening?
Member 11154607 16-Oct-14 9:54am    
Thanks a lot.It helped. Its working now.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900