Click here to Skip to main content
15,868,340 members
Please Sign up or sign in to vote.
1.00/5 (2 votes)
See more:
I've found a C code to count word frequency in a text file but it works only with >1000 words and I need to use it with files having +40000 words.
How can I fix it to work with big files?

Code:

C++
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int main(int argc, char* argv[])
{
   if (argc == 1) {
   printf("The input file name has not been provided\n");
   }
   else if (argc == 2) {
   FILE *f = fopen(argv[1], "rb");
   fseek(f, 0, SEEK_END);
   long fsize = ftell(f);
   fseek(f, 0, SEEK_SET);

   char *str = malloc(fsize + 1);
   fread(str, fsize, 1, f);
   fclose(f);

   str[fsize] = 0;
   int count = 0, c = 0, i, j = 0, k, space = 0;
   char p[1000][512], str1[512], ptr1[1000][512];
   char *ptr;
   for (i = 0;i<strlen(str);i++)
   {
   if ((str[i] == ' ')||(str[i] == ',')||(str[i] == '.'))
   {
   space++;
   }
   }
   for (i = 0, j = 0, k = 0;j < strlen(str);j++)
   {
   if ((str[j] == ' ')||(str[j] == 44)||(str[j] == 46))
   {
   p[i][k] = '\0';
   i++;
   k = 0;
   }
   else
   p[i][k++] = str[j];
   }
   k = 0;
   for (i = 0;i <= space;i++)
   {
   for (j = 0;j <= space;j++)
   {
   if (i == j)
   {
   strcpy(ptr1[k], p[i]);
   k++;
   count++;
   break;
   }
   else
   {
   if (strcmp(ptr1[j], p[i]) != 0)
   continue;
   else
   break;
   }
   }
   }
   for (i = 0;i < count;i++)
   {
   for (j = 0;j <= space;j++)
   {
   if (strcmp(ptr1[i], p[j]) == 0)
   c++;
   }
   printf("%s %d \n", ptr1[i], c);
   c = 0;
   }
   }
   return 0;

}


What I have tried:

I think the problem is something related to: p[1000][512], str1[512], ptr1[1000][512]
Posted
Updated 20-Jan-18 17:19pm
v4
Comments
Patrice T 20-Jan-18 21:53pm    
This code is not complete.
MaxTTT 20-Jan-18 22:00pm    
sorry, bad copy-paste. is complete now
Graeme_Grant 20-Jan-18 22:34pm    
Formattingalwaysmakesiteasiertoread. Especially when asking for help.
MaxTTT 20-Jan-18 22:38pm    
I see. Thanks!
PIEBALDconsult 20-Jan-18 23:15pm    
As with anything, don't try to do everything at once; break the task into subtasks.

1 solution

Learn to indent properly your code, it show its structure and it helps reading and understanding. It also helps spotting structures mistakes.
C++
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int main(int argc, char* argv[])
{
  if (argc == 1) {
    printf("The input file name has not been provided\n");
  }
  else if (argc == 2) {
    FILE *f = fopen(argv[1], "rb");
    fseek(f, 0, SEEK_END);
    long fsize = ftell(f);
    fseek(f, 0, SEEK_SET);

    char *str = malloc(fsize + 1);
    fread(str, fsize, 1, f);
    fclose(f);

    str[fsize] = 0;
    int count = 0, c = 0, i, j = 0, k, space = 0;
    char p[1000][512], str1[512], ptr1[1000][512];
    char *ptr;
    for (i = 0;i<strlen(str);i++)
    {
      if ((str[i] == ' ')||(str[i] == ',')||(str[i] == '.'))
      {
        space++;
      }
    }
    for (i = 0, j = 0, k = 0;j < strlen(str);j++)
    {
      if ((str[j] == ' ')||(str[j] == 44)||(str[j] == 46))
      {
        p[i][k] = '\0';
        i++;
        k = 0;
      }
      else
        p[i][k++] = str[j];
    }
    k = 0;
    for (i = 0;i <= space;i++)
    {
      for (j = 0;j <= space;j++)
      {
        if (i == j)
        {
          strcpy(ptr1[k], p[i]);
          k++;
          count++;
          break;
        }
        else
        {
          if (strcmp(ptr1[j], p[i]) != 0)
            continue;
          else
            break;
        }
      }
    }
    for (i = 0;i < count;i++)
    {
      for (j = 0;j <= space;j++)
      {
        if (strcmp(ptr1[i], p[j]) == 0)
          c++;
      }
      printf("%s %d \n", ptr1[i], c);
      c = 0;
    }
  }
  return 0;

}

Professional programmer's editors have this feature and others ones such as parenthesis matching and syntax highlighting.
Notepad++ Home[^]
ultraedit[^]

Comments in code are also a good idea.

Quote:
I think the problem is something related to: p[1000][512], str1[512], ptr1[1000][512]

There is an easy way to know, try and you will see.

As far as I understand this code, it is highly inefficient. It is brut force, both runtime and in memory.
 
Share this answer
 
v4

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900