Click here to Skip to main content
15,893,190 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I have a file,which contains a list of words,which are alphabetically ordered.At first,whenever I wanted to insert a new word in the file,I just simply looped through the lines of the file in linear time.Then I thought:"Since all strings are and will always be alphabetically ordered,can't I search the best position for the new string in a logarithmic time?".


P.S:I know that the function name has no relation,whatsoever so please don't mind and sorry for having some comments in my native language.

What I have tried:

C++
void insertion_sort(FILE** file_ptr,char* new_word) {

	int no_current_word = 0;
	int saved_position;
	char* current_word[SIZE];
	char current_line[SIZE];
	char* new_buffer=(char*)malloc(BUFFER_MAX_SIZE);
	*file_ptr = fopen("Lista cuvinte.txt", "r+");
	while (fgets(current_line, BUFFER_MAX_SIZE, *file_ptr)) {
		current_word[no_current_word] = (char*)malloc(strlen(current_line)+1);
		strcpy(current_word[no_current_word], current_line);
		no_current_word++;
	}
	for (int i = 0; i < no_current_word; i++) {
		if (strcmp(current_word[i], new_word) > 0) {
			saved_position = i;
			break;
		}
	}
	current_word[no_current_word++] = (char*)malloc(10); //la un moment dat clar o sa fac update la aceasta abordare!
	if (strcmp(new_word, current_word[no_current_word - 2]) > 0)
		strcpy(current_word[no_current_word-1], new_word);
	else {
		for (int i = no_current_word - 1; i > saved_position; i--)
			strcpy(current_word[i], current_word[i - 1]);
		strcpy(current_word[saved_position], new_word);
	}
	strcpy(new_buffer, current_word[0]);
	for (int i = 1; i < no_current_word; i++)
		strcat(new_buffer, current_word[i]);
	*file_ptr = fopen("Lista cuvinte.txt", "w");	//va trebui sa resterg continutul atunci cand il voi re-updata
	fprintf(*file_ptr, "%s", new_buffer);
	rewind(*file_ptr);
	printf("%s", new_buffer);
	printf("\n\n");
}
```
Posted
Updated 8-Feb-20 22:42pm
v2

The way I'd do it is to set up an index file: say 26 entries, giving you the file offset to the first word starting with 'A', 'B', 'C', ...

That way, to insert (or locate) a word, you can skip directly to the first word in the group, and the whole operation becomes a load quicker for little overhead.
 
Share this answer
 
You have to read the whole file anyway, so you can't help doing that. And, unless your words are all exactly the same length, you can't really do a binary search on the file, since you don't know how many bytes to jump forward/back to find the next record of interest. Even when you've found it, you've got to read the rest of the file in, write out the new word, then write out the buffer again. You might as well read the entire file into a buffer, walk sequentially through the buffer until you find where you want to put the new word then write the buffer to where you are now, the new word, and then the rest of the buffer. This is still linear in time, but since you're only doing one read and 2 or 3 writes it should be much faster e.g:

C++
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

void insert_word(const char *file_name, const char *new_word)
{
    FILE *wlist = fopen(file_name, "r+");

    /* find out how big the file is */
    fseek(wlist, 0, SEEK_END);
    size_t nbytes = ftell(wlist);
    rewind(wlist);

    /* alloc a buffer for file */
    /* we'll be doing string ops, so allow for trailing nul */
    char *buffer = malloc(nbytes+1);

    /* now read the file into the buffer */
    fread(buffer, nbytes, 1, wlist);
    buffer[nbytes] = 0;
    rewind(wlist);

    char *word_begin = buffer;
    char *word_end = strchr(buffer, '\n');
    while( *word_begin ) {
        int cmp = strcoll(word_begin, new_word);
        if(cmp == 0) {
            break; /* word already in file */
        } else if( cmp > 0 ) {
            /* found location to put word */
            /* write the buffer up to this point */
            size_t buffer_to_here = word_begin - buffer;
            fwrite(buffer, buffer_to_here, 1, wlist);

            /* add the new word */
            fprintf(wlist, "%s\n", new_word);

            /* write the rest of the buffer */
            size_t buffer_to_end = nbytes - buffer_to_here;
            fwrite(word_begin, buffer_to_end, 1, wlist);
            break;
        }
        /* if we get here, move to the next word */
        word_begin = word_end + 1; /* after the \n pointed to by word_end */
        word_end = strchr(word_begin, '\n');
    }

    /* got to the end of the file, so word is appended */
    if(*word_begin == '\0') {
        fseek(wlist, 0, SEEK_END);
        fprintf(wlist, "%s\n", new_word);
    }

    /* clean up */
    fclose(wlist);
    free(buffer);
}


I've omitted error checking in the above to reduce clutter. There really should be checks after malloc, fopen, fread, fwrite. I use strcoll() rather than strcmp() since the ordering of the words might be significant via the locale. Don't forget to call locale(LC_COLLATE, "") before calling this. The comparison will terminate at worst at the length of new_word, regardless of the length from word_begin to the end of the buffer, so there's no concern that you might be comparing megabytes of data. I tested this on a RPI2, and was able to add a 'zzzzzzzzz' to the end of a list of approx 500,000 words in a 4.8MB file in about 250 ms.

Note that your solution does not call free() for everything you malloc(). Not to do so means that you'll be leaking memory, and if you're on a system with limited memory, you could run out if called repeatedly. In my solution I pass in the file name of the file to work on. If you want to use an already open FILE pointer, you just need to pass that as FILE *. In that case, you have a requirement that the caller has opened the file with mode "r+".

Update:
actually, we can save a write here:
C
/* found location to put word */
/* write the buffer up to this point */
size_t buffer_to_here = word_begin - buffer;
fwrite(buffer, buffer_to_here, 1, wlist);

We haven't changed the file, so we can just fseek() instead:
C
fseek(wlist, buffer_to_here, SEEK_SET);
That's a minor improvement, but it saves some disk i/o.
 
Share this answer
 
v2
Quote:
Can I insert a new string in a sorted array of string, using binary search algorithm?

It depends.
To do what you want, you need to know where is nth word in the file, the problem is that words are variable length.
So with a flat variable length text file, only sequential read/write is practical.
"a"
"abc"
"ac"
"b"

An alternative is to pad each word with spaces to a certain length, so that nth word position is n * line length. In this case it works.
"a   "
"abc "
"ac  "
"b   "

Any other way require an auxiliary file to list positions of each words in the file.
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900