Click here to Skip to main content
Click here to Skip to main content
Go to top

libToken - Simple Wrapper for Reading and Tokenizing Text Files

, 29 Jul 2009
Rate this:
Please Sign up or sign in to vote.
An article on text processing

Introduction

This article will show you how to open and read text files in your programs. It will also present a simple library to hide those details away so you are not bothered with boiler-plate code over and over again.

Background

How many times have I had to parse a text file? Be it for reading some startup options, some kind of resource file or just to extract data from a web page or some other kind of textual data. I don t remember, but I do remember that every project started with same mantra: a few lines of code to open the file, to move through the file line by line, do my stuff and finally to close the file after parsing is done. I used to write my programs mostly In Java before, and below is the loop I usually write:

import java.io.*;

public class LineReader {
	
    public static void main(String[] args){

        // do some checks for args here  .

        String filename = args[0];
		
        try {
			
	BufferedReader br = new BufferedReader(new FileReader(filename));
			
	String line = br.readLine();
			
	while(line != null){
				
	    /*
	     * do something with line here
	     */
		
	    line = br.readLine();
	}
	
	br.close();
			
        }catch(Exception e){
	e.printStackTrace();
        }
    }

}

I have also written quite few C and C++ projects, and this loop look like this in C:

#include <stdio.h>

int main(int argc, char** argv) {
	
    FILE* fp;
    char line[LINE_SIZE];

    fp = fopen( argv[1], "r");
    if (!fp) 
        return 1;
			
    while ( fgets( line, LINE_SIZE, fp) ) {
        /* 
         * do something with line here
         */
    }

    fclose(fp);

    return 0;
}

C++ is bit more civilized and it is really easy to get this going in just few lines:

#include <iostream>
#include <fstream>

int main(int argc, char** argv) {

    if(argc != 2)
        return 1;

    std::string line;

    std::ifstream ifs(argv[1]);

    if(!ifs.is_open())
        return 0;

    while( std::getline(ifs,line) ){        

        /* 
         * do something with line here
         */

    }
    return 0;
}

Simple things are always simple, regardless of the language used. Maybe that is the reason why I never cared to encapsulate this. But once we keep opening files in several places in a program, it might get boring and tedious, no matter how simple it is. Two days ago I got really disgusted doing it in a parser, and actually got myself together and wrote little framework to encapsulate this in a simple line reader.

Using the Code

Meet the libToken. This library is not meant to be an advanced language parser, do regular expressions or any other fancy stuff. There already exist hundreds of libraries that handle those tasks better than I can probably ever write on my own. LibToken is just very tiny wrapper for line parsing pattern, if you can call it a pattern. Maybe it is a line parsing algorithm; it does not really matter what it is called, the important thing is ease of use and deployment. LibToken started originally as one function only, but has expanded over a course of one afternoon to an entire four functions. It is written in pure ANSI C so there will be no name mangling and the same compiled code can (hopefully) be used with any compiler from both C and C++ (if you prefer to use it in a shared DLL as I do).

As said it is a simple framework, not a library. For those who cannot differentiate the two, a framework is generally a prewritten application, with pluggable endpoints which connects the framework with your code. A library is a collection of prewritten functions which you call from your code to do some work. Also a framework calls your code, while in a library you call library code.

That was the theory; here is how it works in this case: You specify a callback which will be called whenever a new line of text is found. The line will be passed to this callback so here is where you do all your parsing, copying and whatever you need to do. The function with which you invoke the parser takes the name of a file to be parsed and a pointer to callback that will be invoked when a new line is found.

Sample code looks like this:

#include <nextline.h>
#include <stdio.h>

void nextLine(const char* line){ /* this is callback */
        /* 
         * do something with line here
         */
}

int main(int argc, char** argv) {
    if(argc != 2)
        return 1;

    ntNextLine(argv[1], nextLine, 0); /* this is parser invocation */
    return 0;
}

As you see there is no special initiation or data structures involved. We just have to pass the name of the file to be opened and our callback. Simple *huh*? (Don t bother about last argument I will explain it next).

So now we can jump into the realm of parsing text directly instead of writing all that open/close file, tests and line fetching loop every time we have to do some parsing.

Protype for nextLine callback looks like this:

        int ntGetLines( const char* filename, ntNextTokenFunc f, ... );

As you see it has ellipsis in argument list which means it is a variadic function (just like printf). I think we can simplify code in our nextLine callback by factoring out some usual boiler-plate code which occurs when parsing text files. For this reason the parsing function can take arguments and callbacks to functions called filters.

When a file is parsed, there will be some lines you don't want to do anything with. A typical example is comments in a programming language. Usually you want to skip those and just get to the next line. Instead of cluttering our main parsing callback with a bunch of ifs or big switch we can refactor those cases in separate and simpler callbacks.

I call those filters and they are supposed to be simple, say, a few lines of code functions. They should return a boolean value used to determine if a line is to be skipped or passed to your callback. The prototype:

        int ntFilterFunc(const char* line);

As example you might do something like this:

#include <nextline.h>
#include <stdio.h>

void nextLine(const char* line){
    puts(line);
}

int skipComment(const char* line){
    if(line[0] == '#')
        return false;
    return true;
}

int main(int argc, char** argv) {

    if(argc != 2)
        return 1;

    ntNextLine(argv[1], nextLine, skipComment);
    return 0;
}

If skipComment returns 0, the line will not be passed to your nextLine callback and ntNextLine will continue with the next line from the file.

That is a very dumb comment-skipping code since it expects there is no white spaces before the # , but it gives the idea of a filter I guess. You can pass any number of those callbacks to the parser. If you don't want to use any filter function, be sure to pass 0 at the end, otherwise your application will end in runtime hell (seg fault).

Finally, this is prototype for callback when a new line is found (the nextLine above is such):

        void ntNextTokenFunc(const char* line);

Tokenizing Files

In some cases we are not interested in line parsing, but rather receiving text one word at a time. Those words do not have to be what we usually call words: a character sequences delimited by white spaces and punctuations. Sometimes we can have other symbols as delimiters, for example, mathematical symbols as parenthesis, operators etc. In programmers parlance, words are called tokens, and the process of finding them is called tokenizing the input.

Tokenizing is usually implemented by reading text line by line and then searching for patterns within that line. Since this is also standard pattern/algorithm, we can encapsulate it into a wrapper. By encapsulating it, we are also hiding its implementation from the user, and we can make it more efficient by changing its implementation from reading one line at a time from file, to reading entire text file at once and then returning one token at a time. This is what ntNextToken function does. In case there is not enough space in the system to hold entire file in memory it will default to line reader. In any case you don t have to care about those details.

To tokenize entire file call

        int ntGetTokens( const char* filename, const char* delimiters,
            ntNextTokenFunc f );

If you wish to find white space delimited tokens only you might use

        int ntGetWords( const char* filename, ntNextTokenFunc f );

instead. This is same function as ntNextToken but with delimiters already chosen to be white ( \n\v\t\r ), or to be exact, with delimiters defined by standard macro isspace(char c).

Tokenizing Strings

Lastly, in many cases you have a string and wish to find tokens within that string. For this purpose there is:

        int ntTokenize( const char* string, const char* delimiters, ntNextTokenFunc f );

The library will make its own copy of string so original data is not being destroyed. It will operate on internal buffer of size NT_LINESIZE which is compile time constant. Default value is 1024 characters (1kb) but you may change if you wish. Another important thing to know is that function expects zero terminated strings. If \0 is missing the algorithm will automatically insert one at buffer[NT_LINESIZE-1] place. This is in order to protect against stack overflows. It also means we have to check the number of characters copied over, also somewhat slow performance. In case you trust your data and know exactly what comes in, you might consider removing those checks. I don't recommend it if you intend to parse any user generated text files or input.

Implementation Details

It is important to notice that library uses only stack space. In case where it actually does allocate heap, it will also free that heap directly after your callback has executed, so don't expect any data you get in the callback, to last after the callback has executed. Also if you need a line(s) to be stored after the callback execution, you should copy them yourself to some space. You should never try to free any pointers you got from the library. That's the beauty of working with callbacks, you don't have to care about new/malloced space, it can be automated for you.

Compared to other string parsing libraries, this is really simple and limited one, but I see simplicity as a feature rather than limitation. You can use this as a first step in building more advanced parser. Compared to standard C function strtok(), all functions in this library are non destructible and thread safe since no static data or any other form of data sharing is used. When compared to features of strtok() there is one disadvantage: it is not possible to change delimiters during the tokenizing process. Since strtok saves states between calls and continues from last place processed, it can change delimiters in different invocations. Considering drawbacks it has because of shared state, I don't think this is really killer feature. An advantage of tokenizer in this library over strtok would be to return delimiters, which is easily implemented; I just didn't have time yet. It can't be done with standard strtok (unless you want to write your own).

There are only 4 functions in one header and one source file. Since there are no other objects then primitive data used on stack, there is no need for initializations. Furthermore we have no shared objects here, so every invocation of a function uses its own stack only, so function invocations are equivalent to objects here. Simply said, there is no need to wrap such simple library in a class for all you object oriented purists.

I prefer to keep it in a single shared library, but you may also simply add header and source files directly to your project. Download includes also a Visual Studio Express project file for building the dll, but no makefile for gcc users (planned for future). I didn t tested it on linux or anything else but VS Express yet.

Observe that a line by line approach is sufficient in most cases unless you allow your tokens to continue on next line. This library does not attempt to handle cases where tokens are broken over multiple lines.

NT in prefix stands for "next token."

Points of Interest

There is a lot of place for improvement in this library. The code so long, is just a prototype, rather then finished library. In future I do plan to convert it to wide chars instead of ASCII, and to add option to return delimiters in string tokenizer. However, the first task is to do more testing before any new features are added.

License

This article, along with any associated source code and files, is licensed under The GNU General Public License (GPLv3)

Share

About the Author

arthur_m

Sweden Sweden
No Biography provided

Comments and Discussions

 
Generalthreaded Pinmemberxliqz25-Jul-09 6:27 
GeneralRe: threaded Pinmemberarthur_m31-Jul-09 6:48 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web01 | 2.8.140921.1 | Last Updated 29 Jul 2009
Article Copyright 2009 by arthur_m
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid