Click here to Skip to main content
Click here to Skip to main content

Counting Lines in a String

By , 18 Jun 2012
 

Introduction

I needed to know, so I wrote a quick, brute-force-and-ignorance (henceforth known as "BF+I") method to do it:

static long LinesCount(string s)
    {
    long count = 0;
    int position = 0;
    while ((position = s.IndexOf('\n', position)) != -1)
        {
        count++;
        position++;         // Skip this occurrence!
        }
    return count;
    }

A quick test, and it works fine.

But then, I thought - isn't that a bit...well...rubbish? Isn't there another way to do it?

And sure, there is: I can think of two off the top of my head: Regex and Linq. So, I knocked up more methods.
And then I got a lot of feedback suggesting other ways, so I have re-worked this tip to reflect those.
A quick-n-dirty test rig to check them:

            string s = File.ReadAllText(@"D:\Temp\MyLargeTextFile.txt");
            long index = LinesCountIndexOf(s);
            long regex = LinesCountRegex(s);
            long linq= LinesCountLinq(s);
            Console.WriteLine("{0}:{1}:{2}", index, regex, linq);
            Stopwatch s0 = new Stopwatch();
... repeated, one per test
            Stopwatch sn = new Stopwatch();
            s0.Start();
            for (int i = 0; i < 100; i++)
                {
                index = LinesCountIndexOf(s);
                }
            s0.Stop();
            sb.AppendFormat("IndexOf     : {0,06}\n", s0.ElapsedMilliseconds);
            tbResults.Lines = sb.ToString().Split('\n');
            sl.Start();
... Identical blocks, different method names.

The input file is 1.64MB, and contains around 23K lines. Please note that this file is read once for the entire test run, and is just used to give a significant amount of data for the tests to run against. A small amount of data would take us below the timer resolution of 1 millisecond and distort the result. This is not a test of how fast I can count lines in a file!

The methods tried so far:

        static long LinesCountIndexOf(string s)
            {
            long count = 0;
            int position = 0;
            while ((position = s.IndexOf('\n', position)) != -1)
                {
                count++;
                position++;         // Skip this occurrence!
                }
            return count;
            }
        static Regex r = new Regex("\n", RegexOptions.Multiline);
        static long LinesCountRegex(string s)
            {
            MatchCollection mc = r.Matches(s);
            return mc.Count;
            }
        static long LinesCountLinq(string s)
            {
            return (from ch in s
                    where ch == '\n'
                    select ch).Count();
            }
        static long LinesCountSplit(string s)
            {
            return (s.Split(new char[] { '\n' })).Length;
            }
 
        static long LinesCount2(string s)
            {
            long count = 0;
            int posMax = s.Length;
 
            for (int position = 0; position < posMax; )
                if (s[position++] == '\n')
                    count++;
 
            return count;
            }
 
        static long LinesCount3(string s)
            {
            long count = 0;
            int posMax = s.Length;
            char[] a = s.ToCharArray();
 
            for (int position = 0; position < posMax; )
                if (a[position++] == '\n')
                    count++;
 
            return count;
            }
 
        static long LinesCount4(string s)
            {
            long count = 0;
            int position = -1;
            while ((position = s.IndexOf('\n', position + 1)) != -1)
                {
                count++;
                }
            return count;
            }
 
        static long LinesCount5(string s)
            {
            long count = 0;
            int posMax = s.Length;
            char[] a = s.ToCharArray();
 
            foreach (char c in a)
                if (c == '\n')
                    count++;
 
            return count;
            }
 
        static long LinesCount6(string s)
            {
            long count = 0;
 
            foreach (char c in s)
                if (c == '\n')
                    count++;
 
            return count;
            }
 
        static long LinesCount7(string s)
            {
            return s.Length - s.Replace("\n", "").Length;
            }
        static long LinesCountAhoCorasick(string s)
            {
            IStringSearchAlgorithm searchAlg = new StringSearch();
            searchAlg.Keywords = new string[] { "\n" };
            StringSearchResult[] results = searchAlg.FindAll(s);
            return results.Length;
            }
        static long LinesCountLinqCount(string s)
            {
            return s.Count(c => (c == '\n'));
            }
        unsafe static long LinesCountUnsafePtr(string s)
            {
            long lineCount = 1;
            fixed (char* pchar = s)
                {
                char* p = pchar;
                for (; *p != '\0'; p++)
                    {
                    if (*p == '\n') lineCount++;
                    }
                }
            return lineCount;
            }

The addition of a StreamReader version, which doesn't really fit, but people seem to be suggesting file based solutions:

            ssr.Start();
            for (int i = 0; i < 100; i++)
                {
                StreamReader sr = new StreamReader(@"D:\Temp\MyLargeTextFile.txt");
                index = LinesCountStream(sr);
                sr.Dispose();
                }
            ssr.Stop();

...

        static long LinesCountStream(StreamReader sr)
            {
            long count = 0;
            while (sr.ReadLine() != null)
                {
                count++;
                }
            return count;
            }

They all return the same number of lines, which is fine. But the timing tests were interesting:

Test    ElapsedMilliseconds
IndexOf     :    167
Index (2)   :    953
Index (3)   :   1026
Index (4)   :    167
Index (5)   :   1211
Index (6)   :   1544
Index (7)   :   1088
Split       :   1103
Regex       :   2517
Linq (Expl) :   3576
Linq Count  :   3531
Stream Read :    976
Unsafe Ptr  :    633

I had expected Regex to be much, much faster! The serious difference between Linq and Regex was a surprise, but I was really surprised when BF+I beat them both hands down! The Split method didn't surprise me by being slower than Index, as it means .NET has to allocate 23K strings each time. To finish it all off, I rewrote the BF+I version as an Extension Method, and made it assume that a string with no newlines was a single line. As an extension method, it was within a millisecond of the original times, as you would expect.

static class ExtensionMethods
    {
    /// <summary>
    /// Returns the number of lines in a string
    /// </summary>
    /// <param name="s"></param>
    /// <returns></returns>
    public static long Lines(this string s)
        {
        long count = 1;
        int position = 0;
        while ((position = s.IndexOf('\n', position)) != -1)
            {
            count++;
            position++;         // Skip this occurrence!
            }
        return count;
        }
    }

BTW: Adding RegexOptions.Compiled to the regular expression brought it down to:

Test    ElapsedMilliseconds
BF+I     170
Split   1089
Regex   2063
Linq    3583

So it is WELL worth adding it to all your Regexes if you can.

Revision History

  • [Edit]Split method suggested by Grasshopper.iics[^] added -OriginalGriff[/edit]
  • [Edit2]I forgot to add the credit for the split method D'Oh! | <img src= " /> - OriginalGriff[/edit2]
  • [Edit3]Deeksha Shenoy[/Edit3]
  • [Edit4]Inclusion of more variants as suggested by: unruled boy[^] (Aho-Corasick, awaiting method) Grasshopper.iics[^] (Split) akemper[^] (Variations on BF+I, Replace, Split) George Swan[^] (LinqCount) Thanks to all these people (and others if I haven't updated this yet)! -OriginalGriff [/Edit4]
  • [Edit5]I forgot to introduce the three letter acronym "BF+I" when I first defined it as "Brute-Force-and-Ignorance". My thanks to fuzz_ball for pointing this out. Sorry for any confusion caused... - OriginalGriff[/Edit5]
  • [Edit6]Jon Bellamy[^] suggested a StreamReader, which I have included despite it being file related, rather than string. Please note that the test is different, as StreamReader does not support Seek or Rewind so has to be instantiated new each time. (Removing the Dispose only took 10 milliseconds off the time) Roman Shero[^] suggested the unsafe version Extra text added to the file read to make it clear why I load a file at all - OriginalGriff[/Edit6]
  • [Edit7]Added download for test rig (it's not pretty!), fixed formatting of code section (don't know what happened to it there), minor spelling mistake fixed, and Revision history reformatted - OriginalGriff[/Edit7]

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

OriginalGriff
CEO
Wales Wales
Member
Born at an early age, he grew older. At the same time, his hair grew longer, and was tied up behind his head.
Has problems spelling the word "the".
Invented the portable cat-flap.
Currently, has not died yet. Or has he?

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
GeneralMy vote of 5memberS. M. Ahasan Habib18 Feb '13 - 17:08 
Nice!
GeneralMy vote of 5memberVadimAlk9 Jul '12 - 18:07 
Great research
GeneralCombination of IndexOf an Parallel.ForEach [modified]memberbunnyboy0158 Jul '12 - 10:55 
That one gives even a faster result.
 
private static long LinesCountIndexOfThreaded(string s)
        {
            long total = 0;
            const char newline = '\n';
 
            Parallel.ForEach(Partitioner.Create(0, s.Length, s.Length /(Environment.ProcessorCount * 2)),
            () => 0L, (range, state, count) =>
            {
                int position = range.Item1;
                while ((position = s.IndexOf(newline, position)) != -1 && (position < range.Item2))
                {
                    count++;
                    position++;
                }
                return count;
            }, localTotal => Interlocked.Add(ref total, localTotal));
 
            return total;
        }


modified 8 Jul '12 - 17:09.

AnswerLast minutememberYvesDaoust25 Jun '12 - 0:34 
When benchmarking the Release version, the optimizer does its job, and you get less heretic results.
 
IndexOf : 142
Index (2) : 181
Index (3) : 343
Index (4) : 148
Index (5) : 297
Index (6) : 127
Index (7) : 783
Split : 829
Regex : 2483
Linq (Expl) : 1385
Linq Count : 1417
Stream Read : 3150
Unsafe Ptr : 0
Stream Read2: 2970

QuestionVery sad !memberYvesDaoust24 Jun '12 - 23:53 
Very well done, this is a complete and very creative research.
 
Anyway, the conclusion fills me with sadness because the should-have-been winner, namely LinesCount6, is strongly disfavored.
 
Indeed, in this problem, there is no better way than scanning the whole string and comparing to newlines. Because you need to take a look at every character once and only once: once is enough for counting, and once is required because a partial read of the string will tell you absolutely nothing about the remaining ones. This is clearly an O(N) problem (technically speaking big-theta(N)).
 
The solution is a trivial scan from start to end with a comparison/counting in the body of the loop (in addition, a test for end-of-string is needed): while (s[i]) count+= (s[i] == '\n'); You can't make it simpler, can you ?
 
Well, you could make it slightly faster by avoiding the test in the loop, using a lookup table, filled with all 0's except for the newline: while (s[i]) count+= lookup[s[i]];. This way, you trade a test for a memory read, which may make it faster (mostly applicable to ASCII characters rather than UNICODE, because of the table size).
 
So what makes LineCount6 so slow and what makes other implementations so fast ?
 
The answer to the first question is overhead: apparently, for every string access C# will wrap the string indexing with type conversion, range checking or other time wasters. The penalty is heavy.
 
The answer to the second question is: built-in scans. Whether it is regexpr or character search, C# will invoke built-in library functions that are doing exactly what is efficient: string scanning with comparison or table lookup. And they are fast because they are built-in, native, and avoid undue overhead.
 
So, to come up with efficient solutions, you need to workaround overhead and design complicated solutions based on the available primitives, instead of using the obvious and efficient one.
 
Very sad !
AnswerRe: Very sad !mvpOriginalGriff25 Jun '12 - 0:20 
It is sad, but I was amazed that my first guess - brute force and ignorance - was so much faster than other approaches.
It may be possible to gain a small speed increase by using unsafe code, but I'm not absolutely certain without checking - it is entirely possible that the overhead of fixing the string to use an unsafe pointer on it may slow it down...Laugh | :laugh: I may add that some time. Smile | :)
 
The one thing that leaps out though, is that any solution to any problem that involves creating a number of new objects is likely to be seriously slow...
Ideological Purity is no substitute for being able to stick your thumb down a pipe to stop the water

QuestionRegex comparison to precompiled would be nicememberSteav18 Jun '12 - 1:43 
Regex comparison to precompiled would be nice
AnswerRe: Regex comparison to precompiled would be nicemvpOriginalGriff18 Jun '12 - 2:25 
Um.
From the bottom of the Tip:
BTW: Adding RegexOptions.Compiled to the regular expression brought it down to:
Test    ElapsedMilliseconds
BF+I     170
Split   1089
Regex   2063
Linq    3583
So it is WELL worth adding it to all your Regexes if you can.  
Ideological Purity is no substitute for being able to stick your thumb down a pipe to stop the water

GeneralRe: Thanks.memberMiller427 Feb '12 - 1:25 
Thanks.
GeneralRe: If you're going to save off the Regex and use it more than o...memberAndrew Rissing9 Jan '12 - 4:56 
If you're going to save off the Regex and use it more than once, compiled should always be faster. Otherwise, uncompiled should be used for the best performance.
GeneralRe: In my case, I was writing a massively parallel find routine,...memberJohn Brett9 Jan '12 - 4:26 
In my case, I was writing a massively parallel find routine, operating on multiple regexs on multiple files concurrently. The regexes themselves were user-supplied, but even the most trivial of strings (without any special characters) were many times slower (~x5 from memory).
Hugely unintuitive, and hard to find, therefore worth sharing.
IIRC, Jon Skeet has also blogged about LINQ sometimes having unexpected performance characteristics (either faster or slower than expected).
GeneralReason for my vote of 5 Nice one.membernikhi _singh26 Feb '12 - 17:43 
Reason for my vote of 5
Nice one.
GeneralCan anyone please explain why LinesCount2 is so slow? I thou...memberMiller426 Feb '12 - 6:57 
Can anyone please explain why LinesCount2 is so slow? I thought it should be as fast as the LinesCountIndexOf.
GeneralRe: If you check the implementaiton of the IndexOf() method in R...memberAndreas Gieriet26 Feb '12 - 11:41 
If you check the implementaiton of the IndexOf() method in Reflector, you see that IndexOf() is an internal implementation, not given as IL, thus leading to more room for optimization.
GeneralReason for my vote of 5 Great analysis!memberAndreas Gieriet26 Feb '12 - 3:53 
Reason for my vote of 5
Great analysis!
GeneralWhat does BF+I stand for? It doesn't appear related to the m...memberQwertie23 Jan '12 - 20:58 
What does BF+I stand for? It doesn't appear related to the method name LinesCountIndexOf. Why not just use the same method names consistently throughout your post?
GeneralRe: "brute-force-and-ignorance" - it's in the first line of the ...mvpOriginalGriff23 Jan '12 - 21:35 
"brute-force-and-ignorance" - it's in the first line of the tip.
It isn't a method name, it's a description of the algorithm! Laugh | :laugh:
GeneralReason for my vote of 5 nicely illustratedmemberRavi Gupta from India17 Jan '12 - 8:22 
Reason for my vote of 5
nicely illustrated
GeneralLine terminations are found closer to the end of a string th...memberBob Van Wagner17 Jan '12 - 2:54 
Line terminations are found closer to the end of a string than the beginning, you want to walk backwards and not forward looking for them. In super-high performance systems one could use one register load of a multibyte word and mask each byte for a match on '\n'.
GeneralReason for my vote of 5 Really excellent thorough research a...memberBillWoodruff16 Jan '12 - 22:42 
Reason for my vote of 5
Really excellent thorough research and evaluation of a variety of techniques !
GeneralReason for my vote of 5 Wonderful!memberAlex Essilfie16 Jan '12 - 22:15 
Reason for my vote of 5
Wonderful!
GeneralJust a general writing comment, when you toss in a TLA it's ...memberfuzz_ball16 Jan '12 - 9:46 
Just a general writing comment, when you toss in a TLA it's always best to introduce it (in parens) the first time after it is introduced. Yes, I broke the rule using TLA above, but it was to make a point Smile | :)
 
The TLA I'm talking about is your BF+I. You should have included that at the top when you first mentioned it. Just a general writing suggestion. Nice to the point article!
GeneralRe: :O Sorry - I forgot...fixedmvpOriginalGriff16 Jan '12 - 21:19 
Blush | :O
Sorry - I forgot...fixed
GeneralIt looks like Linq is built for comfort, not speed.memberGeorge Swan11 Jan '12 - 6:00 
It looks like Linq is built for comfort, not speed.
GeneralOK, so how about a less ignorant version that ignores newlin...memberPIEBALDconsult9 Jan '12 - 4:29 
OK, so how about a less ignorant version that ignores newlines within quotes when requested? And what about formfeeds? Don't they start a new line too? Big Grin | :-D

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web03 | 2.6.130516.1 | Last Updated 18 Jun 2012
Article Copyright 2012 by OriginalGriff
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid