Click here to Skip to main content
Click here to Skip to main content

Tagged as

Collect and Compare Log Statistics using LogJoin

, 22 Oct 2013
Rate this:
Please Sign up or sign in to vote.
The LogJoin tool helps to collect any unstructured data from text files and join it to a simple table representation for easy analysis.
This is an old version of the currently published tip/trick.

Introduction

The LogJoin is a simple console utility written on C#, also it can be used as good example of regex-driven file parser for beginners. In this tip, I'll show how to use it and what things it can do.

A few words about its features:

  • It is simple: less code means less bugs
  • Text parsing is based on user-defined Regular expressions (multiline is supported)
  • Input files are not fully loaded into memory
  • Multiple files can be selected as single input source (this is useful when logs are split into multiple files, there is no need to join them into one big file)
  • It can join unlimited number of sources (limited only by computer resources)
  • Currently it supports output to CSV format only

Background

Sometimes developers or administrators need to collect small pieces of data from huge log files for further analysis. Often the file format does not conform to any standards like CSV or XML. Let's say they are unstructured. This is the first part of the problem. The second is that the data pieces (related to the same entity) may locate in different files where each file has its own unique format. This simple utility is designed to solve both problems. 

Imagine the situation, you have log files from four applications, messaging clients, where the first, second and third clients are sending messages to the fourth client. Each client writes details about each event to its own log file, i.e., when sending a message, sender application writes a set of message attributes (message ID, title, sender account, when it was sent, etc.) to the log. Below is an example (full content is too large and I show only a part of it).

...
 
Message sent from account John. Message ID: 34, time: 2013-01-14 05:29:07; 
title: eu pellentesque; words count: 62
Message body: risus lacus, interdum quis vehicula et, eleifend non nisi. 
Aenean non eros orci. Cum sociis natoque penatibus et magnis dis parturient montes, 
nascetur ridiculus mus. Vestibulum a felis tellus. Donec ornare dictum dolor, 
eu pellentesque elit condimentum vitae. Nunc pellentesque magna nibh, 
fringilla laoreet sapien interdum non. Aenean quis laoreet lacus. 
Praesent et justo ut eros tristique pulvinar at sed urna. Integer
 
Message sent from account John. Message ID: 35, time: 2013-01-14 13:36:04; 
title: tellus Donec; words count: 53
Message body: elit. Aenean sed ipsum ut ipsum venenatis pulvinar at sed nunc. 
In hac habitasse platea dictumst. Suspendisse a lectus sit amet enim sollicitudin volutpat. 
Duis risus lacus, interdum quis vehicula et, eleifend non nisi. Aenean non eros orci. 
Cum sociis natoque penatibus et magnis dis parturient montes, 
nascetur ridiculus mus. Vestibulum a felis
 
...

I highlighted the data that has to be extracted from the log file.

The second and third logs have similar content:

...
 
Message sent from account Mike. Message ID: 17, 
time: 2013-01-07 06:11:24; title: at elit; words count: 52
Message body: dolor sit amet, consectetur adipiscing elit. 
Phasellus sed mi blandit, consectetur nunc eget, mollis nunc. 
Sed dapibus sollicitudin placerat. Curabitur porttitor, sem eu dictum dignissim, 
felis mi suscipit mi, at porta felis turpis at elit. 
Aenean sed ipsum ut ipsum venenatis pulvinar at sed nunc. 
In hac habitasse platea dictumst. Suspendisse a
 
Message sent from account Mike. Message ID: 18, 
time: 2013-01-07 14:31:26; title: Suspendisse a; words count: 39
Message body: sollicitudin placerat. Curabitur porttitor, sem eu dictum dignissim, 
felis mi suscipit mi, at porta felis turpis at elit. 
Aenean sed ipsum ut ipsum venenatis pulvinar at sed nunc. 
In hac habitasse platea dictumst. Suspendisse a lectus sit amet enim
 
...
...
 
Message sent from account Kate. Message ID: 17, time: 2013-01-07 07:36:39; 
title: felis tellus; words count: 58
Message body: hac habitasse platea dictumst. 
Suspendisse a lectus sit amet enim sollicitudin volutpat. 
Duis risus lacus, interdum quis vehicula et, eleifend non nisi. Aenean non eros orci. 
Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. 
Vestibulum a felis tellus. Donec ornare dictum dolor, eu pellentesque elit condimentum vitae. 
Nunc pellentesque magna nibh, fringilla laoreet sapien
 
Message sent from account Kate. Message ID: 18, time: 2013-01-07 18:18:06; 
title: sit amet; words count: 30
Message body: sollicitudin placerat. Curabitur porttitor, sem eu dictum dignissim, 
felis mi suscipit mi, at porta felis turpis at elit. 
Aenean sed ipsum ut ipsum venenatis pulvinar at sed nunc. In hac
 
...

Recipient application log contains information about incoming messages, like message ID, sender's account name, title, time when it was chekced by a spam filter, spam flag, etc.

...
 
New message received, ID: 17, sender: John. Processing it.
Title: penatibus et
Current time: 2013-01-07 01:44:14
Is spam: True
 
New message received, ID: 17, sender: Mike. Processing it.
Title: at elit
Current time: 2013-01-07 06:11:56
Is spam: True
 
New message received, ID: 17, sender: Kate. Processing it.
Title: felis tellus
Current time: 2013-01-07 07:39:46
Is spam: False
 
New message received, ID: 18, sender: John. Processing it.
Title: turpis at
Current time: 2013-01-07 09:11:01
Is spam: True
 
New message received, ID: 18, sender: Mike. Processing it.
Title: Suspendisse a
Current time: 2013-01-07 14:32:48
Is spam: False
 
New message received, ID: 18, sender: Kate. Processing it.
Title: sit amet
Current time: 2013-01-07 18:20:55
Is spam: True
 
...

Assume the goal is to compare time when spam messages were sent and time when recipient detected them as spam. In the next section, I'll explain how to acieve this using LogJoin tool.

The Solution

As I said before, there are two main tasks:

  • Extract values from unstructured text file (or files) and represent them as set of records (like a table in relational database). I call this set of records as Source.
  • Join multiple Sources into single set of records (i.e. into other Source) using operation similar to Left Outer Join in relational databases.

For the first task, regular expressions are used. Regular expression defines rules that extract necessary values from the unstructured text and represent them as set of fields, each fileld has name and value (for example, messageID=17, messageAuthor=John). A set of fields is called Record (like record in relational table). Method GetRecord in class Source performs extraction of the values:

/// <summary>
/// Tries to extract a Record from given <paramref name="text"/> portion
/// </summary>
/// <param name="text">Text potion that may contain all field values</param>
/// <param name="recordNumber">Number of text portion</param>
/// <returns>A Record or null if text protion does not match</returns>
private Record GetRecord(string text, int recordNumber)
{
    var match = this._recordRegex.Match(text);
    if (!match.Success)
        return null;
    string[] keyParts;//one key may contain multiple scalar values
    keyParts = this._keyGroupNames != null
    ? this._keyGroupNames.Select(gn => match.Groups[gn].Value).ToArray()
    : new[] {recordNumber.ToString(CultureInfo.InvariantCulture)};
    if (keyParts.All(string.IsNullOrEmpty))
        throw new ApplicationException(string.Format("Key not found in '{0}' text: {1}", this.Name, text));
    return new Record(string.Join("~", keyParts)/*this is the key of the record*/)
    {
        OtherFields = this.OtherColumnsNames.Select(cn => match.Groups[cn].Value).ToArray()
    };
}

The second task is to join sources to a result set and export the result to output file. This logic is implemented in class Join. Method JoinSources joins records from all sources and method WriteRecords dumps result to ouput file. 

/// <summary>
/// Joins multiple sources by their keys
/// </summary>
/// <param name="sources">the sources</param>
/// <returns>Sequence of joined records</returns>
private static IEnumerable<Record> JoinSources(IEnumerable<Source> sources)
{
    IEnumerable<Record> allRecords = null;
    foreach (var source in sources)
    {
        if (allRecords == null)
            allRecords = source.GetAllRecords().GroupBy(_ => _.Key, (key, group) => @group.Last());
        else
        {
            Source newSource = source;
            allRecords = allRecords
                .GroupJoin(newSource.GetAllRecords(), _ => _.Key, _ => _.Key,
                           (r, group) => new Record(r.Key)
                               {
                                   OtherFields =
                                       r.OtherFields.Concat(
                                           @group.Any()
                                               ? @group.Last().OtherFields
                                               : newSource.OtherColumnsNames.Select(_ => string.Empty)
                                             ).ToArray()
                               });
        }
    }
    return allRecords;
}

/// <summary>
/// Writes records to text writer
/// </summary>
/// <param name="allRecords">Records containing values to write</param>
/// <param name="writer">Outut</param>
/// <returns>Count of lines written</returns>
private long WriteRecords(IEnumerable<Record> allRecords, TextWriter writer)
{
    var s = Stopwatch.StartNew();
    long count = 0;
    foreach (var record in allRecords)
    {
        var line = string.Join(this._separator, new[] {record.Key}.Concat(record.OtherFields));
        writer.WriteLine(line);
        count++;
        if (s.Elapsed.TotalSeconds > 3)
        {
            Console.WriteLine("{0} records written so far...", count);
            s = Stopwatch.StartNew();
        }
    }
    return count;
}

Using the Tool

All parameters are passed to the application from App.Config file, <parameters> section.  It consists of two main parts: Regular expressions and Inputs. Details on each are shown in comments:

  <parameters>
    <!-- <expressions> element contains a list of 
    regular expressions used to parse content of input files. -->
    <expressions>
      <!-- An expression is defined in <expr> element. 
      You should specify an unique name for each expression.
           Expression can be multiline, this means that it can match 
           multiple lines at once 
           (see System.Text.RegularExpressions.RegexOptions.Multiline for details).
           When parsing input text, expression extracts a set of values and 
           then the values are stored to the Record. 
           Each new match of the expression produces new Record.
           To capture the values the expression should define named groups. 
           In the example below, the group (?<messageAuthor>\w+) will capture authors name.
           Some of the values should represent a unique key for the Record, 
           see example below. -->
      <expr name="recipientRecord" multiline="true">
      <![CDATA[^New message received, ID: (?<messageID>\d+), sender: 
      (?<messageAuthor>\w+)\. Processing it.\r\nTitle: (?<title>
      [\w ]+)\r\nCurrent time: (?<timeReceived>\d{4}-\d{2}-\d{2} \d{2}:
      \d{2}:\d{2})\r\nIs spam: (?<isSpam>\w+)\r\n]]></expr>
      
      <expr name="senderRecord" multiline="false">
      <![CDATA[^Message sent from account (?<messageAuthor>\w+)\. 
      Message ID: (?<messageID>\d+), time: (?<timeSent>\d{4}-\
      d{2}-\d{2} \d{2}:\d{2}:\d{2}); title: (?<title>[\w ]+); 
      words count: (?<wordsCount>\d+)]]></expr>
      
    </expressions>
    
    <!-- <inputs> element contains a list of inputs 
    that provide data for processing -->
    <inputs>
      <!-- Input is a text file (or files) 
      containing values for Records. The input should have unique name 
           taht has similar meaning as table name in SQL queries.
           "lines" attribute defines how many lines of 
           text in input file may correspond to one Record. In other words
           this is maximum count of lines that the regular expression can match. -->
      <input name="recipient" lines="5">
        
        <!-- Path to the directory containing the input file (or files) -->
        <dirPath>..\..\sample</dirPath>
    
        <!-- Name of the input file (or file mask for multiple files) -->
        <fileName>log.recipient.txt</fileName>
        
        <!-- Name of the regular expression (defined above) 
        that will be used to parse the input -->
        <regEx>recipientRecord</regEx>
        
        <!-- Set of column names (fields) that each Record contains. 
        These names must correspond to named groups defined for the expression -->
        <columns>timeReceived, messageAuthor, messageID, isSpam, title</columns>
        
        <!-- Set of values that together form a unique key 
        for each Record. Order of key values is important.
             This key is used in Left Join operation between different inputs, see details below. -->
        <key>messageAuthor, messageID</key>
      </input>
      
      <!-- The second input produces set of records for left side of Join operation -->
      <input name="sender" lines="1">
        
        <dirPath>..\..\sample</dirPath>
        
        <!-- This input contains a set of text files, they are ordered by file name -->
        <fileName>log.sender*.txt</fileName>
        
        <regEx>senderRecord</regEx>
        
        <columns>messageAuthor, messageID, timeSent, title, wordsCount</columns>
        
        <!-- Keys of each input should have the same order and the same format
             because when joining Records from different inputs 
             the key values are compared as strings. -->
        <key>messageAuthor, messageID</key>
      </input>
    </inputs>
    
    <!-- Path to output file. The file contains a table in csv format.
         File name can have string format argument that has DateTime value 
         (default is current date and time) -->
    <output>..\..\sample\result-{0:dd-MM-yyy}.csv</output>
    
    <!-- Delimiter for values in output file -->
    <outputDelimiter>,</outputDelimiter>
  </parameters>

Result is written to a CSV file. This is the well known format so today there are many tools that can visualize it and operate the data. For example, you can filter records by 'recipient.isSpam=true' condition. Also, you can calculate message delivery time by formula 'recipient.timeReceived - sender.timeSent' and so on (of course if your tool supports these operations).

In our sample, the application produces the following result:

Key        recipient.timeReceived     recipient.messageAuthor   recipient.messageID    recipient.isSpam  recipient.title    sender.messageAuthor   sender.messageID     sender.timeSent       sender.title     sender.wordsCount
John~0     01.01.2013 10:43           John                      0                      FALSE             dignissim felis    John                   0                    01.01.2013 10:40      dignissim felis  29
Mike~0     01.01.2013 14:22           Mike                      0                      FALSE             penatibus et       Mike                   0                    01.01.2013 14:19      penatibus et     8
Kate~0     01.01.2013 18:24           Kate                      0                      TRUE              consectetur nunc   Kate                   0                    01.01.2013 18:23      consectetur nunc 32
John~1     01.01.2013 20:41           John                      1                      TRUE              sed urna           John                   1                    01.01.2013 20:38      sed urna         6
Mike~1     02.01.2013 01:22           Mike                      1                      TRUE              mi at              Mike                   1                    02.01.2013 01:19      mi at            53
Kate~1     02.01.2013 02:03           Kate                      1                      FALSE             Curabitur porttitorKate                   1                    02.01.2013 02:02      Curabitur porttit42
John~2     02.01.2013 06:10           John                      2                      TRUE              pellentesque elit  John                   2                    02.01.2013 06:08      pellentesque elit12
Mike~2     02.01.2013 10:32           Mike                      2                      FALSE             pellentesque elit  Mike                   2                    02.01.2013 10:31      pellentesque elit26
Kate~2     02.01.2013 13:58           Kate                      2                      TRUE              lacus interdum     Kate                   2                    02.01.2013 13:56      lacus interdum   7
John~3     02.01.2013 17:46           John                      3                      FALSE             Phasellus sed      John                   3                    02.01.2013 17:45      Phasellus sed    57
Mike~3     02.01.2013 23:02           Mike                      3                      TRUE              Vestibulum a       Mike                   3                    02.01.2013 23:01      Vestibulum a     47
Kate~3     03.01.2013 04:25           Kate                      3                      TRUE              orci Cum           Kate                   3                    03.01.2013 04:23      orci Cum         4
. . .
John~16    06.01.2013 16:39           John                      16                     TRUE              sit amet           John                   16                   06.01.2013 16:38      sit amet         34
Mike~16    06.01.2013 18:57           Mike                      16                     TRUE              vitae Nunc         Mike                   16                   06.01.2013 18:55      vitae Nunc       2
Kate~16    06.01.2013 21:21           Kate                      16                     FALSE             quis vehicula      Kate                   16                   06.01.2013 21:19      quis vehicula    3
John~17    07.01.2013 01:44           John                      17                     TRUE              penatibus et       John                   17                   07.01.2013 01:41      penatibus et     2
Mike~17    07.01.2013 06:11           Mike                      17                     TRUE              at elit            Mike                   17                   07.01.2013 06:11      at elit          52
Kate~17    07.01.2013 07:39           Kate                      17                     FALSE             felis tellus       Kate                   17                   07.01.2013 07:36      felis tellus     58
John~18    07.01.2013 09:11           John                      18                     TRUE              turpis at          John                   18                   07.01.2013 09:09      turpis at        45
Mike~18    07.01.2013 14:32           Mike                      18                     FALSE             Suspendisse a      Mike                   18                   07.01.2013 14:31      Suspendisse a    39
Kate~18    07.01.2013 18:20           Kate                      18                     TRUE              sit amet           Kate                   18                   07.01.2013 18:18      sit amet         30
. . .

Conclusion

If you have any questions, bug reports, suggestions regarding both this tip and the application, you're welcome to email me at dev.yuriy.n@gmail.com. Also, if you find the application useful, it would be nice for me to know about that.

License

This article, along with any associated source code and files, is licensed under The BSD License

Share

About the Author

Yuriy Nelipovich
Software Developer CactusSoft
Belarus Belarus
No Biography provided

Comments and Discussions


Discussions posted for the Published version of this article. Posting a message here will take you to the publicly available article in order to continue your conversation in public.
 
-- There are no messages in this forum --
| Advertise | Privacy | Mobile
Web01 | 2.8.140827.1 | Last Updated 22 Oct 2013
Article Copyright 2013 by Yuriy Nelipovich
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid