How to generate full visitor count from an Apache log file

Gabor Szabo (szabgab)

Rate me:

3.86/5 (4 votes)

16 Jan 2012CPOL3 min read

27.2K

Count how many hits were generated from each IP address and show the top 10 sources.

Introduction

In the previous article, I described how to create a report from an Apache log file for the number of hits from localhost vs. elsewhere. That script can be easily changed to provide a report for any single IP address vs. the rest of the world just by replacing the IP address with another address.

It can be also changed to provide a report with full visitor count, showing how many hits came from each IP address. Then it is easy to show the top 10 sources, or filter them in some other way.

Background

Just to recall, in the default format, each line in the log file of Apache starts like this:

127.0.0.1 - - [10/Apr/2007:10:39:11 +0300] ...
127.0.0.1 - - [10/Apr/2007:10:39:11 +0300] ...
139.12.0.2 - - [10/Apr/2007:10:40:54 +0300] ...
217.1.20.22 - - [10/Apr/2007:10:40:54 +0300] ...

That means if we take any single line and put it in the $line variable, we can extract the IP address by the following code:

PERL

my $length = index ($line, " ");
my $ip = substr($line, 0, $length);

Using the code

In order to count an arbitrary set of strings, we need a data structure that can map strings to scalar values. In Perl, this data structure is called "associative array" or in short "hash". In other languages, a similar thing might be called a map, a dictionary, or a look-up table.

A hash is basically an unordered set of key-value pairs, where the keys are unique strings and the values can be any scalar value (number, string, or a reference).

In Perl, a hash is marked with the percentage character (%). So we declare the %count hash to hold the IP to "number of hits" mapping. Most of the code is the same as in the previous example but instead of increasing two separate scalars, we increase the elements of the hash using the following construct:

PERL

$count{$ip}++;

When we encounter an IP address for the first time, $count{$ip} does not exist yet. If a value is not there yet, Perl assumes it has an "undef" value in it. If that is used in some numerical operation such as the ++ auto-increment, then it pretends to be the number 0. That becomes 1 and this operation also creates the appropriate entry in the hash. The key-value pair automatically springs to existence. This is also called auto-vivification.

As you can see, the hash grows automatically. Perl does all the memory management.

Once this is done, we'll have a hash in which each key is an IP address and each value is the number of times that IP address appears in the file. The keys function gets a hash as a parameter and returns the unordered list of keys of the hash. This code will print all the IP addresses with the corresponding number of hits:

PERL

foreach my $ip (keys %count) {
    print "$ip   $count{$ip}\n";
}

Code

The full script is here:

PERL

#!/usr/bin/perl
use strict;
use warnings;

my $file = shift or die "Usage: $0 FILENAME\n";
open my $fh, '<', $file or die "Could not open '$file': $!";

my %count;

while (my $line = <$fh>) {
    my $length = index ($line, " ");
    my $ip = substr($line, 0, $length);
    $count{$ip}++;   
}

foreach my $ip (keys %count) {
    print "$ip   $count{$ip}\n";
}

Points of interest

Of course it would be nicer to have them sorted and this code will do it:

PERL

foreach my $ip (sort keys %count) {
    print "$ip   $count{$ip}\n";
}

But this sorts the IP addresses based on the ASCII table. Probably not very interesting.

A better sorting might be this:

PERL

foreach my $ip (reverse sort { $count{$a} <=> $count{$b} } keys %count) {
    print "$ip   $count{$ip}\n";
}

Here we sort the keys according to the corresponding values and then we reverse the order to get the IPs with the largest numbers first. This is the expression, but let's take it apart:

PERL

reverse sort { $count{$a} <=> $count{$b} } keys %count

You can sort any list of strings.

PERL

sort @strings;

By default this sorts comparing every two values based on the ASCII table.

You can also sort them using any other condition. E.g., the length of the strings:

PERL

sort { length($a) <=> length($b) } @strings;

The sort() function of Perl will take any two values it wants to compare, put them in the two variables $a and $b, and evaluate the block. Based on the result, it will either keep the order of the two values or swap them.

PERL

sort { $count{$a} <=> $count{$b} } keys %count

This code does the same but it sorts the keys of the hash and when comparing two keys, the expression will compare the values of the two keys. The result will be in increasing order but if we would like to display the IP with the biggest number of hits, then we need to reverse the results:

PERL

reverse sort { $count{$a} <=> $count{$b} } keys %count

In the last example, we do the same but when displaying, we use a helper variable to limit the number of items to the top two IP addresses.

PERL

my $top = 2;
foreach my $ip (reverse sort { $count{$a} <=> $count{$b} } keys %count) {
    print "$ip   $count{$ip}\n";
    $top--;
    if ($top <= 0) {
        last;
    }
}

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By

Gabor Szabo (szabgab)

Instructor / Trainer Self Employed

Israel

I have started programming on a Casio calculator running some mini-BASIC in 1982 while I was in high-school in Budapest, Hungary.

Since then I switch hardware, programing language and country of residence.

Today I live in Israel and provide Perl training all over the world. Both by traveling to clients and on-line as a video course under the "Perl Maven" brand. I am also running a weekly newsletter about Perl called "Perl Weekly" and have initiated writing an IDE for Perl in Perl called "Padre, the Perl IDE".

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

How to generate full visitor count from an Apache log file

Introduction

Background

Using the code

Code

Points of interest

License

Comments and Discussions