Click here to Skip to main content
Click here to Skip to main content

How to generate full visitor count from an Apache log file

, 16 Jan 2012 CPOL
Rate this:
Please Sign up or sign in to vote.
Count how many hits were generated from each IP address and show the top 10 sources.

Introduction

In the previous article, I described how to create a report from an Apache log file for the number of hits from localhost vs. elsewhere. That script can be easily changed to provide a report for any single IP address vs. the rest of the world just by replacing the IP address with another address.

It can be also changed to provide a report with full visitor count, showing how many hits came from each IP address. Then it is easy to show the top 10 sources, or filter them in some other way.

Background

Just to recall, in the default format, each line in the log file of Apache starts like this:

 127.0.0.1 - - [10/Apr/2007:10:39:11 +0300] ...
 127.0.0.1 - - [10/Apr/2007:10:39:11 +0300] ...
 139.12.0.2 - - [10/Apr/2007:10:40:54 +0300] ...
 217.1.20.22 - - [10/Apr/2007:10:40:54 +0300] ...

That means if we take any single line and put it in the $line variable, we can extract the IP address by the following code:

my $length = index ($line, " ");
my $ip = substr($line, 0, $length);

Using the code

In order to count an arbitrary set of strings, we need a data structure that can map strings to scalar values. In Perl, this data structure is called "associative array" or in short "hash". In other languages, a similar thing might be called a map, a dictionary, or a look-up table.

A hash is basically an unordered set of key-value pairs, where the keys are unique strings and the values can be any scalar value (number, string, or a reference).

In Perl, a hash is marked with the percentage character (%). So we declare the %count hash to hold the IP to "number of hits" mapping. Most of the code is the same as in the previous example but instead of increasing two separate scalars, we increase the elements of the hash using the following construct:

$count{$ip}++;

When we encounter an IP address for the first time, $count{$ip} does not exist yet. If a value is not there yet, Perl assumes it has an "undef" value in it. If that is used in some numerical operation such as the ++ auto-increment, then it pretends to be the number 0. That becomes 1 and this operation also creates the appropriate entry in the hash. The key-value pair automatically springs to existence. This is also called auto-vivification.

As you can see, the hash grows automatically. Perl does all the memory management.

Once this is done, we'll have a hash in which each key is an IP address and each value is the number of times that IP address appears in the file. The keys function gets a hash as a parameter and returns the unordered list of keys of the hash. This code will print all the IP addresses with the corresponding number of hits:

foreach my $ip (keys %count) {
    print "$ip   $count{$ip}\n";
}

Code

The full script is here:

#!/usr/bin/perl
use strict;
use warnings;

my $file = shift or die "Usage: $0 FILENAME\n";
open my $fh, '<', $file or die "Could not open '$file': $!";

my %count;

while (my $line = <$fh>) {
    my $length = index ($line, " ");
    my $ip = substr($line, 0, $length);
    $count{$ip}++;   
}

foreach my $ip (keys %count) {
    print "$ip   $count{$ip}\n";
}

Points of interest

Of course it would be nicer to have them sorted and this code will do it:

foreach my $ip (sort keys %count) {
    print "$ip   $count{$ip}\n";
}

But this sorts the IP addresses based on the ASCII table. Probably not very interesting.

A better sorting might be this:

foreach my $ip (reverse sort { $count{$a} <=> $count{$b} } keys %count) {
    print "$ip   $count{$ip}\n";
}

Here we sort the keys according to the corresponding values and then we reverse the order to get the IPs with the largest numbers first. This is the expression, but let's take it apart:

reverse sort { $count{$a} <=> $count{$b} } keys %count

You can sort any list of strings.

sort @strings;

By default this sorts comparing every two values based on the ASCII table.

You can also sort them using any other condition. E.g., the length of the strings:

sort { length($a) <=> length($b) } @strings;

The sort() function of Perl will take any two values it wants to compare, put them in the two variables $a and $b, and evaluate the block. Based on the result, it will either keep the order of the two values or swap them.

sort { $count{$a} <=> $count{$b} } keys %count

This code does the same but it sorts the keys of the hash and when comparing two keys, the expression will compare the values of the two keys. The result will be in increasing order but if we would like to display the IP with the biggest number of hits, then we need to reverse the results:

reverse sort { $count{$a} <=> $count{$b} } keys %count

In the last example, we do the same but when displaying, we use a helper variable to limit the number of items to the top two IP addresses.

my $top = 2;
foreach my $ip (reverse sort { $count{$a} <=> $count{$b} } keys %count) {
    print "$ip   $count{$ip}\n";
    $top--;
    if ($top <= 0) {
        last;
    }
}

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Gabor Szabo (szabgab)
Instructor / Trainer Self Employed
Israel Israel
I have started programming on a Casio calculator running some mini-BASIC in 1982 while I was in high-school in Budapest, Hungary.
 
Since then I switch hardware, programing language and country of residence.
 
Today I live in Israel and provide Perl training all over the world. Both by traveling to clients and on-line as a video course under the "Perl Maven" brand. I am also running a weekly newsletter about Perl called "Perl Weekly" and have initiated writing an IDE for Perl in Perl called "Padre, the Perl IDE".
Follow on   Twitter

Comments and Discussions

 
GeneralMy vote of 3 PinmemberJonathan [Darka]25-Jan-12 4:29 
You really should show the output of the script and maybe show it can be customized/used to make it useful.
SuggestionWarning about Apache log format PinmemberPeter_in_278016-Jan-12 13:50 
GeneralRe: Warning about Apache log format PinmemberGabor Szabo (szabgab)16-Jan-12 20:15 
GeneralRe: Warning about Apache log format PinmemberPeter_in_278016-Jan-12 20:44 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web04 | 2.8.141223.1 | Last Updated 17 Jan 2012
Article Copyright 2012 by Gabor Szabo (szabgab)
Everything else Copyright © CodeProject, 1999-2014
Layout: fixed | fluid