Click here to Skip to main content
Click here to Skip to main content

Azure WebState

, 9 Jun 2013 CPOL
Rate this:
Please Sign up or sign in to vote.
Crawling tons of (individual) web information and creating statistics using Windows Azure.

Please note

This article is an entry in our Windows Azure Developer Challenge. Articles in this sub-section are not required to be full articles so care should be taken when voting. Create your free Azure Trial Account to Enter the Challenge.

Azure WebState

The project is available at http://azurewebstate.azurewebsites.net/[^].

Contents

Introduction

Windows Azure is a cloud computing platform and infrastructure. It provides both platform-as-a-service (PaaS) and infrastructure-as-a-service (IaaS) models and supports many different programming languages (C#, C++, Java, JavaScript, Python, ...), tools (Visual Studio, Command Line, Git, Eclipse, ...) and frameworks (.NET, Zend, Node.js, ...), as well as different operating systems (Windows Server, SUSE, Ubuntu, OpenLogic, ...) for virtual machines. There are several reasons to pick Windows Azure instead of a classical web hosting. One reason is certainly the covering distribution of the data centers. The CDN nodes are currently located in 24 countries.

In this contribution to the "Windows Azure Developer Challenge" contest I will present all steps that have been required in order to develop a fully fledged cloud based application, that uses the scaling and load-balancing features of the Windows Azure platform. We will see how easy (or hard? hopefully not!) it is to set up a configuration that uses several key features of Microsoft's cloud provider:

  • Hosting a website and deploying it with integrated source control in form of git
  • Using a (SQL) database for storing relational data
  • Installing third-party software for different needs, like MongoDB for document storage
  • Setting up a virtual machine as a worker, with scaling capabilities to increase the number of workers on request

Before we can go into the exact details of my idea (and the implementation), we should have a look at my Azure account.

My Azure Account

The rules for this contest read: If you don't register you will not be eligible for the competition. Please ensure you sign up for your trial using this link so we can tell who's signed up.

That being said it is pretty obvious that one has to register. Following the given link we end up on the page windowsazure.com/en-us/pricing/free-trial (and some affiliate network parameter). The trial account would give us the following abilities for 3 months (for free):

  • virtual machines & cloud services / 750 compute hours per month
  • SQL server / 750 hours of web, standard, or enterprise
  • web sites / 10 web sites
  • mobile services / 10 mobile services
  • relational database / 1 SQL database
  • SQL reporting / 100 hours per month
  • storage / 70 GB with 50,000,000 storage transactions
  • backup / 20 GB
  • data transfer / unlimited inbound & 25 GB outbound
  • media services encoding / 50 GB (input & output combined)
  • cdn / 20 GB outbound with 500,000 transactions
  • cache / 128 MB
  • service bus / 1,500 relay hours and 500,000 messages

That's pretty cool stuff! Here 750 compute hours per month is slightly above 31 days of raw computing power. This is enough to have one virtual machine running all the time (with actually doing some stuff - and not being idle or powered off). Also we get 10 web sites for free and one SQL server running the whole month. The storage as well as the CDN traffic data is also sufficient to have a quite powerful machine in the cloud.

Having logged in with my Microsoft account (formerly known as Microsoft passport, Live ID or Windows Live Account) an upgrade has been offered to me. Being a Microsoft MVP for Visual C# has the positive side of having a Microsoft MSDN and TechNET subscription. This also gives me a Windows Azure MSDN - Visual Studio Ultimate subscription on Windows Azure. This package has the following properties:

  • virtual machines & cloud services / 1500 compute hours per month
  • SQL server / 1500 hours of web, standard, or enterprise
  • web sites / 10 web sites
  • mobile services / 10 mobile services
  • relational database / 5 SQL database
  • SQL reporting / 100 hours per month
  • storage / 90 GB with 100,000,000 storage transactions
  • backup / 40 GB
  • data transfer / unlimited inbound & 40 GB outbound
  • media services encoding / 100 GB (input & output combined)
  • cdn / 35 GB outbound with 2,000,000 transactions
  • cache / 128 MB
  • service bus / 3,000 relay hours and 1,000,000 messages

The changes are all marked with bold text. So I get more or less twice the computing power of the free trial, which is not bad. Let's go on to discuss my idea and the possible features of its upcoming implementation.

The main idea

My project carries the name Azure WebState and represents an Azure based web statistic creater / data crawler. What does that mean? In the past month I've build a fully functional HTML5 and CSS3 parser. The project is about to be released (as open source), with a CodeProject article about to come. I tried to implement the full DOM (DOM Level-3, and partly DOM Level-4), which means that once an HTML document has been parsed, one is able to query elements with QuerySelector() or QuerySelectorAll(). Of course methods like GetElementById() and others are implemented as well.

How is this library useful for this project? Let's understand the big picture, before we go into details:

The scheme of Azure WebState

What I try to build is an MVC 4 webpage that works mostly with the Web API. Of course there is visible front-end, which uses part of the public available API and some of the only internal available API. The API can be used for various things:

  • Getting information (statistics) for a certain webpage
  • Getting information of a public available statistic view
  • Getting information on a restricted statistic view
  • Searching within a crawl list

A crawl list is a list of URLs, where the statistic is based on. The page will come with a pre-defined list of about 100-500 of the most popular webpages (including Amazon, Bing, CodeProject, Facebook, Google, Netflix, StackOverflow, Twitter, Wikipedia, YouTube ...), however, users can register on the page (e.g. to get an API key) and setup their own crawl list (which could be based on the pre-defined list, but does not have to be).

The requirement of crawling pages, parsing them and creating statistics upon their data is also reflected in the database architecture. Instead of just using a relational (SQL based) database, this project will actually use a SQL and a NoSQL database. This is the relation between the two:

The relation between the relational and the NoSQL DB

While the relational database will store all relational data (like users and their crawl lists (one to many), crawl lists and their corresponding views (one to many), crawl lists and their entries (one to many), users and their settings (one to one) etc.), the NoSQL database will provide a kind of document storage.

We pick MongoDB for various reasons. A good reason is the availability of MongoDB on Windows Azure. Another reason is that MongoDB is based on JSON / BSON, with an in-built JavaScript API. This means that we are able to just return some queries directly from MongoDB to the client as raw JSON data.

The reason for picking a NoSQL database is explained quite fast: We will have a (text) blob for each crawl entry (maybe even more, if the history of a document is saved as well) (representing the HTML page), and (most probably) other (text) blobs as well (there could be zero to many CSS stylesheets attached to one document). So this is already not a fixed layout. The next reason is that the number of statistic entries might grow over time. In the beginning only the official statistics are gathered for one entry, however, one user could pick the same entry and request other statistics to be gathered as well. Therefore all in all we have to be able to do an easy expansion of data on a per-entry basis. This is not possible in relational database (in fact there are ways, but it is just not very efficient).

What's the purpose of the project? Crawling the web and creatig statistics about it. How many elements are on a webpage. What's the average size of a webpage. What are the request times and what is the average parsing time. All this data will be saved and will be made available.

There will be tons of statistics on the webpage (available for everyone) and everyone will be able to create an account (over OpenID) and create / publish his own crawl-list(s) with statistic views.

What kinds of statistics will be covered? This is actually very very open. Any statistic based on the HTML and CSS content of a webpage can be covered. Every user can set up other statistics to be determined. The pre-defined statistics include:

  • Number of elements (HTML)
  • Number of rules (HTML)
  • Response time (HTML)
  • Parsing time (HTML, CSS)
  • Percentages of elements (div, p, ...)
  • Percentages of style-rules
  • Maximum level in HTML
  • The ratio of information (text) against document size (HTML).
  • The number of images (HTML).
  • The number of links (HTML).
  • The number of different colors (CSS).
  • and many many more ...

There will be also statistics that go across all entries, like the percentage of CSS class names (could be that a certain name is found a lot more often than others) or the most common media queries.

In theory (even though this highly unlikely to be implemented during the contest) I could also extend the database with a tag directory, which enables searching the crawled content.

How will I attack the challenge? I will start with a front-end that shows a webpage and contains already everything required. In the next step I create the SQL based relational database and wire up the webpage to it. Now it's time to set up a primary worker along with the MongoDB database. The primary worker will handle the union crawl-list (unification of all crawl-lists with distinct entries of course) and distribute the work among other works (load balancing and scalability).

In the final step I will polish the API and create a mobile access experience that allows to view the statistics offline and enables further abilities like notifications and more.

The challenges

In this section I am going to discuss how I experienced (and hopefully mastered) the various challenges. I will present code, screenshots and helpful resources that I've found on my way to the cloud.

First Challenge: Getting Started

This was an easy one, since I just had to follow the link (given above or on the challenge page) and upgrade to my MSDN Azure subscription. Everything went smooth and my account has been active within 2 minutes.

How Azure might benefit or change the way I do things today

Windows Azure makes me independent of constraints like a fixed hardware or software setup (if I need more computation power - I get it; if I need to run Linux for this life-saving tool - I power up a Linux VM). Azure provides the memory and computing power for scalable data-driven applications like WebState.

Second Challenge: Build a website

There are multiple ways to write and deploy webpages on Windows Azure. One of the best ways is to use ASP.NET MVC. On the one hand we can write the webpage with one of the most advanced and comfortable languages, C#, on the other side we get the best tooling available in form of Microsoft's Visual Studio.

I decided to go for a Single-Page Application with ASP.NET MVC 4. There are multiple reasons for picking this:

  • We get a lot of features, that we would like to use anyway, already integrated. Less work!
  • A part of the page makes heavy use of the API, as does the provided ToDo-example (boilerplate) provided with this project template.
  • OAuth is already included and the provided code-first database models are quite close to our target.
  • Knockout.js is included to make MVVM with binding on the API driven elements easy
  • The web API is already included, with an area dedicated to showing help for using the API

All in all if we go for the Single-Page Application project template we get a lot of benefits, which dramatically boost our development speed in this case.

Picking ASP.NET Project

Using OAuth

The first thing I had to do was to reconfigure some of the default settings. I started with the AuthConfig.cs in the App_Start folder. This class defined in this file is used at startup to do some of the OAuth configuration. My code looks like the following:

public static void RegisterAuth()
{
    OAuthWebSecurity.RegisterGoogleClient();

    OAuthWebSecurity.RegisterMicrosoftClient(
        clientId: /* ... */,
        clientSecret: /* ... */);

    OAuthWebSecurity.RegisterTwitterClient(
        consumerKey: /* ... */,
        consumerSecret: /* ... */);

    OAuthWebSecurity.RegisterYahooClient();
}

In order to get those codes I had to register the webpage on the developer services of Microsoft and Twitter. Luckily there was a document available at the ASP.NET webpage, that had direct links for those services.

Doing the registration at the Twitter developer homepage looked like the following:

Twitter Developer services

On the Microsoft homepage the procedure was quite similar, however, less obtrusive in my opinion. Here my input resulted in the following output from the webpage:

Microsoft Developer services

Now that everything was set up for doing OAuth I was ready to touch the provided models. This part of the competition does not yet involve the database (and we are still missing the VM, so no worker is available yet to produce statistic data), however, we can still do the whole relational mapping in code-first. This will be deployed using a Microsoft SQL express database without us caring much about it.

Codefirst with Entity Framework

As already said - in this part of the competition we do not care yet about real statistics, MongoDB (that will be part of the next challenge, along with Microsoft SQL) or crawling the data in one or many worker instances.

Let's have a look at the models prepared for some of the statistic / view work.

The codefirst approach for (part of) the statistics

Basically every user can have multiple views. Each view could have a unique API key assigned, or just the same. The API key is required only for external (i.e. API) access - if a user is logged in he can always access his views (even restricted ones).

Each view does have multiple statistic items, i.e. data that describes what kind of statistics to get from the given crawl list. This data will be described in a SQL similar language. There is much more behind this concept, however, I will explain part of it in the next section when we introduce MongoDB and in the fourth section on the worker / VM. Here I just want to point out that the crawl entries are also present in MongoDB, where the statistic fields for each crawl item are present. This is basically the union of all statistic fields for a given crawl item. We will see that MongoDB will be a perfect fit for the resulting kind of data.

Each user can also manage crawl lists. He could create new crawl lists or use existing (public or his own private) ones. He could also create new crawl lists based on existing (public or his own private) ones.

Since there is no worker (plus no document store in form of MongoDB) and all data depends highly on the worker the most dynamic part of the webpage will be left disabled for the moment.

Deploying on Windows Azure

After logging in to the Windows Azure Management center we just have to click on New at the bottom of the screen. Now we can go on webpage and just go for a quick creation. Entering the URL is all we need before the actual webpage is being set up:

Create a new webpage

The setup process might take a few seconds. While the webpage is being created a loading animation is shown. After the webpage is created we can go back to the Visual Studio and deploy our application.

The webpage is being created

For doing this efficiently we download a generated publish profile (from the Windows Azure Management center) and import it into Visual Studio. We could also publish the web application from FTP directly by setting up deployment credentials in the portal and pushing the application to Windows Azure from any FTP client, however, considering that we already use Visual Studio, why shouldn't we do it the easy way?

Finally we have everything in place! We can right click on the project in Visual Studio, select publish and choose to import the downloaded publishing profile in this dialog:

Publishing the page

It is very important that everything, i.e. also the generated XML file for the web API documentation, is included in the project. If the file is just placed in the (right) directory, it won't be published. Only files that are included in the project will be published. This is, of course, also true for content files like images and others.

Easter egg!

I do not publish the real easter egg (which is not that hard to find out), but I want to announce a little easter that I've build in. If you open the source code of the webpage you will see a comment that shows some ASCII art graphics, which is ... CodeProject's Bob (you guessed it)!

As far as responsive design goes: Right now the whole webpage has been created with desktop-first. This is a statistic homepage and meant for professional use - nothing about only consuming data. The last challenge will transform the public statistics into something quite usable for mobile devices. Here is where stylesheet extensions and manipulations, as well as some features of ASP.NET MVC 4 (like user-agent detection), will shine.

Helpful resources

Of course there exist some helpful webpages that provide one or the other interesting tip regarding webpages with ASP.NET MVC (4), deploying webpages on Azure or others. I found the following resources quite helpful:

Third Challenge: Using SQL on Azure

I already created a SQL database in the last challenge - just to support the login possibility on the webpage. In this section I will extend the database, perform additional configuration and install MongoDB.

Let's start by installing MongoDB. MongoDB will be the document store for the whole solution. If it is still unclear why the project uses MongoDB then the following list of arguments will probably help:

  • High availability built in
  • Data replication and durability built in
  • We can scale up on demand
  • We can scale out on demand
  • Easily survives reboots of instances
  • Integration with Azure diagonstics

For Windows Azure we are using the standard MongoDB binaries. The code for these binaries is open source. When a MongoDB worker starts we have to do the following stops:

  • We have to register change notifiers
  • We should mount the storage (blob)
  • Starting the MongoDB service (mongod)
  • We have to run the cloud command

Of course we also need to perform some steps for stopping the service:

  • We have to stepdown
  • We should stop the service mongod
  • Finally we have to unmount the blobs

Challenges can be found in various areas. For instance debugging is not that easy. Also the IP potentially changes on reboot, since we do not have a fixed assigned machine (which is also the advantage of cloud computing). Keeping several sets of configurations in sync is also not that easy.

Luckily there is a good emulator that works great. Here cloud storage is emulated as the local mounted drives. When deploying MongoDB we have to do the following steps.

  • Create storage account (get a key)
  • Create the service
  • Specify storage account (key) in solution

Most of this work is nowadays automated. The only decision we have to make is if we want to deploy to the platform-as-a-service or infrastructure-as-a-service. In the first case we have the choices of installing it on a Windows VM or a Linux VM. In the second case we do not care at all!

For using the Windows Azure command line utility, as well as installing a VM in form of MongoDB we will need a file with our publishing settings. This can be obtained from the Windows Azure Webpage (windows.azure.com/download/publishprofile.aspx).

Download publish settings

Obviously (from a programmer's perspective) the choice between IaaS and PaaS does not matter much. Therefore we go for the PaaS solution, since there is a great (but, as we will see probably outdated?) installer tool available and we can (later on) adjust the OS to our needs. The upcoming OS adjustments are an important point, because they will allow us to have a much more direct connection to the running service. We will be able to clone complete VMs instead of messing around with multiple configurations. The installer is a command line utility (powershell script) that has to be run as an administrator.

MongoDB installer Azure

In order to run the script we also need to lower the restriction for executing (powershell) scripts. Usually this is really restricted. By using the Set-ExecutionPolicy command we can set it to Unrestricted (which allows all scripts to run, i.e. no certificate or explicit permission required) for the duration of the installation. The current value can be obtained by using the Get-ExecutionPolicy command.

However, where light is there is also shadow. The problem is that the installer is dependent on node.js and (supplied) JavaScript file(s). Obviously the authors of the installer did not care about correct versions, as did the authors of npm in general. Even though the concept of versions is incorporated, usually the dependencies are just downloaded with the latest versions. This is a huge problem, since the following statements cannot be executed any more:

var azure = require(input['lib'].value + '/azure');
var cli = require(input['lib'].value + '/cli/cli');
var utils = require(input['lib'].value + '/cli/utils');
var blobUtils = require(input['lib'].value + '/cli/blobUtils');

Therefore the script cannot execute and fails at this point:

MongoDB Azure Install Problem

The problem is that the dependent package (azure) of the package azure-cli changed a lot. In the end I searched for the desired equivalents and just (out of lazyness) copied full paths to the require argument:

var azure = require('.../azure-cli/node_modules/azure/lib/azure');
var cli = require('.../azure-cli/lib/cli');
var utils = require('.../azure-cli/lib/util/utils');
var blobUtils = require('.../azure-cli/lib/util/blobUtils');

With the change the script is now working as expected and we finally get to the next step!

MongoDB Azure Install Fixed

For this install we do not use replica. When the page grows then there is a good chance that those replica might become handy. This provides fast modification access while having even faster and load balanced read access. Also the system is much more robust and less open for failures with data-loss. The following scheme will be followed when dealing with replica sets.

MongoDB replica set

Before we look how our web app can interact with the just created MongoDB instance, we should have a look at creating (a real) SQL database on Windows Azure. The process of adding a database itself is quite straight forward.

We start by logging into the Windows Azure management webpage. Then we just click on New and select Data Services, SQL Database. Now we could import a previously exported SQL database, create a database with custom options or quick create a database with the default options.

Windows Azure Create SQL

Usually picking the quick create option is sufficient. In our case we just need a persistent store for user data, which is one of the cases where a standard SQL database is a quite good fit. Our data fits perfectly in a pre-defined scheme and using the Entity Framework ORM we do not have to care much about SQL administration.

However, besides the classical way of using SQL Management Studio or similar, we can also administrate our database from the Windows Azure webpage. A silverlight plugin has been created, which allows us to do all the necessary management. When we click on Manage we will first be asked to create a firewall rule for our current IP. We can do that safely, but should (at a later point in time) remove this rule afterwards.

Creating this firewall rule might take a few minutes. After the rule has been created we can log into the SQL management area.

Windows Azure SQL Management

Here we can create, edit or remove stored procedures, tables and views. Most of the tasks of a database administrator can be done with this silverlight plugin. A quick view at the tables of our database after publishing the webpage:

Windows Azure SQL Tables

All in all everything is set up by using the publish agent in Visual Studio. Everything we have to do is enter the connection string to our freshly created database and testing the connection. Then this connection is automatically used for deployment. Required tables will be auto-generated and everything will be set up according to the rules detected by the Entity Framework.

In the next challenge we will then set up the worker and wire up the communication between MongoDB and our worker. Our worker will also have to communicate with the SQL database, which will be also be discussed.

The current state of the web application is that users can register, log-on (or off), change their password or associated accounts. Data is already presented in a dummy-form. In the next stage we will do most of the work, which allows users to create their own crawl lists and views. We will also integrate the worker, which is the corner-stone of our application.

Helpful resources

Fourth Challenge: Virtual Machines

In principle the last challenge also set up a VM (i.e. there is a VM already running at the moment). However, the challenge is more than just setting up a virtual machine. So in these paragraphs I will go into details of what a VM is, how we can benefit from creating one and how we can create one. The last paragraphs will then be dedicated on configuring the system, administrating it and installing our worker.

But not fast! One thing that will also be discussed in detail is how the worker is actually written and what the worker is doing. After all the worker is the probably the most central piece in the whole application, since it creates the data, which feeds the web applications. Hence following the discussion on how to set up and use a VM, we will go into details of the worker application.

So lets dive right into virtual machines. Everyone already starts a kind of virtual machine if we start the browser (therefore you are currently already running one). A modern browser allows us to run webpages (sometimes also known as web-applications) by supplying them with a set of APIs that offer threads, storage, graphics, network and more (everything that an OS offers us). If we think of the browser of an operating system, then we are running a virtual machine with it, since we know that only one operating system can run at a time.

This implies that any other operating system is only virtual. What other operating system see is a kind of virtual machine (not the real machine), since the machine is abstracted / modified from the real hardware (but also limited to it). So what exactly is a virtual machine? It is an abstraction layer that fakes a machine such that an arbitrary operating system could boot within an existing operating system.

We already see that this abstraction is somehow expensive. After all the whole cost must be paid somewhere. Every call from the system running in the VM to a memory address has to be mapped to the real memory address. Every call to system resources like graphic cards, USB ports and others is now indirect. On the other side there are several really cool benefits:

  • The system is very easy to sandbox.
  • System resources can be controlled quite easily.
  • System resources that appear like physical hardware may not exist or be a combination of several components.
  • The whole system can be duplicated very easily.
  • The system can be supervised and modified with less effort.

Windows Azure represents two important milestones. On the one hand it is a synonym for Microsoft's outstanding infrastructure, with (huge) computing centers all around the world. On the other hand it is the name of the underlying operating system, which is specialized in managing the available computing power, load-balancing it and hosting virtual machines. Most computers run Windows Azure and can therefore host highly optimized virtual machines, which are as close to the real hardware as possible. However, they still have all the benefits of virtual machines.

This allows us to append (virtual) hard drives in form of storage disks, which exceed any available storage capacity. The trick is that we access a bunch of drives in Microsoft's computing center at once without knowing.

There are several ways to create a new VM to run in Windows Azure. The simplest way is to use the web interface to create one. The next image shows how this could be done.

Azure Virtual Machine Create

Another (more advanced) possibility would be to use the command line utility. The following snippet creates a new VM called my-vm-name, which uses a standard image called MSFT__Windows-Server-2008-R2-SP1.11-29-2011 with the username username:

azure vm create my-vm-name MSFT__Windows-Server-2008-R2-SP1.11-29-2011 username --location "Western US" -r

Everything could be managed by the command line utility. This gives us also the option of uploading our own virtual machine (specified in the vhd file format). The advantage is that any VM could be duplicated. Therefore we could create a suitable configuration, test it on our own premises, upload it and then scale it up to quite a lot of instances.

The following snippet creates a new VM called mytestimage from the file Sample.vhd:

azure vm image create mytestimage ./Sample.vhd -o windows -l "West US"

Coming back to our created VM we might first connect to it directly over the remote desktop protocol (RDP). We do not even need to open the remote desktop program or something similar, since Windows Azure already contains a direct link to a *.rdp file, which will contain the required configuration for us. Opening this usually yields the following warning:

Remote Certificate Missing

This warning could be turned off by installing the required certificates. For the moment we can ignore it. By just continuing with connecting to our virtual machine we will eventually be able to log on our system, provided we enter the right data for the installed administrator account.

The next image shows the screen that can be captured directly after having successfully logged on our own VM running on Windows Azure.

Remote Desktop Connection View

Now it's time to talk about the worker application. The application is a simple console program. There is really not much to say about the reasons for picking a console program. In fact it could be a service without any input or output, but having at least some information on screen can never be bad. The application will be deployed by copy / paste of a release folder. We can use the clipboard copy mechanism that is provided by the Windows RDP client.

The program itself is nearly as simple as the following code snippet:

static void Main(string[] args)
{
    //Everything will run in a task, hence the possibility for cancellation
    cts = new CancellationTokenSource();

    /* Evaluation of arguments */

    //The log function prints something on screen and logs it in the DB
    Log("Worker started.");

    //Connect to MongoDB using the official 10gen C# driver
    client = new MongoClient("mongodb://localhost");
    server = client.GetServer();
    db = server.GetDatabase("ds");

    //Obviously something is wrong
    if (server.State == MongoServerState.Disconnected)
    {
        Log("Could not connect to MongoDB.");
        //Ah well, there are plenty of options but I like this one most
        Environment.Exit(0);
    }

    Log("Successfully connected to MongoDB instance.");

    //This runs the hot (running) task
    var worker = Crawler.Run(db, cts.Token);

    //Just a little console app
    while (true)
    {
        Console.Write(">>> ");
        string cmd = Console.ReadLine();

        /* Command pattern */
    }

    //Make sure we closed the task
    cts.Cancel();
    Log("Worker ended.");   
}

Basically the main function just connects to the MongoDB instance and starts the crawler as a Task. One of the advantages of creating a console application is the possibility to interact with it in a quite simple and "natural" way - over the command line.

Before we go into the kernel of the crawler we need to take a look on the most important library for the whole project: AngleSharp. We get the library by using the same method as for all other libraries: over NuGet. The current state of AngleSharp is that it is still far away from being finished, however, the current state is sufficient to use it in this project.

AngleSharp

The kernel of the whole crawler is executed by calling the static Run method. This is basically a big loop over all entries. This loop is wrapped in a loop again, such that the process is an infinite continuation. In principle we could also set the process idle after finishing the big loop until a certain condition is matched. Such a condition could be that the big loop is only processed once per day, i.e. the condition would be that the starting day is different from the current day.

Let's have a look at the Run method.

public class Crawler
{
    public static async Task Run(MongoDatabase db, CancellationToken cancel = new CancellationToken())
    {
        //Flag to break
        var continuation = true;
        //Don't consume too many (consecutive) exceptions
        var consecutivecrashes = 0;
        //Initialize a new crawler
        var crawler = new Crawler(db, cancel);
        Program.Log("Crawler initialized.");

        //Permanent crawling
        do
        {
            //Get all entries
            var entries = db.GetCollection<CrawlEntry>("entries").FindAll();

            //And crawl each of them
            foreach (var entry in entries)
            {
                try
                {
                    //Alright
                    await crawler.DoWork(entry);
                    //Apparently no crash - therefore reset
                    consecutivecrashes = 0;
                }
                catch (OperationCanceledException)
                {
                    //Cancelled - let's stop.
                    continuation = false;
                    break;
                }
                catch (Exception ex)
                {
                    //Ouch! Log it and increment consecutive crashes
                    consecutivecrashes++;
                    Program.Log("Crawler crashed with " + ex.Message + ".");

                    //We already reached the maximum number of allowed crashes
                    if (consecutivecrashes == MAX_CRASHES)
                    {
                        continuation = false;
                        Program.Log("Crawler faced too many (" + consecutivecrashes.ToString() + ") consecutive crashes.");
                        break;
                    }

                    continue;
                }
            }
        }
        while (continuation);

        Program.Log("Crawler ended.");
    }

    /* Crawler Instance */
}

Nothing too spectacular here. The method creates a new instance of the crawler class and performs the asynchronous DoWork method. This method relies on the MonogoDB database instance, some other class members and a static variable. The static variable is marked ThreadStatic to run multiple kernels without interfering with each other.

[ThreadStatic]
Stopwatch timer;

async Task DoWork(CrawlEntry entry)
{
    //Init timer if not done for this thread
    if(timer == null)
        timer = new Stopwatch();

    cancel.ThrowIfCancellationRequested();

    //Get response time for the request
    timer.Start();
    var result = await http.GetAsync(entry.Url);
    var source = await result.Content.ReadAsStreamAsync();
    timer.Stop();

    cancel.ThrowIfCancellationRequested();
    var response = timer.Elapsed;

    //Parse document
    timer.Restart();
    var document = DocumentBuilder.Html(source);
    timer.Stop();
    
    //Save the time that has been required for parsing the document
    var htmlParser = timer.Elapsed;
    cancel.ThrowIfCancellationRequested();

    //Get the stylesheets' content
    var stylesheet = await GetStylesheet(document);

    cancel.ThrowIfCancellationRequested();

    //Parse the stylesheet
    timer.Restart();
    var styles = CssParser.ParseStyleSheet(stylesheet);
    timer.Stop();

    var cssParser = timer.Elapsed;
    cancel.ThrowIfCancellationRequested();

    //Get all elements in a flat list
    var elements = document.QuerySelectorAll("*");

    //Get the (original) html text
    var content = await result.Content.ReadAsStringAsync();

    cancel.ThrowIfCancellationRequested();

    //Build the entity
    var entity = new DocumentEntry
    {
        SqlId = entry.SqlId,
        Url = entry.Url,
        Content = content,
        Created = DateTime.Now,
        Statistics = new BsonDocument(),
        Nodes = new BsonDocument(),
        HtmlParseTime = htmlParser.TotalMilliseconds,
        CssParseTime = cssParser.TotalMilliseconds,
        ResponseTime = response.TotalMilliseconds
    };

    //Perform the custom evaluation
    EvaluateNodes(entity.Nodes, elements);
    EvaluateStatistics(entity.Statistics, document, styles);

    timer.Reset();
    //Add to the corresponding MongoDB collection
    AddToCollection(entity);
}

There is also some kind of magic behind the EvaluateNodes and EvaluateStatistics methods. For now those functions will not be discussed in detail. These two functions are basically evaluating the generated DOM and stylesheet. Here we use a DSL, which is used to perform the custom evaluations that can be entered by any registered user.

The output of the worker program is shown in the next image.

Azure WebState worker

The worker uses a kind of magic command line procedure to print new lines with information without interfering the current user input. In order to archieve this the Log method calls the MoveBufferArea method. In the following (simplified) version we just shift the buffer area by one line, however, sometimes more than just one line is required to fit the new message in.

public static void Log(string msg)
{
    var left = Console.CursorLeft;
    var top = Console.CursorTop;

    Console.SetCursorPosition(0, top);
    Console.MoveBufferArea(0, top, Console.BufferWidth, 1, 0, ++top);

    var time = DateTime.Now;

    Console.WriteLine("[ {0:00}:{1:00} / {2:00}.{3:00}.{4:00} ] " + msg, 
        time.Hour, time.Minute, time.Day, time.Month, time.Year - 2000);
    Console.SetCursorPosition(left, top);
}

This concludes the discussion of the worker. In the next section we will continue to work on the webpage, which will then finally allow users to create their own crawl lists and set up their own statistics.

Fifth Challenge: Mobile access

This section is about to come.

Points of Interest

I am highly interested in the Windows Azure platform since a long time. This contest is finally my chance to try around a bit and get to know it better. I love that Scott Guthrie manages this team, since he's not only a great speaker, but also passionate about technology and very keen on creating amazing products. I recommend anyone who is interested in ASP.NET (history) or current Windows Azure happenings to check out the official blog at weblogs.asp.net/scottgu/.

History

  • v1.0.0 | Initial Release | 27.04.2013
  • v1.1.0 | Second challenge completed | 12.05.2013
  • v1.1.1 | Link added on top of the article | 17.05.2013
  • v1.2.0 | Third challenge completed | 26.05.2013
  • v1.3.0 | Fourth challenge completed | 09.06.2013

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Florian Rappl
Chief Technology Officer
Germany Germany
Florian is from Regensburg, Germany. He started his programming career with Perl. After programming C/C++ for some years he discovered his favorite programming language C#. He did work at Siemens as a programmer until he decided to study Physics. During his studies he worked as an IT consultant for various companies.
 
Florian is also giving lectures in C#, HTML5 with CSS3 and JavaScript, and other topics. Having graduated from University with a Master's degree in theoretical physics he is currently busy doing his PhD in the field of High Performance Computing.
Follow on   Google+

Comments and Discussions

 
QuestionNice to see some none SQL experiments PinadminChris Maunder29-May-13 16:59 
AnswerRe: Nice to see some none SQL experiments PinmvpFlorian Rappl29-May-13 22:29 
GeneralMy vote of 5 PinmemberAbhishek Nandy14-May-13 20:15 
GeneralRe: My vote of 5 PinmvpFlorian Rappl14-May-13 21:12 
GeneralMy vote of 5 PinadminChris Maunder5-May-13 17:44 
GeneralRe: My vote of 5 PinmvpFlorian Rappl6-May-13 0:00 
GeneralMy vote of 5 Pinmemberswm170129-Apr-13 7:09 
GeneralRe: My vote of 5 PinmvpFlorian Rappl30-Apr-13 22:54 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.1411028.1 | Last Updated 9 Jun 2013
Article Copyright 2013 by Florian Rappl
Everything else Copyright © CodeProject, 1999-2014
Layout: fixed | fluid