Click here to Skip to main content
15,889,034 members
Articles / Programming Languages / Javascript
Article

AJAX web scraping and interaction with digg

Rate me:
Please Sign up or sign in to vote.
2.50/5 (2 votes)
6 Sep 20063 min read 34.6K   288   19   3
Scraping sites with JavaScript-generated content to create a simple news viewer.

Sample Image - ajaxwebscrapingfordigg.jpg

Introduction

Interacting with sites that use JavaScript to generate content has, until now, been either very complex, or almost impossible. This tutorial will demonstrate the usage of the WebRobot v1.1 component to interact with the social bookmarking site digg, which employs JavaScript heavily to generate the displayed content, and to interact with it.You may click here to download the completed application, and you may also download a free trial version of the WebRobot v1.1 component here, or here for users of the .NET Framework 2.0.

First, we will create our instance of the WebRobot component, and enable AJAX mode:

VB
Private wrobot As New foxtrot.xray.WebRobot

Private Sub Form1_Load(ByVal sender As System.Object, _
        ByVal e As System.EventArgs) Handles MyBase.Load
    wrobot.AJAX = True
End Sub

Private Sub Form1_Closing(ByVal sender As Object, _
        ByVal e As System.ComponentModel.CancelEventArgs) _
        Handles MyBase.Closing
    wrobot.Dispose()
End Sub

We created our instance of the WebRobot, enabled AJAX mode, and then, on the Closing event of our form, we called the Dispose metod to release all resources. Now, we will log in to digg:

VB
'Load the main digg page
wrobot.LoadPage("http://digg.com")
'Get the login form
Dim loginform As foxtrot.xray.Form = wrobot.GetFormByContainsAction("login")
'Username field
Dim userfield As foxtrot.xray.Input.Text = loginform.Fields(0)
'Password field
Dim pswdfield As foxtrot.xray.Input.Password = loginform.Fields(1)
'Submit button
Dim sbmtfield As foxtrot.xray.Input.Submit = loginform.Fields(3)
userfield.Value = username
pswdfield.Value = password
'Simulate a click on the submit button
sbmtfield.Click()

After loading the main page, and filling out the login form, we clicked on the submit button. We could have used the WebRobot's SubmitForm method, but since this page may use JavaScript for form and button events, it would be safer to just simulate a click, so that any code gets interpreted. The Click event blocks until all actions are performed and any necessary page navigation is complete.

Now, we can start parsing through the main page content, to detect all the news items displayed. The WebRobot v1.1 component has an Element object and a FindElements method that allow sifting through the page. The Event object also exposes a Click method, to allow clicking on the elements you find after parsing. Let's look for news items:

VB
Dim newsitems As New System.Collections.ArrayList
'Get the list of DIV elements on the web page
Dim elements() As foxtrot.xray.Element = wrobot.FindElements("div")
For Each item As foxtrot.xray.Element In elements
    'Remove the CR and LF characters at the start of the element that the 
    'digg html source contains
    Dim text As String = item.Text.TrimStart(vbCrLf.ToCharArray()).ToLower
    'Look for DIVs of news-summary class
    If (text.IndexOf("<div class=news-summary") = 0) Then
        newsitems.Add(item)
    End If
Next

Now, we have the DIVs containing our news items. Note the use of the Text property of the elements to search for the class of the DIV.

Now that we have our list of DIVs, we will parse the content from them:

VB
For Each newsitem As foxtrot.xray.Element In newsitems
    'Object to store parsed article info
    Dim artinfo As New ArticleInfo
    Get the H3s in the item, to look for the title
    Dim titledata() As foxtrot.xray.Element = newsitem.FindElements("H3")
    'The first H3 contains the title, now find the A HREF containing
    'the news link
    Dim urldata() As foxtrot.xray.Element = titledata(0).FindElements("A")
    'The first A HREF found contains the news link
    Dim ahref As String = urldata(0).Text
    'Regular expression to get the URL and the title of the story
    Dim parser As New _
        System.Text.RegularExpressions.Regex("href=""(.*)"".*>(.*)</", _ 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase Or _
        System.Text.RegularExpressions.RegexOptions.Singleline)
    'Store the URL and title
    artinfo.URL = parser.Matches(ahref).Item(0).Groups.Item(1).Value
    artinfo.Title = parser.Matches(ahref).Item(0).Groups.Item(2).Value
    'More parsing code follows
    .
    .
    .
Next

We found the URL and title of the story by searching within the DIV. Now, we will find the amount of diggs, the digg This! link, and the digg discussion for each news item:

VB
'The amount of diggs is contained in a STRONG element. Find the one
'with a class that matches diggs-strong-
Dim digginfo() As foxtrot.xray.Element = newsitem.FindElements("strong")
For Each item As foxtrot.xray.Element In digginfo
    Dim text As String = item.Text.TrimStart(vbCrLf.ToCharArray()).ToLower
    If (text.IndexOf("<strong id=diggs-strong-") = 0) Then
        parser = New System.Text.RegularExpressions.Regex(">(.*)</", _ 
          System.Text.RegularExpressions.RegexOptions.IgnoreCase Or _ 
          System.Text.RegularExpressions.RegexOptions.Singleline)
        'Store the diggs count
        artinfo.Diggs = _
           Integer.Parse(parser.Matches(text).Item(0).Groups.Item(1).Value)
    End If
Next
'The digg this! link and the digg discussion links are stored in A HREFs
urldata = newsitem.FindElements("A")
For Each item As foxtrot.xray.Element In urldata
    If (item.Text.IndexOf("digg it") > -1) Then
        'If item contains digg it, it's the digg this! link.
        'If the user has already dugg the item, this link will
        'not be present. If present, we will store the Element
        'object to simulate a click
        artinfo.DiggLink = item
    ElseIf (item.Text.IndexOf("class=more") > -1) Then
        'If the A HREF class is more, then this is the digg discussion link
        parser = New System.Text.RegularExpressions.Regex("href=""(.*)"".*>(.*)</", _
          System.Text.RegularExpressions.RegexOptions.IgnoreCase Or _ 
          System.Text.RegularExpressions.RegexOptions.Singleline)
        artinfo.DiggMore = parser.Matches(item.Text).Item(0).Groups.Item(1).Value
    End If
Next
'Create a new item for the main article ListView
Dim litem As New ListViewItem(artinfo.Title)
'Store the article info in the articlelist HashTable
articlelist(litem) = artinfo
ListView1.Items.Add(litem)

We have populated our form with the article info. Now, we add code to load a web browser instance with the link story that was clicked on:

VB
Private Sub ListView1_DoubleClick(ByVal sender As Object, _ 
        ByVal e As System.EventArgs) Handles ListView1.DoubleClick
    'Are there any selected items?
    If (ListView1.SelectedItems.Count > 0) Then
        'Get the article info related to the selected item
        Dim item As ListViewItem = ListView1.SelectedItems(0)
        Dim artinfo As ArticleInfo = articlelist(item)
        'Launch a new web browser instance with the URL
        System.Diagnostics.Process.Start(artinfo.URL)
    End If
End Sub

Now, we create a context menu, to be displayed whenever the user right-clicks on an article. This context menu will show the amount of diggs (in MenuItem1), enable the user to digg the story (in MenuItem2), and also launch a browser instance with the digg discussion (in MenuItem3). First, we will add code to update the digg count and wether the news item has been dugg or not:

VB
Private Sub ListView1_Click(ByVal sender As Object, _
        ByVal e As System.EventArgs) Handles ListView1.Click
    'Is there a selected item?
    If (ListView1.SelectedItems.Count > 0) Then
        'Enable the context menu
        ListView1.ContextMenu = ContextMenu1        
        'Get the article info related to the selected item
        Dim item As ListViewItem = ListView1.SelectedItems(0)
        Dim artinfo As ArticleInfo = articlelist(item)
        'Update digg count
        MenuItem1.Text = artinfo.Diggs.ToString & " Diggs"
        'Can we digg this item?
        If (artinfo.DiggLink Is Nothing) Then
            'Item already dugg
            MenuItem2.Text = "Dugg!"
            MenuItem2.Enabled = False
        Else
            'We can dig this item
            MenuItem2.Text = "Digg this!"
            MenuItem2.Enabled = True
        End If
    Else
        'Disable the context menu
        ListView1.ContextMenu = Nothing
    End If
End Sub

Now, we can add code to digg a news item:

VB
Private Sub MenuItem2_Click(ByVal sender As System.Object, _ 
        ByVal e As System.EventArgs) Handles MenuItem2.Click
    'Is there a selected item?
    If (ListView1.SelectedItems.Count > 0) Then
        ListView1.ContextMenu = ContextMenu1
        'Get the article info related to the selected item
        Dim item As ListViewItem = ListView1.SelectedItems(0)
        Dim artinfo As ArticleInfo = articlelist(item)
        'Are we sure we can digg this item?
        If Not (artinfo.DiggLink Is Nothing) Then
            'Simulate a click on the digg this! link, which
            'contains JavaScript code, but no valid HREF
            artinfo.DiggLink.Click()
            'Clear this item so that we cannot try to digg it
            'again, update digg count, and update the user
            'interface
            artinfo.DiggLink = Nothing
            artinfo.Diggs += 1
            MenuItem2.Text = "Dugg!"
            MenuItem2.Enabled = False
            MenuItem1.Text = artinfo.Diggs.ToString & " Diggs"
        End If
    End If
End Sub

Finally, we add code to load a browser window with the digg discussion link:

VB
Private Sub MenuItem3_Click(ByVal sender As System.Object, _
        ByVal e As System.EventArgs) Handles MenuItem3.Click
    'Are there any selected items?
    If (ListView1.SelectedItems.Count > 0) Then
        'Get the article info related to the selected item
        Dim item As ListViewItem = ListView1.SelectedItems(0)
        Dim artinfo As ArticleInfo = articlelist(item)
        'Launch a new web browser instance with the digg discussion
        System.Diagnostics.Process.Start(artinfo.DiggMore)
    End If
End Sub

 We have interacted with digg, simulating a real user clicking on links. Short of captchas, there is no way for a web application to know that it's not a real user at the helm.

For more information on the WebRobot v1.1 component, visit http://foxtrot-xray.com/main/prod/dev/web_robot. You may also download the full documentation at http://foxtrot-xray.com/main/prod/dev/documentation.chm/view.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


Written By
Dominican Republic Dominican Republic
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
Generalwhere has foxtrot-xray gone Pin
kmv9819-Jan-09 22:25
professionalkmv9819-Jan-09 22:25 
GeneralSWExplorerAutomation [modified] Pin
Alex Furman12-Jan-07 4:00
Alex Furman12-Jan-07 4:00 
We have used SWExplorerAutomation from http://webiussoft.com[^]and it saved us a lot of time for developing Web scraping solution.


-- modified at 20:42 Thursday 22nd February, 2007

AlexF

GeneralOther web scraping solution Pin
SAPUsa7-Sep-06 19:55
SAPUsa7-Sep-06 19:55 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.