Introduction
In the previous article, we saw a simple VB application which will pull out the HTML page of a particular URL. In this article, we will build a small web crawler which will crawl through all the links in the given URL.
- Setting up the Visual Basic Environment with required Components and Libraries:
- Open Visual Basic and create a new project (user Standard EXE).
- Select Project -> References from the main menu and add the following Microsoft Libraries:
- Microsoft HTML Object Library
- Add Microsoft Windows Common Controls to the toolbox as follows. Select Project -> Components from the main menu. The Components window will open. With the controls tab selected, scroll down and click the check box preceding the components:
- Microsoft Windows Common Control 6.x
- Set up the UI for the Crawler
- Add a label, two button controls, a listbox, and a treeview control as below:
- Add the code for the Crawler:
- On click of the start button, populate the list box with all the links under the given URL:
Private Sub cmdStart_Click()
getLinks txtURL.Text, 1
End Sub
- The
getlinks function based on the second parameter populates either the listbox or the treeview. Here since the parameter is 1, it populates the listbox with all the links under the URL:
-
Private Sub getLinks(strURL As String, iParentChild As Integer, _
Optional iParentNo As Integer)
Dim objLink As HTMLLinkElement
Dim objMSHTML As New MSHTML.HTMLDocument
Dim objDoc As New MSHTML.HTMLDocument
Dim objNode As Node
Set objDoc = objMSHTML.createDocumentFromUrl(txtURL.Text, vbNullString)
MousePointer = vbHourglass
While objDoc.readyState <> "complete"
DoEvents
Wend
For Each objLink In objDoc.links
If iParentChild = 1 Then
lstLinks.AddItem objLink
ElseIf iParentChild = 2 Then
Set objNode = trvLinks.Nodes.Add(iParentNo, tvwChild)
objNode.Text = objLink
End If
Next
MousePointer = vbNormal
End Sub
- If the user wishes to go further down with some of the links, then she/he can select the links and press Get Inner Links Button:
Private Sub cmdGet_Click()
Dim iCount As Integer
If lstLinks.SelCount = 0 Then
MsgBox "Please Select a Link"
Exit Sub
Else
iCount = 0
While iCount <= lstLinks.ListCount - 1
If lstLinks.Selected(iCount) Then
trvLinks.Nodes.Add , , , lstLinks.List(iCount)
getLinks lstLinks.List(iCount), 2, trvLinks.Nodes.Count
lstLinks.RemoveItem (iCount)
lstLinks.Refresh
Else
iCount = iCount + 1
End If
Wend
End If
End Sub
- All the inner links will get populated inside the
Treeview. Now if the user further wishes to drilldown, he can double click on those URLs in the treeview:
Private Sub trvLinks_DblClick()
getLinks trvLinks.SelectedItem.Text, 2, trvLinks.SelectedItem.Index
End Sub
- Then finally the screen would look something like this:
History
- 12th August, 2002: Initial post