Click here to Skip to main content
Click here to Skip to main content

Small Web Agents using VB - Part II

By , 12 Aug 2002
 

Introduction

In the previous article, we saw a simple VB application which will pull out the HTML page of a particular URL. In this article, we will build a small web crawler which will crawl through all the links in the given URL.

  1. Setting up the Visual Basic Environment with required Components and Libraries:
    • Open Visual Basic and create a new project (user Standard EXE).
    • Select Project -> References from the main menu and add the following Microsoft Libraries:
      • Microsoft HTML Object Library
    • Add Microsoft Windows Common Controls to the toolbox as follows. Select Project -> Components from the main menu. The Components window will open. With the controls tab selected, scroll down and click the check box preceding the components:
      • Microsoft Windows Common Control 6.x
  2. Set up the UI for the Crawler
    • Add a label, two button controls, a listbox, and a treeview control as below:

      Click to enlarge image

  3. Add the code for the Crawler:
    • On click of the start button, populate the list box with all the links under the given URL:
      Private Sub cmdStart_Click()
      '
      	'1 will populate lstlinks with all the parent links 
               'in the requested URL
      	getLinks txtURL.Text, 1
      			'
      End Sub
    • The getlinks function based on the second parameter populates either the listbox or the treeview. Here since the parameter is 1, it populates the listbox with all the links under the URL:
    • Private Sub getLinks(strURL As String, iParentChild As Integer, _
      	Optional iParentNo As Integer)
      '
          Dim objLink As HTMLLinkElement
          Dim objMSHTML As New MSHTML.HTMLDocument
          Dim objDoc As New MSHTML.HTMLDocument
          Dim objNode As Node
          '
          Set objDoc = objMSHTML.createDocumentFromUrl(txtURL.Text, vbNullString)
          '
          MousePointer = vbHourglass
          While objDoc.readyState <> "complete"
              DoEvents
          Wend
          'get all Links
          For Each objLink In objDoc.links
          '
              If iParentChild = 1 Then
              '
                  lstLinks.AddItem objLink
              '
              ElseIf iParentChild = 2 Then
              '
                  'lstInnerLinks.AddItem objLink
                 
                  Set objNode = trvLinks.Nodes.Add(iParentNo, tvwChild)
                  objNode.Text = objLink
                  'objNode.Image = "leaf"
              '
              End If
          '
          Next
          MousePointer = vbNormal
      '
      End Sub
    • If the user wishes to go further down with some of the links, then she/he can select the links and press Get Inner Links Button:
      Private Sub cmdGet_Click()
      '
          Dim iCount As Integer
          'Dim objNode As New Node
          If lstLinks.SelCount = 0 Then
          '
              MsgBox "Please Select a Link"
              Exit Sub
          Else
          '
              'objNode.Text = lstLinks.Text
              'For iCount = 0 To lstLinks.ListCount - 1
              iCount = 0
              While iCount <= lstLinks.ListCount - 1
              
                  If lstLinks.Selected(iCount) Then
                      
                      trvLinks.Nodes.Add , , , lstLinks.List(iCount)
                      getLinks lstLinks.List(iCount), 2, trvLinks.Nodes.Count
                      lstLinks.RemoveItem (iCount)
                      lstLinks.Refresh
                  Else
                      iCount = iCount + 1
                  End If
                  
               Wend
              'Next
               
          '
          End If
      '
      End Sub
    • All the inner links will get populated inside the Treeview. Now if the user further wishes to drilldown, he can double click on those URLs in the treeview:
      Private Sub trvLinks_DblClick()
      '
          getLinks trvLinks.SelectedItem.Text, 2, trvLinks.SelectedItem.Index
      '
      End Sub
    • Then finally the screen would look something like this:

      Click to enlarge image

History

  • 12th August, 2002: Initial post

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Gopi Subramanian
Web Developer
India India
Member
No Biography provided

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
GeneralMy vote of 5membermanoj kumar choubey28 Mar '12 - 0:07 
Nice
GeneralMy vote of 5memberviral.sharma13 Feb '11 - 18:32 
very easy and fast than Inet Control or Webbrowser control
QuestionGopi - Are you interested? Search Engine Expert: Building Web 3.0 Social Networking People Search Engine, Do You Want To Be The King Of The Internet?memberrapiddata30 Dec '06 - 4:12 
http://newyork.craigslist.org/mnh/eng/255305939.html
 
Search Engine Expert: Building Web 3.0 Social Networking People Search Engine, Do You Want To Be The King Of The Internet?
 
Company Description:
PeoplePeople.com: The Superior MySpace Alternative .....
PeoplePeople.com is combining proprietary Natural Language Extraction, Artificial Intelligence Algorithms and Information Integration logic to build a Social Networking Search Engine.
Using Natural Language Extraction tools, our programs are able to read English sentences and understand what they mean. PeoplePeople.com then extracts relevant pieces of information about people, such as the companies they work for and their job titles or a social networking page like a person's page on MySpace.
Artificial Intelligence Algorithms allow our computers to analyze a Web site and extract information based on an understanding of how the Web site is constructed. PeoplePeople.com can deduce that a specific paragraph describes a company, or a social networking page like a person's page on MySpace.
 
Position Purpose:
This person will work with the PeoplePeople.com Search Technology Team to develop the core search engine and web crawlers. This individual will be the search engineer on the design and implementation of a large scale crawling, processing and serving system. Tasks include implementing search algorithms, data mining, improving relevancy or search results, managing terabytes of data and scaling algorithms to work on very large data sets, and serving search results using a large network of Windows 2003 Servers.
 
Accountabilities:
This position is an integral part of PeoplePeople.com's core technology team involving the design, development and implementation of PeoplePeople.com's search engine: the crawling, indexing and ranking of billions of documents on the Internet. As such, this person will be expected to make a significant contribution to this effort by designing innovative technical solutions to this significant challenge.
Requirements/Qualities:
*Must have experience in building a search engine crawler and indexer
*Configuring crawlers and indexing content
*Must have the desire and commitment to build a leading-edge search technology
*Must have extensive programming experience in Microsoft C#.NET and SQL Server.
*Must have experience with search engine relevance and information retrieval techniques
*Must have a minimum of 5 years experience in software development in either an academic or corporate environment
*Must be able to communicate and work with both technical and non-technical people
 
This position is open to telecommuting, consulting or full time work.
Send your resume to searchengine@GeeWHIZConsulting.com, attention Leo Loiacono.
 
Best,
Leo
201-923-9595
[^]
GeneralcreateDocumentFromUrlmemberfra_mimi10 Sep '06 - 22:04 
Hi, I have a problem in Visual Basic 2005 Express Edition whit:
 
Dim objMSHTML As New MSHTML.HTMLDocument
Dim objDocument As New MSHTML.HTMLDocument
objDocument = objMSHTML.createDocumentFromUrl(txtURL.Text, vbNullString)
 
the error message that I get is "AccessVioletionException was unhandled "

 
Fra_Mimi

GeneralThe double-click code is not workingmemberlegoman551 Apr '06 - 17:08 
When I double-click on a treeview item, isn't it supposed to list the limks available on that site?
 
It is just copying all the items from the listbox to the item that you double-clicked on. It is not going back to the web for the new links.
 
There is also some commented-out code in there. Did you check this code before you posted it?
 
Thanks.
QuestionIs there a part I for this article.memberlegoman5527 Mar '06 - 20:07 
The title implies that there is a part I but I can't find it.
 
Please let me know.
 
Thank you.
GeneralProgrammaticaly open the linkssussAman Bhandari28 Feb '05 - 10:05 
I want each of the links specified in a webpage to open programmaticaly and browse through their contents through VB. Does anybody has an idea as to how one can implement this.
 
Aman
GeneralRe: Programmaticaly open the linksmemberMamta Suri15 Jul '05 - 5:03 
hi aman
i was browsing through your posts. i think u wont mind helping me. can u mail me ur email address mine is mamsonu_2005 AT yahoo DOT com
thanx
GeneralcreateDocumentFromUrlmemberJack Clift27 Feb '05 - 21:10 
Am trying to write an application that uses createDocumentFromUrl do retreive data from the web. The particular site I am retreiving causes the following IE message:
 
This page has an unspecified potential security flaw. Would you like to continue?
 
I am retreiving several 100 pages so I neet to programatically supress this message, how can this be done?
 
Note that when using the same url in IE directly (rather than from VB), I do
not get any error / warning messages.
 
Thanks
 
JC

GeneralRe: createDocumentFromUrlsussAnonymous23 Jun '05 - 21:58 
hi
GeneralRe: createDocumentFromUrlmembertienpv6 Nov '05 - 15:40 
pleasse help me. I also meet the same error createDocumentFromUrl when i added component Microsoft HTML object library. And i can't find createDocumentFromUrl. when i could continuous work!
Thanh you very much for somebody help me
ragard

 
phan van tien
Generalusing activex dll in asp pagesussR Keller24 Jul '04 - 19:26 
I used your code to open a document in an activex dll in visual basic 6. I compiled the dll, and i called it from an asp page (I have IE version 6.0) . The code stays in the loop of DoEvents forever.
How can I resolve this problem?
GeneralMore featuressusscadessi23 Jan '04 - 4:00 
So how could i delete a link while performing the for each loop?.
I know how to retrieve values from attributes in each link, but how could i replace these values, again while performing the loop?.
QuestionHow to get the Plain text from the linkmemberdsdon1023 Oct '03 - 14:15 
How do you get the plain text from each of the links objects and place those in the treeview instead of the link itself?
 
Thanks,
 
Don
 
Don
GeneralRuntime Error!membervtk26 May '03 - 18:39 
Hello,
 
I am getting below said error message when i ran the above code. I am attaching here with the error and code. Please help me how to get out of this problem.
 
I am getting this error for only google.com site. As best of my knowledge, in body onload event google set the focus to the text field.
 
How can I disable all the events in document. Because of I am using this component in ASP. My requirement is to get the content of the web page and extract the info. from it and store in our database.
 
I can also get the content using WinINET control. But I want the html code after executing the client side script.
 
If is there any alternate solution please help me.
 
Waiting for your valuable reply.....
V.Thandava Krishna.
 

======================================================
 
Option Explicit
 
Public Function GetHTMLCode(URL As String, strSearch As String) As
 
String
 
On Error Resume Next
 
Dim doc1 As New HTMLDocument
Dim doc2 As HTMLDocument
 
Set doc2 = doc1.createDocumentFromUrl("http://www.google.com/",
 
"null")
 
Do Until doc2.readyState = "complete"
DoEvents
Loop
 
GetHTMLCode = doc2.documentElement.outHTML
 
End Function
 
============================================================
 

============================================================
A run time error has occured.
Do u want to debug?
 
Line8:
Error: Can't move the focus to the controls because it is invisible,
 
not eanbled, or of a type that does not accept the focus.
============================================================
 

 

 
V.Thandava Krishna.
Application Programmer
GeneralRe: Runtime Error!membersatyaprakashrathore24 Jun '08 - 21:39 
I Am getting the same type of error on there
In The code
 
If Left$(.ItemURL(i), 1) = "h" Then
Set objDocument = objMSHTML.createDocumentFromUrl(clsInet.ItemURL(i), vbNullString)
While objDocument.ReadyState <> "complete"
DoEvents

Wend
Dim tempstr
' 'tempstr = objDocument.documentElement.outerText
' 'Text1.text = tempstr
' On Error Resume Next
strHtmlContet = objDocument.documentElement.outerHTML
'Text1.text = objDocument.documentElement.outerHTML
lstText.SetFocus
 

 
This is probably 'coz of
Set objDocument = objMSHTML.createDocumentFromUrl(clsInet.ItemURL(i), vbNullString)
 
but i don't find how to give a focus to it
 
---------------------------
Error
---------------------------
A Runtime Error has occurred.
Do you wish to Debug?
 
Line: 1139
Error: Can't move focus to the control because it is invisible, not enabled, or of a type that does not accept the focus.
---------------------------
Yes No
---------------------------
 
satya

GeneralVB.netsussAnonymous5 Nov '02 - 4:05 
I am trying to implement this code in a vb.net solution. I get the following error wheni try to use the htmllinkelement
 
"An unhandled exception of type 'System.NullReferenceException' occurred in Test.exe
Additional information: Object reference not set to an instance of an object."
 
Here is my code
 
Dim oLink As mshtml.HTMLLinkElement
Dim oMSHTML As New mshtml.HTMLDocument()
Dim oDom As New mshtml.HTMLDocument()
 
oMSHTML.open()
oDom = oMSHTML.createDocumentFromUrl(sPath, vbNullString)
 
Do Until oDom.readyState = "complete"
System.Windows.Forms.Application.DoEvents()
Loop
 
For Each oLink In oDom.links
MsgBox(oLink.innerText) 'THIS IS THE LINE THAT GENERATEs THE ERROR
Next
 
oMSHTML.close()
 
Any help would be appreciated
GeneralRe: VB.netsussAdi Scale Arumugam13 Oct '04 - 20:19 
Hiya,
 
The variable "oLink" is mistype-ed
it should have been
 
Dim objLink As mshtml.HTMLAnchorElementClass
 
and it works fine.atleast for me.
 
-Adi Scale Arumugam

GeneralRe: VB.netmemberitsmuthu29 Oct '09 - 23:21 
Hi.. i tried this code but getting an error as
Attempted to read or write protected memory. This is often an indication that other memory is corrupt. in VB.NET.
 
Can u help me out to solve this...
GeneralWeb components instead ofsussAnonymous13 Aug '02 - 7:34 

I like the idea of web agents, but instead of crawling for unidentified HTML resources such like images, href, and so on, wouldn't it be more useful to square out real components of web pages such like navbars, ad blocks, main content, ... ?
I know that's hard, because one has to reverse engineer the semantics of any arbitraty HTML code, but I believe this would blast !Cool | :cool:

GeneralRe: Web components instead ofmemberGopi Subramanian15 Aug '02 - 19:21 
once you hve the document object with you, its really very easy to scan thru any page. But before doing it you should know how the page is formatted. Making this generic would be a pain as different pages are laid out differently.Smile | :)
GeneralGetting errors using VB 6memberAmmar12 Aug '02 - 23:50 
I got an error on opening the form which was written in log file as:
"Line 13: Class MSComctlLib.TreeView of control trvLinks was not a loaded control class."
 
Then after I managed to put a different treeview from MSComCtl.ocx file, I got compile error (User-defined type not defined) on first line of function getLinks "objLink As HTMLLinkElement"
 
How to fix this?
 
Ammar

 
There is a difference in knowing the path and walking the path.
GeneralRe: Getting errors using VB 6memberGopi Subramanian13 Aug '02 - 0:07 
Regarding ur error objLink as HTMLLinkElement, check whether u hve added Microsoft HTML Object Library in your reference
 
Thanks
Gopi:Smile | :)
GeneralRe: Getting errors using VB 6memberAmmar13 Aug '02 - 0:13 
Ooops. Smile | :)
Thanks.
 
Now I get the error (Objext doesnt support this method or property) on line:
Set objDoc = objMSHTML.createDocumentFromUrl(txtURL.Text, vbNullString)
 
I checked. createDocumentFromUrl doesnt exist is objMSHTML list box.
 
Ammar

 
There is a difference in knowing the path and walking the path.
GeneralRe: Getting errors using VB 6memberGopi Subramanian13 Aug '02 - 0:17 
Hey,
Check if u hve IE 5, if not install it..microsoft have a strange way of behaving

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web02 | 2.6.130523.1 | Last Updated 13 Aug 2002
Article Copyright 2002 by Gopi Subramanian
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid