Click here to Skip to main content
Click here to Skip to main content
Go to top

Web Scraping (Problems & Solutions)

, 5 Nov 2013
Rate this:
Please Sign up or sign in to vote.
From this article, you will be able to get the basic idea about web scraping and a few problems and their solutions while working.

Introduction

Web scraping is the Considered the most efficient and programmatic way to grab data from different web sources. Basically web scraping is done on webpages. It is a simple technique to collect necessary information from other webpages to personal database.

Need to Consider:

  1. Html Structure.
  2. Proper Tagging.

1. Html Structure: 

Our first consideration for web scraping will be Html structure. For scraping we need our content Html to be structured. With out proper structured Html code scraping will be a mess because of lot of time consumption and hazard. If the content is well structured then it an amazing way to collect data.

2. Proper Tagging:

Content Html tags need to be properly formatted. It needs id or class. If the content Html has only inline Html then it will be a mess. It needs a identification to fetch data. The proper way to put an Id or a class name that we can use. If the content Html has this facility then scraping will be a good idea.

Uses of Web Scraping

  1. Online price comparison
  2. Contact scraping  
  3. Weather data monitoring
  4. Website change detection
  5. Research
  6. Web mash up
  7. Web data integration
  8. Telephone no collection
  9. Address collection 
  10. Country/City/State Name Collection.

In this article, I will discuss a few useful techniques of web scrapping using HtmlAgilityPack. The most surprising feature of HTML Agility Pack is that it now supports LINQ. This means you can write the usual Linq query to get your result. If you need to know more information about HTML Agility Pack, then you can visit their documentation at CodePlex.  

Okay, so let’s begin now.

Problem Statement-1 

Suppose we have the following HTML code. From the underlined Html, we want to extract only the links related to the anchor tags.

Solution-1

Step 1: Process the raw content (that is HTML). Load the total HTML source code and convert it to a string.  Through the Html Web Request and Response we get the entire Html code from the given link. Then using the Stream Reader the total content is read to the end and we get the string format of the Html source code. Following is the code for the above procedure.

Step 2: Return the converted string and again convert to HTML document type.

In the above code, we have the getSourceCode() method in the WorkerClass class. This method loads the total HTML provided and then returns the total HTML as a string. Returning string is then converted to HtmlDocument and returned. The underlined images show that we have the HTML document ready. Now our content is ready to perform a LINQ query to get our desired result.

Here primaryDivId is a Boolean variable which will be true if it gets any div with id divAchors. Here anchorsHref holds the collection of the anchor’s links and anchorsInnerText is the collection of the anchor’s inner text.

Problem Statement-2 

Suppose we need to download images. The HTML format may be like the following:

Solution-2

To download all the images and also to get their alternative information text, we need to do the following:

The following //img tag on the SelectNodes represents that the div having the Id divImage  may have the img tag. If it gets any image tag within the scope of this dev it will fetch it's source and alternative information text. Here I need to mention that no matter where the image tag resides, no mater if the image tag resided with a few div levels, this query will fetch them all.

From the above code, we will be able to get the collection of the image source links in the imageSrc list and their alt text in the imageInnerText list. Using a foreach loop, we can download and save the images in our desired folder.

Problem Statement-3

Suppose we need to find the inner text of a div with its class name. The HTML for this problem may look like the following:

Solution-3

Here is the solution for this problem statement:

The innerText string will provide you a full length uncut string whereas the innerTextList will provide you a list of inner text’s collection.

Problem Statement-4 

Suppose we have the similar problem like the above one with a slight change. The change is that the class name toggles between two classes. I am not sure about which class name might present when the page renders. The HTML for this problem statement may look like the following:

Here the classes toggles between demoText1 and demoText2.

Solution-4

Here is the solution for the above problem statement:

The solution is similar to the solution-3 with an extra or (|) condition in the query. You can also use and (&) condition if you need to.

These are the recent problems that I faced so far in my work and I solved them in this way. I think these solutions will help you to solve your problems because it covers a lot related to web scraping. If you encounter more problems, please let me know, I will try to solve them. Thanks for reading. Happy coding. Smile | :)

References 

  1. Wikipedia
  2. CodePlex

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

dpalash
Software Developer Desme Bangladesh
Bangladesh Bangladesh
I have completed my Undergraduate from Khulna University of Engineering in Computer Science & Engineering. Now i am working as a Software Engineer at desme-Bangladesh on ASP.NET. I really love this technology and like to build my career with this only.
Follow on   Twitter   Google+

Comments and Discussions

 
QuestionScrape data from paginated grid view PinprofessionalMember 1050050624-Jan-14 17:38 
GeneralMy vote of 5 PinprofessionalMaksud Saifullah Pulak9-Nov-13 17:50 
GeneralRe: My vote of 5 Pinprofessionaldpalash9-Nov-13 20:10 
Questionfull samples Pinmemberkiquenet.com6-Nov-13 4:36 
AnswerRe: full samples Pinprofessionaldpalash6-Nov-13 5:37 
GeneralMy 5 PinprofessionalShahriar Iqbal Chowdhury/Galib5-Nov-13 7:35 
GeneralRe: My 5 Pinprofessionaldpalash5-Nov-13 8:34 
GeneralMy vote of 5 Pinmemberfredatcodeproject5-Nov-13 1:05 
GeneralRe: My vote of 5 Pinprofessionaldpalash5-Nov-13 2:19 
GeneralMy vote of 5 PinmemberMonjurul Habib4-Nov-13 22:44 
GeneralRe: My vote of 5 Pinprofessionaldpalash5-Nov-13 0:37 
GeneralWhy i can't download the codes PinmemberEndles_story4-Nov-13 21:27 
GeneralRe: Why i can't download the codes Pinprofessionaldpalash5-Nov-13 0:35 
QuestionScrapper Pinmembermoududur shamim4-Nov-13 9:39 
AnswerRe: Scrapper Pinprofessionaldpalash4-Nov-13 20:20 
GeneralMy vote of 5 PinmemberSk. Tajbir4-Nov-13 5:46 
GeneralRe: My vote of 5 Pinprofessionaldpalash4-Nov-13 9:38 
QuestionIP blocking PinmemberMukesh.C.Gupta3-Nov-13 23:49 
QuestionPlagiarism PinprofessionalRavi Bhavnani3-Nov-13 7:30 
AnswerRe: Plagiarism Pinprofessionaldpalash3-Nov-13 8:01 
QuestionNeeds some work PinmvpRichard MacCutchan3-Nov-13 5:33 
AnswerRe: Needs some work Pinprofessionaldpalash3-Nov-13 5:37 
GeneralRe: Needs some work PinmvpRichard MacCutchan3-Nov-13 5:45 
GeneralRe: Needs some work [modified] Pinprofessionaldpalash3-Nov-13 5:46 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web02 | 2.8.140921.1 | Last Updated 5 Nov 2013
Article Copyright 2013 by dpalash
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid