Click here to Skip to main content
15,884,176 members
Articles / Web Development / HTML

Web Scraping (Problems & Solutions)

Rate me:
Please Sign up or sign in to vote.
4.90/5 (29 votes)
5 Nov 2013CPOL4 min read 77.6K   1.9K   54   27
From this article, you will be able to get the basic idea about web scraping and a few problems and their solutions while working.

Introduction

Web scraping is the Considered the most efficient and programmatic way to grab data from different web sources. Basically web scraping is done on webpages. It is a simple technique to collect necessary information from other webpages to personal database.

Need to Consider:

  1. Html Structure.
  2. Proper Tagging.

1. Html Structure: 

Our first consideration for web scraping will be Html structure. For scraping we need our content Html to be structured. With out proper structured Html code scraping will be a mess because of lot of time consumption and hazard. If the content is well structured then it an amazing way to collect data.

2. Proper Tagging:

Content Html tags need to be properly formatted. It needs id or class. If the content Html has only inline Html then it will be a mess. It needs a identification to fetch data. The proper way to put an Id or a class name that we can use. If the content Html has this facility then scraping will be a good idea.

Uses of Web Scraping

  1. Online price comparison
  2. Contact scraping  
  3. Weather data monitoring
  4. Website change detection
  5. Research
  6. Web mash up
  7. Web data integration
  8. Telephone no collection
  9. Address collection 
  10. Country/City/State Name Collection.

In this article, I will discuss a few useful techniques of web scrapping using HtmlAgilityPack. The most surprising feature of HTML Agility Pack is that it now supports LINQ. This means you can write the usual Linq query to get your result. If you need to know more information about HTML Agility Pack, then you can visit their documentation at CodePlex.  

Okay, so let’s begin now.

Problem Statement-1 

Suppose we have the following HTML code. From the underlined Html, we want to extract only the links related to the anchor tags.

Image 1

Solution-1

Step 1: Process the raw content (that is HTML). Load the total HTML source code and convert it to a string.  Through the Html Web Request and Response we get the entire Html code from the given link. Then using the Stream Reader the total content is read to the end and we get the string format of the Html source code. Following is the code for the above procedure.

Image 2

Step 2: Return the converted string and again convert to HTML document type.

Image 3

Image 4

In the above code, we have the getSourceCode() method in the WorkerClass class. This method loads the total HTML provided and then returns the total HTML as a string. Returning string is then converted to HtmlDocument and returned. The underlined images show that we have the HTML document ready. Now our content is ready to perform a LINQ query to get our desired result.

Image 5

Here primaryDivId is a Boolean variable which will be true if it gets any div with id divAchors. Here anchorsHref holds the collection of the anchor’s links and anchorsInnerText is the collection of the anchor’s inner text.

Problem Statement-2 

Suppose we need to download images. The HTML format may be like the following:

Image 6

Solution-2

To download all the images and also to get their alternative information text, we need to do the following:

Image 7

The following //img tag on the SelectNodes represents that the div having the Id divImage  may have the img tag. If it gets any image tag within the scope of this dev it will fetch it's source and alternative information text. Here I need to mention that no matter where the image tag resides, no mater if the image tag resided with a few div levels, this query will fetch them all.

Image 8

From the above code, we will be able to get the collection of the image source links in the imageSrc list and their alt text in the imageInnerText list. Using a foreach loop, we can download and save the images in our desired folder.

Problem Statement-3

Suppose we need to find the inner text of a div with its class name. The HTML for this problem may look like the following:

Image 9

Solution-3

Here is the solution for this problem statement:

Image 10

The innerText string will provide you a full length uncut string whereas the innerTextList will provide you a list of inner text’s collection.

Problem Statement-4 

Suppose we have the similar problem like the above one with a slight change. The change is that the class name toggles between two classes. I am not sure about which class name might present when the page renders. The HTML for this problem statement may look like the following:

Image 11

Image 12

Here the classes toggles between demoText1 and demoText2.

Solution-4

Here is the solution for the above problem statement:

Image 13

The solution is similar to the solution-3 with an extra or (|) condition in the query. You can also use and (&) condition if you need to.

These are the recent problems that I faced so far in my work and I solved them in this way. I think these solutions will help you to solve your problems because it covers a lot related to web scraping. If you encounter more problems, please let me know, I will try to solve them. Thanks for reading. Happy coding. :)

References 

  1. Wikipedia
  2. CodePlex

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer Cefalo
Bangladesh Bangladesh
Hi,

I am Palash Debnath. I have been working on windows technologies since 2008. I started with ASP.NET. Then I moved to Windows Form and from the last year I have been working with Windows 8 app development. Work with Windows 10 apps development as well. Now I have been working with Microsoft Azure. I have completed my Undergraduate from Khulna University of Engineering in Computer Science & Engineering. Currently working as a Senior Software Engineer at Cefalo.

Comments and Discussions

 
QuestionWell done. However, you wouldn't scrape sites without a browser engine. Pin
DeeEllEll7-Oct-16 0:56
DeeEllEll7-Oct-16 0:56 
GeneralMy vote of 5 Pin
Dmitriy Gakh18-May-16 6:31
professionalDmitriy Gakh18-May-16 6:31 
QuestionGreat Man! Pin
ridoy29-Oct-15 20:13
professionalridoy29-Oct-15 20:13 
QuestionScrape data from paginated grid view Pin
Member 1050050624-Jan-14 17:38
professionalMember 1050050624-Jan-14 17:38 
GeneralMy vote of 5 Pin
Maksud Saifullah Pulak9-Nov-13 17:50
Maksud Saifullah Pulak9-Nov-13 17:50 
GeneralRe: My vote of 5 Pin
dpalash9-Nov-13 20:10
professionaldpalash9-Nov-13 20:10 
Questionfull samples Pin
kiquenet.com6-Nov-13 4:36
professionalkiquenet.com6-Nov-13 4:36 
AnswerRe: full samples Pin
dpalash6-Nov-13 5:37
professionaldpalash6-Nov-13 5:37 
GeneralMy 5 Pin
Shahriar Iqbal Chowdhury/Galib5-Nov-13 7:35
professionalShahriar Iqbal Chowdhury/Galib5-Nov-13 7:35 
GeneralRe: My 5 Pin
dpalash5-Nov-13 8:34
professionaldpalash5-Nov-13 8:34 
GeneralMy vote of 5 Pin
fredatcodeproject5-Nov-13 1:05
professionalfredatcodeproject5-Nov-13 1:05 
GeneralRe: My vote of 5 Pin
dpalash5-Nov-13 2:19
professionaldpalash5-Nov-13 2:19 
GeneralMy vote of 5 Pin
Monjurul Habib4-Nov-13 22:44
professionalMonjurul Habib4-Nov-13 22:44 
GeneralRe: My vote of 5 Pin
dpalash5-Nov-13 0:37
professionaldpalash5-Nov-13 0:37 
GeneralWhy i can't download the codes Pin
Endles_story4-Nov-13 21:27
Endles_story4-Nov-13 21:27 
GeneralRe: Why i can't download the codes Pin
dpalash5-Nov-13 0:35
professionaldpalash5-Nov-13 0:35 
QuestionScrapper Pin
moududur shamim4-Nov-13 9:39
moududur shamim4-Nov-13 9:39 
AnswerRe: Scrapper Pin
dpalash4-Nov-13 20:20
professionaldpalash4-Nov-13 20:20 
GeneralMy vote of 5 Pin
Sk. Tajbir4-Nov-13 5:46
Sk. Tajbir4-Nov-13 5:46 
GeneralRe: My vote of 5 Pin
dpalash4-Nov-13 9:38
professionaldpalash4-Nov-13 9:38 
QuestionIP blocking Pin
Mukesh.C.Gupta3-Nov-13 23:49
Mukesh.C.Gupta3-Nov-13 23:49 
QuestionPlagiarism Pin
Ravi Bhavnani3-Nov-13 7:30
professionalRavi Bhavnani3-Nov-13 7:30 
AnswerRe: Plagiarism Pin
dpalash3-Nov-13 8:01
professionaldpalash3-Nov-13 8:01 
QuestionNeeds some work Pin
Richard MacCutchan3-Nov-13 5:33
mveRichard MacCutchan3-Nov-13 5:33 
AnswerRe: Needs some work Pin
dpalash3-Nov-13 5:37
professionaldpalash3-Nov-13 5:37 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.