Click here to Skip to main content
12,241,949 members (50,209 online)
Click here to Skip to main content
Add your own
alternative version

Stats

25.6K views
1.3K downloads
53 bookmarked
Posted

Web Scraping (Problems & Solutions)

, 5 Nov 2013 CPOL
Rate this:
Please Sign up or sign in to vote.
From this article, you will be able to get the basic idea about web scraping and a few problems and their solutions while working.

Introduction

Web scraping is the Considered the most efficient and programmatic way to grab data from different web sources. Basically web scraping is done on webpages. It is a simple technique to collect necessary information from other webpages to personal database.

Need to Consider:

  1. Html Structure.
  2. Proper Tagging.

1. Html Structure: 

Our first consideration for web scraping will be Html structure. For scraping we need our content Html to be structured. With out proper structured Html code scraping will be a mess because of lot of time consumption and hazard. If the content is well structured then it an amazing way to collect data.

2. Proper Tagging:

Content Html tags need to be properly formatted. It needs id or class. If the content Html has only inline Html then it will be a mess. It needs a identification to fetch data. The proper way to put an Id or a class name that we can use. If the content Html has this facility then scraping will be a good idea.

Uses of Web Scraping

  1. Online price comparison
  2. Contact scraping  
  3. Weather data monitoring
  4. Website change detection
  5. Research
  6. Web mash up
  7. Web data integration
  8. Telephone no collection
  9. Address collection 
  10. Country/City/State Name Collection.

In this article, I will discuss a few useful techniques of web scrapping using HtmlAgilityPack. The most surprising feature of HTML Agility Pack is that it now supports LINQ. This means you can write the usual Linq query to get your result. If you need to know more information about HTML Agility Pack, then you can visit their documentation at CodePlex.  

Okay, so let’s begin now.

Problem Statement-1 

Suppose we have the following HTML code. From the underlined Html, we want to extract only the links related to the anchor tags.

Solution-1

Step 1: Process the raw content (that is HTML). Load the total HTML source code and convert it to a string.  Through the Html Web Request and Response we get the entire Html code from the given link. Then using the Stream Reader the total content is read to the end and we get the string format of the Html source code. Following is the code for the above procedure.

Step 2: Return the converted string and again convert to HTML document type.

In the above code, we have the getSourceCode() method in the WorkerClass class. This method loads the total HTML provided and then returns the total HTML as a string. Returning string is then converted to HtmlDocument and returned. The underlined images show that we have the HTML document ready. Now our content is ready to perform a LINQ query to get our desired result.

Here primaryDivId is a Boolean variable which will be true if it gets any div with id divAchors. Here anchorsHref holds the collection of the anchor’s links and anchorsInnerText is the collection of the anchor’s inner text.

Problem Statement-2 

Suppose we need to download images. The HTML format may be like the following:

Solution-2

To download all the images and also to get their alternative information text, we need to do the following:

The following //img tag on the SelectNodes represents that the div having the Id divImage  may have the img tag. If it gets any image tag within the scope of this dev it will fetch it's source and alternative information text. Here I need to mention that no matter where the image tag resides, no mater if the image tag resided with a few div levels, this query will fetch them all.

From the above code, we will be able to get the collection of the image source links in the imageSrc list and their alt text in the imageInnerText list. Using a foreach loop, we can download and save the images in our desired folder.

Problem Statement-3

Suppose we need to find the inner text of a div with its class name. The HTML for this problem may look like the following:

Solution-3

Here is the solution for this problem statement:

The innerText string will provide you a full length uncut string whereas the innerTextList will provide you a list of inner text’s collection.

Problem Statement-4 

Suppose we have the similar problem like the above one with a slight change. The change is that the class name toggles between two classes. I am not sure about which class name might present when the page renders. The HTML for this problem statement may look like the following:

Here the classes toggles between demoText1 and demoText2.

Solution-4

Here is the solution for the above problem statement:

The solution is similar to the solution-3 with an extra or (|) condition in the query. You can also use and (&) condition if you need to.

These are the recent problems that I faced so far in my work and I solved them in this way. I think these solutions will help you to solve your problems because it covers a lot related to web scraping. If you encounter more problems, please let me know, I will try to solve them. Thanks for reading. Happy coding. Smile | :)

References 

  1. Wikipedia
  2. CodePlex

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

dpalash
Software Developer Cefalo
Bangladesh Bangladesh
Hi,

I am Palash Debnath. I have been working on windows technologies since 2008. I started with ASP.NET. Then I moved to Windows Form and from the last year I have been working with Windows 8 app development. My future plan is to work with Windows 10 app development as well. I have completed my Undergraduate from Khulna University of Engineering in Computer Science & Engineering. Now I am working as a Software Engineer at Cefalo on Windows 8 app development.

You may also be interested in...

Comments and Discussions

 
QuestionGreat Man! Pin
ridoy29-Oct-15 21:13
professionalridoy29-Oct-15 21:13 
QuestionScrape data from paginated grid view Pin
Member 1050050624-Jan-14 18:38
professionalMember 1050050624-Jan-14 18:38 
GeneralMy vote of 5 Pin
Maksud Saifullah Pulak9-Nov-13 18:50
professionalMaksud Saifullah Pulak9-Nov-13 18:50 
GeneralRe: My vote of 5 Pin
dpalash9-Nov-13 21:10
professionaldpalash9-Nov-13 21:10 
Questionfull samples Pin
kiquenet.com6-Nov-13 5:36
memberkiquenet.com6-Nov-13 5:36 
AnswerRe: full samples Pin
dpalash6-Nov-13 6:37
professionaldpalash6-Nov-13 6:37 
GeneralMy 5 Pin
Shahriar Iqbal Chowdhury/Galib5-Nov-13 8:35
professionalShahriar Iqbal Chowdhury/Galib5-Nov-13 8:35 
GeneralRe: My 5 Pin
dpalash5-Nov-13 9:34
professionaldpalash5-Nov-13 9:34 
GeneralMy vote of 5 Pin
fredatcodeproject5-Nov-13 2:05
memberfredatcodeproject5-Nov-13 2:05 
GeneralRe: My vote of 5 Pin
dpalash5-Nov-13 3:19
professionaldpalash5-Nov-13 3:19 
GeneralMy vote of 5 Pin
Monjurul Habib4-Nov-13 23:44
memberMonjurul Habib4-Nov-13 23:44 
GeneralRe: My vote of 5 Pin
dpalash5-Nov-13 1:37
professionaldpalash5-Nov-13 1:37 
GeneralWhy i can't download the codes Pin
Endles_story4-Nov-13 22:27
memberEndles_story4-Nov-13 22:27 
GeneralRe: Why i can't download the codes Pin
dpalash5-Nov-13 1:35
professionaldpalash5-Nov-13 1:35 
QuestionScrapper Pin
moududur shamim4-Nov-13 10:39
membermoududur shamim4-Nov-13 10:39 
AnswerRe: Scrapper Pin
dpalash4-Nov-13 21:20
professionaldpalash4-Nov-13 21:20 
GeneralMy vote of 5 Pin
Sk. Tajbir4-Nov-13 6:46
memberSk. Tajbir4-Nov-13 6:46 
GeneralRe: My vote of 5 Pin
dpalash4-Nov-13 10:38
professionaldpalash4-Nov-13 10:38 
QuestionIP blocking Pin
Mukesh.C.Gupta4-Nov-13 0:49
memberMukesh.C.Gupta4-Nov-13 0:49 
QuestionPlagiarism Pin
Ravi Bhavnani3-Nov-13 8:30
professionalRavi Bhavnani3-Nov-13 8:30 
AnswerRe: Plagiarism Pin
dpalash3-Nov-13 9:01
professionaldpalash3-Nov-13 9:01 
QuestionNeeds some work Pin
Richard MacCutchan3-Nov-13 6:33
mvpRichard MacCutchan3-Nov-13 6:33 
AnswerRe: Needs some work Pin
dpalash3-Nov-13 6:37
professionaldpalash3-Nov-13 6:37 
GeneralRe: Needs some work Pin
Richard MacCutchan3-Nov-13 6:45
mvpRichard MacCutchan3-Nov-13 6:45 
GeneralRe: Needs some work Pin
dpalash3-Nov-13 6:46
professionaldpalash3-Nov-13 6:46 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.160426.1 | Last Updated 5 Nov 2013
Article Copyright 2013 by dpalash
Everything else Copyright © CodeProject, 1999-2016
Layout: fixed | fluid