Click here to Skip to main content
15,885,546 members
Please Sign up or sign in to vote.
1.33/5 (3 votes)
See more:
Hi all,

I developed a web scraper (using C#) that should be able to make thousands of requests each time.

The problem is that the website's server will block my IP after a number of requests.

Questions:

1- How to prevent being blocked?
2- How to know when will the website's server will block my IP? I mean how to know my limit whether being certain amount of traffic or certain number of requests.

Thanks.
Posted
Updated 10-May-11 1:38am
v2

There's no way to do this. The one way would to be to limit the scraping to a very slow rate which kinda nullifies the very purpose of scraping.

Alternatively, spread the scraping out to multiple domains. For example pick a 100 domains, get 1 page from domain-1, then the next from domain-2, and so on till domain-100, then get the 2nd page from domain-1, then from domain-2, and so on. The trick here is that this artificially slows down your scraping to 1/100 of its former speed (from the server's perspective), but you don't actually lose out on your scraping speeds because you are scraping from multiple sites. Makes sense?
 
Share this answer
 
Comments
BobJanova 10-May-11 9:58am    
This is a good idea. The OP needs to realise that his scraper is essentially a low level DoS tool and modify it accordingly. Spreading his efforts over multiple servers would do that quite effectively.
1) That's very easy. Don't do anything that might lead to being considered a threat. Bombarding a server with 'thousands of requests each time' could be considered to be hostile.

2) Now that's a good idea. Each server should post information on how much abuse its owner will tolerate. Seriously, it's very much like when you misbehave at somebody's house. When the owner grabs you by the collar and shows you the door, then you know how far you could go.
 
Share this answer
 
v2
It's not possible to get block if you are able to use my way.

Use an ADSL modem like Airties, let your server use that internet connection and send reset command in a schedule.

That works like a charm. :)
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900