Click here to Skip to main content
13,198,850 members (41,596 online)
Click here to Skip to main content
Add your own
alternative version


98 bookmarked
Posted 21 Jan 2009

Automating Internet Browsing With Robots

, 21 Jan 2009
Rate this:
Please Sign up or sign in to vote.
How to reverse engineer a website to build a robot to automate browsing.


Every now and then, I find myself driven nuts by websites that insist on me clicking through tens of pages to do relatively simple things that I need to do on a regular basis, or websites which insist that I access their content interactively on their web page rather than being able to select some content and download it for viewing later.

An example of this is YouTube, and the number of downloaders available on the Internet is testament to the demand for the ability to grab the videos so they can be watched offline either on the PC or on iPods and other more convenient devices.

While it would be simple to just create such a program and upload it here, I thought it would be more useful if I walked you through the process of creating such a program, as it really is a combination of activities, only one of which is coding.

Before I get into this, it is inevitable that the program that accompanies this article will work as is for only a short period of time. The code has inbuilt into it a dependency that YouTube continue to work exactly as it does now (Jan 2009). This is, of course, unrealistic, but I also don't intend to keep revisiting and updating this article to keep it working. If this happens, then I leave it as an exercise for the reader to use the methods below to work out what changes to make to get it to work again.

What the article does include that will remain useful is a number of utility functions and coding patterns for quickly implementing these sort of robots.


Before we commence, I want to point out that I have written this article based on a machine that uses Internet Explorer. It is possible, if not probable, that it can all also be done with Opera, Firefox etc., but as I don't use these browsers, I have not bothered to document how you would use those tools instead of IE.

Step 1 - Capture A Sample

Before we open up our coding tools, we have some work to do with our browser. We need to reverse engineer how the web site in question interacts with our browser.

The tool I use for this is a free browser plug-in named Fiddler. You will need to download and install this tool on your machine. For a full description of what this is and how it works, please see the tool's website. Using IE, Fiddler will just work. Using other browsers, you may need to spend some time reading the website to work out how to get it to work.

Our first step is to observe what happens when we watch a video on YouTube. To do this, we follow these steps ...

  1. Open up your browser.
  2. Find the option for clearing the browser cache, and do so. This will clear the cache, which will make it easier to see what is going on. On IE 7, this can be found by going to the Tools item on the command bar and selecting the Internet Options menu item. On the General tab, click the Delete button in the Browsing history section, and then click the Delete files button in the dialog box that appears.
  3. Startup Fiddler (in IE 7, you can find it under Tools/Fiddler2 in the command bar).
  4. StartFiddler.JPG

  5. Clear any capture data when Fiddler starts, by using the Edit/Remove/All Sessions menu item.
  6. RemoveSessions.JPG

  7. Navigate to
  8. Find a video you are interested in.
  9. Rick click the mouse on the video thumbnail, and select Copy Shortcut from the context menu, and save this shortcut in a text document. This URL is our starting URL.
  10. CopyShortcut.JPG

  11. Watch the video in its entirety ... choosing a short video helps here.
  12. Stop Fiddler, capturing data using the File/Capture Traffic menu item.
  13. StopCapture.JPG

  14. Select all the captured traffic using the Edit/Select All menu item.
  15. Save the traffic to a file using the File/Save/Session(s)/In ArchiveZip menu item for future reference.
  16. SaveInArchive.JPG

Step 2 - Investigation

The first thing that can be observer by right clicking on the video in the browser is that YouTube uses Adobe Flash to play videos.


Flash videos are typically either streamed or downloaded .FLV files. A quick look in the browser cache shows that we do indeed have a fairly large FLV file, which is our video. This is the file we want our tool to download.


Our next step is to go back to Fiddler and find the request which downloaded the .FLV file, and more specifically, take a close look at the URL. This is the URL we need to access:


Now, we go to the start of the Fiddler log and try to find our starting URL. This should be the same as the URL we found when we copied the shortcut behind the video thumbnail. This is our starting point. We just have to work out how to get from here to the .FLV file's URL.


Looking at the requests between the original shortcut and the FLV file, we see another interesting request.


This is interesting because it seems to be referring to the same page, but on a different server with different (and simpler) parameters.

Before we go on, I need to introduce you to Fiddler's right hand panel. This panel allows you to inspect the web request made and the response received. If you click on the URL above and select the Inspectors tab (red circle), you can see the request data (pink circle) and the response data (green circle). Spend some time here familiarising yourself as we will be spending some time here in the next few steps.


Drawing your attention to the Raw tab in the response, we can see this request resulted in an HTTP 1.1 return code of 303. This code means that the website in response to this request has told your browser it is looking in the wrong place and has an alternative place for you to look. You can see this location 6 lines down, prefixed by Location:, and fortunately for us, it is that long complicated target URL.

To test this, we can take the shorter:

URL and put it into our browser address bar, and a few seconds later, the file download box appears. A bit of "try it and see" experimenting with this URL further shows that it can be shortened to:

and it will still work fine. This simpler URL is now our new target.

Our next step is to work out how to get the components of this URL. Breaking it down, we have a number of elements:

  • - This is the server to contact. This appears to be the same as our original shortcut, so we should be able to get it from there.
  • get_video? - This seems fairly static, so we can hardcode this.
  • video_id=zgDp4CW_ZI0& - This was the v parameter in the original shortcut, so once again, we should be able to get it from there.
  • t=OEgsToPDskL4WlMPA0C8xypYnj1Q6IIE - Last one, and what do you know, we are stuck.

To find the source for our t parameter, we need to go back to Fiddler and look at the response to the original shortcut.


Looking closely at the raw response, we see it is an HTML file which we can search.


Fortunately, this string is in the file on the following line:

var fullscreenUrl = '/watch_fullscreen?fs=1&creator=Hughsnews&
 kSN0urx0C&t=OEgsToPDskL4WlMPA0C8xypYnj1Q6IIE&title=Airplane Fever!';

So now, we have everything we need to build a robot. Our new robot needs to follow these steps:

  1. Accept the shortcut.
  2. Extract the server name and the v parameter.
  3. Retrieve the webpage the shortcut refers to.
  4. Search the returned page for the line containing our t parameter.
  5. Build the URL and request it.
  6. When we get the redirection, follow it.
  7. Save the FLV file returned.

The Code

The code attached to the article basically implements the robot steps identified above. I don't plan to discuss the code as it is in the source files and clearly documented

Of more interest is the RobotUtility.cs file as it contains a set of small but useful functions which can be reused in any similar robot. The functions here are not exhaustive, and if you plan to build robots for other websites, I am sure you will want to extend it with other functions.

The functions and their use can be summarised as follows:

string GetPage(Uri url)

This function gets the text returned by a URL in a string.

string ExtractValue(string tosearch, string regex)

This function is useful for extracting substrings from a string, but the regex string should follow some strict rules:

  • It must have a substring named val.
  • It must find what you are looking for in the first match.

URI GetRedirect(Uri url)

This function requests the URL, and returns the new URL that this one redirects to.

void GetFile(Uri url, string file)

This function requests the URL, and saves the resulting data in the named file.

Putting it all Together

Using these functions, it is possible to implement the robot steps described above as follows:

// Accept the shortcut
string shortcut = tbURL.Text;

// Extract the server name and v parameter.
string server = RobotUtility.Utility.ExtractValue(shortcut, "^(?'val'.*?)watch");
// grab everything up to but excluding 'watch'

string v = RobotUtility.Utility.ExtractValue(shortcut, "v=(?'val'.*?)$"); 
// grab everying after but excluding 'v='

// Retrieve the webpage the shortcut refers to
string page = RobotUtility.Utility.GetPage(new Uri(shortcut));

// Search the returned page for our t parameter
string t = RobotUtility.Utility.ExtractValue(page, "&t=(?'val'.*?)&");
// grab everything after but excluding '&t=' up to the next '&'

// Build the URL and request it.
string flvurl = server + "get_video?video_id=" + v + "&t=" + t;

Uri redirect = RobotUtility.Utility.GetRedirect(new Uri(flvurl)); 
// as it happens this step is not necessary as the GetFile 
// actually will handle the redirect but this is consistent with the article.

// When we get the redirection follow it
// Save the FLV file returned
string file = tbDestination.Text + "\\" + v + ".flv";
RobotUtility.Utility.GetFile(redirect, file);


When you need to do something in bulk, or you want to evade an overly restrictive website, a robot can save you huge amounts of time and frustration.


This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


About the Author

Australia Australia
No Biography provided

You may also be interested in...


Comments and Discussions

GeneralMy vote of 5 Pin
Midax4-Nov-12 3:08
memberMidax4-Nov-12 3:08 
QuestionChange Pin
Member 421173417-Feb-12 15:48
memberMember 421173417-Feb-12 15:48 
What is worth having?

Portable storage media could change the way people acquire and use information.

I think it requires format, to force compliance, in order to achieve a desired result.

Result could be one of intent to change information use to include EZ hard-copy acquisition in order to exploit such use for profit.

Once such a standard is applicable, progress can occur to proliferate such desirable information on a simple, exchangeable USB memory stick.

Insert the stick and put a quarter into the slot to acquire the NYT or any other publication in exchange for the quarter. All done without personal connection to the electronic network.

Possible as run by airlines in conjunction with publications for travelers.

Being connected is expensive and somewhat undesirable.



GeneralCool Pin
icetea9428-Jan-09 0:57
membericetea9428-Jan-09 0:57 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Terms of Use | Mobile
Web01 | 2.8.171020.1 | Last Updated 21 Jan 2009
Article Copyright 2009 by Keithsw
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid