Click here to Skip to main content
15,914,070 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I am making a program in C# that needs to pull the address of selected files from a webpage and then download the files. The website in question is http://www.un.org/depts/dhl/resguide/r1.htm (and various similar). The problem is that the links are not direct to the file, if you follow them in a browser they redirect you first to a temporary page and then to the file itself. If I follow the link given on the page direct rather than from the webpage link I do not get directed to the file but get an error page.

Any ideas on how I can reach the actual file address through my program?
Posted
Comments
William Winner 21-Oct-10 17:04pm    
By the way, if my answer provided you with enough information to move on, then make sure to mark it as a solution so others know your question has been solved.

If it didn't, post a comment to my answer with questions or why it doesn't work for you.

Ok, so I was able to get it to work. I created a simple form that has a Listbox on it. When the form loads, it goes to the http://www.un.org/depts/dhl/resguide/r1.htm[^] page and pulls out all of the links. Then, when you click on a link (assuming you click one of the pdf links), it goes through the whole process of redirecting, acquiring cookies, and then outputting the file to a temporary file. Here's the code:

C#
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using System.Text.RegularExpressions;
using System.Net;
using System.IO;

namespace UNDocs
{
    public partial class Form1 : Form
    {
        private const string StartingPage = @"http://www.un.org/depts/dhl/resguide/r1.htm";
        private const string CookieOriginator = @"http://daccess-dds-ny.un.org/prod/ods_mother.nsf?Login&Username=freeods2&Password=1234";

        public Form1()
        {
            InitializeComponent();
        }

        private void Form1_Load(object sender, EventArgs e)
        {
            //Load all of the links in the Listbox
            string html = GetHTML(StartingPage);

            MatchCollection matches = GetLinks(html);

            foreach (Match match in matches)
            {
                string value = match.Groups["link"].Value;

                listBox1.Items.Add(value);
            }
        }

        private void button1_Click(object sender, EventArgs e)
        {
            //Get the temporary page to redirect to
            string tempPage = GetURLBase(listBox1.SelectedItem.ToString()) +
                              GetPageToRedirectTo(listBox1.SelectedItem.ToString(), StartingPage);

            //Get the cookies to use
            CookieContainer cookies = GetCookies(CookieOriginator, tempPage);

            //This is the page with the link to the actual page
            string finalPage = GetPageToRedirectTo(tempPage);

            //Get the byte array representing the pdf file
            byte[] pdf = GetBytesFromHTTP(finalPage, cookies);

            //write the file to disk
            WriteFile(@"D:\temp.pdf", pdf);
        }

        public MatchCollection GetLinks(string s)
        {
            Regex regex = new Regex("href=\"(?<link>.*?)\"", RegexOptions.Multiline);
            return regex.Matches(s);
        }

        public string GetHTML(string url)
        {
            return GetHTML(url, "");
        }

        public string GetHTML(string url, string Referer)
        {
            return GetHTML(url, Referer, new CookieContainer());
        }
     
        public string GetHTML(string url, string Referer, CookieContainer cookies)
        {
            HttpWebRequest myRequest = (HttpWebRequest)HttpWebRequest.Create(url);
            myRequest.Referer = Referer;
            myRequest.CookieContainer = cookies;

            string pageSource = "";

            using (HttpWebResponse response = (HttpWebResponse)myRequest.GetResponse())
            {
                using (StreamReader reader = new StreamReader(response.GetResponseStream()))
                {
                    pageSource = reader.ReadToEnd();
                }
            }

            return pageSource;
        }

        public byte[] GetBytesFromHTTP(string url, CookieContainer cookies)
        {
            HttpWebRequest myRequest = (HttpWebRequest)HttpWebRequest.Create(url);
            myRequest.CookieContainer = cookies;
            myRequest.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip,deflate");

            byte[] result = null;
            byte[] buffer = new byte[4096];

            using (HttpWebResponse response = (HttpWebResponse)myRequest.GetResponse())
            {
                using (Stream responseStream = response.GetResponseStream())
                {
                    using (MemoryStream memoryStream = new MemoryStream())
                    {
                        int count = 0;

                        do
                        {
                            count = responseStream.Read(buffer, 0, buffer.Length);
                            memoryStream.Write(buffer, 0, count);
                        } while (count != 0);

                        result = memoryStream.ToArray();
                    }
                }
            }

            return result;
        }

        private string GetURLBase(string url)
        {
            Regex regex = new Regex("(?<base>http://.*?)/");
            return regex.Match(url).Groups["base"].Value;
        }

        private string GetPageToRedirectTo(string url)
        {
            return GetPageToRedirectTo(url, "");
        }

        private string GetPageToRedirectTo(string url, string Referer)
        {
            HttpWebRequest myRequest = (HttpWebRequest)HttpWebRequest.Create(url);
            myRequest.Referer = Referer;

            string pageSource = "";

            using (HttpWebResponse response = (HttpWebResponse)myRequest.GetResponse())
            {
                using (StreamReader reader = new StreamReader(response.GetResponseStream()))
                {
                    pageSource = reader.ReadToEnd();
                }
            }

            string urlToRedirectTo = "";

            //Get URL in Meta Tag
            Regex regex = new Regex("<META.*URL=(?<URL>.*)\"");
            urlToRedirectTo = regex.Match(pageSource).Groups["URL"].Value;

            return urlToRedirectTo;
        }

        private CookieContainer GetCookies(string url, string Referer)
        {
            HttpWebRequest myRequest = (HttpWebRequest)HttpWebRequest.Create(url);
            myRequest.Referer = Referer;
            myRequest.CookieContainer = new CookieContainer();
            myRequest.GetResponse().Close();

            return myRequest.CookieContainer;
        }

        public void WriteFile(string FileName, byte[] fileContents)
        {
            FileStream outFile = new FileStream(FileName, FileMode.Create);

            using (BinaryWriter writer = new BinaryWriter(outFile))
            {
                writer.Write(fileContents, 0, fileContents.Length);
            }

            outFile.Dispose();
        }
    }
}


(that was fun to figure out!)
 
Share this answer
 
v2
Comments
m4tthew_uk 22-Oct-10 6:23am    
Thanks a lot for the time you spent on this. I am at work right now so can't test it but it looks great and I will be checking it out when I get home tonight.
That's a very strange webpage. I know that it is setting cookies, because after I go to it, my browser has cookies set. However, when I try to use InternetGetCookies(...), it doesn't return anything. And when I try using HTTPWebRequest to get the cookies, the same thing happens.

And they are session cookies that are created dynamically, so you can't just use the values in the cookies that are passed to the browser.

It would appear that it uses a google-analytics script to write the cookies that you would need. So, I'm not sure how you could go about doing it...

[Update]
Ok, so I guess it's just using GA to do some tracking and those cookies are not necessary.

It is setting it's own cookie, that I can track with HttpFox. But, I can't understand why I get two different sets of HTML when following the page. If I follow the link in Firefox, the first thing that comes up is a blank page that just redirects. But, in code, it comes up with a page that says "not authorised".

AHA!!!! I figured it out. You have to set the HttpWebRequest.Referer property equal to the referring page.. That gives you the page with the redirect.

The redirected page has two different links in it. One is the page to be redirected to, the other adds a frame to the page with a source. That frame is what sets the cookie. Without that cookie, the page that you are being redirected to never loads.

I'm curious now if I can write something that will actually go all the way through...
 
Share this answer
 
v2
You could download the temporary html, and parse it to get the redirected url.
 
Share this answer
 
Comments
William Winner 21-Oct-10 11:32am    
The only problem is that the links on the page he gave us don't go to a temporary URL. If you pull out the HTML of the actual links, I don't even see how they're doing the re-directing. Even the temporary url doesn't provide a link to get to the document...
Nish Nishant 21-Oct-10 11:39am    
Interestingly none of those links even work for me. I get a not-authorized page.
William Winner 21-Oct-10 11:45am    
It's seems to need us to send a cookie, but I can't even get it to download the original cookie...
m4tthew_uk 21-Oct-10 12:00pm    
That's the problem, the links in the html are such as: http://daccess-ods.un.org/access.nsf/Get?Open&DS=A/RES/103(I)&Lang=E&Area=RESOLUTION I get not-athorised if I follow the link in anyway other than clicking on the link too. I am not sure what is happening, I can see it is redirecting me from the link in the html to a temp (seemingly blank) page and then to the file but I can't get past the page with the above link.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900