Downloading files without direct address through C# program

Question

0.00/5 (No votes)

See more:

I am making a program in C# that needs to pull the address of selected files from a webpage and then download the files. The website in question is http://www.un.org/depts/dhl/resguide/r1.htm (and various similar). The problem is that the links are not direct to the file, if you follow them in a browser they redirect you first to a temporary page and then to the file itself. If I follow the link given on the page direct rather than from the webpage link I do not get directed to the file but get an error page.

Any ideas on how I can reach the actual file address through my program?

Posted 21-Oct-10 5:15am

m4tthew_uk

Add a Solution

Comments

William Winner 21-Oct-10 17:04pm

By the way, if my answer provided you with enough information to move on, then make sure to mark it as a solution so others know your question has been solved.

If it didn't, post a comment to my answer with questions or why it doesn't work for you.

3 solutions

Solution 2

That's a very strange webpage. I know that it is setting cookies, because after I go to it, my browser has cookies set. However, when I try to use InternetGetCookies(...), it doesn't return anything. And when I try using HTTPWebRequest to get the cookies, the same thing happens.

And they are session cookies that are created dynamically, so you can't just use the values in the cookies that are passed to the browser.

It would appear that it uses a google-analytics script to write the cookies that you would need. So, I'm not sure how you could go about doing it...

[Update]
Ok, so I guess it's just using GA to do some tracking and those cookies are not necessary.

It is setting it's own cookie, that I can track with HttpFox. But, I can't understand why I get two different sets of HTML when following the page. If I follow the link in Firefox, the first thing that comes up is a blank page that just redirects. But, in code, it comes up with a page that says "not authorised".

AHA!!!! I figured it out. You have to set the HttpWebRequest.Referer property equal to the referring page.. That gives you the page with the redirect.

The redirected page has two different links in it. One is the page to be redirected to, the other adds a frame to the page with a source. That frame is what sets the cookie. Without that cookie, the page that you are being redirected to never loads.

I'm curious now if I can write something that will actually go all the way through...

Posted 21-Oct-10 6:34am

William Winner

Updated 21-Oct-10 7:00am

v2

Solution 1

You could download the temporary html, and parse it to get the redirected url.

Posted 21-Oct-10 5:26am

Nish Nishant

Comments

William Winner 21-Oct-10 11:32am

The only problem is that the links on the page he gave us don't go to a temporary URL. If you pull out the HTML of the actual links, I don't even see how they're doing the re-directing. Even the temporary url doesn't provide a link to get to the document...

Nish Nishant 21-Oct-10 11:39am

Interestingly none of those links even work for me. I get a not-authorized page.

William Winner 21-Oct-10 11:45am

It's seems to need us to send a cookie, but I can't even get it to download the original cookie...

m4tthew_uk 21-Oct-10 12:00pm

That's the problem, the links in the html are such as: http://daccess-ods.un.org/access.nsf/Get?Open&DS=A/RES/103(I)&Lang=E&Area=RESOLUTION I get not-athorised if I follow the link in anyway other than clicking on the link too. I am not sure what is happening, I can see it is redirecting me from the link in the html to a temp (seemingly blank) page and then to the file but I can't get past the page with the above link.

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

William Winner · Accepted Answer · 2010-10-21T09:20:00

Ok, so I was able to get it to work. I created a simple form that has a Listbox on it. When the form loads, it goes to the http://www.un.org/depts/dhl/resguide/r1.htm[^] page and pulls out all of the links. Then, when you click on a link (assuming you click one of the pdf links), it goes through the whole process of redirecting, acquiring cookies, and then outputting the file to a temporary file. Here's the code:

C#

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using System.Text.RegularExpressions;
using System.Net;
using System.IO;

namespace UNDocs
{
    public partial class Form1 : Form
    {
        private const string StartingPage = @"http://www.un.org/depts/dhl/resguide/r1.htm";
        private const string CookieOriginator = @"http://daccess-dds-ny.un.org/prod/ods_mother.nsf?Login&Username=freeods2&Password=1234";

        public Form1()
        {
            InitializeComponent();
        }

        private void Form1_Load(object sender, EventArgs e)
        {
            //Load all of the links in the Listbox
            string html = GetHTML(StartingPage);

            MatchCollection matches = GetLinks(html);

            foreach (Match match in matches)
            {
                string value = match.Groups["link"].Value;

                listBox1.Items.Add(value);
            }
        }

        private void button1_Click(object sender, EventArgs e)
        {
            //Get the temporary page to redirect to
            string tempPage = GetURLBase(listBox1.SelectedItem.ToString()) +
                              GetPageToRedirectTo(listBox1.SelectedItem.ToString(), StartingPage);

            //Get the cookies to use
            CookieContainer cookies = GetCookies(CookieOriginator, tempPage);

            //This is the page with the link to the actual page
            string finalPage = GetPageToRedirectTo(tempPage);

            //Get the byte array representing the pdf file
            byte[] pdf = GetBytesFromHTTP(finalPage, cookies);

            //write the file to disk
            WriteFile(@"D:\temp.pdf", pdf);
        }

        public MatchCollection GetLinks(string s)
        {
            Regex regex = new Regex("href=\"(?<link>.*?)\"", RegexOptions.Multiline);
            return regex.Matches(s);
        }

        public string GetHTML(string url)
        {
            return GetHTML(url, "");
        }

        public string GetHTML(string url, string Referer)
        {
            return GetHTML(url, Referer, new CookieContainer());
        }
     
        public string GetHTML(string url, string Referer, CookieContainer cookies)
        {
            HttpWebRequest myRequest = (HttpWebRequest)HttpWebRequest.Create(url);
            myRequest.Referer = Referer;
            myRequest.CookieContainer = cookies;

            string pageSource = "";

            using (HttpWebResponse response = (HttpWebResponse)myRequest.GetResponse())
            {
                using (StreamReader reader = new StreamReader(response.GetResponseStream()))
                {
                    pageSource = reader.ReadToEnd();
                }
            }

            return pageSource;
        }

        public byte[] GetBytesFromHTTP(string url, CookieContainer cookies)
        {
            HttpWebRequest myRequest = (HttpWebRequest)HttpWebRequest.Create(url);
            myRequest.CookieContainer = cookies;
            myRequest.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip,deflate");

            byte[] result = null;
            byte[] buffer = new byte[4096];

            using (HttpWebResponse response = (HttpWebResponse)myRequest.GetResponse())
            {
                using (Stream responseStream = response.GetResponseStream())
                {
                    using (MemoryStream memoryStream = new MemoryStream())
                    {
                        int count = 0;

                        do
                        {
                            count = responseStream.Read(buffer, 0, buffer.Length);
                            memoryStream.Write(buffer, 0, count);
                        } while (count != 0);

                        result = memoryStream.ToArray();
                    }
                }
            }

            return result;
        }

        private string GetURLBase(string url)
        {
            Regex regex = new Regex("(?<base>http://.*?)/");
            return regex.Match(url).Groups["base"].Value;
        }

        private string GetPageToRedirectTo(string url)
        {
            return GetPageToRedirectTo(url, "");
        }

        private string GetPageToRedirectTo(string url, string Referer)
        {
            HttpWebRequest myRequest = (HttpWebRequest)HttpWebRequest.Create(url);
            myRequest.Referer = Referer;

            string pageSource = "";

            using (HttpWebResponse response = (HttpWebResponse)myRequest.GetResponse())
            {
                using (StreamReader reader = new StreamReader(response.GetResponseStream()))
                {
                    pageSource = reader.ReadToEnd();
                }
            }

            string urlToRedirectTo = "";

            //Get URL in Meta Tag
            Regex regex = new Regex("<META.*URL=(?<URL>.*)\"");
            urlToRedirectTo = regex.Match(pageSource).Groups["URL"].Value;

            return urlToRedirectTo;
        }

        private CookieContainer GetCookies(string url, string Referer)
        {
            HttpWebRequest myRequest = (HttpWebRequest)HttpWebRequest.Create(url);
            myRequest.Referer = Referer;
            myRequest.CookieContainer = new CookieContainer();
            myRequest.GetResponse().Close();

            return myRequest.CookieContainer;
        }

        public void WriteFile(string FileName, byte[] fileContents)
        {
            FileStream outFile = new FileStream(FileName, FileMode.Create);

            using (BinaryWriter writer = new BinaryWriter(outFile))
            {
                writer.Write(fileContents, 0, fileContents.Length);
            }

            outFile.Dispose();
        }
    }
}

(that was fun to figure out!)