Click here to Skip to main content
12,241,814 members (43,889 online)
Rate this:
 
Please Sign up or sign in to vote.
See more: C# HTML LINQ
Hi people,


I'm making a provider of news and I need to get a HTML code from the website, save it and find text by a LINQ expression.
I hope some of you can help me with this hard task.

i'm using this code to find the webpage source:

public static String code(string Url)
    {
        
            HttpWebRequest myRequest = (HttpWebRequest)WebRequest.Create(Url);
            myRequest.Method = "GET";
            WebResponse myResponse = myRequest.GetResponse();
            StreamReader sr = new StreamReader(myResponse.GetResponseStream(), System.Text.Encoding.UTF8);
            string result = sr.ReadToEnd();
            sr.Close();
            myResponse.Close();
            
            return result;
     }


now I want to find text in a div of the webpage source.
but i don't know how to do it
Posted 1-Jun-11 0:49am
Jpmcm13379
Edited 1-Jun-11 4:41am
v2
Comments
digital man 1-Jun-11 6:04am
   
What have you already tried? What research have you carried out on your own? No one will provide a complete solution: we will answer specific questions about a specific problem having seen that you have already tried, hard, to find the answer for yourself.
ali yeganeh 19-Jan-13 8:51am
   
streamreader.readtoend is so slow
a browser load whole website with images and .. is faster
ali yeganeh 19-Jan-13 13:17pm
   
i said this command take 10 seconds
string result = sr.ReadToEnd();
i must read a 3 pages and this take almost 30 seconds
Rate this: bad
 
good
Please Sign up or sign in to vote.

Solution 2

You getting HTML code from a website. You can use code like this.

string urlAddress = "http://google.com";
 
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(urlAddress);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
if (response.StatusCode == HttpStatusCode.OK)
{
  Stream receiveStream = response.GetResponseStream();
  StreamReader readStream = null;
  if (response.CharacterSet == null)
    readStream = new StreamReader(receiveStream);
  else
    readStream = new StreamReader(receiveStream, Encoding.GetEncoding(response.CharacterSet));
  string data = readStream.ReadToEnd();
  response.Close();
  readStream.Close();
}

This will give you the returned HTML code from the website. But find text via LINQ is not that easy.
Perhaps it is better to use regular expression but that does not play well with HTML code.
Event better is to get your news by a RSS feed[^].
  Permalink  
Comments
SAKryukov 1-Jun-11 15:22pm
   
Good code (however, OP is almost there) and advice, my 5.

Please see what I added to complement your solution.
--SA
the_h 19-Aug-13 14:36pm
   
Thanks this is very helpful for me :)
Kim Togo 2-Jun-11 9:30am
   
Thanks SA
Kim Togo 20-Aug-13 2:17am
   
Glad you can use it :-)
Rate this: bad
 
good
Please Sign up or sign in to vote.

Solution 4

In addition to what Kim suggested, I would advice some further steps.

If you use RSS feed, chances are, this well-formed XML. Parse XML in one of the following ways and locate the elements you need:

  1. Use System.Xml.XmlDocument class. It implements DOM interface; this way is the easiest and good enough if the size if the document is not too big.
    See http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.aspx[^].
  2. Use the class System.Xml.XmlTextReader; this is the fastest way of reading, especially is you need to skip some data.
    See http://msdn.microsoft.com/en-us/library/system.xml.xmlreader.aspx[^].
  3. Use the class System.Xml.Linq.XDocument; this is the most adequate way similar to that of XmlDocument, supporting LINQ to XML Programming.
    See http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.aspx[^], http://msdn.microsoft.com/en-us/library/bb387063.aspx[^].

If by some reason you need to find something in HTML which is not formatted as well-formed XML (which would be a pity), try to use some HTML parser.

For example, check Majestic-12 Open Source HTML parser: http://www.majestic12.co.uk/projects/html_parser.php[^].

—SA
  Permalink  
Comments
Kim Togo 2-Jun-11 9:31am
   
Good advice SA. My 5.
SAKryukov 2-Jun-11 13:12pm
   
Thank you, Kim.
--SA
Rate this: bad
 
good
Please Sign up or sign in to vote.

Solution 9

See my WebResourceProvider[^] framework that was designed to do the exact task you're doing.

/ravi
  Permalink  
Rate this: bad
 
good
Please Sign up or sign in to vote.

Solution 7

If by some reason you need to find something in HTML which is not formatted as well-formed XML (which would be a pity), try to use some HTML parser.

For example, check Majestic-12 Open Source HTML here
  Permalink  
Rate this: bad
 
good
Please Sign up or sign in to vote.

Solution 6

string htmlContent = new System.Net.WebClient().DownloadString(theUrl);

or you could use

System.Net.WebClient wc = new System.Net.WebClient();
            wc.DownloadStringCompleted += (s, e) =>
            {
                string htmlContent = e.Result;
            };
            wc.DownloadStringAsync(theUrl);


http://g4ac.co.za/WebClient[^]
  Permalink  
Rate this: bad
 
good
Please Sign up or sign in to vote.

Solution 8

Hi all friends
Final project for my university and I need to make a search engine.
I do not know what the title page of the website into a single variable as a string Please help me so I guess it is important
  Permalink  
Comments
CHill60 16-Mar-14 14:48pm
   
If you have a question of your own then please post it. Relatively few people will see your post as you have posted it as a solution to an already answered (and old) question

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

  Print Answers RSS
Top Experts
Last 24hrsThis month


Advertise | Privacy | Mobile
Web02 | 2.8.160426.1 | Last Updated 17 Nov 2015
Copyright © CodeProject, 1999-2016
All Rights Reserved. Terms of Service
Layout: fixed | fluid

CodeProject, 503-250 Ferrand Drive Toronto Ontario, M3C 3G8 Canada +1 416-849-8900 x 100