Click here to Skip to main content
15,867,453 members
Please Sign up or sign in to vote.
2.00/5 (5 votes)
See more:
Hi people,


I'm making a provider of news and I need to get a HTML code from the website, save it and find text by a LINQ expression.
I hope some of you can help me with this hard task.

i'm using this code to find the webpage source:

C#
public static String code(string Url)
    {
        
            HttpWebRequest myRequest = (HttpWebRequest)WebRequest.Create(Url);
            myRequest.Method = "GET";
            WebResponse myResponse = myRequest.GetResponse();
            StreamReader sr = new StreamReader(myResponse.GetResponseStream(), System.Text.Encoding.UTF8);
            string result = sr.ReadToEnd();
            sr.Close();
            myResponse.Close();
            
            return result;
     }



now I want to find text in a div of the webpage source.
but i don't know how to do it
Posted
Updated 1-Jun-11 3:41am
v2
Comments
R. Giskard Reventlov 1-Jun-11 6:04am    
What have you already tried? What research have you carried out on your own? No one will provide a complete solution: we will answer specific questions about a specific problem having seen that you have already tried, hard, to find the answer for yourself.
ali yeganeh 19-Jan-13 8:51am    
streamreader.readtoend is so slow
a browser load whole website with images and .. is faster
ali yeganeh 19-Jan-13 13:17pm    
i said this command take 10 seconds
string result = sr.ReadToEnd();
i must read a 3 pages and this take almost 30 seconds

You getting HTML code from a website. You can use code like this.

C#
string urlAddress = "http://google.com";

HttpWebRequest request = (HttpWebRequest)WebRequest.Create(urlAddress);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
if (response.StatusCode == HttpStatusCode.OK)
{
  Stream receiveStream = response.GetResponseStream();
  StreamReader readStream = null;
  if (response.CharacterSet == null)
    readStream = new StreamReader(receiveStream);
  else
    readStream = new StreamReader(receiveStream, Encoding.GetEncoding(response.CharacterSet));
  string data = readStream.ReadToEnd();
  response.Close();
  readStream.Close();
}


This will give you the returned HTML code from the website. But find text via LINQ is not that easy.
Perhaps it is better to use regular expression but that does not play well with HTML code.
Event better is to get your news by a RSS feed[^].
 
Share this answer
 
Comments
Sergey Alexandrovich Kryukov 1-Jun-11 15:22pm    
Good code (however, OP is almost there) and advice, my 5.

Please see what I added to complement your solution.
--SA
the_h 19-Aug-13 14:36pm    
Thanks this is very helpful for me :)
Kim Togo 2-Jun-11 9:30am    
Thanks SA
Kim Togo 20-Aug-13 2:17am    
Glad you can use it :-)
In addition to what Kim suggested, I would advice some further steps.

If you use RSS feed, chances are, this well-formed XML. Parse XML in one of the following ways and locate the elements you need:


  1. Use System.Xml.XmlDocument class. It implements DOM interface; this way is the easiest and good enough if the size if the document is not too big.
    See http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.aspx[^].
  2. Use the class System.Xml.XmlTextReader; this is the fastest way of reading, especially is you need to skip some data.
    See http://msdn.microsoft.com/en-us/library/system.xml.xmlreader.aspx[^].
  3. Use the class System.Xml.Linq.XDocument; this is the most adequate way similar to that of XmlDocument, supporting LINQ to XML Programming.
    See http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.aspx[^], http://msdn.microsoft.com/en-us/library/bb387063.aspx[^].


If by some reason you need to find something in HTML which is not formatted as well-formed XML (which would be a pity), try to use some HTML parser.

For example, check Majestic-12 Open Source HTML parser: http://www.majestic12.co.uk/projects/html_parser.php[^].

—SA
 
Share this answer
 
Comments
Kim Togo 2-Jun-11 9:31am    
Good advice SA. My 5.
Sergey Alexandrovich Kryukov 2-Jun-11 13:12pm    
Thank you, Kim.
--SA
C#
string htmlContent = new System.Net.WebClient().DownloadString(theUrl);


or you could use

C#
System.Net.WebClient wc = new System.Net.WebClient();
            wc.DownloadStringCompleted += (s, e) =>
            {
                string htmlContent = e.Result;
            };
            wc.DownloadStringAsync(theUrl);



http://g4ac.co.za/WebClient[^]
 
Share this answer
 
See my WebResourceProvider[^] framework that was designed to do the exact task you're doing.

/ravi
 
Share this answer
 
If by some reason you need to find something in HTML which is not formatted as well-formed XML (which would be a pity), try to use some HTML parser.

For example, check Majestic-12 Open Source HTML here
 
Share this answer
 
Hi all friends
Final project for my university and I need to make a search engine.
I do not know what the title page of the website into a single variable as a string Please help me so I guess it is important
 
Share this answer
 
Comments
CHill60 16-Mar-14 14:48pm    
If you have a question of your own then please post it. Relatively few people will see your post as you have posted it as a solution to an already answered (and old) question

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900