Click here to Skip to main content
15,893,668 members
Please Sign up or sign in to vote.
4.00/5 (1 vote)
See more:
I'm having trouble parsing this html text into a structure.
I want to parse the below html text into this structure:
C#
struct result
{
  public   int code;
  public string sub;
  public string grade;
}

The assignment will be like this:
C#
result.code=176
result.sub="CHEMISTRY"
result.grade="A-"
XML
<TR>
<TD bgColor=#fefefe align=middle><STRONG>176</STRONG></TD>
<TD bgColor=#fafafa width="70%" align=left><STRONG>CHEMISTRY</STRONG></TD>
<TD bgColor=#fefefe align=middle><STRONG>A- </STRONG></TD>
</TR>

Thanks to all.

Updated:
What I'm trying?
Just trying to download all the results from a website and save in local database.It could be 12k results for my District only not whole country. I'm very close to complete using my own code. But from CP if I find any proper way/simplified way that would be a great help. I've already gone to GPA. Now Subject is to be parsed.
Do me a favor.


HTML
HSC 2010 Result Publication

Roll No.  124450 
Registration No.  719662 
Academic Session  2008-09 
Name  NASRIN LIPI  
Father's Name  MD. MOAZZAM HOSSAIN  
Institute Name  REBATI MOHAN UCHCHA MADHYAMIK BIDYALYA, SIDDHIRGONJ  
Center Name  NARAYANGANJ - 4, GOVT. ADAMJEENAGAR M. W. COLLEGE  
Student Group  SCIENCE  
Student Type  REGULAR  
Result  PASSED 
GPA  5.00 

Subject-wise Grade/ Mark Sheet
Code  Subject  Grade/ Marks  
107 ENGLISH A  
174 PHYSICS A+  
176 CHEMISTRY A+  
178 BIOLOGY A+  
127 MATHEMATICS A+
Posted
Updated 22-Mar-11 3:12am
v4
Comments
OriginalGriff 22-Mar-11 4:14am    
What have you tried?
What trouble are you having?
Edit your question and give us better information!
Аslam Iqbal 22-Mar-11 8:52am    
take a look and see what I'm trying.

I know its not fair answering owns question. But I'm going to do that.
First:
Download all results from web.
The query string is:
C#
string url = "http://www.educationboardresults.gov.bd/arch/result.php?roll=" + roll + "&board=dhaka&exam=HSC&year=" + year;

Call it from a loop:
for (i = startfrom; i <= endat; i++)
{
                HSC_WEB hw = new HSC_WEB(i, year);// This will download in separate thread and            
                hw.MyEvent += new HSC_WEB.ProgressDelegate(hw_Completed);// fire when finish
                TotalReq += 1;
................
}


Second: Purse html text to plain text using webbrowser. It removes all tags and comments.
Third:With that plain text(sample text given in my question) the following class find all I wanted.

C#
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace HSC_RES_Downloader
{
    class stringsearch
    {
        public stringsearch( string  BaseString)
        {
            this.BaseString = BaseString;
        }
        public string BaseString;
        public string getyear()
        {
            return getstring(BaseString, "HSC", "", "Result");
        }
        public string getroll()
        {
            return getstring(BaseString, "Roll No.", "", "Registration");
        }
        public string getregno()
        {
            return getstring(BaseString, "Registration No.", "", "Academic Session");
        }
        public string getSession()
        {
            return getstring(BaseString, "Academic Session", "", "Name");
        }
        public string getname()
        {
            return getstring(BaseString, "Academic Session", "Name", "Father's Name");
        }
        public string getfname()
        {
            return getstring(BaseString, "Father's Name", "", "Institute");
        }
        public string getInstitutenane()
        {
            return getstring(BaseString, "Institute Name", "", "Center");
        }
        public string getCenter()
        {
            return getstring(BaseString, "Center Name", "", "Student Group");
        }
        public string getGroup()
        {
            return getstring(BaseString, "Student Group", "", "Student Type");
        }
        public string getsType()
        {
            return getstring(BaseString, "Student Type", "", "Result");
        }
        public string getResult()
        {
            return getstring(BaseString, "Student Type", "Result", "GPA");
        }
        public string getGPA()
        {
            return getstring(BaseString, "GPA", "", "Subject-wise");
        }

        public List<List<string>> subjectsgpa()
        { 
            List<string> sublist;
            
            string substr="BENGALI,ENGLISH,SECRETARIAL MANAGEMENT,COMMERCIAL GEOGRAPHY,"+
                    "STATISTICS,COMPUTER STUDIES,AGRICULTURE STUDIES," +
                    "PRINCIPLE OF BUSINESS,ACCOUNTING,"+
                    "PHYSICS,CHEMISTRY,MATHEMATICS,BIOLOGY,"+
                    "SOCIAL WELFARE,ISLAMIC HISTORY,ISLAMIC STUDIES,CIVICS";
            sublist = substr.Split(',').ToList();          
            
            int ps1 = 0;
            ps1 = BaseString.IndexOf("Code Subject Grade/ Marks");
            substr = BaseString.Substring(ps1 + "Code Subject Grade/ Marks".Length);
            string gpa=string.Empty ;
            List<List<string>> subgpas = new List<List<string>>();
            foreach (string SubName in sublist)
            {
                ps1 = substr.IndexOf(SubName);
                if (ps1 > 0)
                {
                    List<string> subgpa = new List<string>();
                    gpa = substr.Substring(ps1+SubName.Length,2);
                    subgpa.Add(gpa.Trim());
                    subgpa.Add( SubName);
                    subgpas.Add(subgpa);
                }
            }
            return  subgpas ;
        }

        public string getstring(string basestr, string str1, string str2, string endstring)
        {
            int ps1 = 0, ps2 = 0, ps3 = 0;
            ps1 = basestr.IndexOf(str1);
            ps2 = basestr.IndexOf(str2, ps1 + str1.Length);
            ps3 = basestr.IndexOf(endstring, ps2);
            string ss = basestr.Substring(ps2 + str2.Length, ps3 - ps2 - str2.Length);
            ss = ss.Trim();
            return ss;
        }

    }

}


I still have a lot of things to do. Any modification will be appreciated.
 
Share this answer
 
Assuming HTML is valid XML, use System.Xml.XmlReader.

—SA
 
Share this answer
 
Comments
Аslam Iqbal 22-Mar-11 14:55pm    
It raises error for this tag:
<TD bgColor=#fefefe align=middle>
Sergey Alexandrovich Kryukov 22-Mar-11 17:54pm    
Because the tag is not well-formed; needs quotation marks: bgColor="#fefefe" align="middle".
I'm sorry to say, parsing stuff like that is a lot of pain.
--SA
Sergey Alexandrovich Kryukov 22-Mar-11 17:55pm    
Maybe HtmlAgilityPack will save you from this trouble.
(Based on that, I'll up-vote Pavel's answer.)
--SA
Аslam Iqbal 22-Mar-11 17:57pm    
I'm almost done! after a while i will submit that.
Each table row can be parsed pretty easily.
(I am assuming that no <td> block is ever empty)
It's not foolproof, should only be used with a list of <tr> items and it's not debugged (haven't got VS here).

C#
string[] rows = HTML.Split(new string[] { "<tr>", "</tr>"} ); //I think this is allowed, not sure.
List<result> results = new List<result>();
foreach (string row in rows)
{
  //Declaring a few temporary variables.
  string code = string.Empty;
  string sub = string.Empty;
  string grade = string.Empty;
  bool inTag = false;

  for (int i = 0; i < row.Length; i++)
  {
    if(row[i] == '<')
      inTag = true;
    else if (row[i] == '>')
      inTag = false;
    else if (!inTag) //inTag is true when your between the < and > characters.
    {
      if (code.Length == 0) //is 'code' already defined?
      {
          code = row.Substring(i, row.IndexOf('<',i)-i); //get text from row, starting at i and stopping at the next occurance of <
          i += code.Length; //prevent doubles
      }
      else if (sub.Length == 0) //is 'sub' already defined?
      {
          sub = row.Substring(i, row.IndexOf('<',i)-1);
          i += sub.Length;
      }
      else if (grade.Length == 0)/ /is 'grade' already defined?
      {
          grade = row.Substring(i, row.IndexOf('<',i)-1);
          i += grade.Length;
      }
    }
  }
  if (code.Length != 0) //Last checkup
     results.Add(new result(int.Parse(code), sub, grade));
}
 
Share this answer
 
v9
Comments
Sandeep Mewara 22-Mar-11 7:10am    
Good effort. 5!
Аslam Iqbal 22-Mar-11 8:58am    
Thanks. But its not so easy.
Sergey Alexandrovich Kryukov 22-Mar-11 17:57pm    
Programming is not for people who afraid of difficulties :-)
So far, Pavel's answer may be the best for you.
--SA
Also try to use HtmlAgilityPack which is very usefull to deal with html parsing and processing.
 
Share this answer
 
Comments
#realJSOP 22-Mar-11 8:55am    
I agree. Why reinvent the wheel on this?
Аslam Iqbal 22-Mar-11 9:14am    
ah, did you see updated part of my Question? Thanks.
Sergey Alexandrovich Kryukov 22-Mar-11 17:56pm    
Pavel, I'm up-voting your answer by my 5 after it turns out that the file is now well-formed as XML. This is such pain, this kind of HTML...
--SA
Sergey Alexandrovich Kryukov 22-Mar-11 17:58pm    
Aslam, I suggest your formally accept this post as the answer.
--SA

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900