Click here to Skip to main content
15,886,840 members
Please Sign up or sign in to vote.
4.00/5 (1 vote)
See more:
I'm having trouble parsing this html text into a structure.
I want to parse the below html text into this structure:
C#
struct result
{
  public   int code;
  public string sub;
  public string grade;
}

The assignment will be like this:
C#
result.code=176
result.sub="CHEMISTRY"
result.grade="A-"
XML
<TR>
<TD bgColor=#fefefe align=middle><STRONG>176</STRONG></TD>
<TD bgColor=#fafafa width="70%" align=left><STRONG>CHEMISTRY</STRONG></TD>
<TD bgColor=#fefefe align=middle><STRONG>A- </STRONG></TD>
</TR>

Thanks to all.

Updated:
What I'm trying?
Just trying to download all the results from a website and save in local database.It could be 12k results for my District only not whole country. I'm very close to complete using my own code. But from CP if I find any proper way/simplified way that would be a great help. I've already gone to GPA. Now Subject is to be parsed.
Do me a favor.


HTML
HSC 2010 Result Publication

Roll No.  124450 
Registration No.  719662 
Academic Session  2008-09 
Name  NASRIN LIPI  
Father's Name  MD. MOAZZAM HOSSAIN  
Institute Name  REBATI MOHAN UCHCHA MADHYAMIK BIDYALYA, SIDDHIRGONJ  
Center Name  NARAYANGANJ - 4, GOVT. ADAMJEENAGAR M. W. COLLEGE  
Student Group  SCIENCE  
Student Type  REGULAR  
Result  PASSED 
GPA  5.00 

Subject-wise Grade/ Mark Sheet
Code  Subject  Grade/ Marks  
107 ENGLISH A  
174 PHYSICS A+  
176 CHEMISTRY A+  
178 BIOLOGY A+  
127 MATHEMATICS A+
Posted
Updated 22-Mar-11 3:12am
v4
Comments
OriginalGriff 22-Mar-11 4:14am    
What have you tried?
What trouble are you having?
Edit your question and give us better information!
Аslam Iqbal 22-Mar-11 8:52am    
take a look and see what I'm trying.

Each table row can be parsed pretty easily.
(I am assuming that no <td> block is ever empty)
It's not foolproof, should only be used with a list of <tr> items and it's not debugged (haven't got VS here).

C#
string[] rows = HTML.Split(new string[] { "<tr>", "</tr>"} ); //I think this is allowed, not sure.
List<result> results = new List<result>();
foreach (string row in rows)
{
  //Declaring a few temporary variables.
  string code = string.Empty;
  string sub = string.Empty;
  string grade = string.Empty;
  bool inTag = false;

  for (int i = 0; i < row.Length; i++)
  {
    if(row[i] == '<')
      inTag = true;
    else if (row[i] == '>')
      inTag = false;
    else if (!inTag) //inTag is true when your between the < and > characters.
    {
      if (code.Length == 0) //is 'code' already defined?
      {
          code = row.Substring(i, row.IndexOf('<',i)-i); //get text from row, starting at i and stopping at the next occurance of <
          i += code.Length; //prevent doubles
      }
      else if (sub.Length == 0) //is 'sub' already defined?
      {
          sub = row.Substring(i, row.IndexOf('<',i)-1);
          i += sub.Length;
      }
      else if (grade.Length == 0)/ /is 'grade' already defined?
      {
          grade = row.Substring(i, row.IndexOf('<',i)-1);
          i += grade.Length;
      }
    }
  }
  if (code.Length != 0) //Last checkup
     results.Add(new result(int.Parse(code), sub, grade));
}
 
Share this answer
 
v9
Comments
Sandeep Mewara 22-Mar-11 7:10am    
Good effort. 5!
Аslam Iqbal 22-Mar-11 8:58am    
Thanks. But its not so easy.
Sergey Alexandrovich Kryukov 22-Mar-11 17:57pm    
Programming is not for people who afraid of difficulties :-)
So far, Pavel's answer may be the best for you.
--SA
I know its not fair answering owns question. But I'm going to do that.
First:
Download all results from web.
The query string is:
C#
string url = "http://www.educationboardresults.gov.bd/arch/result.php?roll=" + roll + "&board=dhaka&exam=HSC&year=" + year;

Call it from a loop:
for (i = startfrom; i <= endat; i++)
{
                HSC_WEB hw = new HSC_WEB(i, year);// This will download in separate thread and            
                hw.MyEvent += new HSC_WEB.ProgressDelegate(hw_Completed);// fire when finish
                TotalReq += 1;
................
}


Second: Purse html text to plain text using webbrowser. It removes all tags and comments.
Third:With that plain text(sample text given in my question) the following class find all I wanted.

C#
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace HSC_RES_Downloader
{
    class stringsearch
    {
        public stringsearch( string  BaseString)
        {
            this.BaseString = BaseString;
        }
        public string BaseString;
        public string getyear()
        {
            return getstring(BaseString, "HSC", "", "Result");
        }
        public string getroll()
        {
            return getstring(BaseString, "Roll No.", "", "Registration");
        }
        public string getregno()
        {
            return getstring(BaseString, "Registration No.", "", "Academic Session");
        }
        public string getSession()
        {
            return getstring(BaseString, "Academic Session", "", "Name");
        }
        public string getname()
        {
            return getstring(BaseString, "Academic Session", "Name", "Father's Name");
        }
        public string getfname()
        {
            return getstring(BaseString, "Father's Name", "", "Institute");
        }
        public string getInstitutenane()
        {
            return getstring(BaseString, "Institute Name", "", "Center");
        }
        public string getCenter()
        {
            return getstring(BaseString, "Center Name", "", "Student Group");
        }
        public string getGroup()
        {
            return getstring(BaseString, "Student Group", "", "Student Type");
        }
        public string getsType()
        {
            return getstring(BaseString, "Student Type", "", "Result");
        }
        public string getResult()
        {
            return getstring(BaseString, "Student Type", "Result", "GPA");
        }
        public string getGPA()
        {
            return getstring(BaseString, "GPA", "", "Subject-wise");
        }

        public List<List<string>> subjectsgpa()
        { 
            List<string> sublist;
            
            string substr="BENGALI,ENGLISH,SECRETARIAL MANAGEMENT,COMMERCIAL GEOGRAPHY,"+
                    "STATISTICS,COMPUTER STUDIES,AGRICULTURE STUDIES," +
                    "PRINCIPLE OF BUSINESS,ACCOUNTING,"+
                    "PHYSICS,CHEMISTRY,MATHEMATICS,BIOLOGY,"+
                    "SOCIAL WELFARE,ISLAMIC HISTORY,ISLAMIC STUDIES,CIVICS";
            sublist = substr.Split(',').ToList();          
            
            int ps1 = 0;
            ps1 = BaseString.IndexOf("Code Subject Grade/ Marks");
            substr = BaseString.Substring(ps1 + "Code Subject Grade/ Marks".Length);
            string gpa=string.Empty ;
            List<List<string>> subgpas = new List<List<string>>();
            foreach (string SubName in sublist)
            {
                ps1 = substr.IndexOf(SubName);
                if (ps1 > 0)
                {
                    List<string> subgpa = new List<string>();
                    gpa = substr.Substring(ps1+SubName.Length,2);
                    subgpa.Add(gpa.Trim());
                    subgpa.Add( SubName);
                    subgpas.Add(subgpa);
                }
            }
            return  subgpas ;
        }

        public string getstring(string basestr, string str1, string str2, string endstring)
        {
            int ps1 = 0, ps2 = 0, ps3 = 0;
            ps1 = basestr.IndexOf(str1);
            ps2 = basestr.IndexOf(str2, ps1 + str1.Length);
            ps3 = basestr.IndexOf(endstring, ps2);
            string ss = basestr.Substring(ps2 + str2.Length, ps3 - ps2 - str2.Length);
            ss = ss.Trim();
            return ss;
        }

    }

}


I still have a lot of things to do. Any modification will be appreciated.
 
Share this answer
 
Also try to use HtmlAgilityPack which is very usefull to deal with html parsing and processing.
 
Share this answer
 
Comments
#realJSOP 22-Mar-11 8:55am    
I agree. Why reinvent the wheel on this?
Аslam Iqbal 22-Mar-11 9:14am    
ah, did you see updated part of my Question? Thanks.
Sergey Alexandrovich Kryukov 22-Mar-11 17:56pm    
Pavel, I'm up-voting your answer by my 5 after it turns out that the file is now well-formed as XML. This is such pain, this kind of HTML...
--SA
Sergey Alexandrovich Kryukov 22-Mar-11 17:58pm    
Aslam, I suggest your formally accept this post as the answer.
--SA
Assuming HTML is valid XML, use System.Xml.XmlReader.

—SA
 
Share this answer
 
Comments
Аslam Iqbal 22-Mar-11 14:55pm    
It raises error for this tag:
<TD bgColor=#fefefe align=middle>
Sergey Alexandrovich Kryukov 22-Mar-11 17:54pm    
Because the tag is not well-formed; needs quotation marks: bgColor="#fefefe" align="middle".
I'm sorry to say, parsing stuff like that is a lot of pain.
--SA
Sergey Alexandrovich Kryukov 22-Mar-11 17:55pm    
Maybe HtmlAgilityPack will save you from this trouble.
(Based on that, I'll up-vote Pavel's answer.)
--SA
Аslam Iqbal 22-Mar-11 17:57pm    
I'm almost done! after a while i will submit that.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900