Click here to Skip to main content
15,867,453 members
Please Sign up or sign in to vote.
3.00/5 (1 vote)
See more:
Hi All,

I have a project to read an email (HTML format) and extract certain information from the email, such as reference numbers, amounts etc..
Once I have retrieve the email, I store this into a char buffer.
The email contains all the HTML tags etc.. See below.

I would like to know, how can I extract the HTML data and not the HTML tags.
EG:
<HTML>Hello World</HTML>
I want to extract the 'Hello World' part.

I thought of comparing each character and if a character is in angle brackets '<' or '>' I will discard the character thus I would have all other data.

Is this the most efficient method, since we expect high volumes of emails.

Thanks in advance.
_____

XML
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]--></head>
<body lang=EN-US link=blue vlink=purple>
<div class=WordSection1><p class=MsoNormal>
<o:p>&nbsp;</o:p></p>
<div align=center>
<table class=MsoNormalTable border=0 cellspacing=0 cellpadding=0 width=720 style='width:540.0pt'>
<tr style='height:129.75pt'>
<td style='padding:0cm 0cm 0cm 0cm;height:129.75pt'>
<p class=MsoNormal>
<img width=720 height=173 id="_x0000_i1026" src="cid:image001.jpg@01CB8683.F336E550" alt="Standard Bank"><o:p></o:p></p></td>
</tr><tr><td width=718 style='width:538.5pt;background:#2E77BA;padding:0cm .75pt 0cm .75pt'>
<div align=center><table class=MsoNormalTable border=0 cellspacing=0 cellpadding=0 width=705 style='width:528.75pt'><tr>
<td style='background:white;padding:7.5pt 7.5pt 7.5pt 7.5pt'><p><b>
<span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:navy'><br></span></b>
<strong>
<span style='font-size:18.0pt;font-family:"Arial","sans-serif";color:navy'>Business Online deposit received</span>
</strong><o:p></o:p></p><p class=MsoNormal>
<span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>Dear </span>
<span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:#4F81BD'>&lt;&lt;preferredName&gt;&gt;</span>
<span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'><br>
<br>A deposit has been received for your Standard Bank account number </span>
<span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:#4F81BD'>&lt;&lt;ACC NO&gt;&gt;</span>
<span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>.<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>
<o:p>&nbsp;</o:p></span></p><p class=MsoNormal>
<span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>The details are as follows:<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'><o:p>&nbsp;</o:p></span></p>
<table class=MsoNormalTable border=0 cellspacing=1 cellpadding=0 width="95%" style='width:95.42%;background:#3D5378'>
<tr><td width="10%" valign=top style='width:10.46%;background:white;padding:1.5pt 1.5pt 1.5pt 1.5pt'>
<p class=MsoNormal><b><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>Currency<o:p></o:p></span>
</b></p></td><td width="21%" valign=top style='width:21.84%;background:white;padding:1.5pt 1.5pt 1.5pt 1.5pt'>
<p class=MsoNormal><b><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>Amount<o:p></o:p>
</span></b></p></td><td style='background:white;padding:1.5pt 1.5pt 1.5pt 1.5pt'><p class=MsoNormal><b>
<span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>Value Date</span></b><o:p></o:p></p>
</td><td width="26%" valign=top style='width:26.52%;background:white;padding:1.5pt 1.5pt 1.5pt 1.5pt'><p class=MsoNormal>
<b><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>Reference<o:p></o:p></span></b></p></td>
<td width="24%" style='width:24.56%;background:white;padding:1.5pt 1.5pt 1.5pt 1.5pt'><p class=MsoNormal><b>
<span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>Message ID</span></b><o:p></o:p></p></td>
</tr><tr style='height:12.1pt'>
<td width="10%" valign=top style='width:10.46%;background:white;padding:1.5pt 1.5pt 1.5pt 1.5pt;height:12.1pt'>
<p class=MsoNormal align=right style='text-align:right'><span style='font-family:"Arial","sans-serif"'>R<o:p>
</o:p></span></p></td>
<td width="21%" valign=top style='width:21.84%;background:white;padding:1.5pt 1.5pt 1.5pt 1.5pt;height:12.1pt'>
<p class=MsoNormal><span style='font-family:"Arial","sans-serif"'>2860.00<o:p></o:p></span></p></td>
<td style='background:white;padding:1.5pt 1.5pt 1.5pt 1.5pt;height:12.1pt'><p class=MsoNormal>


[edit]tried to fix the formatting, but something seems amiss[/edit]
[edit2] fixed the formatting [/edit2]
Posted
Updated 19-Nov-10 4:33am
v5

This does exactly what you suggest, but see the space problem with your sample data :(
C++
bool InTag(char c)
{
    static int bracket = 0;
    switch (c)
    {
    case '<':
        ++bracket;
        break;
    case '>':
        --bracket;
        return true;
    }
    return bracket > 0;
}

#include <fstream>
#include <sstream>
#include <iterator>
#include <algorithm>
#include <iostream>
int main()
{
    std::ifstream in("Test.htm");
    std::ostringstream oss;
    std::remove_copy_if(std::istream_iterator<char>(in), std::istream_iterator<char>(), std::ostream_iterator<char>(oss), InTag);
    std::cout << oss.str() << std::endl;
    return 0;
}

cheers,
AR
 
Share this answer
 
One of the most fastest and easiest is just to implement it very straight forward if you need what you propose. The rules you defined are to stop when a '<' is encountered and to start when a '>' is encountered, so that is what you should do :)

simply scan through the string and when you encounter an '<' you stop copying and when you encounter a '>' it is time to start copying again.

The pseudo code would be something like this:
while not end of string {
   while not curchar == '<' and not end of string {
      copy character
      move to next character
    }
   move to next character /* skip '<' character */
   while not curchar == '>' and not end of string {
     move to next character
   }
   move to next character /* skip '>' character */
}


Good luck!
 
Share this answer
 
Thanks for the suggestions, I implemented the following.


<pre lang="cs">void CBank::Extract(CString HTML)
{
char *Buffer;
int BufferSize = 0;
char *Temp;
int StartPos = 0;
int EndPos = 0;
int TempSize = 0;
BufferSize = HTML.GetLength();
Buffer = new char[BufferSize + 1];
memset(Buffer,0,BufferSize + 1);
memcpy(Buffer,HTML.GetBuffer(),BufferSize);
for (int i=0;i&lt;BufferSize;i++)
{
if (Buffer[i] == &#39;&lt;&#39;)
{
i++;
for (int k = i; k &lt; BufferSize; k++)
{
if(Buffer[k] == &#39;&gt;&#39;)
{
i = k;
break;
}
}
}
if ((Buffer[i] == &#39;&gt;&#39;) &amp;&amp; (Buffer[i+1] != &#39;&lt;&#39;) &amp;&amp; (Buffer[i+1] != 0x0d))//Carriage Return
{
StartPos = 0;
EndPos = 0;
i++; //Buffer[i] == &#39;&gt;&#39; so Buffer[i++] != &#39;&gt;&#39;
StartPos = i;
for (int j = i; j &lt; BufferSize; j++)
{
if (Buffer[j] == &#39;&lt;&#39;)//Found the start of a tag
{
i = j;
EndPos = j;
TempSize = EndPos - StartPos;
Temp = new char[TempSize + 1];
memset(Temp,0,TempSize + 1);
memcpy(Temp,&amp;Buffer[StartPos],TempSize);
delete []Temp;
break;
}
}
}
}
if (Buffer)
{
delete []Buffer;
Buffer = NULL;
}
}</pre>
 
Share this answer
 
I have used these functions for XML parsing for several years with very impressive performance. Perhaps they will help you also:

BOOL XMLGetNodeValue(LPCTSTR cpXML, LPCTSTR cpTagName, LPTSTR cpBuf)
{
  BOOL bRetval = FALSE;
  LPTSTR cpTagStartB = NULL;
  LPTSTR cpTagStartE = NULL;
  LPTSTR cpTagEndB = NULL;
  *cpBuf = 0;
  if (XMLFindNode(cpXML,cpTagName,&cpTagStartB,&cpTagStartE,&cpTagEndB))
    {
      if (cpTagStartE != cpTagEndB && cpTagEndB)
        {
          int iLen = (cpTagEndB-cpTagStartE)-1;
          memcpy(cpBuf,cpTagStartE+1,iLen*sizeof(TCHAR));
          *(cpBuf+iLen) = 0;
        }
      bRetval = TRUE;
    }
  return bRetval;
}

BOOL XMLFindNode(LPCTSTR cpXML, LPCTSTR cpTagName, TCHAR **cpTagStartB, TCHAR **cpTagStartE, TCHAR **cpTagEndB)
{
  BOOL bRetval = FALSE;
  TCHAR caTag[512];
  _stprintf(caTag,_T("<%s"),cpTagName);
  int iLen = _tcslen(caTag);
  LPTSTR cpStart = _tcsstr(cpXML,caTag);
  if (cpStart)
    {
      if (*(cpStart+iLen) == ' ' || *(cpStart+iLen) == '>')
        {
          if (cpTagStartB)
            *cpTagStartB = cpStart;
          LPTSTR cpEnd = _tcschr(cpStart, '>');
          if (cpTagStartE)
            *cpTagStartE = cpEnd;
          if (cpEnd && *(cpEnd-1) == '/' && cpTagEndB)
            *cpTagEndB = cpEnd;  // single tag, inline closing mark
          else
            {
              if (cpEnd && cpTagEndB)
                {
                  _stprintf(caTag,_T("</%s>"),cpTagName);
                  *cpTagEndB = _tcsstr(cpEnd,caTag);
                }
            }
          bRetval = TRUE;
        }
      else
        return XMLFindNode(cpStart+iLen,cpTagName,cpTagStartB,cpTagStartE,cpTagEndB);
    }
  return bRetval;
}

BOOL XMLFindNodeAttribute(LPCTSTR cpXML, LPCTSTR cpTagName, LPCTSTR cpAttribName, LPTSTR cpAttribValue, TCHAR **cpTagStart /*=NULL*/)
{
  BOOL bRetval = FALSE;
  LPTSTR cpTagStartB = NULL;
  LPTSTR cpTagStartE = NULL;
  bRetval = XMLFindNode(cpXML,cpTagName, &cpTagStartB,&cpTagStartE);
  if (bRetval)
    {
      TCHAR caBuf[512];
      _stprintf(caBuf,_T("%s="),cpAttribName);
      LPTSTR p = _tcsstr(cpTagStartB+_tcslen(cpTagName),caBuf);
      if (p && p < cpTagStartE)
        {
          p += _tcslen(caBuf);
          if (*p == '\"')
            p++;
          int iCount = 0;
          LPTSTR cpAttribEnd = p;
          while (*cpAttribEnd != '\"' && cpAttribEnd < cpTagStartE)
            {
              *(cpAttribValue+iCount) = *cpAttribEnd;
              cpAttribEnd++;
              iCount++;
            }
          *(cpAttribValue+iCount) = 0;
        }
      if (cpTagStart)
        *cpTagStart = cpTagStartB;
    }
  return bRetval;
}
 
Share this answer
 
Use Regular Expressions[^] for finding and extracting.

There are implementations for C++[^].

Cheers
Uwe
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900