Click here to Skip to main content
Rate this: bad
good
Please Sign up or sign in to vote.
See more: C++
Hi All,
 
I have a project to read an email (HTML format) and extract certain information from the email, such as reference numbers, amounts etc..
Once I have retrieve the email, I store this into a char buffer.
The email contains all the HTML tags etc.. See below.
 
I would like to know, how can I extract the HTML data and not the HTML tags.
EG:
<HTML>Hello World</HTML>
I want to extract the 'Hello World' part.
 
I thought of comparing each character and if a character is in angle brackets '<' or '>' I will discard the character thus I would have all other data.
 
Is this the most efficient method, since we expect high volumes of emails.
 
Thanks in advance.
_____
 
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]--></head>
<body lang=EN-US link=blue vlink=purple>
<div class=WordSection1><p class=MsoNormal>
<o:p>&nbsp;</o:p></p>
<div align=center>
<table class=MsoNormalTable border=0 cellspacing=0 cellpadding=0 width=720 style='width:540.0pt'>
<tr style='height:129.75pt'>
<td style='padding:0cm 0cm 0cm 0cm;height:129.75pt'>
<p class=MsoNormal>
<img width=720 height=173 id="_x0000_i1026" src="cid:image001.jpg@01CB8683.F336E550" alt="Standard Bank"><o:p></o:p></p></td>
</tr><tr><td width=718 style='width:538.5pt;background:#2E77BA;padding:0cm .75pt 0cm .75pt'>
<div align=center><table class=MsoNormalTable border=0 cellspacing=0 cellpadding=0 width=705 style='width:528.75pt'><tr>
<td style='background:white;padding:7.5pt 7.5pt 7.5pt 7.5pt'><p><b>
<span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:navy'><br></span></b>
<strong>
<span style='font-size:18.0pt;font-family:"Arial","sans-serif";color:navy'>Business Online deposit received</span>
</strong><o:p></o:p></p><p class=MsoNormal>
<span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>Dear </span>
<span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:#4F81BD'>&lt;&lt;preferredName&gt;&gt;</span>
<span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'><br>
<br>A deposit has been received for your Standard Bank account number </span>
<span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:#4F81BD'>&lt;&lt;ACC NO&gt;&gt;</span>
<span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>.<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>
<o:p>&nbsp;</o:p></span></p><p class=MsoNormal>
<span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>The details are as follows:<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'><o:p>&nbsp;</o:p></span></p>
<table class=MsoNormalTable border=0 cellspacing=1 cellpadding=0 width="95%" style='width:95.42%;background:#3D5378'>
<tr><td width="10%" valign=top style='width:10.46%;background:white;padding:1.5pt 1.5pt 1.5pt 1.5pt'>
<p class=MsoNormal><b><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>Currency<o:p></o:p></span>
</b></p></td><td width="21%" valign=top style='width:21.84%;background:white;padding:1.5pt 1.5pt 1.5pt 1.5pt'>
<p class=MsoNormal><b><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>Amount<o:p></o:p>
</span></b></p></td><td style='background:white;padding:1.5pt 1.5pt 1.5pt 1.5pt'><p class=MsoNormal><b>
<span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>Value Date</span></b><o:p></o:p></p>
</td><td width="26%" valign=top style='width:26.52%;background:white;padding:1.5pt 1.5pt 1.5pt 1.5pt'><p class=MsoNormal>
<b><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>Reference<o:p></o:p></span></b></p></td>
<td width="24%" style='width:24.56%;background:white;padding:1.5pt 1.5pt 1.5pt 1.5pt'><p class=MsoNormal><b>
<span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>Message ID</span></b><o:p></o:p></p></td>
</tr><tr style='height:12.1pt'>
<td width="10%" valign=top style='width:10.46%;background:white;padding:1.5pt 1.5pt 1.5pt 1.5pt;height:12.1pt'>
<p class=MsoNormal align=right style='text-align:right'><span style='font-family:"Arial","sans-serif"'>R<o:p>
</o:p></span></p></td>
<td width="21%" valign=top style='width:21.84%;background:white;padding:1.5pt 1.5pt 1.5pt 1.5pt;height:12.1pt'>
<p class=MsoNormal><span style='font-family:"Arial","sans-serif"'>2860.00<o:p></o:p></span></p></td>
<td style='background:white;padding:1.5pt 1.5pt 1.5pt 1.5pt;height:12.1pt'><p class=MsoNormal>
 
[edit]tried to fix the formatting, but something seems amiss[/edit]
[edit2] fixed the formatting [/edit2]
Posted 19-Nov-10 3:47am
Edited 19-Nov-10 5:33am
v5
Rate this: bad
good
Please Sign up or sign in to vote.

Solution 3

This does exactly what you suggest, but see the space problem with your sample data Frown | :(
bool InTag(char c)
{
    static int bracket = 0;
    switch (c)
    {
    case '<':
        ++bracket;
        break;
    case '>':
        --bracket;
        return true;
    }
    return bracket > 0;
}
 
#include <fstream>
#include <sstream>
#include <iterator>
#include <algorithm>
#include <iostream>
int main()
{
    std::ifstream in("Test.htm");
    std::ostringstream oss;
    std::remove_copy_if(std::istream_iterator<char>(in), std::istream_iterator<char>(), std::ostream_iterator<char>(oss), InTag);
    std::cout << oss.str() << std::endl;
    return 0;
}
cheers,
AR
  Permalink  
Rate this: bad
good
Please Sign up or sign in to vote.

Solution 2

One of the most fastest and easiest is just to implement it very straight forward if you need what you propose. The rules you defined are to stop when a '<' is encountered and to start when a '>' is encountered, so that is what you should do Smile | :)
 
simply scan through the string and when you encounter an '<' you stop copying and when you encounter a '>' it is time to start copying again.
 
The pseudo code would be something like this:
while not end of string {
   while not curchar == '<' and not end of string {
      copy character
      move to next character
    }
   move to next character /* skip '<' character */
   while not curchar == '>' and not end of string {
     move to next character
   }
   move to next character /* skip '>' character */
}
 
Good luck!
  Permalink  
Rate this: bad
good
Please Sign up or sign in to vote.

Solution 4

Thanks for the suggestions, I implemented the following.
 

<pre lang="cs">void CBank::Extract(CString HTML)
{
      char      *Buffer;
      int      BufferSize = 0;
      char      *Temp;
      int      StartPos = 0;
      int      EndPos   = 0;
      int      TempSize = 0;
      BufferSize = HTML.GetLength();
      Buffer = new char[BufferSize + 1];
      memset(Buffer,0,BufferSize + 1);
      memcpy(Buffer,HTML.GetBuffer(),BufferSize);
      for (int i=0;i&lt;BufferSize;i++)
      {
            if (Buffer[i] == &#39;&lt;&#39;)
            {
                  i++;
                  for (int k = i; k &lt; BufferSize; k++)
                  {
                        if(Buffer[k] == &#39;&gt;&#39;)
                        {
                              i = k;
                              break;
                        }
                  }
            }
            if ((Buffer[i] == &#39;&gt;&#39;) &amp;&amp; (Buffer[i+1] != &#39;&lt;&#39;) &amp;&amp; (Buffer[i+1] != 0x0d))//Carriage Return
            {
                  StartPos      = 0;
                  EndPos         = 0;
                  i++;            //Buffer[i] == &#39;&gt;&#39; so Buffer[i++] != &#39;&gt;&#39;
                  StartPos      = i;
                  for (int j = i; j &lt; BufferSize; j++)
                  {
                        if (Buffer[j] == &#39;&lt;&#39;)//Found the start of a tag
                        {
                              i               = j;
                              EndPos         = j;
                              TempSize      = EndPos - StartPos;
                              Temp = new char[TempSize + 1];
                              memset(Temp,0,TempSize + 1);
                              memcpy(Temp,&amp;Buffer[StartPos],TempSize);
                              delete []Temp;
                              break;
                        }
                  }
            }
      }
      if (Buffer)
      {
            delete []Buffer;
            Buffer = NULL;
      }
}</pre>
  Permalink  
Rate this: bad
good
Please Sign up or sign in to vote.

Solution 5

I have used these functions for XML parsing for several years with very impressive performance. Perhaps they will help you also:
 
BOOL XMLGetNodeValue(LPCTSTR cpXML, LPCTSTR cpTagName, LPTSTR cpBuf)
{
  BOOL bRetval = FALSE;
  LPTSTR cpTagStartB = NULL;
  LPTSTR cpTagStartE = NULL;
  LPTSTR cpTagEndB = NULL;
  *cpBuf = 0;
  if (XMLFindNode(cpXML,cpTagName,&cpTagStartB,&cpTagStartE,&cpTagEndB))
    {
      if (cpTagStartE != cpTagEndB && cpTagEndB)
        {
          int iLen = (cpTagEndB-cpTagStartE)-1;
          memcpy(cpBuf,cpTagStartE+1,iLen*sizeof(TCHAR));
          *(cpBuf+iLen) = 0;
        }
      bRetval = TRUE;
    }
  return bRetval;
}
 
BOOL XMLFindNode(LPCTSTR cpXML, LPCTSTR cpTagName, TCHAR **cpTagStartB, TCHAR **cpTagStartE, TCHAR **cpTagEndB)
{
  BOOL bRetval = FALSE;
  TCHAR caTag[512];
  _stprintf(caTag,_T("<%s"),cpTagName);
  int iLen = _tcslen(caTag);
  LPTSTR cpStart = _tcsstr(cpXML,caTag);
  if (cpStart)
    {
      if (*(cpStart+iLen) == ' ' || *(cpStart+iLen) == '>')
        {
          if (cpTagStartB)
            *cpTagStartB = cpStart;
          LPTSTR cpEnd = _tcschr(cpStart, '>');
          if (cpTagStartE)
            *cpTagStartE = cpEnd;
          if (cpEnd && *(cpEnd-1) == '/' && cpTagEndB)
            *cpTagEndB = cpEnd;  // single tag, inline closing mark
          else
            {
              if (cpEnd && cpTagEndB)
                {
                  _stprintf(caTag,_T("</%s>"),cpTagName);
                  *cpTagEndB = _tcsstr(cpEnd,caTag);
                }
            }
          bRetval = TRUE;
        }
      else
        return XMLFindNode(cpStart+iLen,cpTagName,cpTagStartB,cpTagStartE,cpTagEndB);
    }
  return bRetval;
}
 
BOOL XMLFindNodeAttribute(LPCTSTR cpXML, LPCTSTR cpTagName, LPCTSTR cpAttribName, LPTSTR cpAttribValue, TCHAR **cpTagStart /*=NULL*/)
{
  BOOL bRetval = FALSE;
  LPTSTR cpTagStartB = NULL;
  LPTSTR cpTagStartE = NULL;
  bRetval = XMLFindNode(cpXML,cpTagName, &cpTagStartB,&cpTagStartE);
  if (bRetval)
    {
      TCHAR caBuf[512];
      _stprintf(caBuf,_T("%s="),cpAttribName);
      LPTSTR p = _tcsstr(cpTagStartB+_tcslen(cpTagName),caBuf);
      if (p && p < cpTagStartE)
        {
          p += _tcslen(caBuf);
          if (*p == '\"')
            p++;
          int iCount = 0;
          LPTSTR cpAttribEnd = p;
          while (*cpAttribEnd != '\"' && cpAttribEnd < cpTagStartE)
            {
              *(cpAttribValue+iCount) = *cpAttribEnd;
              cpAttribEnd++;
              iCount++;
            }
          *(cpAttribValue+iCount) = 0;
        }
      if (cpTagStart)
        *cpTagStart = cpTagStartB;
    }
  return bRetval;
}
  Permalink  
Rate this: bad
good
Please Sign up or sign in to vote.

Solution 1

Use Regular Expressions[^] for finding and extracting.
 
There are implementations for C++[^].
 
Cheers
Uwe
  Permalink  

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

  Print Answers RSS
0 OriginalGriff 6,055
1 DamithSL 4,621
2 Maciej Los 4,087
3 Kornfeld Eliyahu Peter 3,500
4 Sergey Alexandrovich Kryukov 3,294


Advertise | Privacy | Mobile
Web03 | 2.8.141220.1 | Last Updated 22 Nov 2010
Copyright © CodeProject, 1999-2014
All Rights Reserved. Terms of Service
Layout: fixed | fluid

CodeProject, 503-250 Ferrand Drive Toronto Ontario, M3C 3G8 Canada +1 416-849-8900 x 100