Click here to Skip to main content
12,065,721 members (34,896 online)
Rate this:
 
Please Sign up or sign in to vote.
See more: C++
Hi All,

I have a project to read an email (HTML format) and extract certain information from the email, such as reference numbers, amounts etc..
Once I have retrieve the email, I store this into a char buffer.
The email contains all the HTML tags etc.. See below.

I would like to know, how can I extract the HTML data and not the HTML tags.
EG:
<HTML>Hello World</HTML>
I want to extract the 'Hello World' part.

I thought of comparing each character and if a character is in angle brackets '<' or '>' I will discard the character thus I would have all other data.

Is this the most efficient method, since we expect high volumes of emails.

Thanks in advance.
_____

<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]--></head>
<body lang=EN-US link=blue vlink=purple>
<div class=WordSection1><p class=MsoNormal>
<o:p>&nbsp;</o:p></p>
<div align=center>
<table class=MsoNormalTable border=0 cellspacing=0 cellpadding=0 width=720 style='width:540.0pt'>
<tr style='height:129.75pt'>
<td style='padding:0cm 0cm 0cm 0cm;height:129.75pt'>
<p class=MsoNormal>
<img width=720 height=173 id="_x0000_i1026" src="cid:image001.jpg@01CB8683.F336E550" alt="Standard Bank"><o:p></o:p></p></td>
</tr><tr><td width=718 style='width:538.5pt;background:#2E77BA;padding:0cm .75pt 0cm .75pt'>
<div align=center><table class=MsoNormalTable border=0 cellspacing=0 cellpadding=0 width=705 style='width:528.75pt'><tr>
<td style='background:white;padding:7.5pt 7.5pt 7.5pt 7.5pt'><p><b>
<span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:navy'><br></span></b>
<strong>
<span style='font-size:18.0pt;font-family:"Arial","sans-serif";color:navy'>Business Online deposit received</span>
</strong><o:p></o:p></p><p class=MsoNormal>
<span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>Dear </span>
<span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:#4F81BD'>&lt;&lt;preferredName&gt;&gt;</span>
<span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'><br>
<br>A deposit has been received for your Standard Bank account number </span>
<span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:#4F81BD'>&lt;&lt;ACC NO&gt;&gt;</span>
<span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>.<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>
<o:p>&nbsp;</o:p></span></p><p class=MsoNormal>
<span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>The details are as follows:<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'><o:p>&nbsp;</o:p></span></p>
<table class=MsoNormalTable border=0 cellspacing=1 cellpadding=0 width="95%" style='width:95.42%;background:#3D5378'>
<tr><td width="10%" valign=top style='width:10.46%;background:white;padding:1.5pt 1.5pt 1.5pt 1.5pt'>
<p class=MsoNormal><b><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>Currency<o:p></o:p></span>
</b></p></td><td width="21%" valign=top style='width:21.84%;background:white;padding:1.5pt 1.5pt 1.5pt 1.5pt'>
<p class=MsoNormal><b><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>Amount<o:p></o:p>
</span></b></p></td><td style='background:white;padding:1.5pt 1.5pt 1.5pt 1.5pt'><p class=MsoNormal><b>
<span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>Value Date</span></b><o:p></o:p></p>
</td><td width="26%" valign=top style='width:26.52%;background:white;padding:1.5pt 1.5pt 1.5pt 1.5pt'><p class=MsoNormal>
<b><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>Reference<o:p></o:p></span></b></p></td>
<td width="24%" style='width:24.56%;background:white;padding:1.5pt 1.5pt 1.5pt 1.5pt'><p class=MsoNormal><b>
<span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>Message ID</span></b><o:p></o:p></p></td>
</tr><tr style='height:12.1pt'>
<td width="10%" valign=top style='width:10.46%;background:white;padding:1.5pt 1.5pt 1.5pt 1.5pt;height:12.1pt'>
<p class=MsoNormal align=right style='text-align:right'><span style='font-family:"Arial","sans-serif"'>R<o:p>
</o:p></span></p></td>
<td width="21%" valign=top style='width:21.84%;background:white;padding:1.5pt 1.5pt 1.5pt 1.5pt;height:12.1pt'>
<p class=MsoNormal><span style='font-family:"Arial","sans-serif"'>2860.00<o:p></o:p></span></p></td>
<td style='background:white;padding:1.5pt 1.5pt 1.5pt 1.5pt;height:12.1pt'><p class=MsoNormal>

[edit]tried to fix the formatting, but something seems amiss[/edit]
[edit2] fixed the formatting [/edit2]
Posted 19-Nov-10 3:47am
JustH1.1K
Edited 19-Nov-10 5:33am
v5
Rate this: bad
 
good
Please Sign up or sign in to vote.

Solution 3

This does exactly what you suggest, but see the space problem with your sample data Frown | :(
bool InTag(char c)
{
    static int bracket = 0;
    switch (c)
    {
    case '<':
        ++bracket;
        break;
    case '>':
        --bracket;
        return true;
    }
    return bracket > 0;
}
 
#include <fstream>
#include <sstream>
#include <iterator>
#include <algorithm>
#include <iostream>
int main()
{
    std::ifstream in("Test.htm");
    std::ostringstream oss;
    std::remove_copy_if(std::istream_iterator<char>(in), std::istream_iterator<char>(), std::ostream_iterator<char>(oss), InTag);
    std::cout << oss.str() << std::endl;
    return 0;
}
cheers,
AR
  Permalink  
Rate this: bad
 
good
Please Sign up or sign in to vote.

Solution 2

One of the most fastest and easiest is just to implement it very straight forward if you need what you propose. The rules you defined are to stop when a '<' is encountered and to start when a '>' is encountered, so that is what you should do Smile | :)

simply scan through the string and when you encounter an '<' you stop copying and when you encounter a '>' it is time to start copying again.

The pseudo code would be something like this:
while not end of string {
   while not curchar == '<' and not end of string {
      copy character
      move to next character
    }
   move to next character /* skip '<' character */
   while not curchar == '>' and not end of string {
     move to next character
   }
   move to next character /* skip '>' character */
}

Good luck!
  Permalink  
Rate this: bad
 
good
Please Sign up or sign in to vote.

Solution 4

Thanks for the suggestions, I implemented the following.


<pre lang="cs">void CBank::Extract(CString HTML)
{
      char      *Buffer;
      int      BufferSize = 0;
      char      *Temp;
      int      StartPos = 0;
      int      EndPos   = 0;
      int      TempSize = 0;
      BufferSize = HTML.GetLength();
      Buffer = new char[BufferSize + 1];
      memset(Buffer,0,BufferSize + 1);
      memcpy(Buffer,HTML.GetBuffer(),BufferSize);
      for (int i=0;i&lt;BufferSize;i++)
      {
            if (Buffer[i] == &#39;&lt;&#39;)
            {
                  i++;
                  for (int k = i; k &lt; BufferSize; k++)
                  {
                        if(Buffer[k] == &#39;&gt;&#39;)
                        {
                              i = k;
                              break;
                        }
                  }
            }
            if ((Buffer[i] == &#39;&gt;&#39;) &amp;&amp; (Buffer[i+1] != &#39;&lt;&#39;) &amp;&amp; (Buffer[i+1] != 0x0d))//Carriage Return
            {
                  StartPos      = 0;
                  EndPos         = 0;
                  i++;            //Buffer[i] == &#39;&gt;&#39; so Buffer[i++] != &#39;&gt;&#39;
                  StartPos      = i;
                  for (int j = i; j &lt; BufferSize; j++)
                  {
                        if (Buffer[j] == &#39;&lt;&#39;)//Found the start of a tag
                        {
                              i               = j;
                              EndPos         = j;
                              TempSize      = EndPos - StartPos;
                              Temp = new char[TempSize + 1];
                              memset(Temp,0,TempSize + 1);
                              memcpy(Temp,&amp;Buffer[StartPos],TempSize);
                              delete []Temp;
                              break;
                        }
                  }
            }
      }
      if (Buffer)
      {
            delete []Buffer;
            Buffer = NULL;
      }
}</pre>
  Permalink  
Rate this: bad
 
good
Please Sign up or sign in to vote.

Solution 5

I have used these functions for XML parsing for several years with very impressive performance. Perhaps they will help you also:

BOOL XMLGetNodeValue(LPCTSTR cpXML, LPCTSTR cpTagName, LPTSTR cpBuf)
{
  BOOL bRetval = FALSE;
  LPTSTR cpTagStartB = NULL;
  LPTSTR cpTagStartE = NULL;
  LPTSTR cpTagEndB = NULL;
  *cpBuf = 0;
  if (XMLFindNode(cpXML,cpTagName,&cpTagStartB,&cpTagStartE,&cpTagEndB))
    {
      if (cpTagStartE != cpTagEndB && cpTagEndB)
        {
          int iLen = (cpTagEndB-cpTagStartE)-1;
          memcpy(cpBuf,cpTagStartE+1,iLen*sizeof(TCHAR));
          *(cpBuf+iLen) = 0;
        }
      bRetval = TRUE;
    }
  return bRetval;
}
 
BOOL XMLFindNode(LPCTSTR cpXML, LPCTSTR cpTagName, TCHAR **cpTagStartB, TCHAR **cpTagStartE, TCHAR **cpTagEndB)
{
  BOOL bRetval = FALSE;
  TCHAR caTag[512];
  _stprintf(caTag,_T("<%s"),cpTagName);
  int iLen = _tcslen(caTag);
  LPTSTR cpStart = _tcsstr(cpXML,caTag);
  if (cpStart)
    {
      if (*(cpStart+iLen) == ' ' || *(cpStart+iLen) == '>')
        {
          if (cpTagStartB)
            *cpTagStartB = cpStart;
          LPTSTR cpEnd = _tcschr(cpStart, '>');
          if (cpTagStartE)
            *cpTagStartE = cpEnd;
          if (cpEnd && *(cpEnd-1) == '/' && cpTagEndB)
            *cpTagEndB = cpEnd;  // single tag, inline closing mark
          else
            {
              if (cpEnd && cpTagEndB)
                {
                  _stprintf(caTag,_T("</%s>"),cpTagName);
                  *cpTagEndB = _tcsstr(cpEnd,caTag);
                }
            }
          bRetval = TRUE;
        }
      else
        return XMLFindNode(cpStart+iLen,cpTagName,cpTagStartB,cpTagStartE,cpTagEndB);
    }
  return bRetval;
}
 
BOOL XMLFindNodeAttribute(LPCTSTR cpXML, LPCTSTR cpTagName, LPCTSTR cpAttribName, LPTSTR cpAttribValue, TCHAR **cpTagStart /*=NULL*/)
{
  BOOL bRetval = FALSE;
  LPTSTR cpTagStartB = NULL;
  LPTSTR cpTagStartE = NULL;
  bRetval = XMLFindNode(cpXML,cpTagName, &cpTagStartB,&cpTagStartE);
  if (bRetval)
    {
      TCHAR caBuf[512];
      _stprintf(caBuf,_T("%s="),cpAttribName);
      LPTSTR p = _tcsstr(cpTagStartB+_tcslen(cpTagName),caBuf);
      if (p && p < cpTagStartE)
        {
          p += _tcslen(caBuf);
          if (*p == '\"')
            p++;
          int iCount = 0;
          LPTSTR cpAttribEnd = p;
          while (*cpAttribEnd != '\"' && cpAttribEnd < cpTagStartE)
            {
              *(cpAttribValue+iCount) = *cpAttribEnd;
              cpAttribEnd++;
              iCount++;
            }
          *(cpAttribValue+iCount) = 0;
        }
      if (cpTagStart)
        *cpTagStart = cpTagStartB;
    }
  return bRetval;
}
  Permalink  
Rate this: bad
 
good
Please Sign up or sign in to vote.

Solution 1

Use Regular Expressions[^] for finding and extracting.

There are implementations for C++[^].

Cheers
Uwe
  Permalink  

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

  Print Answers RSS
Top Experts
Last 24hrsThis month


Advertise | Privacy | Mobile
Web02 | 2.8.160207.1 | Last Updated 22 Nov 2010
Copyright © CodeProject, 1999-2016
All Rights Reserved. Terms of Service
Layout: fixed | fluid

CodeProject, 503-250 Ferrand Drive Toronto Ontario, M3C 3G8 Canada +1 416-849-8900 x 100