![]() |
Web Development »
ASP »
General
Intermediate
Removing HTML from the text in ASPBy Konstantin VassermanExploring the options of removing HTML tags from the text in ASP. |
VBScriptWin2K, ASP, IIS, Dev
|
|
Advanced Search Add to IE Search |
|
|
|
||||||||||||||||
There could be a number of reasons why you as a developer want to remove HTML tags from the text. The most common situation is when you are going to display some text on the web page and the text was submitted by an unknown user or it came from some other source that you have no control over. You don't have any idea what the content of the text is: it could contain some damaging script or some HTML formatting that will completely mess up the look of your site. It could be that you just don't want any HTML tags in the text because of some application restrictions. You might want to limit the use of HTML to some simple text formatting tags, but restrict the users from using links and inserting images. Whether you have a good reason for that or you just want HTML out of your text because you are a member of "HTML Hatred Club" - you have to find the way to get those tags out of the text. This article will look into the options you have when it comes to removing HTML tags from the text in ASP.
First and probably the easiest solution is to just disable HTML tags in the text without removing them. You can do it with Replace() function. For example, if you want to disable all the SCRIPT tags you could do this:
strText = Replace(strText, "<script", "<script", 1, -1, 1)
or to make sure that all HTML tags are disabled:
strText = Replace(strText, "<", "<")
No opening brackets - no valid HTML tags - no problem. Right?
It is a good (quick) security measure to prevent people from embedding damaging client-side scripts within the text they submit, but it is hardly a user-friendly feature.
The problem with this approach is that all the HTML tags are now shown as well as the rest of the text and it is very hard to read. It's kind of like displaying the HTML source to the user - not a very nice thing to do.
How to make HTML tags disappear from the text? Well, we can just remove them. We can just take everything between opening bracket "<" and closing bracket ">" of HTML tags and remove it. It sounds easy ...
Well, it is easier said than done, especially in VBScript. :-)
People who code in Perl or Java Scripts can actually tell you that it is a piece of cake. They are absolutely right. For example, JavaScript function that removes everything between the brackets could look like this:
function RemoveHTML( strText )
{
var regEx = /<[^>]*>/g;
return strText.replace(regEx, "");
}
For those of you who doesn't know what all of these "/<[^>]*>/g" mean - it's called Regular Expression. "Regular expressions are patterns used to match character combinations in strings." You can learn more about them by following this link: http://developer.netscape.com/docs/manuals/js/client/jsguide/regexp.htm.
Back in VBScript world, for those of us who runs Scripting Engine 5.0 or later (you can check you version by calling ScriptEngineMajorVersion and ScriptEngineMinorVersion functions) we can use RegExp object as well. RemoveHTML function could look like this:
Function RemoveHTML( strText )
Dim RegEx
Set RegEx = New RegExp
RegEx.Pattern = "<[^>]*>"
RegEx.Global = True
RemoveHTML = RegEx.Replace(strText, "")
End Function
It doesn't look too complicated, does it? Providing that you know how to build those patterns ... ;-)
For the rest of VBScript people (who has an older Scripting Engine or doesn't want to mess with the Regular Expressions) writing of your own little parser is the way to go. Below is an example of such a function. My friend Chris Coursey and I used this function in one of our projects a couple of years ago:
Function RemoveHTML( strText )
Dim nPos1
Dim nPos2
nPos1 = InStr(strText, "<")
Do While nPos1 > 0
nPos2 = InStr(nPos1 + 1, strText, ">")
If nPos2 > 0 Then
strText = Left(strText, nPos1 - 1) & Mid(strText, nPos2 + 1)
Else
Exit Do
End If
nPos1 = InStr(strText, "<")
Loop
RemoveHTML = strText
End Function
While all of the above solutions work and do exactly what they were meant to do (remove everything between the brackets), there are at least a couple of problems with this approach:
First of all, because these functions are only take into an account the bracket characters - any brackets within the body of the text that were never meant to be HTML tags will be removed. They will be removed together with any text that happens to be within those brackets. In other words, any attempt by a user to include "<" or ">" characters in the text might cause these functions to produce unpredictable and at the time very ugly results.
On the other hand, these functions remove all the HTML tags unconditionally. You cannot control which tags are removed and which are kept untouched. That is the problem when you want to let your users to enter some harmless HTML tags like "<b>" and "<i>", but remove the other tags.
The only way to overcome both of the previously discussed problems is to make your code aware of specific HTML tags that you want to be removed. I am currently unaware of any third-party ASP components that would do the job for you, but they might very well be out there. I did however attempted to write one myself based on MSHTML Library and I've seen that somebody has used Internet Explorer's Application object to produce the desired results of striping HTML tags. Both of these solutions seemed to work, but with IE solution you will most likely get a huge performance hit and both of them don't seem to be very safe things to do according to MSKB:
"It may be desirable to parse HTML files inside a Web server process in response to a browser page request. However, the WebBrowser control, DHTML Editing Control, MSHTML, and other Internet Explorer components may not function properly in an Active Server Pages (ASP) page or other application run in a Web server application." (http://support.microsoft.com/support/kb/articles/Q244/0/85.ASP?LN=EN-US&SD=gn&FR=0)
In other words - think twice before using any IE components on the server side.
Having explored all of the above options I have taking a challenge of writing an ASP function in VBScript that would both be intelligent enough to remove only known HTML tags and at the same time would provide the developer with ability to control which tags to remove. Following is the result of this attempt.
A few words about the function:
Usage of the function is simple:
strPlainText = RemoveHTML(strTextWithHTML)
And here is the function:
Function RemoveHTML( strText )
Dim TAGLIST
TAGLIST = ";!--;!DOCTYPE;A;ACRONYM;ADDRESS;APPLET;AREA;B;BASE;BASEFONT;" &_
"BGSOUND;BIG;BLOCKQUOTE;BODY;BR;BUTTON;CAPTION;CENTER;CITE;CODE;" &_
"COL;COLGROUP;COMMENT;DD;DEL;DFN;DIR;DIV;DL;DT;EM;EMBED;FIELDSET;" &_
"FONT;FORM;FRAME;FRAMESET;HEAD;H1;H2;H3;H4;H5;H6;HR;HTML;I;IFRAME;IMG;" &_
"INPUT;INS;ISINDEX;KBD;LABEL;LAYER;LAGEND;LI;LINK;LISTING;MAP;MARQUEE;" &_
"MENU;META;NOBR;NOFRAMES;NOSCRIPT;OBJECT;OL;OPTION;P;PARAM;PLAINTEXT;" &_
"PRE;Q;S;SAMP;SCRIPT;SELECT;SMALL;SPAN;STRIKE;STRONG;STYLE;SUB;SUP;" &_
"TABLE;TBODY;TD;TEXTAREA;TFOOT;TH;THEAD;TITLE;TR;TT;U;UL;VAR;WBR;XMP;"
Const BLOCKTAGLIST = ";APPLET;EMBED;FRAMESET;HEAD;NOFRAMES;NOSCRIPT;OBJECT;SCRIPT;STYLE;"
Dim nPos1
Dim nPos2
Dim nPos3
Dim strResult
Dim strTagName
Dim bRemove
Dim bSearchForBlock
nPos1 = InStr(strText, "<")
Do While nPos1 > 0
nPos2 = InStr(nPos1 + 1, strText, ">")
If nPos2 > 0 Then
strTagName = Mid(strText, nPos1 + 1, nPos2 - nPos1 - 1)
strTagName = Replace(Replace(strTagName, vbCr, " "), vbLf, " ")
nPos3 = InStr(strTagName, " ")
If nPos3 > 0 Then
strTagName = Left(strTagName, nPos3 - 1)
End If
If Left(strTagName, 1) = "/" Then
strTagName = Mid(strTagName, 2)
bSearchForBlock = False
Else
bSearchForBlock = True
End If
If InStr(1, TAGLIST, ";" & strTagName & ";", vbTextCompare) > 0 Then
bRemove = True
If bSearchForBlock Then
If InStr(1, BLOCKTAGLIST, ";" & strTagName & ";", vbTextCompare) > 0 Then
nPos2 = Len(strText)
nPos3 = InStr(nPos1 + 1, strText, "</" & strTagName, vbTextCompare)
If nPos3 > 0 Then
nPos3 = InStr(nPos3 + 1, strText, ">")
End If
If nPos3 > 0 Then
nPos2 = nPos3
End If
End If
End If
Else
bRemove = False
End If
If bRemove Then
strResult = strResult & Left(strText, nPos1 - 1)
strText = Mid(strText, nPos2 + 1)
Else
strResult = strResult & Left(strText, nPos1)
strText = Mid(strText, nPos1 + 1)
End If
Else
strResult = strResult & strText
strText = ""
End If
nPos1 = InStr(strText, "<")
Loop
strResult = strResult & strText
RemoveHTML = strResult
End Function
General
News
Question
Answer
Joke
Rant
Admin
|
PermaLink |
Privacy |
Terms of Use
Last Updated: 27 Sep 2000 Editor: Chris Maunder |
Copyright 2000 by Konstantin Vasserman Everything else Copyright © CodeProject, 1999-2009 Web15 | Advertise on the Code Project |