Click here to Skip to main content
Click here to Skip to main content

Wrap your HTML parser to exclude scripting

, 25 Dec 2003
Rate this:
Please Sign up or sign in to vote.
Processing complex HTML pages will require sectional or content exclusion

Introduction

Most parser enabled internet applications require script exclusion. This wrapper properly excludes script elements from testing, and possible script tainting. After reading the file it is entered into an array for line by line processing. If you are trying to disable anomalies caused by IE, clear line 2 of a saved document to keep it from reasserting the original document object model. It WILL do that on fresh documents. Clearing the line forces it to create a new model. Note that this is done in preparation for subsequent browser navigations, NOT this parsing session.

    Dim loc, z as long    
    Elements = Split(s, vbCrLf)

    Elements(1) = ""     
    in_script = False
    
    For i = 2 To UBound(Elements)
        z = 1
        If in_script = False Then
            loc = InStr(z, UCase(Elements(i)), "<SCRIPT ", vbBinaryCompare)
            If loc > 0 Then
                If (InStr(z, UCase(Elements(i)), 
                    "<SCRIPT ", vbBinaryCompare) > 0 And 
                    InStr(z, UCase(Elements(i)), 
                    "</SCRIPT>", vbBinaryCompare) > 0) Then
                    in_script = False
                    Elements(i) = Replace(Elements(i), 
                      Page & "_files/", myscriptsfolder)
                    z = loc + 8
                Else
                    Elements(i) = Replace(Elements(i), 
                      Page & "_files/", myscriptsfolder)
                    in_script = True
                End If
            End If
                      
'/////////////////////////////////////////////////         

            
'  ADD MORE PARSER METHODS HERE
            

'insert basetag method calls InsertBaseElement method
            
loc = InStr(z, Elements(i), "<HEAD>", vbBinaryCompare)
            
If loc > 0 
  Then
            
    If (objDocument.getElementsByTagName("BASE").length = 0) 
      Then                  
        Elements(i) = InsertBaseElement(Elements(i), loc)               
     Else            
        Elements(i) = Replace(Elements(i), s, ARCRoot)
            
  End If
End If           
            
'/////////////////////////////////////////////////<BR>
<BR>           DoEvents            
            
'/////////////////////////////////////////////////
            

            'This code can be modified to suit special 
            'requirements
            
'It is useful for chopping of a <BR>'document with dynamic footer content
            
'written by script methods
            
'(Coders may be trying to ensure some kind of difficulty getting a 
' clean archive document from their service.) This code attempts 
' to cleanup the non-compliant HTML footer.
           
            loc = InStr(z, UCase(Elements(i)), "</SCRIPT>", vbBinaryCompare)
            If loc > 0 Then
                in_script = False
                i = i + 1
                Elements(i) = "</BODY></HTML>"
                i = i + 1
                Do While i < UBound(Elements)
                    Elements(i) = ""
                    i = i + 1
                Loop
            End If
        Else
            'in_script = true so look for endtag
            loc = InStr(z, UCase(Elements(i)), "</SCRIPT>", vbBinaryCompare)
            If loc > 0 Then
                in_script = False
            End If
        End If

    Next

Using the Code

Insert your own methods to replace links, image tags, insert a table, footer etc. Leaving this wrapper intact will protect the script sections and it will also prevent the parser method from misbehaving. I added z value to let the parser process the HTML in strings having code after the found </SCRIPT> tag (as is possible with NYTimes pages.)

I need to appologize for not providing a working demonstration. Its difficult to just throw out a useful demonstration at this time without disseminating too much about the BOWSER parse method. The wrapper is used in my BOWSER demonstration.

Interesting Points

It seems that providers are using complex structures to prevent commercial quality archiving of their content. I have no problem handling the content of the average HTML website, but the NYT with its dynamic content insertions play at havoc, using techniques to cause my parser to either skip content or otherwise misbehave. Presently I'm adding code to process HTML found after the </SCRIPT> tags in what is like a cat & mouse game. The more sophisticated the parser becomes, the easier it will be to break.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

Share

About the Author

QUIETTA

United States United States
Hubris is like armor. He is afraid to take on a Coder project for what it may do to him, both financially and to his health.
 
WEB Bowser is a hack that makes it easy to acquire internet content for archival. Its also possible to PROXY it, serving it to the internet as well. Given this capability, people will change the internet again.
 
I envision a programmable server that will acquire content for proxy. Users will then add proxy content to their own, making it available almost as if it were their own. We are already "information collectors." Its the opinions that are getting real hard to find.
 
-

Comments and Discussions

 
-- There are no messages in this forum --
| Advertise | Privacy | Terms of Use | Mobile
Web03 | 2.8.150301.1 | Last Updated 26 Dec 2003
Article Copyright 2003 by QUIETTA
Everything else Copyright © CodeProject, 1999-2015
Layout: fixed | fluid