Click here to Skip to main content
Click here to Skip to main content

Convert any URL to a MHTML archive using native .NET code

By , 3 Apr 2005
 

Sample Image - MhtBuilder.gif

Introduction

If you've ever used the File | Save As... menu in Internet Explorer, you might have noticed a few interesting options IE provides under the Save As Type drop-down box:

Screenshot - Internet Explorer Save As menu

The options provided are:

  • Web Page, complete
  • Web Archive, single file
  • Web Page, HTML only
  • Text File

Most of these are self-explanatory, with the exception of the Web Archive (MHTML) format. What's neat about this format is that it bundles the web page and all of its references, into a single compact .MHT file. It's a lot easier to distribute a single self-contained file than it is to distribute a HTML file with a subfolder full of image/CSS/Flash/XML files referenced by that HTML file. In our case, we were generating HTML reports and we needed to check these reports into a document management system which expects a single file. The MHTML (*.mht) format solves this problem beautifully!

This project contains the MhtBuilder class, a 100% .NET managed code solution which can auto-generate a MHT file from a target URL, in one line of code. As a bonus, it will also generate all the other formats listed above, too. And it's completely free, unlike some commercial solutions you might find out there.

Background

I know people assume the worst of Microsoft, but the MHTML format is actually based on RFC standard 2557, compliant Multipart MIME Message (MHTML web archive). So it's an actual Internet standard! Web Archive, a.k.a. MHTML, is a remarkably simple plain text format which looks a lot like (and is in fact almost exactly identical to) an email. Here's the header of the MHT file you are viewing at the top of the page:

Screenshot - Mht file header

To generate a MHTML file, we simply merge together all of the files referenced in the HTML. The red line marks the first content block; there will be one content block for each file. We need to follow a few rules, though:

  • Use Quoted-Printable encoding for the text formats.
  • Use Base64 encoding for the binary formats.
  • Make sure the Content-Location has the correct absolute URL for each reference.

Not all websites will tolerate being packaged into a MHTML file. This version of Mht.Builder supports frames and IFrame, but watch out for pages that include lots of complicated JavaScript. You'll want to use the .StripScripts option on sites like that.

Using Mht.Builder

MhtBuilder comes with a complete demo app:

Screenshot - Mht demo application

Try it out on your favorite website. The files will be generated by default in the \bin folder of the solution. Just click the View button to launch them. Bear in mind that for the Web Archive and complete tabs, all the content from the target web page must be downloaded to the /bin folder, so it might take a little while! Although I don't provide any feedback events yet, I do emit a lot of progress feedback via the Debug.Write, so switch to the debug output tab to see what's happening in real time.

There are four tabs here, just like the four options IE provides in its Save As Type options. In MhtBuilder, these are the four methods being called, in the order they appear on the tabs:

Public Sub SavePageComplete(ByVal outputFilePath As String, Optional url As String)
Public Sub SavePageArchive(ByVal outputFilePath As String, Optional url As String)
Public Sub SavePage(ByVal outputFilePath As String, Optional url As String)
Public Sub SavePageText(ByVal outputFilePath As String, Optional url As String)

As of Windows XP Service Pack 2, HTML files opened from disk result in security blocks. In order to avoid this, we need to add the "Mark of the Web" to the file so IE knows what URL it came from, and can thus assign an appropriate security zone to the HTML. That's what the blnAddMark parameter is for; it causes the HTML file to be tagged with this single line at the top:

<!-- saved from url=(0027)http://www.codeproject.com/ -->

The other thing we need to do when saving these files is fix up the URLs. Any relative URLs such as:

<img src="/images/standard/logo225x72.gif">

must be converted to absolute URLs like so:

<img src="http://www.codeproject.com/images/standard/logo225x72.gif">

We do this using regular expressions, which gets us a NameValueCollection of all the references we need to fix. We loop through each reference and perform the fixup on the HTML string.

Private Function ExternalHtmlFiles() As Specialized.NameValueCollection
  If Not _ExternalFileCollection Is Nothing Then
    Return _ExternalFileCollection
  End If
  
  _ExternalFileCollection = New Specialized.NameValueCollection
  Dim r As Regex
  Dim html As String = Me.ToString
  
  Debug.WriteLine("Resolving all external HTML references from URL:")
  Debug.WriteLine("    " & Me.Url)
  
  '-- src='filename.ext' ; background="filename.ext"
  '-- note that we have to test 3 times to catch all quote styles: '', "", and none
  r = New Regex( _
    "(\ssrc|\sbackground)\s*=\s*((?<Key>'(?<Value>[^']+)')|" & _
    "(?<Key>""(?<Value>[^""]+)"")|(?<Key>(?<Value>[^ \n\r\f]+)))", _
    RegexOptions.IgnoreCase Or RegexOptions.Multiline)
    AddMatchesToCollection(html, r, _ExternalFileCollection)
  
  '-- @import "style.css" or @import url(style.css)
  r = New Regex( _
    "(@import\s|\S+-image:|background:)\s*?(url)*\s*?(?<Key>" & _
    "[""'(]{1,2}(?<Value>[^""')]+)[""')]{1,2})", _
    RegexOptions.IgnoreCase Or RegexOptions.Multiline)
    AddMatchesToCollection(html, r, _ExternalFileCollection)
  
  '-- <link rel=stylesheet href="style.css">
  r = New Regex( _
    "<link[^>]+?href\s*=\s*(?<Key>" & _
    "('|"")*(?<Value>[^'"">]+)('|"")*)", _
    RegexOptions.IgnoreCase Or RegexOptions.Multiline)
    AddMatchesToCollection(html, r, _ExternalFileCollection)
  
  '-- <iframe src="mypage.htm"> or <frame src="mypage.aspx">
  r = New Regex( _
    "<i*frame[^>]+?src\s*=\s*(?<Key>" & _
    "['""]{0,1}(?<Value>[^'""\\>]+)['""]{0,1})", _
    RegexOptions.IgnoreCase Or RegexOptions.Multiline)
    AddMatchesToCollection(html, r, _ExternalFileCollection)
  
  Return _ExternalFileCollection
End Function

We use a similar technique to get a list of all the files we need to download, which are then downloaded via my WebClientEx class. Why use that instead of the built in Net.WebClient? Good question! Because it doesn't support HTTP compression. My class, on the other hand, does:

Private Function Decompress(ByVal b() As Byte, _
      ByVal CompressionType As HttpContentEncoding) As Byte()

  Dim s As Stream
  Select Case CompressionType
    Case HttpContentEncoding.Deflate
      s = New Zip.Compression.Streams.InflaterInputStream(New MemoryStream(b), _
          New Zip.Compression.Inflater(True))
    Case HttpContentEncoding.Gzip
      s = New GZip.GZipInputStream(New MemoryStream(b))
    Case Else
      Return b
  End Select
  
  Dim ms As New MemoryStream
  Const chunkSize As Integer = 2048
  
  Dim sizeRead As Integer
  Dim unzipBytes(chunkSize) As Byte
  While True
    sizeRead = s.Read(unzipBytes, 0, chunkSize)
    If sizeRead > 0 Then
      ms.Write(unzipBytes, 0, sizeRead)
    Else
      Exit While
    End If
  End While
  s.Close()
  
  Return ms.ToArray
End Function

HTTP compression is a no-brainer: it increases your effective bandwidth by 75 percent by using standard GZIP compression-- courtesy of the SharpZipLib library.

Conclusion

Creating MHTML files isn't hard, but there are lots of little gotchas when dealing with HTML, regular expressions, and HTTP downloads. I tried to document all the difficult bits in the source code. I've also tested MhtBuilder on dozens of different websites so far with excellent results.

There are many more details and comments in the source code provided at the top of the article, so check it out. Please don't hesitate to provide feedback, good or bad! I hope you enjoyed this article. If you did, you may also like my other articles as well.

History

  • Sunday, September 12, 2004 - Published.
  • Monday, March 28, 2005 - Version 2.0
    • Completely rewritten!
    • Autodetection of content encoding (e.g., international web pages), tested against multi-language websites.
    • Now correctly decompresses both types of HTTP compression.
    • Supports completely in-memory operation for server-side use, or on-disk storage for client use.
    • Now works on web pages with frames and IFrames, using recursive retrieval.
    • HTTP authentication and HTTP Proxy support.
    • Allows configuration of browser ID string to retrieve browser-specific content.
    • Basic cookie support (needs enhancement and testing).
    • Much improved regular expressions used for parsing HTTP.
    • Extensive use of VB.NET 2005 style XML comments throughout.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

wumpus1
Web Developer
United States United States
Member
My name is Jeff Atwood. I live in Berkeley, CA with my wife, two cats, and far more computers than I care to mention. My first computer was the Texas Instruments TI-99/4a. I've been a Microsoft Windows developer since 1992; primarily in VB. I am particularly interested in best practices and human factors in software development, as represented in my recommended developer reading list. I also have a coding and human factors related blog at www.codinghorror.com.

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
Hint: For improved responsiveness ensure Javascript is enabled and choose 'Normal' from the Layout dropdown and hit 'Update'.
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
QuestionHow to render this website ?memberBattosaiii25 Jan '10 - 5:41 
Hi all,   First of all this is a really good project and very useful. The project works well for lot of webpages. I would like to save for example the following webpage http://www.mapunderwriting.co.uk/ but the quality is bad. When i inspect the source code a stylesheet is missing. If I...
Questionrendering error using mht buildermemberwilliam saylor30 Dec '09 - 9:17 
The question I have is, is there a way to catch an http error when the url is openned and the mht is created. I have had a problem reciently where an error will occur and the file will still be created and display the error because it is a valid URL. If it's possible for this check to be put in...
QuestionFile download in httpsmemberGargi K3 Apr '09 - 1:02 
Hi.. firstly, thanks for the code. Its exactly what we wanted. However we have one problem. Our dev and QA environments work under http and the code works fine for that but when the site runs under HTTPS(in UAT/LIVE environment), it throws following error:   The underlying connection was...
QuestionRe: File download in httpsmemberGreg Hauptmann6 Oct '09 - 9:58 
did you get a reply or work out a solution for this issue?
QuestionRe: File download in httpsmemberm.daveiga3 May '10 - 3:49 
Actually, i'm having the same problem. Anyone has a clue?
GeneralGood articlememberDonsw15 Mar '09 - 16:34 
I was looking for something like this. All I found was the commercial ones. I am currently saving the html only. although it works it will be nice to add the graphics.   cheers, Donsw My Recent Article : Organizational Structure within a Company for PMPs
GeneralIEmemberBenAA2 Sep '08 - 6:20 
In IE, when I save as MHTML, relative files are not embedded. If I have an HTML page with relative files, I save it as an mhtml, then remove the local files, I see no images in the mhtml. Am I doing something wrong. Is this how this code behaves? (Since it is supposed to mimic IE).
QuestionMS Word Open the mht errormemberyanglz999 Jun '08 - 16:39 
open the mht generate by this code with ms word, it shows error not a correct mht file, what's the problem?
AnswerRe: MS Word Open the mht error [modified]membergg6731 May '10 - 17:37 
The closing "--" seems to be missing after the last boundary   the last line : ------=_NextPart_000_00   should be : ------=_NextPart_000_00--   Then Word is happy   Edit: ups, just saw that this problem has already been fixed 5 years ago : Fixed it....well,...
GeneralGoing the other waymemberurbane.tiger23 May '08 - 15:37 
Anyone know of something that will transform MHTML into HTML - I downloaded something from Softpedia that claimed to do it - but it doesn't produce any output!   TUT   If you up your bandwidth from slow DSL to fast DSL, make sure your shields are robust, you'll probably be...
QuestionAnyone providing this code as an extension for Firefox 2?memberalternety3 Jan '08 - 8:43 
Anyone providing this code as an extension for Firefox 2?   I have been searching at length for a way to get this function into Firefox. Anyone know of a reliable add on for Firefox?
QuestionMSIE7.0, VS2005, Vista Home Premium Not WorkingmemberJustALark6 Dec '07 - 7:17 
Can anyone help me get this working with Internet Explorer 7.0?   It doesn't throw an exceeption and saves the *.mht file. The MHT file will not open in IE7.   If I go to the same website and "Save As" from Internet Explorer the MHT file opens OK.   I even tried the program...
QuestionImages and CSS files are being referenced to the website and not being encoded in the single filememberBaladitya Ganty20 Jul '07 - 5:42 
we are using this piece of code for converting a lot of our reports which are in ASP to MHTML files which represents as an image of that week. The problem we are facing is the MHTML file being generated is just adding a reference of all the images and CSS to our production site after archiving...
QuestionTrying to get html files on my hard drive to be converted...memberitskyb16 Jul '07 - 9:04 
This app works wonderfully with http:// based requests. However, if I try to use file:///... based requests, I get an invalid cast exception with with the WebClientEx.vb class on line 343:   Dim wreq As HttpWebRequest = DirectCast(WebRequest.Create(Url), HttpWebRequest)   Anyone...
GeneralInteresting problemmemberp10005 Jul '07 - 6:48 
Hi,   This is sort of an aside, but I figure that people who look at this page probably have a great familiarity with MHTML, and I need it to solve my problem. I have a webpage which contains a base64 string encoding a .png file. I also know the dimensions of the file etc. But the page...
GeneralRe: Interesting problemmembergordon byers10 Jun '09 - 4:05 
Its possible, as i've just had to write it
QuestionHas anyone got this workng on vista / vs2005 / ie7 ?memberMootah21 Jun '07 - 9:43 
  The mht generated looks fine (it's not all mungled up), but it won't load into ie7.   Any thoughts? When you load the page, it's blank. the source in the browser is:   <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <HTML><HEAD> <META...
AnswerFixed it....well, to be honest, Kyle fixed itmemberMootah21 Jun '07 - 10:34 
...Looking at you blog Jeff I found this post from Kyle who was talking about a fix to open the mht in word. I applied the fix, and hey presto. cheers Moose --------------------- Here's what Kyle said to do:   I've made two code changes to allow for the file to be opened in Word...
GeneralRe: Fixed it....well, to be honest, Kyle fixed itmemberrobalexclark5 Nov '08 - 6:00 
Nice one Kyle!
QuestionJava appletsmembervitorg20 Jun '07 - 1:41 
Greetings,   Is it possible to embed java applets in mhtml files? I tried adding to your code the applet tag and make it process the '.class' files but it doesn't seem to work. It only displays a blank page   Thanks.
GeneralIIS6.0 Windows 2003memberRajeshCR4 Jun '07 - 20:25 
Application is not working with 11S and Windows 2003 with localhost application. working fine with external websites and IIS 5.0   For example www.google.com is fine and localhost:8080\websitetest\test.aspx is not giving the correct result
QuestionHow to make it to Batch Process?memberElven Wong28 Mar '07 - 22:32 
I have a list of URL. How to make this going to be batched?   Seems that the whole process won't raise a "Finish" event..
GeneralAwesome! Just what I've been looking for.membersaab340b27 Mar '07 - 5:39 
It's like the genie granted me one wish and this was the result. It does exactly what I wanted and it's already in .Net format. Superb!
Questionwhy the .Mht files are showing as text files?membergovindaraj.perumal15 Mar '07 - 20:46 
Hi,   I am using your code to generate .Mht files. but at some sites, the generated .Mht files are opening in IE as text file. Can you please tell me why does it happen? is there any browser settings?   please help me out?   Thanks   Govind
AnswerRe: why the .Mht files are showing as text files?memberChristianW7 Jan '08 - 23:01 
I've got the same problem.   Wheras http://www.codinghorror.com/blog/ was saved successfully to mht http://www.codeplex.com does not work correctly.   As Govind already said, only text/html-code will be displayed.   Any ideas?   Thx, Christian

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web01 | 2.6.130516.1 | Last Updated 4 Apr 2005
Article Copyright 2004 by wumpus1
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid