Click here to Skip to main content
13,353,095 members (77,342 online)
Click here to Skip to main content
Add your own
alternative version


162 bookmarked
Posted 12 Sep 2004

Convert any URL to a MHTML archive using native .NET code

, 3 Apr 2005
Rate this:
Please Sign up or sign in to vote.
A native .NET class for saving URLs: text-only, HTML page, HTML archive, or HTML complete.

Sample Image - MhtBuilder.gif


If you've ever used the File | Save As... menu in Internet Explorer, you might have noticed a few interesting options IE provides under the Save As Type drop-down box:

Screenshot - Internet Explorer Save As menu

The options provided are:

  • Web Page, complete
  • Web Archive, single file
  • Web Page, HTML only
  • Text File

Most of these are self-explanatory, with the exception of the Web Archive (MHTML) format. What's neat about this format is that it bundles the web page and all of its references, into a single compact .MHT file. It's a lot easier to distribute a single self-contained file than it is to distribute a HTML file with a subfolder full of image/CSS/Flash/XML files referenced by that HTML file. In our case, we were generating HTML reports and we needed to check these reports into a document management system which expects a single file. The MHTML (*.mht) format solves this problem beautifully!

This project contains the MhtBuilder class, a 100% .NET managed code solution which can auto-generate a MHT file from a target URL, in one line of code. As a bonus, it will also generate all the other formats listed above, too. And it's completely free, unlike some commercial solutions you might find out there.


I know people assume the worst of Microsoft, but the MHTML format is actually based on RFC standard 2557, compliant Multipart MIME Message (MHTML web archive). So it's an actual Internet standard! Web Archive, a.k.a. MHTML, is a remarkably simple plain text format which looks a lot like (and is in fact almost exactly identical to) an email. Here's the header of the MHT file you are viewing at the top of the page:

Screenshot - Mht file header

To generate a MHTML file, we simply merge together all of the files referenced in the HTML. The red line marks the first content block; there will be one content block for each file. We need to follow a few rules, though:

  • Use Quoted-Printable encoding for the text formats.
  • Use Base64 encoding for the binary formats.
  • Make sure the Content-Location has the correct absolute URL for each reference.

Not all websites will tolerate being packaged into a MHTML file. This version of Mht.Builder supports frames and IFrame, but watch out for pages that include lots of complicated JavaScript. You'll want to use the .StripScripts option on sites like that.

Using Mht.Builder

MhtBuilder comes with a complete demo app:

Screenshot - Mht demo application

Try it out on your favorite website. The files will be generated by default in the \bin folder of the solution. Just click the View button to launch them. Bear in mind that for the Web Archive and complete tabs, all the content from the target web page must be downloaded to the /bin folder, so it might take a little while! Although I don't provide any feedback events yet, I do emit a lot of progress feedback via the Debug.Write, so switch to the debug output tab to see what's happening in real time.

There are four tabs here, just like the four options IE provides in its Save As Type options. In MhtBuilder, these are the four methods being called, in the order they appear on the tabs:

Public Sub SavePageComplete(ByVal outputFilePath As String, Optional url As String)
Public Sub SavePageArchive(ByVal outputFilePath As String, Optional url As String)
Public Sub SavePage(ByVal outputFilePath As String, Optional url As String)
Public Sub SavePageText(ByVal outputFilePath As String, Optional url As String)

As of Windows XP Service Pack 2, HTML files opened from disk result in security blocks. In order to avoid this, we need to add the "Mark of the Web" to the file so IE knows what URL it came from, and can thus assign an appropriate security zone to the HTML. That's what the blnAddMark parameter is for; it causes the HTML file to be tagged with this single line at the top:

<!-- saved from url=(0027) -->

The other thing we need to do when saving these files is fix up the URLs. Any relative URLs such as:

<img src="/images/standard/logo225x72.gif">

must be converted to absolute URLs like so:

<img src="">

We do this using regular expressions, which gets us a NameValueCollection of all the references we need to fix. We loop through each reference and perform the fixup on the HTML string.

Private Function ExternalHtmlFiles() As Specialized.NameValueCollection
  If Not _ExternalFileCollection Is Nothing Then
    Return _ExternalFileCollection
  End If
  _ExternalFileCollection = New Specialized.NameValueCollection
  Dim r As Regex
  Dim html As String = Me.ToString
  Debug.WriteLine("Resolving all external HTML references from URL:")
  Debug.WriteLine("    " & Me.Url)
  '-- src='filename.ext' ; background="filename.ext"
  '-- note that we have to test 3 times to catch all quote styles: '', "", and none
  r = New Regex( _
    "(\ssrc|\sbackground)\s*=\s*((?<Key>'(?<Value>[^']+)')|" & _
    "(?<Key>""(?<Value>[^""]+)"")|(?<Key>(?<Value>[^ \n\r\f]+)))", _
    RegexOptions.IgnoreCase Or RegexOptions.Multiline)
    AddMatchesToCollection(html, r, _ExternalFileCollection)
  '-- @import "style.css" or @import url(style.css)
  r = New Regex( _
    "(@import\s|\S+-image:|background:)\s*?(url)*\s*?(?<Key>" & _
    "[""'(]{1,2}(?<Value>[^""')]+)[""')]{1,2})", _
    RegexOptions.IgnoreCase Or RegexOptions.Multiline)
    AddMatchesToCollection(html, r, _ExternalFileCollection)
  '-- <link rel=stylesheet href="style.css">
  r = New Regex( _
    "<link[^>]+?href\s*=\s*(?<Key>" & _
    "('|"")*(?<Value>[^'"">]+)('|"")*)", _
    RegexOptions.IgnoreCase Or RegexOptions.Multiline)
    AddMatchesToCollection(html, r, _ExternalFileCollection)
  '-- <iframe src="mypage.htm"> or <frame src="mypage.aspx">
  r = New Regex( _
    "<i*frame[^>]+?src\s*=\s*(?<Key>" & _
    "['""]{0,1}(?<Value>[^'""\\>]+)['""]{0,1})", _
    RegexOptions.IgnoreCase Or RegexOptions.Multiline)
    AddMatchesToCollection(html, r, _ExternalFileCollection)
  Return _ExternalFileCollection
End Function

We use a similar technique to get a list of all the files we need to download, which are then downloaded via my WebClientEx class. Why use that instead of the built in Net.WebClient? Good question! Because it doesn't support HTTP compression. My class, on the other hand, does:

Private Function Decompress(ByVal b() As Byte, _
      ByVal CompressionType As HttpContentEncoding) As Byte()

  Dim s As Stream
  Select Case CompressionType
    Case HttpContentEncoding.Deflate
      s = New Zip.Compression.Streams.InflaterInputStream(New MemoryStream(b), _
          New Zip.Compression.Inflater(True))
    Case HttpContentEncoding.Gzip
      s = New GZip.GZipInputStream(New MemoryStream(b))
    Case Else
      Return b
  End Select
  Dim ms As New MemoryStream
  Const chunkSize As Integer = 2048
  Dim sizeRead As Integer
  Dim unzipBytes(chunkSize) As Byte
  While True
    sizeRead = s.Read(unzipBytes, 0, chunkSize)
    If sizeRead > 0 Then
      ms.Write(unzipBytes, 0, sizeRead)
      Exit While
    End If
  End While
  Return ms.ToArray
End Function

HTTP compression is a no-brainer: it increases your effective bandwidth by 75 percent by using standard GZIP compression-- courtesy of the SharpZipLib library.


Creating MHTML files isn't hard, but there are lots of little gotchas when dealing with HTML, regular expressions, and HTTP downloads. I tried to document all the difficult bits in the source code. I've also tested MhtBuilder on dozens of different websites so far with excellent results.

There are many more details and comments in the source code provided at the top of the article, so check it out. Please don't hesitate to provide feedback, good or bad! I hope you enjoyed this article. If you did, you may also like my other articles as well.


  • Sunday, September 12, 2004 - Published.
  • Monday, March 28, 2005 - Version 2.0
    • Completely rewritten!
    • Autodetection of content encoding (e.g., international web pages), tested against multi-language websites.
    • Now correctly decompresses both types of HTTP compression.
    • Supports completely in-memory operation for server-side use, or on-disk storage for client use.
    • Now works on web pages with frames and IFrames, using recursive retrieval.
    • HTTP authentication and HTTP Proxy support.
    • Allows configuration of browser ID string to retrieve browser-specific content.
    • Basic cookie support (needs enhancement and testing).
    • Much improved regular expressions used for parsing HTTP.
    • Extensive use of VB.NET 2005 style XML comments throughout.


This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


About the Author

Web Developer
United States United States
My name is Jeff Atwood. I live in Berkeley, CA with my wife, two cats, and far more computers than I care to mention. My first computer was the Texas Instruments TI-99/4a. I've been a Microsoft Windows developer since 1992; primarily in VB. I am particularly interested in best practices and human factors in software development, as represented in my recommended developer reading list. I also have a coding and human factors related blog at

You may also be interested in...


Comments and Discussions

Generalawesome Pin
Tim Kohler18-Aug-06 10:27
memberTim Kohler18-Aug-06 10:27 
QuestionProxy Server help [modified] Pin
Syed Javed22-Jul-06 14:35
memberSyed Javed22-Jul-06 14:35 
GeneralA contribution Pin
Yehuda A15-Jul-06 11:39
memberYehuda A15-Jul-06 11:39 
GeneralRe: A contribution Pin
hev8-Nov-06 7:15
memberhev8-Nov-06 7:15 
QuestionHow can I Convert MHTML to HTML Pin
xfary3-Jul-06 16:56
memberxfary3-Jul-06 16:56 
Questionhow can i do multiple html files at once Pin
cnrock27-Jun-06 18:07
membercnrock27-Jun-06 18:07 
GeneralIs this able to do multiple html files at once Pin
sdejager27-Jun-06 14:56
membersdejager27-Jun-06 14:56 
GeneralLocal HTML/Image files support (continued) Pin
rsegijn16-Jun-06 1:20
memberrsegijn16-Jun-06 1:20 
This works for me (please post any improvements):

1. In WebClient.ex I added 2 functions ContentTypeFromExtension and IsBinaryFromExtension:

Private Function ContentTypeFromExtension(ByVal UrlExt As String) As String
Select Case UrlExt.ToLower
Case ".htm", ".html"
Return "text/html"
Case ".css"
Return "text/css"
Case ".gif"
Return "image/gif"
Case ".jpg", ".jpeg", ".jpe"
Return "image/jpeg"
Case ".bmp"
Return "image/bmp"
Case ".tif", ".tiff"
Return "image/tiff"
Case ".png"
Return "image/x-png"
Case ".xbm"
Return "image/x-xbitmap"
Case ".xpm"
Return "image/x-xpixmap"
Case ".xwd"
Return "image/x-xwindowdump"
Case ".djv", ".djvu"
Return "image/vnd.djvu"
Case ".js"
Return "text/javascript"
Case ".xml", ".xsl"
Return "text/xml"
Case ".xht", ".xhtml"
Return "application/xhtml+xml"
Case ".txt", ".asc"
Return "text/plain"
Case ".rtf"
Return "text/rtf"
Case ".rtx"
Return "text/richtext"
Case ".sgm", ".sgml"
Return "text/sgml"
Case ".avi"
Return "video/ms-video"
Case ".mpe", ".mpeg", ".mpg"
Return "video/mpeg"
Case ".wmv"
Return "video/x-ms-wmv"
Case ".mov", ".qt"
Return "video/quicktime"
Case ".movie"
Return "video/x-sgi-movie"
Case ".mxu"
Return "video/vnd.mpegurl"
Case ".ram", ".rm"
Return "audio/x-pn-realaudio"
Case ".ra"
Return "audio/x-realaudio"
Case ".mp2", ".mp3", ".mpga"
Return "audio/mpeg"
Case ".mid", ".midi"
Return "audio/midi"
Case ".wav"
Return "audio/x-wav"
Case ".aif", ".aifc", ".aiff"
Return "audio/x-aiff"
Case ".doc"
Return "application/msword"
Case ".xls"
Return "application/"
Case ".ppt"
Return "application/"
Case ".flash", ".swf"
Return "application/x-shockwave-flash"
Case ".ipx"
Return "application/x-ipix"
Case ".pdf"
Return "application/pdf"
Case ".zip"
Return "application/zip"
Case ".bin", ".class", ".dll", ".dms", ".exe", ".lha", ".lzh", ".so"
Return "application/octet-stream"
Case ".dcr", ".dir", ".dxr"
Return "application/x-director"
Case Else
Return "text/html"
End Select
End Function
Private Function IsBinaryFromExtension(ByVal UrlExt As String) As Boolean
Select Case UrlExt.ToLower
Case ".htm", ".html", ".css", ".js", ".xml", ".xsl", ".xht", ".xhtml", ".txt", ".asc", ".rtf", ".rtx", ".sgm", ".sgml"
Return False
Case Else
Return True
End Select
End Function

2. Changed Sub GetUrlData:

''' <summary>
''' returns a collection of bytes from a Url
''' <param name="Url">URL to retrieve
Public Sub GetUrlData(ByVal Url As String, ByVal ifModifiedSince As DateTime)
Dim UrlExt As String
Dim wreq As WebRequest = DirectCast(WebRequest.Create(Url), WebRequest)

UrlExt = Path.GetExtension(Url)
'-- do we need to use a proxy to get to the web?
If _ProxyUrl <> "" Then
Dim wp As New WebProxy(_ProxyUrl)
If _ProxyAuthenticationRequired Then
If _ProxyUser <> "" And _ProxyPassword <> "" Then
wp.Credentials = New NetworkCredential(_ProxyUser, _ProxyPassword)
wp.Credentials = CredentialCache.DefaultCredentials
End If
wreq.Proxy = wp
End If
End If

'-- does the target website require credentials?
If _AuthenticationRequired Then
If _AuthenticationUser <> "" And _AuthenticationPassword <> "" Then
wreq.Credentials = New NetworkCredential(_AuthenticationUser, _AuthenticationPassword)
wreq.Credentials = CredentialCache.DefaultCredentials
End If
End If

wreq.Method = "GET"
wreq.Timeout = _RequestTimeoutMilliseconds
wreq.Headers.Add("Accept-Encoding", _AcceptedEncodings)

'-- sometimes we need to transfer cookies to another URL;
'-- this keeps them around in the object
If KeepCookies Then
If _PersistedCookies Is Nothing Then
_PersistedCookies = New CookieContainer
End If
End If

'-- download the target URL into a byte array
Dim wresp As WebResponse = DirectCast(wreq.GetResponse, WebResponse)

'-- convert response stream to byte array
Dim ebr As New ExtendedBinaryReader(wresp.GetResponseStream)
_ResponseBytes = ebr.ReadToEnd()

'-- determine if body bytes are compressed, and if so,
'-- decompress the bytes
Dim ContentEncoding As HttpContentEncoding
If wresp.Headers.Item("Content-Encoding") Is Nothing Then
ContentEncoding = HttpContentEncoding.None
Select Case wresp.Headers.Item("Content-Encoding").ToLower
Case "gzip"
ContentEncoding = HttpContentEncoding.Gzip
Case "deflate"
ContentEncoding = HttpContentEncoding.Deflate
Case Else
ContentEncoding = HttpContentEncoding.Unknown
End Select
_ResponseBytes = Decompress(_ResponseBytes, ContentEncoding)
End If

'-- sometimes URL is indeterminate, eg, ""
'-- in that case the folder and file resolution MUST be done on
'-- the server, and returned to the client as ContentLocation
_ContentLocation = wresp.Headers("Content-Location")
If _ContentLocation Is Nothing Then
_ContentLocation = ""
End If

'-- if we have string content, determine encoding type
'-- (must cast to prevent Nothing)
_DetectedContentType = wresp.Headers("Content-Type")
If _DetectedContentType Is Nothing Then
_DetectedContentType = ""
_DetectedContentType = ContentTypeFromExtension(UrlExt)
End If
If IsBinaryFromExtension(UrlExt) Then
_DetectedEncoding = Nothing
If _ForcedEncoding Is Nothing Then
_DetectedEncoding = DetectEncoding(_DetectedContentType, _ResponseBytes)
End If
End If

End Sub

3. Maybe not necessary, because I added it before creating the functions in 1)
In External.vb:

3a. After Private _ContentType As String I added:
Private _ContentTypeBefore As String

3b. In Public Property URL() I added:
_ContentTypeBefore = ""
_ContentType = ""

3c. I changed the line:
_ContentType = _Builder.WebClient.ResponseContentType
_ContentTypeBefore = _Builder.WebClient.ResponseContentType
If _ContentTypeBefore = "application/octet-stream" Then
_ContentTypeBefore = "text/html"
End If
_ContentType = _ContentTypeBefore

3d. Because I don't know sh*t about regex constructions I changed Private Sub SetUrl into:
Private Sub SetUrl(ByVal url As String, ByVal validate As Boolean)
If validate Then
_Url = ResolveUrl(url)
_Url = url
End If
'-- http://mywebsite
_UrlRoot = Regex.Match(url, "http://[^/'""]+", RegexOptions.IgnoreCase).ToString
If _UrlRoot = "" Then
_UrlRoot = Regex.Match(url, "file:///[^/'""]+", RegexOptions.IgnoreCase).ToString
End If
If _UrlRoot = "" Then
_UrlRoot = Regex.Match(url, "file:///[^\\'""]+", RegexOptions.IgnoreCase).ToString
End If
'-- http://mywebsite/myfolder
If _Url.LastIndexOf("/") > 8 Then
_UrlFolder = _Url.Substring(0, _Url.LastIndexOf("/"))
_UrlFolder = _UrlRoot
End If
End Sub

3e. In Private Sub AddMatchesToCollection I added:
Dim urlRegex2 As New Regex("^files*:///\w+", RegexOptions.IgnoreCase)
and changed:
If Not urlRegex.IsMatch(value) Then
If Not urlRegex.IsMatch(value) And Not urlRegex2.IsMatch(value) Then

I don't know what problems will arise by changing Sub GetUrlData.
- WebRequest/WebResponse instead of HttpWebRequest/HttpWebResponse
- Left out:
wreq.UserAgent = _HttpUserAgent
wreq.IfModifiedSince = ifModifiedSince
wreq.CookieContainer = _PersistedCookies

Hope the above works for you too.


GeneralRe: Local HTML/Image files support (continued) Pin
alhambra-eidos13-Aug-09 11:28
memberalhambra-eidos13-Aug-09 11:28 
Generallocal html files support Pin
rsegijn15-Jun-06 2:33
memberrsegijn15-Jun-06 2:33 
GeneralRe: local html files support Pin
rsegijn15-Jun-06 3:19
memberrsegijn15-Jun-06 3:19 
QuestionDisplay MHT from inside ASP.NET app? Pin
JTW23-Jun-06 1:50
memberJTW23-Jun-06 1:50 
QuestionCan MHT library be integrate in visual basic Pin
Jennifer88823-Apr-06 5:50
memberJennifer88823-Apr-06 5:50 
GeneralPerfect! Pin
lewist5731-Mar-06 10:38
memberlewist5731-Mar-06 10:38 
Questionhow to save local Html's into mht files Pin
bouha30-Mar-06 4:01
memberbouha30-Mar-06 4:01 
QuestionLicense terms ? Pin
pblse28-Mar-06 4:52
memberpblse28-Mar-06 4:52 
GeneralBig help Pin
M3Fan18-Jan-06 15:48
memberM3Fan18-Jan-06 15:48 
GeneralRe: Big help Pin
pblse28-Mar-06 6:22
memberpblse28-Mar-06 6:22 
GeneralMS Word &amp; Content IDs Pin
Oliver Haskell7-Nov-05 4:02
memberOliver Haskell7-Nov-05 4:02 
GeneralThis is a very cool program. Pin
Ashaman15-Sep-05 2:55
memberAshaman15-Sep-05 2:55 
Generalusing pinedit Pin
robinseatec29-Aug-05 19:52
memberrobinseatec29-Aug-05 19:52 
GeneralCurrentbug list Pin
wumpus122-Aug-05 12:11
memberwumpus122-Aug-05 12:11 
GeneralRe: Currentbug list Pin
DaveDeath8-Sep-05 2:28
memberDaveDeath8-Sep-05 2:28 
GeneralRe: Currentbug list Pin
HelenT20-Apr-06 11:48
memberHelenT20-Apr-06 11:48 
GeneralThank you dear Pin
suri197126-Jul-05 22:35
membersuri197126-Jul-05 22:35 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Terms of Use | Mobile
Web04 | 2.8.180111.1 | Last Updated 4 Apr 2005
Article Copyright 2004 by wumpus1
Everything else Copyright © CodeProject, 1999-2018
Layout: fixed | fluid