|
|

Introduction
If you've ever used the File | Save As... menu in Internet Explorer, you might have noticed a few interesting options IE provides under the Save As Type drop-down box:

The options provided are:
- Web Page, complete
- Web Archive, single file
- Web Page, HTML only
- Text File
Most of these are self-explanatory, with the exception of the Web Archive (MHTML) format. What's neat about this format is that it bundles the web page and all of its references, into a single compact .MHT file. It's a lot easier to distribute a single self-contained file than it is to distribute a HTML file with a subfolder full of image/CSS/Flash/XML files referenced by that HTML file. In our case, we were generating HTML reports and we needed to check these reports into a document management system which expects a single file. The MHTML (*.mht) format solves this problem beautifully!
This project contains the MhtBuilder class, a 100% .NET managed code solution which can auto-generate a MHT file from a target URL, in one line of code. As a bonus, it will also generate all the other formats listed above, too. And it's completely free, unlike some commercial solutions you might find out there.
Background
I know people assume the worst of Microsoft, but the MHTML format is actually based on RFC standard 2557, compliant Multipart MIME Message (MHTML web archive). So it's an actual Internet standard! Web Archive, a.k.a. MHTML, is a remarkably simple plain text format which looks a lot like (and is in fact almost exactly identical to) an email. Here's the header of the MHT file you are viewing at the top of the page:

To generate a MHTML file, we simply merge together all of the files referenced in the HTML. The red line marks the first content block; there will be one content block for each file. We need to follow a few rules, though:
- Use Quoted-Printable encoding for the text formats.
- Use Base64 encoding for the binary formats.
- Make sure the Content-Location has the correct absolute URL for each reference.
Not all websites will tolerate being packaged into a MHTML file. This version of Mht.Builder supports frames and IFrame, but watch out for pages that include lots of complicated JavaScript. You'll want to use the .StripScripts option on sites like that.
Using Mht.Builder
MhtBuilder comes with a complete demo app:

Try it out on your favorite website. The files will be generated by default in the \bin folder of the solution. Just click the View button to launch them. Bear in mind that for the Web Archive and complete tabs, all the content from the target web page must be downloaded to the /bin folder, so it might take a little while! Although I don't provide any feedback events yet, I do emit a lot of progress feedback via the Debug.Write, so switch to the debug output tab to see what's happening in real time.
There are four tabs here, just like the four options IE provides in its Save As Type options. In MhtBuilder, these are the four methods being called, in the order they appear on the tabs: Public Sub SavePageComplete(ByVal outputFilePath As String, Optional url As String)
Public Sub SavePageArchive(ByVal outputFilePath As String, Optional url As String)
Public Sub SavePage(ByVal outputFilePath As String, Optional url As String)
Public Sub SavePageText(ByVal outputFilePath As String, Optional url As String)
As of Windows XP Service Pack 2, HTML files opened from disk result in security blocks. In order to avoid this, we need to add the "Mark of the Web" to the file so IE knows what URL it came from, and can thus assign an appropriate security zone to the HTML. That's what the blnAddMark parameter is for; it causes the HTML file to be tagged with this single line at the top: <!---->
The other thing we need to do when saving these files is fix up the URLs. Any relative URLs such as: <img src="/images/standard/logo225x72.gif">
must be converted to absolute URLs like so: <img src="http://www.codeproject.com/images/standard/logo225x72.gif">
We do this using regular expressions, which gets us a NameValueCollection of all the references we need to fix. We loop through each reference and perform the fixup on the HTML string. Private Function ExternalHtmlFiles() As Specialized.NameValueCollection
If Not _ExternalFileCollection Is Nothing Then
Return _ExternalFileCollection
End If
_ExternalFileCollection = New Specialized.NameValueCollection
Dim r As Regex
Dim html As String = Me.ToString
Debug.WriteLine("Resolving all external HTML references from URL:")
Debug.WriteLine(" " & Me.Url)
r = New Regex( _
"(\ssrc|\sbackground)\s*=\s*((?<Key>'(?<Value>[^']+)')|" & _
"(?<Key>""(?<Value>[^""]+)"")|(?<Key>(?<Value>[^ \n\r\f]+)))", _
RegexOptions.IgnoreCase Or RegexOptions.Multiline)
AddMatchesToCollection(html, r, _ExternalFileCollection)
r = New Regex( _
"(@import\s|\S+-image:|background:)\s*?(url)*\s*?(?<Key>" & _
"[""'(]{1,2}(?<Value>[^""')]+)[""')]{1,2})", _
RegexOptions.IgnoreCase Or RegexOptions.Multiline)
AddMatchesToCollection(html, r, _ExternalFileCollection)
r = New Regex( _
"<link[^>]+?href\s*=\s*(?<Key>" & _
"('|"")*(?<Value>[^'"">]+)('|"")*)", _
RegexOptions.IgnoreCase Or RegexOptions.Multiline)
AddMatchesToCollection(html, r, _ExternalFileCollection)
r = New Regex( _
"<i*frame[^>]+?src\s*=\s*(?<Key>" & _
"['""]{0,1}(?<Value>[^'""\\>]+)['""]{0,1})", _
RegexOptions.IgnoreCase Or RegexOptions.Multiline)
AddMatchesToCollection(html, r, _ExternalFileCollection)
Return _ExternalFileCollection
End Function
We use a similar technique to get a list of all the files we need to download, which are then downloaded via my WebClientEx class. Why use that instead of the built in Net.WebClient? Good question! Because it doesn't support HTTP compression. My class, on the other hand, does: Private Function Decompress(ByVal b() As Byte, _
ByVal CompressionType As HttpContentEncoding) As Byte()
Dim s As Stream
Select Case CompressionType
Case HttpContentEncoding.Deflate
s = New Zip.Compression.Streams.InflaterInputStream(New MemoryStream(b), _
New Zip.Compression.Inflater(True))
Case HttpContentEncoding.Gzip
s = New GZip.GZipInputStream(New MemoryStream(b))
Case Else
Return b
End Select
Dim ms As New MemoryStream
Const chunkSize As Integer = 2048
Dim sizeRead As Integer
Dim unzipBytes(chunkSize) As Byte
While True
sizeRead = s.Read(unzipBytes, 0, chunkSize)
If sizeRead > 0 Then
ms.Write(unzipBytes, 0, sizeRead)
Else
Exit While
End If
End While
s.Close()
Return ms.ToArray
End Function
HTTP compression is a no-brainer: it increases your effective bandwidth by 75 percent by using standard GZIP compression-- courtesy of the SharpZipLib library.
Conclusion
Creating MHTML files isn't hard, but there are lots of little gotchas when dealing with HTML, regular expressions, and HTTP downloads. I tried to document all the difficult bits in the source code. I've also tested MhtBuilder on dozens of different websites so far with excellent results.
There are many more details and comments in the source code provided at the top of the article, so check it out. Please don't hesitate to provide feedback, good or bad! I hope you enjoyed this article. If you did, you may also like my other articles as well.
History
- Sunday, September 12, 2004 - Published.
- Monday, March 28, 2005 - Version 2.0
- Completely rewritten!
- Autodetection of content encoding (e.g., international web pages), tested against multi-language websites.
- Now correctly decompresses both types of HTTP compression.
- Supports completely in-memory operation for server-side use, or on-disk storage for client use.
- Now works on web pages with frames and IFrames, using recursive retrieval.
- HTTP authentication and HTTP Proxy support.
- Allows configuration of browser ID string to retrieve browser-specific content.
- Basic cookie support (needs enhancement and testing).
- Much improved regular expressions used for parsing HTTP.
- Extensive use of VB.NET 2005 style XML comments throughout.
| You must Sign In to use this message board. |
|
| | Msgs 1 to 25 of 115 (Total in Forum: 115) (Refresh) | FirstPrevNext |
|
|
 |
|
|
Anyone providing this code as an extension for Firefox 2?
I have been searching at length for a way to get this function into Firefox. Anyone know of a reliable add on for Firefox?
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Can anyone help me get this working with Internet Explorer 7.0?
It doesn't throw an exceeption and saves the *.mht file. The MHT file will not open in IE7.
If I go to the same website and "Save As" from Internet Explorer the MHT file opens OK.
I even tried the program against "http://www.codinghorror.com/blog/" and it again it save a MHT file that will not open in IE7.
Probably something simple but I can't figure it out...
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
we are using this piece of code for converting a lot of our reports which are in ASP to MHTML files which represents as an image of that week. The problem we are facing is the MHTML file being generated is just adding a reference of all the images and CSS to our production site after archiving also. suppose the main site is down or we are offline then these webpage archived reports are not showing up properly, So any of you can you tell us some tweaking of this code to make it working.
Baladitya Ganty
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
This app works wonderfully with http:// based requests. However, if I try to use file:///... based requests, I get an invalid cast exception with with the WebClientEx.vb class on line 343:
Dim wreq As HttpWebRequest = DirectCast(WebRequest.Create(Url), HttpWebRequest)
Anyone have a work around? I'm not a guru with the HttpWebRequest.
I'm basically writing an application that dumps some information with charts to a html file and I would like to have it converted to .mht format for easy distribution.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
 |
|
|
Hi,
This is sort of an aside, but I figure that people who look at this page probably have a great familiarity with MHTML, and I need it to solve my problem. I have a webpage which contains a base64 string encoding a .png file. I also know the dimensions of the file etc. But the page will not know the image's URL.
I want to use this image as the background for one of the elements in my page. In Firefox/Safari/Opera, I can just use the "data: URI", i.e.
element.style.background-image = "url(data:image/png;base64," + base64String + ")";
Unfortunately, Internet Explorer does not support the data: URI. But I figure that IE must have this functionality, because it would be ridiculous if it didn't. And it looks to me like MHTML is the most likely way that one can get this done with IE.
Does anyone know if this is possible, and if so, could you please provide a short code snippet explaining how?
Thanks.
P1000
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|

The mht generated looks fine (it's not all mungled up), but it won't load into ie7.
Any thoughts? When you load the page, it's blank. the source in the browser is:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <HTML><HEAD> <META http-equiv=Content-Type content="text/html; charset=windows-1252"></HEAD> <BODY></BODY></HTML>
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
...Looking at you blog Jeff I found this post from Kyle who was talking about a fix to open the mht in word. I applied the fix, and hey presto.  cheers Moose --------------------- Here's what Kyle said to do:
I've made two code changes to allow for the file to be opened in Word 2003. This made it work for me anyway.
Kyle
In builder.vb starting on line 474 change the procedure to the following:
Private Sub AppendMhtBoundary(Optional ByVal bEndOfFile As Boolean = False) AppendMhtLine() If bEndOfFile = False Then AppendMhtLine("--" & _MimeBoundaryTag) Else AppendMhtLine("--" & _MimeBoundaryTag & "--") End If End Sub
In builder.vb on line 438, change procedure call to: AppendMhtBoundary(True)
Kyle on August 15, 2005 05:30 PM
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Greetings,
Is it possible to embed java applets in mhtml files? I tried adding to your code the applet tag and make it process the '.class' files but it doesn't seem to work. It only displays a blank page 
Thanks.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Application is not working with 11S and Windows 2003 with localhost application. working fine with external websites and IIS 5.0
For example www.google.com is fine and localhost:8080\websitetest\test.aspx is not giving the correct result
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
I have a list of URL. How to make this going to be batched?
Seems that the whole process won't raise a "Finish" event..
|
| Sign In·View Thread·PermaLink | 2.67/5 (3 votes) |
|
|
|
 |
|
|
It's like the genie granted me one wish and this was the result. It does exactly what I wanted and it's already in .Net format. Superb!
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Hi,
I am using your code to generate .Mht files. but at some sites, the generated .Mht files are opening in IE as text file. Can you please tell me why does it happen? is there any browser settings?
please help me out?
Thanks
Govind
|
| Sign In·View Thread·PermaLink | 3.00/5 (2 votes) |
|
|
|
 |
|
|
I've got the same problem.
Wheras http://www.codinghorror.com/blog/ was saved successfully to mht http://www.codeplex.com does not work correctly.
As Govind already said, only text/html-code will be displayed.
Any ideas?
Thx, Christian
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
OK, problem solved!
It occurs when the subject of the mht archive is too long or includes linebreaks.
To solve the problem, only one line of code is affected to change:
Builder.VB 449 AppendMhtLine("Subject: " & ef.HtmlTitle)
Greetings Christian
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Hi, Thank you very much for providing this well crafted code. I need to save html as "Web page complete", "Web page archive", "Web page as PDF" for an application to backup blogs. Currently I am doing it using your code . What I am interested in is to show the download progress as the web page is being saved. So I was thinking of combining the functions provided in MHT builder into the extended web browser control found at http://www.codeproject.com/csharp/ExtendedWebBrowser.asp.
I would request your help/guidance/indication/criticism on it.
S M Mahbub Murshed
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Hello, great job!! I've tried your library and I found it very usefull: my best compliments. But I've found an error saving a particular web page: the url is "http://www.flcgil.it/notizie/news/2006/dicembre/firmato_il_contratto_di_lavoro_dell_enea_si_inizia_a_parlare_seriamente_dei_precari".
When I save the page, it gives me an exception: "System.IO.PathTooLongException: The path is too long after being fully qualified. Make sure path is less than 260 characters." But the url is 131 characters! The exception is thrown saving the page in mht (with the method SavePageArchive, setting the file storage on disk as temporary or permanent) and in Web page complete (with SavePageComplete). I suppose that during the saving, the library saves temporary html files where the name of the file, added to the url, exceeds 260 characters. If I'm right, the only solution is to give new shorter name to this temporary files.
Has anybody noticed this bug?
Thanks for your help!
Renato
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
I think I've fixed this bug. I've noticed two reason for the failure in saving my page (http://www.flcgil.it/notizie/news/2006/dicembre/firmato_il_contratto_di_lavoro_dell_enea_si_inizia_a_parlare_seriamente_dei_precari):
1) this page probably references itself, so the recursively download of the externally referenced files never ends. Yes, there is a property AllowRecursiveFileRetrieval that I can set to false to avoid this, but I want to be sure to download all necessary files. My idea is to permit the recursion only to a certain level of depth; reaching that limit, I suppose that the file is autoreferencing and I stop the recursive download. I've made the following changes in ExternalFile.vb:
Public Sub DownloadExternalFiles(ByVal st As Builder.FileStorage, ByVal level As Integer, Optional ByVal recursive As Boolean = False) 'test to avoid infinite recursion level += 1 If level > 4 Then recursive = False End If DownloadExternalFiles(st, Me.ExternalFilesFolder, level, recursive) End Sub
Private Sub DownloadExternalFiles(ByVal st As Builder.FileStorage, ByVal targetFolder As String, ByVal level As Integer, ByVal recursive As Boolean) Dim FileCollection As Specialized.NameValueCollection = ExternalHtmlFiles() If Not FileCollection.HasKeys Then Return Debug.WriteLine("Downloading all external files collected from URL:") Debug.WriteLine(" " & Url) For Each Key As String In FileCollection.AllKeys DownloadExternalFile(FileCollection.Item(Key), st, targetFolder, level, recursive) Next End Sub
Private Sub DownloadExternalFile(ByVal url As String, ByVal st As Builder.FileStorage, _ ByVal targetFolder As String, ByVal level As Integer, Optional ByVal recursive As Boolean = False) '... not changed If isNew Then '-- add this (possibly) downloaded file to our shared collection _Builder.WebFiles.Add(wf.UrlUnmodified, wf)
'-- if this is an HTML file, it has dependencies of its own; '-- download them into a subfolder If (wf.IsHtml Or wf.IsCss) And recursive Then wf.DownloadExternalFiles(st, level, recursive) End If End If End Sub
In the file Builder.vb, in the functions SavePageComplete, GetPageArchive and SavePageArchive, when I call the method DownloadExternalFiles I initialize the depth of the recursion to zero: _HtmlFile.DownloadExternalFiles(st, 0, _AllowRecursion)
2) When creating a new file name, it shouldn't be too long. This can happen expecially if the title of html page is used as file name. So I've modified the function MakeValidFilename in ExternalFile.vb: Private Function MakeValidFilename(ByVal s As String, Optional ByVal enforceLength As Boolean = False) As String If enforceLength Then End If '-- replace any invalid filesystem chars, plus leading/trailing/doublespaces Dim name As String name = Regex.Replace(Regex.Replace(s, "[\/\\\:\*\?\""""\<\>\|]|^\s+|\s+$", ""), "\s{2,}", " ") 'enforce the maximum length to 25 characters If name.Length > 25 Then Dim extension As String extension = Path.GetExtension(name) name = name.Substring(0, 25 - extension.Length) & extension End If Return name End Function
(Maybe the optional parameter enforceLength was added to do something similar). There is also a function MakeValidFilename in the file Builder.vb, but I can't see when it is called, so I haven't modified it.
With this changes, I can save my web page. Has anybody done something similar? Is there something I've missed?
Renato
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
 |
|
|
Thanks dear for this cute artical and good class library , but my question now is if i publish my created presentation ( 3 slides for example ) to htm format from Microsoft PowerPoint and i tried to save it using your library , each time i clicked in any link in the htm presentation, the mht.dll saves only first slide
i think the problem results that the URL of the browser not changed even if i click in any link in the htm presentation, and the first file ( called fram.htm ) and this page contains first slide only URL , so the mht.dll detects only this slide page.
and i expect that the solution will be if i can save mht files from brwser cache ( like file save as in the browser)
i hope to help me in this problem
Thanks and Regards
Hammad
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Not fully clear reason for next code:
If Me.IsCss Then _DownloadedBytes = _TextEncoding.GetBytes(ProcessHtml(Me.ToString)) End If
Seems it is misswriting Maybe it should be
If Me.IsCss Then _DownloadedBytes = _TextEncoding.GetBytes(ProcessCss(Me.ToString)) End If
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
 |
|
|
First of all thank you for the wonderful article and project. My question is not directly related to this project but I am posting it here in hope for getting some help. I am trying to write a proxy server to share internet connection. I know there are several small utilities available for this purpose but I wanted to do it my self so I can enhance it as I need. There is nice project (SSLProxy) at GotDotNet with source code but it is all in C# and I feel much comfortable using VB.NET. Also that project is pretty big to be converted to VB.NET. I wrote a small class using SOCKETS but it is not stable and sometimes it misses chunk of stream; especially when more then 2 connections are active. Any help or suggestion is much appreciated.
syedhashmi@gmail.com
-- modified at 19:37 Saturday 22nd July, 2006
|
| Sign In·View Thread·PermaLink | 1.00/5 (1 vote) |
|
|
|
 |
|
|
Before my contribution, I should state I enjoyed very much reading through the code !!!
I think I have found and fix two small bugs in the WebFile class.
Bug #1 The code wrongly assumes that the URL and <Base HREF=..> are identical. To fix it, I made three changes: 1) I added a private member to the class:
Private _BaseUrlFolder As String
2) _BaseUrlFolder is set in the ProcessHtml() method:
If BaseUrlFolder <> "" Then If BaseUrlFolder.EndsWith("/") Then _BaseUrlFolder = BaseUrlFolder.Substring(0, BaseUrlFolder.Length - 1) Else _BaseUrlFolder = BaseUrlFolder End If End If
3) _BaseUrlFolder is used in the ConvertRelativeToAbsoluteRefs() method
'-- href="/anything" to href="http://www.web.com/anything" r = New Regex(urlPattern, _ RegexOptions.IgnoreCase Or RegexOptions.Multiline) html = r.Replace(html, "${attrib}=${delim1}" & _BaseUrlFolder & "/${url}${delim2}")
'-- href="anything" to href="http://www.web.com/folder/anything" r = New Regex(urlPattern.Replace("/", ""), _ RegexOptions.IgnoreCase Or RegexOptions.Multiline) html = r.Replace(html, "${attrib}=${delim1}" & _BaseUrlFolder & "/${url}${delim2}")
'-- @import(/anything) to @import url(http://www.web.com/anything) r = New Regex(cssPattern, _ RegexOptions.IgnoreCase Or RegexOptions.Multiline) html = r.Replace(html, "${attrib} url(" & _BaseUrlFolder & "/${url})")
'-- @import(anything) to @import url(http://www.web.com/folder/anything) r = New Regex(cssPattern.Replace("/", ""), _ RegexOptions.IgnoreCase Or RegexOptions.Multiline) html = r.Replace(html, "${attrib} url(" & _BaseUrlFolder & "/${url})")
Bug #2 In the ProcessHtml method, removal of <base href=... > tag should be case insensitive and multiline. Code follows:
'-- remove the <base href=''> tag if present; causes problems when viewing locally. Dim r As New Regex("<base[^>]*?>", RegexOptions.IgnoreCase Or RegexOptions.Multiline) html = r.Replace(html, "") r = Nothing
|
| Sign In·View Thread·PermaLink | 2.00/5 (1 vote) |
|
|
|
 |
|
|
Generally correct but...
'-- href="/anything" to href="http://www.web.com/anything" r = New Regex(urlPattern, _ RegexOptions.IgnoreCase Or RegexOptions.Multiline) html = r.Replace(html, "${attrib}=${delim1}" & _BaseUrlFolder & "/${url}${delim2}")
'-- @import(/anything) to @import url(http://www.web.com/anything) r = New Regex(cssPattern, _ RegexOptions.IgnoreCase Or RegexOptions.Multiline) html = r.Replace(html, "${attrib} url(" & _BaseUrlFolder & "/${url})")
This is replacement of root-based relative url. So here should be used something like _BaseUrlRoot (getted same as you describe) instead of _BaseUrlFolder.
|
| Sign In·View Thread·PermaLink | 5.00/5 (1 vote) |
|
|
|
 |
|
| |