Click here to Skip to main content
Click here to Skip to main content

Convert any URL to a MHTML archive using native .NET code

By , 3 Apr 2005
 

Sample Image - MhtBuilder.gif

Introduction

If you've ever used the File | Save As... menu in Internet Explorer, you might have noticed a few interesting options IE provides under the Save As Type drop-down box:

Screenshot - Internet Explorer Save As menu

The options provided are:

  • Web Page, complete
  • Web Archive, single file
  • Web Page, HTML only
  • Text File

Most of these are self-explanatory, with the exception of the Web Archive (MHTML) format. What's neat about this format is that it bundles the web page and all of its references, into a single compact .MHT file. It's a lot easier to distribute a single self-contained file than it is to distribute a HTML file with a subfolder full of image/CSS/Flash/XML files referenced by that HTML file. In our case, we were generating HTML reports and we needed to check these reports into a document management system which expects a single file. The MHTML (*.mht) format solves this problem beautifully!

This project contains the MhtBuilder class, a 100% .NET managed code solution which can auto-generate a MHT file from a target URL, in one line of code. As a bonus, it will also generate all the other formats listed above, too. And it's completely free, unlike some commercial solutions you might find out there.

Background

I know people assume the worst of Microsoft, but the MHTML format is actually based on RFC standard 2557, compliant Multipart MIME Message (MHTML web archive). So it's an actual Internet standard! Web Archive, a.k.a. MHTML, is a remarkably simple plain text format which looks a lot like (and is in fact almost exactly identical to) an email. Here's the header of the MHT file you are viewing at the top of the page:

Screenshot - Mht file header

To generate a MHTML file, we simply merge together all of the files referenced in the HTML. The red line marks the first content block; there will be one content block for each file. We need to follow a few rules, though:

  • Use Quoted-Printable encoding for the text formats.
  • Use Base64 encoding for the binary formats.
  • Make sure the Content-Location has the correct absolute URL for each reference.

Not all websites will tolerate being packaged into a MHTML file. This version of Mht.Builder supports frames and IFrame, but watch out for pages that include lots of complicated JavaScript. You'll want to use the .StripScripts option on sites like that.

Using Mht.Builder

MhtBuilder comes with a complete demo app:

Screenshot - Mht demo application

Try it out on your favorite website. The files will be generated by default in the \bin folder of the solution. Just click the View button to launch them. Bear in mind that for the Web Archive and complete tabs, all the content from the target web page must be downloaded to the /bin folder, so it might take a little while! Although I don't provide any feedback events yet, I do emit a lot of progress feedback via the Debug.Write, so switch to the debug output tab to see what's happening in real time.

There are four tabs here, just like the four options IE provides in its Save As Type options. In MhtBuilder, these are the four methods being called, in the order they appear on the tabs:

Public Sub SavePageComplete(ByVal outputFilePath As String, Optional url As String)
Public Sub SavePageArchive(ByVal outputFilePath As String, Optional url As String)
Public Sub SavePage(ByVal outputFilePath As String, Optional url As String)
Public Sub SavePageText(ByVal outputFilePath As String, Optional url As String)

As of Windows XP Service Pack 2, HTML files opened from disk result in security blocks. In order to avoid this, we need to add the "Mark of the Web" to the file so IE knows what URL it came from, and can thus assign an appropriate security zone to the HTML. That's what the blnAddMark parameter is for; it causes the HTML file to be tagged with this single line at the top:

<!-- saved from url=(0027)http://www.codeproject.com/ -->

The other thing we need to do when saving these files is fix up the URLs. Any relative URLs such as:

<img src="/images/standard/logo225x72.gif">

must be converted to absolute URLs like so:

<img src="http://www.codeproject.com/images/standard/logo225x72.gif">

We do this using regular expressions, which gets us a NameValueCollection of all the references we need to fix. We loop through each reference and perform the fixup on the HTML string.

Private Function ExternalHtmlFiles() As Specialized.NameValueCollection
  If Not _ExternalFileCollection Is Nothing Then
    Return _ExternalFileCollection
  End If
  
  _ExternalFileCollection = New Specialized.NameValueCollection
  Dim r As Regex
  Dim html As String = Me.ToString
  
  Debug.WriteLine("Resolving all external HTML references from URL:")
  Debug.WriteLine("    " & Me.Url)
  
  '-- src='filename.ext' ; background="filename.ext"
  '-- note that we have to test 3 times to catch all quote styles: '', "", and none
  r = New Regex( _
    "(\ssrc|\sbackground)\s*=\s*((?<Key>'(?<Value>[^']+)')|" & _
    "(?<Key>""(?<Value>[^""]+)"")|(?<Key>(?<Value>[^ \n\r\f]+)))", _
    RegexOptions.IgnoreCase Or RegexOptions.Multiline)
    AddMatchesToCollection(html, r, _ExternalFileCollection)
  
  '-- @import "style.css" or @import url(style.css)
  r = New Regex( _
    "(@import\s|\S+-image:|background:)\s*?(url)*\s*?(?<Key>" & _
    "[""'(]{1,2}(?<Value>[^""')]+)[""')]{1,2})", _
    RegexOptions.IgnoreCase Or RegexOptions.Multiline)
    AddMatchesToCollection(html, r, _ExternalFileCollection)
  
  '-- <link rel=stylesheet href="style.css">
  r = New Regex( _
    "<link[^>]+?href\s*=\s*(?<Key>" & _
    "('|"")*(?<Value>[^'"">]+)('|"")*)", _
    RegexOptions.IgnoreCase Or RegexOptions.Multiline)
    AddMatchesToCollection(html, r, _ExternalFileCollection)
  
  '-- <iframe src="mypage.htm"> or <frame src="mypage.aspx">
  r = New Regex( _
    "<i*frame[^>]+?src\s*=\s*(?<Key>" & _
    "['""]{0,1}(?<Value>[^'""\\>]+)['""]{0,1})", _
    RegexOptions.IgnoreCase Or RegexOptions.Multiline)
    AddMatchesToCollection(html, r, _ExternalFileCollection)
  
  Return _ExternalFileCollection
End Function

We use a similar technique to get a list of all the files we need to download, which are then downloaded via my WebClientEx class. Why use that instead of the built in Net.WebClient? Good question! Because it doesn't support HTTP compression. My class, on the other hand, does:

Private Function Decompress(ByVal b() As Byte, _
      ByVal CompressionType As HttpContentEncoding) As Byte()

  Dim s As Stream
  Select Case CompressionType
    Case HttpContentEncoding.Deflate
      s = New Zip.Compression.Streams.InflaterInputStream(New MemoryStream(b), _
          New Zip.Compression.Inflater(True))
    Case HttpContentEncoding.Gzip
      s = New GZip.GZipInputStream(New MemoryStream(b))
    Case Else
      Return b
  End Select
  
  Dim ms As New MemoryStream
  Const chunkSize As Integer = 2048
  
  Dim sizeRead As Integer
  Dim unzipBytes(chunkSize) As Byte
  While True
    sizeRead = s.Read(unzipBytes, 0, chunkSize)
    If sizeRead > 0 Then
      ms.Write(unzipBytes, 0, sizeRead)
    Else
      Exit While
    End If
  End While
  s.Close()
  
  Return ms.ToArray
End Function

HTTP compression is a no-brainer: it increases your effective bandwidth by 75 percent by using standard GZIP compression-- courtesy of the SharpZipLib library.

Conclusion

Creating MHTML files isn't hard, but there are lots of little gotchas when dealing with HTML, regular expressions, and HTTP downloads. I tried to document all the difficult bits in the source code. I've also tested MhtBuilder on dozens of different websites so far with excellent results.

There are many more details and comments in the source code provided at the top of the article, so check it out. Please don't hesitate to provide feedback, good or bad! I hope you enjoyed this article. If you did, you may also like my other articles as well.

History

  • Sunday, September 12, 2004 - Published.
  • Monday, March 28, 2005 - Version 2.0
    • Completely rewritten!
    • Autodetection of content encoding (e.g., international web pages), tested against multi-language websites.
    • Now correctly decompresses both types of HTTP compression.
    • Supports completely in-memory operation for server-side use, or on-disk storage for client use.
    • Now works on web pages with frames and IFrames, using recursive retrieval.
    • HTTP authentication and HTTP Proxy support.
    • Allows configuration of browser ID string to retrieve browser-specific content.
    • Basic cookie support (needs enhancement and testing).
    • Much improved regular expressions used for parsing HTTP.
    • Extensive use of VB.NET 2005 style XML comments throughout.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

wumpus1
Web Developer
United States United States
Member
My name is Jeff Atwood. I live in Berkeley, CA with my wife, two cats, and far more computers than I care to mention. My first computer was the Texas Instruments TI-99/4a. I've been a Microsoft Windows developer since 1992; primarily in VB. I am particularly interested in best practices and human factors in software development, as represented in my recommended developer reading list. I also have a coding and human factors related blog at www.codinghorror.com.

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
QuestionMaking it work with IE9 PinmemberNichUK23 Jan '13 - 7:37 
Just spent a couple of hours trying to work out why IE9 wouldn't read the MHT files produced with this project...
 
Turns out to be that it expects the last boundary message to have an additional "--" on the end, so:
 
----=_NextPart_000_00--
 
rather than just
 
----=_NextPart_000_00
 
Might help someone!
QuestionCan we mention this in the credits under CPOL license? Pinmemberalbert.dc6 Feb '12 - 22:31 
We want to mention your tool in our acknowledgement file but don't know specifically what license type to put.
Question[FIX] Certain background URLs are wrong [modified] Pinmemberalbert.dc30 Sep '11 - 9:07 
...in my case, I have a php content file that gets included in the main php file and the relative background URLs inside <style> tags in the php content file ended up looking like this
url(http://www.foo.com/ )images/bar.png' )
Notice the missing apos in the beginning and also the space) sandwiched inside the URL.
This prevents it from downloading the image and also removes the image when rendered.
 
To fix, replace lines 388 to 391 in DownloadHtmlFile(String) in Builder.vb with this
If Not _HtmlFile.WasDownloaded Then
	Throw New Exception("unable to download '" & Me.Url & "': " & _
		_HtmlFile.DownloadException.Message, _HtmlFile.DownloadException)
Else
	Dim html As String = _HtmlFile.ToString
	Dim urlFixRegex As Regex = New Regex("(background:\surl\()(http://.+?)(\s\))", RegexOptions.IgnoreCase Or RegexOptions.Singleline)
	html = urlFixRegex.Replace(html, "$1'$2")
	_HtmlFile.DownloadedBytes = _HtmlFile.TextEncoding.GetBytes(html)
End If


modified 30 Sep '11 - 15:23.

QuestionUsage Restrictions? Pinmembermarchesb17 Aug '11 - 5:45 
Jeff,
 
Outstanding work! Your article and source don't appear to have any usage terms associated with them. Can you please clarify what limitations you require for its usage?
 
Thanks!
GeneralMy vote of 5 PinmemberBob@work27 Jul '11 - 6:31 
Terrific post, great app, and lots of helpful comments. That's a 5!
QuestionAlt to MHT builder-- no dotNET required PinmemberMember 807420011 Jul '11 - 0:21 
I have been here before in the past but I never downloaded this program, cause I didn't want to install dotNet again.. as of yet. so I went elsewhere. The visit here led me to finding another MHT editor that didn't need dotNET and in doing so, I found a bug in the program, and it lead to the me playing around with Windows some more as I was already doing.
This led me to play around with Outlook and some MHT and EML files and...
 

People email me and I will tell you how to use OUTLOOK Express as a MHT editor and how to convert EML to MHT's without any added software. Yes, without any added software. WINDOWS already comes with an MHT editor.
 
If you read the Article above, the KEY words are the MIME technology. Not only can you use OUTLOOK Express as an MHT editor, you can convert your HTML pages to MHT and Convert your EML to MHT.

NO SOFTWARE TO DOWNLOAD. In addition, I will tell you how to Open an EML in Internet Explorer;no software to download either.
 
It's cool to download stuff like this that is free, however, all you need is someone with an bit of engineering know-how and is nice enough like me to point out something that is not discussed on the web as I have never come across it being discussed about what I do with OUTLOOK express.
 
I was going to post it here, but I going to post it on my blogspot instead. So like I said, just ask..
 
Netverse
AnswerRe: THIS IS A SPAM, PLEASE REMOVE IT PinmemberAram Azhari5 Nov '11 - 20:08 
Generalembedding flash file in mht Pinmemberaj862210 Jun '11 - 2:12 
Does this project embeds the swf file in mht too?
GeneralMy vote of 5 Pinmemberjohnson_han28 Jul '10 - 19:38 
usefull!
GeneralProposed way to support file:///... based requests Pinmembermarchesb12 May '10 - 8:38 
This solution is awesome, just what I was looking for! I noticed in ealier messages that file:///... based requests were originally supported, but that support was dropped. For anyone still interested, I was able to add that support (at least for my purposes) as follows:
 
1. In ExternalFile.vb, updated the urlPattern var in ConvertRelativeToAbsoluteRefs method to include "file:":
Dim urlPattern As String = _
"(?<attrib>\shref|\ssrc|\sbackground)\s*?=\s*?" & _
"(?<delim1>[""'\\]{0,2})(?!\s*\+|#|http:|ftp:|mailto:|javascript:|fileSmile | :) " & _
"/(?<url>[^""'>\\]+)(?<delim2>[""'\\]{0,2})"
 
2. In ExternalFile.vb, added a new urlFileRegex var in AddMatchesToCollection method:
...
Dim urlFileRegex As New Regex("^file*:///\w+", RegexOptions.IgnoreCase)
...
If (Not urlRegex.IsMatch(value)) AndAlso (Not urlFileRegex.IsMatch(value)) Then
...
 
3. In WebClientEx.vb added these vars at top of the class:
Private _IsFileWebRequest As Boolean = False
Private _LoadedInitHTML As Boolean = False
 
4. In WebClientEx.vb added this prop:
Public Property IsFileWebRequest() As Boolean
Get
Return _IsFileWebRequest
End Get
Set(ByVal Value As Boolean)
_IsFileWebRequest = Value
End Set
End Property
 
5. In WebClientEx.vb, modified GetUrlData method to be as follows:
Public Sub GetUrlData(ByVal Url As String, ByVal ifModifiedSince As DateTime)
'-- a.) Added If/Then check and handling for _IsFileWebRequest = True:
Dim wreq As System.Net.WebRequest
If _IsFileWebRequest = False Then
wreq = DirectCast(WebRequest.Create(Url), HttpWebRequest)
Else
wreq = DirectCast(WebRequest.Create(Url), FileWebRequest)
End If
 
'-- do we need to use a proxy to get to the web?
If _ProxyUrl <> "" Then
Dim wp As New WebProxy(_ProxyUrl)
If _ProxyAuthenticationRequired Then
If _ProxyUser <> "" And _ProxyPassword <> "" Then
wp.Credentials = New NetworkCredential(_ProxyUser, _ProxyPassword)
Else
wp.Credentials = CredentialCache.DefaultCredentials
End If
wreq.Proxy = wp
End If
End If
 
'-- does the target website require credentials?
If _AuthenticationRequired Then
If _AuthenticationUser <> "" And _AuthenticationPassword <> "" Then
wreq.Credentials = New NetworkCredential(_AuthenticationUser, _AuthenticationPassword)
Else
wreq.Credentials = CredentialCache.DefaultCredentials
End If
End If
 
wreq.Method = "GET"
wreq.Timeout = _RequestTimeoutMilliseconds
 
'-- b.) Added If/Then check and type cast handling for _IsFileWebRequest = False:
If _IsFileWebRequest = False Then
CType(wreq, HttpWebRequest).UserAgent = _HttpUserAgent
End If
 
wreq.Headers.Add("Accept-Encoding", _AcceptedEncodings)
 
'-- c.) Added If/Then check and type cast handling for _IsFileWebRequest = False:
If _IsFileWebRequest = False Then
'-- note that, if present, this will trigger a 304 exception
'-- if the URL being retrieved is not newer than the specified
'-- date/time
If ifModifiedSince <> DateTime.MinValue Then
CType(wreq, HttpWebRequest).IfModifiedSince = ifModifiedSince
End If
End If
 
'-- d.) Added If/Then check and type cast handling for _IsFileWebRequest = False:
If _IsFileWebRequest = False Then
'-- sometimes we need to transfer cookies to another URL;
'-- this keeps them around in the object
If KeepCookies Then
If _PersistedCookies Is Nothing Then
_PersistedCookies = New CookieContainer
End If
CType(wreq, HttpWebRequest).CookieContainer = _PersistedCookies
End If
End If
 
'-- download the target URL into a byte array
'-- e.) Added If/Then check and handling for _IsFileWebRequest = True:
Dim wresp As System.Net.WebResponse
If _IsFileWebRequest = False Then
wresp = DirectCast(wreq.GetResponse, HttpWebResponse)
Else
wresp = DirectCast(wreq.GetResponse, FileWebResponse)
End If
 
'-- convert response stream to byte array
Dim ebr As New ExtendedBinaryReader(wresp.GetResponseStream)
_ResponseBytes = ebr.ReadToEnd()
 
'-- determine if body bytes are compressed, and if so,
'-- decompress the bytes
Dim ContentEncoding As HttpContentEncoding
If wresp.Headers.Item("Content-Encoding") Is Nothing Then
ContentEncoding = HttpContentEncoding.None
Else
Select Case wresp.Headers.Item("Content-Encoding").ToLower
Case "gzip"
ContentEncoding = HttpContentEncoding.Gzip
Case "deflate"
ContentEncoding = HttpContentEncoding.Deflate
Case Else
ContentEncoding = HttpContentEncoding.Unknown
End Select
_ResponseBytes = Decompress(_ResponseBytes, ContentEncoding)
End If
 
'-- sometimes URL is indeterminate, eg, "http://website.com/myfolder"
'-- in that case the folder and file resolution MUST be done on
'-- the server, and returned to the client as ContentLocation
_ContentLocation = wresp.Headers("Content-Location")
If _ContentLocation Is Nothing Then
_ContentLocation = ""
End If
 
'-- if we have string content, determine encoding type
'-- (must cast to prevent Nothing)
_DetectedContentType = wresp.Headers("Content-Type")
If _DetectedContentType Is Nothing Then
_DetectedContentType = ""
End If
 
'-- f.) Added If/Then check and handling for _IsFileWebRequest = True (wanted to force to text/html for my purposes, but only for the initial HTML text content):
If _IsFileWebRequest = True AndAlso _LoadedInitHTML = False Then
_DetectedContentType = "text/html;charset=UTF-8"
End If
 
If Me.ResponseIsBinary Then
_DetectedEncoding = Nothing
Else
If _ForcedEncoding Is Nothing Then
_DetectedEncoding = DetectEncoding(_DetectedContentType, _ResponseBytes)
End If
End If
 
'-- g.) Added If/Then check and handling for _IsFileWebRequest = True (wanted to force to utf-8 for my purposes, but only for the initial HTML text content):
If _IsFileWebRequest = True AndAlso _LoadedInitHTML = False Then
_DetectedEncoding = System.Text.Encoding.GetEncoding("utf-8")
End If
 
'-- h.) Added use of _LoadedInitHTML:
_LoadedInitHTML = True
 
End Sub
 

6. In Builder.vb, added IsFileWebRequest arg to SavePageArchive method's signature:
Public Function SavePageArchive(ByVal outputFilePath As String, ByVal st As FileStorage, _
Optional ByVal url As String = "", _
Optional ByVal IsFileWebRequest As Boolean = False) As String
 
7. Added the following line at top of SavePageArchive method:
WebClient.IsFileWebRequest = IsFileWebRequest
 

... and that did it for me.
QuestionHow to render this website ? PinmemberBattosaiii25 Jan '10 - 5:41 
Hi all,
 
First of all this is a really good project and very useful. The project works well for lot of webpages.
I would like to save for example the following webpage http://www.mapunderwriting.co.uk/ but the quality is bad. When i inspect the source code a stylesheet is missing. If I add the stylesheet missing manually(taken from the file saved by IE), it does not solve my issue. What could be the issue here ?
 
Thanks
Questionrendering error using mht builder Pinmemberwilliam saylor30 Dec '09 - 9:17 
The question I have is, is there a way to catch an http error when the url is openned and the mht is created. I have had a problem reciently where an error will occur and the file will still be created and display the error because it is a valid URL. If it's possible for this check to be put in place please let me know.
 
Thanks,
QuestionFile download in https PinmemberGargi K3 Apr '09 - 1:02 
Hi.. firstly, thanks for the code. Its exactly what we wanted. However we have one problem. Our dev and QA environments work under http and the code works fine for that but when the site runs under HTTPS(in UAT/LIVE environment), it throws following error:
 
The underlying connection was closed: Could not establish trust relationship for the SSL/TLS secure channel
 
What I am doing is providing an HTML url for a file which gets created dynamically in the application folder (e.g. 'https://servername/Forms/PrintFiles/HTMLPage.html where 'servername' is the server where the site is hosted)
 
I will be thankful for any suggestions provided.
QuestionRe: File download in https PinmemberGreg Hauptmann6 Oct '09 - 9:58 
QuestionRe: File download in https Pinmemberm.daveiga3 May '10 - 3:49 
GeneralGood article PinmemberDonsw15 Mar '09 - 16:34 
I was looking for something like this. All I found was the commercial ones. I am currently saving the html only. although it works it will be nice to add the graphics.
 
cheers,
Donsw
My Recent Article : Organizational Structure within a Company for PMPs

GeneralIE PinmemberBenAA2 Sep '08 - 6:20 
In IE, when I save as MHTML, relative files are not embedded. If I have an HTML page with relative files, I save it as an mhtml, then remove the local files, I see no images in the mhtml. Am I doing something wrong. Is this how this code behaves? (Since it is supposed to mimic IE).
QuestionMS Word Open the mht error Pinmemberyanglz999 Jun '08 - 16:39 
open the mht generate by this code with ms word, it shows error not a correct mht file, what's the problem?
AnswerRe: MS Word Open the mht error [modified] Pinmembergg6731 May '10 - 17:37 
GeneralGoing the other way Pinmemberurbane.tiger23 May '08 - 15:37 
Anyone know of something that will transform MHTML into HTML - I downloaded something from Softpedia that claimed to do it - but it doesn't produce any output!
 
TUT
 
If you up your bandwidth from slow DSL to fast DSL, make sure your shields are robust, you'll probably be visiting places you've not been before.

QuestionAnyone providing this code as an extension for Firefox 2? Pinmemberalternety3 Jan '08 - 8:43 
Anyone providing this code as an extension for Firefox 2?
 
I have been searching at length for a way to get this function into Firefox. Anyone know of a reliable add on for Firefox?
QuestionMSIE7.0, VS2005, Vista Home Premium Not Working PinmemberJustALark6 Dec '07 - 7:17 
Can anyone help me get this working with Internet Explorer 7.0?
 
It doesn't throw an exceeption and saves the *.mht file.
The MHT file will not open in IE7.
 
If I go to the same website and "Save As" from Internet Explorer the MHT file opens OK.
 
I even tried the program against "http://www.codinghorror.com/blog/" and it again it save a MHT file that will not open in IE7.
 
Probably something simple but I can't figure it out... Confused | :confused:
QuestionImages and CSS files are being referenced to the website and not being encoded in the single file PinmemberBaladitya Ganty20 Jul '07 - 5:42 
we are using this piece of code for converting a lot of our reports which are in ASP to MHTML files which represents as an image of that week.
The problem we are facing is the MHTML file being generated is just adding a reference of all the images and CSS to our production site after archiving also. suppose the main site is down or we are offline then these webpage archived reports are not showing up properly, So any of you can you tell us some tweaking of this code to make it working.
 
Baladitya Ganty
QuestionTrying to get html files on my hard drive to be converted... Pinmemberitskyb16 Jul '07 - 9:04 
This app works wonderfully with http:// based requests. However, if I try to use file:///... based requests, I get an invalid cast exception with with the WebClientEx.vb class on line 343:
 
Dim wreq As HttpWebRequest = DirectCast(WebRequest.Create(Url), HttpWebRequest)
 
Anyone have a work around? I'm not a guru with the HttpWebRequest.
 
I'm basically writing an application that dumps some information with charts to a html file and I would like to have it converted to .mht format for easy distribution.
GeneralInteresting problem Pinmemberp10005 Jul '07 - 6:48 
Hi,
 
This is sort of an aside, but I figure that people who look at this page probably have a great familiarity with MHTML, and I need it to solve my problem. I have a webpage which contains a base64 string encoding a .png file. I also know the dimensions of the file etc. But the page will not know the image's URL.
 
I want to use this image as the background for one of the elements in my page. In Firefox/Safari/Opera, I can just use the "data: URI", i.e.
 
element.style.background-image = "url(data:image/png;base64," + base64String + ")";
 
Unfortunately, Internet Explorer does not support the data: URI. But I figure that IE must have this functionality, because it would be ridiculous if it didn't. And it looks to me like MHTML is the most likely way that one can get this done with IE.
 
Does anyone know if this is possible, and if so, could you please provide a short code snippet explaining how?
 
Thanks.
 
P1000
GeneralRe: Interesting problem Pinmembergordon byers10 Jun '09 - 4:05 
QuestionHas anyone got this workng on vista / vs2005 / ie7 ? PinmemberMootah21 Jun '07 - 9:43 
Confused | :confused:
 
The mht generated looks fine (it's not all mungled up), but it won't load into ie7.
 
Any thoughts? When you load the page, it's blank.
the source in the browser is:
 
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=windows-1252"></HEAD>
<BODY></BODY></HTML>
AnswerFixed it....well, to be honest, Kyle fixed it PinmemberMootah21 Jun '07 - 10:34 
GeneralRe: Fixed it....well, to be honest, Kyle fixed it Pinmemberrobalexclark5 Nov '08 - 6:00 
QuestionJava applets Pinmembervitorg20 Jun '07 - 1:41 
Greetings,
 
Is it possible to embed java applets in mhtml files?
I tried adding to your code the applet tag and make it process the '.class' files but it doesn't seem to work. It only displays a blank page Sigh | :sigh:
 
Thanks.
GeneralIIS6.0 Windows 2003 PinmemberRajeshCR4 Jun '07 - 20:25 
Application is not working with 11S and Windows 2003 with localhost application. working fine with external websites and IIS 5.0
 
For example www.google.com is fine and localhost:8080\websitetest\test.aspx is not giving the correct result
QuestionHow to make it to Batch Process? PinmemberElven Wong28 Mar '07 - 22:32 
I have a list of URL. How to make this going to be batched?
 
Seems that the whole process won't raise a "Finish" event..
GeneralAwesome! Just what I've been looking for. Pinmembersaab340b27 Mar '07 - 5:39 
It's like the genie granted me one wish and this was the result. It does exactly what I wanted and it's already in .Net format.
Superb!
Questionwhy the .Mht files are showing as text files? Pinmembergovindaraj.perumal15 Mar '07 - 20:46 
Hi,
 
I am using your code to generate .Mht files. but at some sites, the generated .Mht files are opening in IE as text file. Can you please tell me why does it happen? is there any browser settings?
 
please help me out?
 
Thanks
 
Govind
AnswerRe: why the .Mht files are showing as text files? PinmemberChristianW7 Jan '08 - 23:01 
GeneralRe: why the .Mht files are showing as text files? PinmemberChristianW7 Jan '08 - 23:21 
GeneralDownload progress Pinmemberudvranto3 Feb '07 - 11:15 
Hi,
Thank you very much for providing this well crafted code. I need to save html as "Web page complete", "Web page archive", "Web page as PDF" for an application to backup blogs. Currently I am doing it using your code . What I am interested in is to show the download progress as the web page is being saved. So I was thinking of combining the functions provided in MHT builder into the extended web browser control found at http://www.codeproject.com/csharp/ExtendedWebBrowser.asp.
 
I would request your help/guidance/indication/criticism on it.

 
S M Mahbub Murshed
QuestionTrouble saving long url (but not so long! Less than 260 characters, I mean) [modified] PinmemberLa raza22 Dec '06 - 5:07 
Hello,
great job!! I've tried your library and I found it very usefull: my best compliments.
But I've found an error saving a particular web page: the url is "http://www.flcgil.it/notizie/news/2006/dicembre/firmato_il_contratto_di_lavoro_dell_enea_si_inizia_a_parlare_seriamente_dei_precari".
 
When I save the page, it gives me an exception: "System.IO.PathTooLongException: The path is too long after being fully qualified. Make sure path is less than 260 characters." But the url is 131 characters!
The exception is thrown saving the page in mht (with the method SavePageArchive, setting the file storage on disk as temporary or permanent) and in Web page complete (with SavePageComplete).
I suppose that during the saving, the library saves temporary html files where the name of the file, added to the url, exceeds 260 characters. If I'm right, the only solution is to give new shorter name to this temporary files.
 
Has anybody noticed this bug?
 
Thanks for your help!
 
Renato
AnswerRe: Trouble saving long url (but not so long! Less than 260 characters, I mean) PinmemberLa raza28 Dec '06 - 5:00 
QuestionBrowser reads www.century21.com but this program doesn't , why ? Pinmemberdlwells10 Dec '06 - 14:55 
Why ?
QuestionHow i can save the htm output from MS PowerPoint to mhtml format PinmemberMohammad Hammad28 Nov '06 - 8:42 
Thanks dear for this cute artical and good class library , but my question now is if i publish my created presentation ( 3 slides for example ) to htm format from Microsoft PowerPoint and i tried to save it using your library , each time i clicked in any link in the htm presentation, the mht.dll saves only first slide
 
i think the problem results that the URL of the browser not changed even if i click in any link in the htm presentation, and the first file ( called fram.htm ) and this page contains first slide only URL , so the mht.dll detects only this slide page.
 
and i expect that the solution will be if i can save mht files from brwser cache ( like file save as in the browser)
 
i hope to help me in this problem
 
Thanks and Regards
 
Hammad

QuestionWebFile.Download() code question Pinmemberhev8 Nov '06 - 2:32 
Not fully clear reason for next code:

If Me.IsCss Then
_DownloadedBytes = _TextEncoding.GetBytes(ProcessHtml(Me.ToString))
End If

 
Seems it is misswritingSigh | :sigh:
Maybe it should be

If Me.IsCss Then
_DownloadedBytes = _TextEncoding.GetBytes(ProcessCss(Me.ToString))
End If

Confused | :confused:
Generalawesome PinmemberTim Kohler18 Aug '06 - 9:27 
This is truly great work. Thanks a lot!
QuestionProxy Server help [modified] PinmemberSyed Javed22 Jul '06 - 13:35 
First of all thank you for the wonderful article and project. My question is not directly related to this project but I am posting it here in hope for getting some help. I am trying to write a proxy server to share internet connection. I know there are several small utilities available for this purpose but I wanted to do it my self so I can enhance it as I need. There is nice project (SSLProxy) at GotDotNet with source code but it is all in C# and I feel much comfortable using VB.NET. Also that project is pretty big to be converted to VB.NET. I wrote a small class using SOCKETS but it is not stable and sometimes it misses chunk of stream; especially when more then 2 connections are active.
Any help or suggestion is much appreciated.
 
syedhashmi@gmail.com

 
 
-- modified at 19:37 Saturday 22nd July, 2006
GeneralA contribution PinmemberYehuda A15 Jul '06 - 10:39 
Before my contribution, I should state I enjoyed very much reading through the code !!!
 
I think I have found and fix two small bugs in the WebFile class.
 
Bug #1
The code wrongly assumes that the URL and <Base HREF=..> are identical. To fix it, I made three changes:
1) I added a private member to the class:
Private _BaseUrlFolder As String
 
2) _BaseUrlFolder is set in the ProcessHtml() method:

If BaseUrlFolder <> "" Then
If BaseUrlFolder.EndsWith("/") Then
_BaseUrlFolder = BaseUrlFolder.Substring(0, BaseUrlFolder.Length - 1)
Else
_BaseUrlFolder = BaseUrlFolder
End If
End If

3) _BaseUrlFolder is used in the ConvertRelativeToAbsoluteRefs() method

'-- href="/anything" to href="http://www.web.com/anything"
r = New Regex(urlPattern, _
RegexOptions.IgnoreCase Or RegexOptions.Multiline)
html = r.Replace(html, "${attrib}=${delim1}" & _BaseUrlFolder & "/${url}${delim2}")
 
'-- href="anything" to href="http://www.web.com/folder/anything"
r = New Regex(urlPattern.Replace("/", ""), _
RegexOptions.IgnoreCase Or RegexOptions.Multiline)
html = r.Replace(html, "${attrib}=${delim1}" & _BaseUrlFolder & "/${url}${delim2}")
 
'-- @import(/anything) to @import url(http://www.web.com/anything)
r = New Regex(cssPattern, _
RegexOptions.IgnoreCase Or RegexOptions.Multiline)
html = r.Replace(html, "${attrib} url(" & _BaseUrlFolder & "/${url})")
 
'-- @import(anything) to @import url(http://www.web.com/folder/anything)
r = New Regex(cssPattern.Replace("/", ""), _
RegexOptions.IgnoreCase Or RegexOptions.Multiline)
html = r.Replace(html, "${attrib} url(" & _BaseUrlFolder & "/${url})")

Bug #2
In the ProcessHtml method, removal of <base href=... > tag should be case insensitive and multiline. Code follows:

'-- remove the <base href=''> tag if present; causes problems when viewing locally.
Dim r As New Regex("<base[^>]*?>", RegexOptions.IgnoreCase Or RegexOptions.Multiline)
html = r.Replace(html, "")
r = Nothing

GeneralRe: A contribution Pinmemberhev8 Nov '06 - 6:15 
QuestionHow can I Convert MHTML to HTML Pinmemberxfary3 Jul '06 - 15:56 
How can I Convert MHTML to HTML using c# or orther language?
Questionhow can i do multiple html files at once Pinmembercnrock27 Jun '06 - 17:07 
I really want to know how to take multiple html files at once. is Me.url a array??
I Iuput "http://www.codeproject.com (ENTER) http://www.google.com" into the Target URL,but it show "unable to download 'http://www.codeproject.com/%0D%0Ahttp:/www.google.com': The remote server returned an error: (400) Bad Request."
 
How can I do??
 

Sorry,I know little about VB.NET and my English is terriable.
please help me~~
GeneralIs this able to do multiple html files at once Pinmembersdejager27 Jun '06 - 13:56 
I was hoping to point this fantastic program at a URL and it will make each and every html file on that site into mht. It seems to take the index page only and then stops. It also saves the name as the title, rather than the actual file name...
 
I really like how this works but if this is able to do an entire site, and names the files using the actual file name rather than the title, could you let me know.
 
Perfect!
 
Sean of the Naki
GeneralLocal HTML/Image files support (continued) Pinmemberrsegijn16 Jun '06 - 0:20 
This works for me (please post any improvements):
 
1. In WebClient.ex I added 2 functions ContentTypeFromExtension and IsBinaryFromExtension:
 
Private Function ContentTypeFromExtension(ByVal UrlExt As String) As String
Select Case UrlExt.ToLower
Case ".htm", ".html"
Return "text/html"
Case ".css"
Return "text/css"
Case ".gif"
Return "image/gif"
Case ".jpg", ".jpeg", ".jpe"
Return "image/jpeg"
Case ".bmp"
Return "image/bmp"
Case ".tif", ".tiff"
Return "image/tiff"
Case ".png"
Return "image/x-png"
Case ".xbm"
Return "image/x-xbitmap"
Case ".xpm"
Return "image/x-xpixmap"
Case ".xwd"
Return "image/x-xwindowdump"
Case ".djv", ".djvu"
Return "image/vnd.djvu"
Case ".js"
Return "text/javascript"
Case ".xml", ".xsl"
Return "text/xml"
Case ".xht", ".xhtml"
Return "application/xhtml+xml"
Case ".txt", ".asc"
Return "text/plain"
Case ".rtf"
Return "text/rtf"
Case ".rtx"
Return "text/richtext"
Case ".sgm", ".sgml"
Return "text/sgml"
Case ".avi"
Return "video/ms-video"
Case ".mpe", ".mpeg", ".mpg"
Return "video/mpeg"
Case ".wmv"
Return "video/x-ms-wmv"
Case ".mov", ".qt"
Return "video/quicktime"
Case ".movie"
Return "video/x-sgi-movie"
Case ".mxu"
Return "video/vnd.mpegurl"
Case ".ram", ".rm"
Return "audio/x-pn-realaudio"
Case ".ra"
Return "audio/x-realaudio"
Case ".mp2", ".mp3", ".mpga"
Return "audio/mpeg"
Case ".mid", ".midi"
Return "audio/midi"
Case ".wav"
Return "audio/x-wav"
Case ".aif", ".aifc", ".aiff"
Return "audio/x-aiff"
Case ".doc"
Return "application/msword"
Case ".xls"
Return "application/vnd.ms-excel"
Case ".ppt"
Return "application/vnd.ms-powerpoint"
Case ".flash", ".swf"
Return "application/x-shockwave-flash"
Case ".ipx"
Return "application/x-ipix"
Case ".pdf"
Return "application/pdf"
Case ".zip"
Return "application/zip"
Case ".bin", ".class", ".dll", ".dms", ".exe", ".lha", ".lzh", ".so"
Return "application/octet-stream"
Case ".dcr", ".dir", ".dxr"
Return "application/x-director"
Case Else
Return "text/html"
End Select
End Function
Private Function IsBinaryFromExtension(ByVal UrlExt As String) As Boolean
Select Case UrlExt.ToLower
Case ".htm", ".html", ".css", ".js", ".xml", ".xsl", ".xht", ".xhtml", ".txt", ".asc", ".rtf", ".rtx", ".sgm", ".sgml"
Return False
Case Else
Return True
End Select
End Function
 
2. Changed Sub GetUrlData:
 
'''
''' returns a collection of bytes from a Url
'''

''' URL to retrieve
Public Sub GetUrlData(ByVal Url As String, ByVal ifModifiedSince As DateTime)
Dim UrlExt As String
Dim wreq As WebRequest = DirectCast(WebRequest.Create(Url), WebRequest)
 

UrlExt = Path.GetExtension(Url)
'-- do we need to use a proxy to get to the web?
If _ProxyUrl <> "" Then
Dim wp As New WebProxy(_ProxyUrl)
If _ProxyAuthenticationRequired Then
If _ProxyUser <> "" And _ProxyPassword <> "" Then
wp.Credentials = New NetworkCredential(_ProxyUser, _ProxyPassword)
Else
wp.Credentials = CredentialCache.DefaultCredentials
End If
wreq.Proxy = wp
End If
End If
 
'-- does the target website require credentials?
If _AuthenticationRequired Then
If _AuthenticationUser <> "" And _AuthenticationPassword <> "" Then
wreq.Credentials = New NetworkCredential(_AuthenticationUser, _AuthenticationPassword)
Else
wreq.Credentials = CredentialCache.DefaultCredentials
End If
End If
 
wreq.Method = "GET"
wreq.Timeout = _RequestTimeoutMilliseconds
wreq.Headers.Add("Accept-Encoding", _AcceptedEncodings)
 
'-- sometimes we need to transfer cookies to another URL;
'-- this keeps them around in the object
If KeepCookies Then
If _PersistedCookies Is Nothing Then
_PersistedCookies = New CookieContainer
End If
End If
 
'-- download the target URL into a byte array
Dim wresp As WebResponse = DirectCast(wreq.GetResponse, WebResponse)
 
'-- convert response stream to byte array
Dim ebr As New ExtendedBinaryReader(wresp.GetResponseStream)
_ResponseBytes = ebr.ReadToEnd()
 
'-- determine if body bytes are compressed, and if so,
'-- decompress the bytes
Dim ContentEncoding As HttpContentEncoding
If wresp.Headers.Item("Content-Encoding") Is Nothing Then
ContentEncoding = HttpContentEncoding.None
Else
Select Case wresp.Headers.Item("Content-Encoding").ToLower
Case "gzip"
ContentEncoding = HttpContentEncoding.Gzip
Case "deflate"
ContentEncoding = HttpContentEncoding.Deflate
Case Else
ContentEncoding = HttpContentEncoding.Unknown
End Select
_ResponseBytes = Decompress(_ResponseBytes, ContentEncoding)
End If
 
'-- sometimes URL is indeterminate, eg, "http://website.com/myfolder"
'-- in that case the folder and file resolution MUST be done on
'-- the server, and returned to the client as ContentLocation
_ContentLocation = wresp.Headers("Content-Location")
If _ContentLocation Is Nothing Then
_ContentLocation = ""
End If
 
'-- if we have string content, determine encoding type
'-- (must cast to prevent Nothing)
_DetectedContentType = wresp.Headers("Content-Type")
If _DetectedContentType Is Nothing Then
_DetectedContentType = ""
Else
_DetectedContentType = ContentTypeFromExtension(UrlExt)
End If
If IsBinaryFromExtension(UrlExt) Then
_DetectedEncoding = Nothing
Else
If _ForcedEncoding Is Nothing Then
_DetectedEncoding = DetectEncoding(_DetectedContentType, _ResponseBytes)
End If
End If
 
End Sub
 
3. Maybe not necessary, because I added it before creating the functions in 1)
In External.vb:
 
3a. After Private _ContentType As String I added:
Private _ContentTypeBefore As String
 
3b. In Public Property URL() I added:
_ContentTypeBefore = ""
after
_ContentType = ""
 
3c. I changed the line:
_ContentType = _Builder.WebClient.ResponseContentType
into
_ContentTypeBefore = _Builder.WebClient.ResponseContentType
If _ContentTypeBefore = "application/octet-stream" Then
_ContentTypeBefore = "text/html"
End If
_ContentType = _ContentTypeBefore
 
3d. Because I don't know sh*t about regex constructions I changed Private Sub SetUrl into:
Private Sub SetUrl(ByVal url As String, ByVal validate As Boolean)
If validate Then
_Url = ResolveUrl(url)
Else
_Url = url
End If
'-- http://mywebsite
_UrlRoot = Regex.Match(url, "http://[^/'""]+", RegexOptions.IgnoreCase).ToString
If _UrlRoot = "" Then
_UrlRoot = Regex.Match(url, "file:///[^/'""]+", RegexOptions.IgnoreCase).ToString
End If
If _UrlRoot = "" Then
_UrlRoot = Regex.Match(url, "file:///[^\\'""]+", RegexOptions.IgnoreCase).ToString
End If
'-- http://mywebsite/myfolder
If _Url.LastIndexOf("/") > 8 Then
_UrlFolder = _Url.Substring(0, _Url.LastIndexOf("/"))
Else
_UrlFolder = _UrlRoot
End If
End Sub
 
3e. In Private Sub AddMatchesToCollection I added:
Dim urlRegex2 As New Regex("^files*:///\w+", RegexOptions.IgnoreCase)
and changed:
If Not urlRegex.IsMatch(value) Then
into:
If Not urlRegex.IsMatch(value) And Not urlRegex2.IsMatch(value) Then
 

Note:
I don't know what problems will arise by changing Sub GetUrlData.
- WebRequest/WebResponse instead of HttpWebRequest/HttpWebResponse
- Left out:
wreq.UserAgent = _HttpUserAgent
wreq.IfModifiedSince = ifModifiedSince
wreq.CookieContainer = _PersistedCookies
 
Hope the above works for you too.
 
Bye,
 
Ron

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web02 | 2.6.130516.1 | Last Updated 4 Apr 2005
Article Copyright 2004 by wumpus1
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid