Click here to Skip to main content
Click here to Skip to main content

Converting RTF to HTML in VB.NET the Easy Way

By , 14 Jan 2010
 

Introduction

This article will explain an easy, robust way to convert rich text to HTML using VB.NET and Microsoft Office Automation.

Background

This all started out because I needed to take the contents of a RichTextBox in an application I had developed and insert it into the body of an email. We're a Microsoft shop all around, so I could depend on Outlook 2007 to be the email client for all users, and I assumed (poorly) that I would be able to insert rich text into an Outlook email with little or no problem. Silly me.

Once I figured out that Outlook did not support rich text, even though it was using Word as its editor, I set about trying to convert my RTF to HTML, and I assumed (again) that there must be some simple straightforward way to do it without parsing all the RTF and accounting for each and every formatting tag myself. An exhaustive search of the internet turned up several third party apps; some of them were free, most of them parsed the RTF and seemed to be a little incomplete, and none of them really fit the bill when it came to simplicity.

I started fooling around with Office automation, thinking that if Microsoft didn't supply direct access to their RTF to HTML conversion process, perhaps they would supply indirect access. Sure enough, after fiddling around with Word for a while, I was able to figure out how to use Word as a translator and convert RTF directly to HTML in one short function. So here, for the assistance of all the other wage slaves out there struggling with a similar problem, is how I did it. Nothing earth shattering here, but a very handy function to have in your back pocket.

Using the Code

Basically, just throw this function into your VB.NET project. You'll need to include a reference to the Microsoft Word 12.0 Object Library (COM object). Other Word libraries may do just as well, but this is how I've used it.

Public Function sRTF_To_HTML(ByVal sRTF As String) As String
    'Declare a Word Application Object and a Word WdSaveOptions object
    Dim MyWord As Microsoft.Office.Interop.Word.Application
    Dim oDoNotSaveChanges As Object = _
         Microsoft.Office.Interop.Word.WdSaveOptions.wdDoNotSaveChanges
    'Declare two strings to handle the data
    Dim sReturnString As String = ""
    Dim sConvertedString As String = ""
    Try
        'Instantiate the Word application,
        ‘set visible to false and create a document
        MyWord = CreateObject("Word.application")
        MyWord.Visible = False
        MyWord.Documents.Add()
        'Create a DataObject to hold the Rich Text
        'and copy it to the clipboard
        Dim doRTF As New System.Windows.Forms.DataObject
        doRTF.SetData("Rich Text Format", sRTF)
        Clipboard.SetDataObject(doRTF)
        'Paste the contents of the clipboard to the empty,
        'hidden Word Document
        MyWord.Windows(1).Selection.Paste()
        '…then, select the entire contents of the document
        'and copy back to the clipboard
        MyWord.Windows(1).Selection.WholeStory()
        MyWord.Windows(1).Selection.Copy()
        'Now retrieve the HTML property of the DataObject
        'stored on the clipboard
        sConvertedString = _
             Clipboard.GetData(System.Windows.Forms.DataFormats.Html)
        'Remove some leading text that shows up in some instances
        '(like when you insert it into an email in Outlook
        sConvertedString = _
             sConvertedString.Substring(sConvertedString.IndexOf("<html"))
        'Also remove multiple  characters that somehow end up in there
        sConvertedString = sConvertedString.Replace("Â", "")
        '…and you're done.
        sReturnString = sConvertedString
        If Not MyWord Is Nothing Then
            MyWord.Quit(oDoNotSaveChanges)
            MyWord = Nothing
        End If
    Catch ex As Exception
        If Not MyWord Is Nothing Then
            MyWord.Quit(oDoNotSaveChanges)
            MyWord = Nothing
        End If
        MsgBox("Error converting Rich Text to HTML")
    End Try
    Return sReturnString
End Function

'
'That does it. If you need to insert your HTML into an
'Outlook mail message (as I did) here's how to do it using the function above.
'
Dim myotl As Microsoft.Office.Interop.Outlook.Application
Dim myMItem As Microsoft.Office.Interop.Outlook.MailItem
myotl = CreateObject("Outlook.application")
myMItem = myotl.CreateItem(Microsoft.Office.Interop.Outlook.OlItemType.olMailItem)
myMItem.Subject = 
    "This email was converted from rich text to HTML using a simple function in VB.net"
myMItem.Display(False)
myMItem.BodyFormat = Microsoft.Office.Interop.Outlook.OlBodyFormat.olFormatHTML
myMItem.HTMLBody = sConvertedString

Points of Interest

One word of warning, the HTML produced by this conversion process is very verbose. It produces a lot of lines of HTML for some very basic formatting, but it has performed error free conversion on thousands of pages of data thus far here where I work.

I am still surprised that Microsoft does not simply have RTF to HTML conversion functionality readily available in its development libraries. It seems like a logical and intuitive function to provide. Still, at least, there's a workaround.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Hanleyk1
Software Developer
United States United States
Member
Hanley Loller. Ex-professional kayaker went back to school at 30 to learn computer programming. Earned my BS in computer science from East Tennessee State University in 2001. Worked for a couple of different companies before landing in the Office of Computing and Information Technology at the Kentucky State Legislature where I mostly write applications using SQL and VB.net. I love my job, but it's still not as good as kayaking for a living.

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
GeneralMy vote of 5memberStijn Courtheyn5 Aug '11 - 3:32 
It was very helpfull and exacly what i needed
GeneralRe: My vote of 5memberHanleyk112 Aug '11 - 7:51 
Thanks, that's always good to hear.
Generaluse of codememberSIFNOk14 Feb '11 - 21:30 
Hi im new to vb.net and this the similist method of conversion. Props you too Hanle! =D
 
but how is this implemented =S as i want the HTML to be pasted into a seperate richtextbox.
 
Please could someone help =)
I Love Meth.

GeneralRe: use of codememberHanleyk116 Feb '11 - 5:22 
Assign the HTML string to the "text" property of whatever text box you wish it to show up in. However, I doubt the results will be exactly what you are looking for. You probably want to just put the results in a regular text box.
GeneralRemoving some of the Verbose HTMLmemberPaulNash19 Jan '11 - 0:40 
Hi,
 
I have a quick trick to remove some of the unnecessary html. Note, I am using Word 2010 but I expect this may help with earlier versions also.
 
I noticed that the HTML contains large blocks of comments i.e. "<!-- sometext -->" . I have therefore used a regular expression to remove the unecessary comments. Here is the code: -
 
Imports System.Text.RegularExpressions
....
'Also remove multiple  characters that somehow end up in there
sConvertedString = sConvertedString.Replace("Â", "")
' NEW LINE
sConvertedString = Regex.Replace(sConvertedString, "\<![ \r\n\t]*(--([^\-]|[\r\n]|-[^\-])*--[ \r\n\t]*)\>", "")
' END OF NEW LINE
'…and you're done.
 
Regards,
 
Paul Nash
GeneralRe: Removing some of the Verbose HTMLmemberHanleyk131 Jan '11 - 6:46 
Nice addition. I may wrap up several improvements in a newer version shortly. I'll list credit for contributors of course.
GeneralProblem with HyperlinksmemberPaulNash19 Jan '11 - 0:32 
Hi,
 
I second many of the comments here, it is a nice solution.
 
However I cannot get hyperlinks to be transfered from a richtextbox over into the generated HTML (they appear as plain text). Is this possible or am I missing something?
 
Regards,
 
Paul Nash
GeneralRe: Problem with HyperlinksmemberHanleyk131 Jan '11 - 6:47 
Not sure, I don't have the same problem. Hyperlinks show up as hyperlinks. May be a settings problem depending on your end product. Where is the final HTML ending up?
GeneralGreat code. (problem with images)memberCroody14 Sep '10 - 4:26 
Hi,
 
Thank you for the great code. It is really helped me.
 
When I insert images the richtextbox they show alright but do not appear in the receipent's mailbox.
 
Does anybody know why or what can be done to correct this?
 
Thank you in advance.
 
croody
GeneralRe: Great code. (problem with images)memberHanleyk114 Sep '10 - 10:41 
Sorry, I couldn't say for sure since I'm allowing Word to do the heavy lifting here. Some limitations aren't unexpected. I'll give it some thought and see if anything comes to mind. If you find a good solution, please post it back here.
 
Thanks,
 
Hanley
GeneralRe: Great code. (problem with images)memberHanleyk129 Sep '10 - 10:31 
I've been pretty busy, but just recently had a chance to look at this while I was working on a related problem with my app. As near as I can tell, I'm not able to replicate the problem you're having. I didn't test extensively with different images though.
 
Are you having this problem with all images or just some particular ones? Are there any other parameters that may be affecting this outcome?
GeneralRe: Great code. (problem with images)memberStijn Courtheyn5 Aug '11 - 3:12 
You will need to check the converted string for <img.
 
You need to get the file and add it as attachment and replace the src to cid:filename
 
this is my code:
If html.Contains("<img ") Then
   Dim iPos As Integer = html.IndexOf("<img ")
   Dim iPosSrc As Integer = html.IndexOf("src=""", iPos)
   Dim iPosSrcEnd As Integer = html.IndexOf("""", iPosSrc + "src=""".Length)
   Dim strImg As String = html.Substring(iPosSrc, iPosSrcEnd - iPosSrc)
   strImg = strImg.Replace("src=""file:///", "")
   myMItem.Attachments.Add(strImg, Microsoft.Office.Interop.Outlook.OlAttachmentType.olEmbeddeditem, 0)
   Dim fInfo As FileInfo = My.Computer.FileSystem.GetFileInfo(strImg)
   Dim strName As String = fInfo.Name
   html = html.Replace(html.Substring(iPosSrc, iPosSrcEnd - iPosSrc), "src=""cid:" & strName)
 
   While html.IndexOf("<img", iPosSrcEnd) > 0
      iPos = html.IndexOf("<img ", iPosSrcEnd)
      iPosSrc = html.IndexOf("src=""", iPos)
      iPosSrcEnd = html.IndexOf("""", iPosSrc + "src=""".Length)
      strImg = html.Substring(iPosSrc, iPosSrcEnd - iPosSrc)
      strImg = strImg.Replace("src=""file:///", "")
      myMItem.Attachments.Add(strImg, Microsoft.Office.Interop.Outlook.OlAttachmentType.olEmbeddeditem, 0)
      fInfo = My.Computer.FileSystem.GetFileInfo(strImg)
      strName = fInfo.Name
      html = html.Replace(html.Substring(iPosSrc, iPosSrcEnd - iPosSrc), "src=""cid:" & strName)
   End While
 
End If

GeneralRe: Great code. (problem with images)memberScotchy5 Aug '11 - 10:00 
While this will work for html that contains the img tag the html generated using RTF that contains images doesn't output the img tags in the html. I think Word might be trying to convert but doesn't know what to do with it. The output after this method was called using RTF from an Outlook MailItem returns:
 
<p class=MsoNormal style='mso-layout-grid-align:none;text-autospace:none'><span
style='mso-spacerun:yes'>             </span><span
style='mso-spacerun:yes'> </span><span style='mso-spacerun:yes'> </span><o:p></o:p></p>
 
There should be exactly two images there. The MailItem contains two attachments both Device Independant Bitmaps. Im not sure if these are in the actual RTF stream or the MailItem.
GeneralRe: Great code. (problem with images)memberHanleyk112 Aug '11 - 7:39 
I'm definitely getting a different result than scotchy. The HTML created from rich text in my app definitely contains image tags.
 
When I debug and intercept the HTML string before it is handed over to Outlook, it contains image tags and it stores the images in a temp folder under "local settings", something like this: "C:\Documents and Settings\Username\Local Settings\Temp\msohtmlclip1\01\clip_image002.gif" It shows up in the body of the email just fine, although the body of the email has an image resizing/positioning tool built into it and the actual image data is more complex than simply the gif file I am listing here. I doubt that just attaching this gif file would give you the results you are looking for. If you want the image to show up properly, you need the automated process to handle it for you.
 
The process of html conversion in the example I am looking at creates the following series of six files that combined seem to contain the image data between them.
 
clip_colorschememapping.xml 1 KB
clip_image001.wmz 368 KB
clip_image002.gif 18 KB
clip_imaage003.wmz 1 KB
clip_image004.gif 1 KB
clip_themedata.thmx 4 KB
 

I would suggest looking for a configuration solution to this problem rather than trying to code around it at a low(er) level. What version of Visual Studio, Word and Outlook are you using? I'm currently using VS 2008 and Outlook 2007 although we were using VS and Outlook 2003 when this code was written. I'm also referencing the "Microsoft Word 12.0 Object Library" and running in a Windows XP environment with Office 2007 Installed. Are there any significant deviations to any of these configurations that might be changing the outcome?
GeneralRe: Great code. (problem with images)memberHanleyk112 Aug '11 - 7:56 
Reading your reply closer, I see the difference. You are looking at the HTML after it has been inserted into Outlook. I am certain Outlook fishes out the img tags. When the initial conversion is done however (by either Word or by the Clipboard), the HTML is more traditional and does contain img tags.
GeneralBullets or indent problem, not surememberbigbro_198517 Aug '10 - 1:21 
Hi,
First of all Awesome CodeThumbs Up | :thumbsup:
 
I’m generating an emailed report using your code and in my RTF I'm using bullets and indents. I'm not sure what is causing this but for some reason I’m getting the following two characters "ï‚" just before each of the bullets.
 
Any Ideas why this is happeningConfused | :confused: Or should I just remove it like you did with the "Ã" character?
 
Thanks in advance
 
Marco
GeneralRe: Bullets or indent problem, not surememberHanleyk123 Aug '10 - 10:18 
Hard to say. I haven't run into that in particular, but bullets are likely to be problematic because they aren't usually a standard character ascii character. If the only problem you're having is the one you describe, I'd probably do a search and replace where you search for the "i," and bullet characters together and replace with just the bullet. This would help you avoid accidentally removing a valid character string.
 
You may need to do some additional coding to determine the character value of the bullet, and I would not assume that the "i," characters are exactly what they appear. I would probably parse the output to retrieve the raw integer character values and then use those when doing your "find and replace" operation. Of course, that's just my coding style.
 
Good luck, and sorry about the late reply. I don't check this email address that often.
 
HanleyK1
GeneralMy vote of 5memberbigbro_198517 Aug '10 - 1:14 
Excelent
JokeRe: My vote of 5memberHanleyk123 Aug '10 - 10:20 
Thanks. Are you sure it's not too much like VBScript? Laugh | :laugh:
GeneralThanks, your code helpedmemberkrashcontrol4 Apr '10 - 0:42 
Hi
 
I was trying to convert from HTML to RTF/PlainText and was using the code from OutlookCode.com[^] but it wasn't quite working. After looking at your code, I was able to come up with a function that seems to be working. So here is my code share. Thanks for the help/inspiration.
 
Private Function GetHTMLBodyAsText(ByVal sourceItem As Object) As String
 
Dim objDoc As Word.Document
Dim objSel As Word.Selection
Dim sConvertedString As String = ""
 
On Error Resume Next
' get a Word.Selection from the source item
objDoc = sourceItem.GetInspector.WordEditor
If Not objDoc Is Nothing Then
objSel = objDoc.Windows(1).Selection
objSel.WholeStory()
objSel.Copy()
 
objSel.PasteAndFormat(WdPasteDataType.wdPasteRTF)
objSel.WholeStory()
objSel.Copy()
sConvertedString = Clipboard.GetData(System.Windows.Forms.DataFormats.Text)
Else
MsgBox("Could not get Word.Document for " & _
sourceItem.Subject)
End If
objDoc = Nothing
objSel = Nothing
 
Return sConvertedString
 
End Function

GeneralRe: Thanks, your code helpedmemberHanleyk18 Apr '10 - 9:36 
Thanks, good to know that it helped. Also nice that you built on it and were kind enough to post your code. I guess it's the logical next step to convert HTML back to Rich Text and/or text. I'm still somewhat dismayed that Microsoft hasn't simply included access to this functionality in their libraries, but at least there's a back door.
GeneralMy vote of 1memberJeffBall18 Jan '10 - 16:48 
More akin to VBS and not really VB.NET
GeneralRe: My vote of 1memberHanleyk120 Jan '10 - 2:55 
Sorry, I'm not following you. This is in fact a VB.net project created in visual studio 2008. (Actually, it's just a function removed from a larger project for inclusion in other people's projects, but still VB.net)
 
If you're having trouble using it in your project, it could be that you are using an earlier version of visual studio or that you haven't got the right references in your project. Tell me what kind of difficulty you're having implementing it and I'll see if I can address it.

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web01 | 2.6.130516.1 | Last Updated 14 Jan 2010
Article Copyright 2010 by Hanleyk1
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid