HTML as a Source for a DOCX File

mbielski

4.73/5 (27 votes)

Jul 5, 2010

CPOL

10 min read

128122

3923

Take HTML or plain text input from an application or website form submission and produce a valid WordML (.docx) file.

Download source - 11.1 KB

Background

Every so often, I have a need to take the text that a user submits from a website form and save it outside of a database. While storing text files is always an option, sometimes a little more elegance is needed. I needed to have the ability to save HTML from a rich content editor as a Word document, somewhat like we used to be able to do with the pre-07 versions of Word (stick a couple of meta tags into the HTML file, and Word was fooled into thinking it was a Word HTML file). With the new DOCX format, that just isn't possible. What resulted from this search and subsequent tinkering has proved very useful (to me, anyway) for on-the-fly Word documents from website form submissions or database field contents that need to be available to the visitor in a usable format.

A Caveat

While this is a quick, simple way to create a docx file, it should by no stretch of the imagination be considered anything more than that. You should never read the contents of a docx file and place it into a textarea tag on a website (or into a textbox in an application) for editing, or change the contents of any pre-existing HTML. Not only does that go against the standards, but it's just bad coding practice to do things that way. Any editing to an existing docx file should be done by a program that is designed to do so from the start, such as Microsoft Word 2007.

A Bit About the DOCX Format

DOCX files are typically Microsoft Word files. The format is new for Microsoft Word 2007. It's a combination of XML files (documents) and ZIP compression for size reduction. If you create a document that contains exactly the same content and save it in both the old and new formats, you'll see a big difference in file size. Previous versions of Word need a patch installed in order to open docx files. Also, this format isn't the same as the OpenDocument standard.

A Note About Schemas

You may notice that docx files reference schemas that are hosted at schemas.openxmlformats.org. It's a very common thing, so don't worry about it. If you could host these schemas on your own server, it would be a security risk for anyone but you, because they don't know who you are. With the schemas at this central location, you have a safe, known location that your clients can trust.

Using the Code

Setting Up

We need to set up the project correctly, and to do that, we need to have System.IO.Packaging in our using block:

using System.IO.Packaging;

Without this, the whole project doesn't work. Also, you'll need to create the reference for WindowsBase. If WindowsBase doesn't come up in your list, you can usually find it at c:\Program Files\Reference Assemblies\Microsoft\Framework\v3.0\WindowsBase.dll.

Creating the File

We need to create the base for the .docx file. This is done in the SaveDOCX function.

private static void SaveDOCX(string fileName, string BodyText, bool IncludeHTML)
{
    string WordprocessingML =
    "http://schemas.openxmlformats.org/wordprocessingml/2006/main";

    XmlDocument xmlStartPart = new XmlDocument();
    XmlElement tagDocument = xmlStartPart.CreateElement("w:document", WordprocessingML);
    xmlStartPart.AppendChild(tagDocument);
    XmlElement tagBody = xmlStartPart.CreateElement("w:body", WordprocessingML);
    tagDocument.AppendChild(tagBody);

The most important thing here is the correct namespace schema. I found a lot of code while researching this that used "http://schemas.openxmlformats.org/wordprocessingml/2006/3/main". This schema was for the beta release of the format. The final version of the schema, which is what is referenced, is what has to be used. By including this in the creation of the body element, we ensure that we are getting the right kind of file in the end.

The rest of this code deals with building up the foundation of the final document. Of particular note is the nesting of tags for the WordprocessingML document (the "start part"):

w:document contains ...
    w:body

This nesting is crucial to creating a valid file. It defines the structure of the file. If it is out of order, the file isn't valid, and probably won't open or be readable. XML documents are made up of one or more XML elements. What is an element? An element is anything between a set of "<" and ">" characters. In this case, it states that the document contains a body element. The elements are created and added to each other in the order of the nesting. Elements can contain other elements, or they can contain data. This is how XML documents are built.

Handling HTML

First, let's handle HTML. Since HTML is a preformatted block of text (a.k.a. elements) that are contained within a document, we should be able to do something with it. By taking advantage of XmlElement "altChunk", we can basically place valid HTML into a file that is then referenced in the yet-to-be-created .docx file. Here's how I set up the altChunk tag and the required references:

string relationshipNamespace =
    "http://schemas.openxmlformats.org/officeDocument/2006/relationships";

XmlElement tagAltChunk = xmlStartPart.CreateElement("w:altChunk", WordprocessingML);
XmlAttribute RelID = tagAltChunk.Attributes.Append
    (xmlStartPart.CreateAttribute("r:id", relationshipNamespace));
RelID.Value = "rId2";
tagBody.AppendChild(tagAltChunk);

The relationships are important to include, as are the relationship IDs. Without them, the file won't work. The relationships within this file will tell Word what type of processing each element should have. We've used the "WordprocessingML" again when creating an XML element (the 'altChunk' element) and that will be the standard all the way through this project. We also gave the element an attribute. Attributes appear within the structure of an element, and always look like "something='something'". The value of an attribute appears within the quotes, and in this case, we have assigned a value of "rId2". If you aren't going to assign a value, in most cases, you can skip adding the attribute. The final line of code in this block attaches the new element to the body via the AppendChild method. This achieves the nesting that the document needs.

The Importance of Valid HTML

As we've been seeing, XML documents have to contain the right formatting and nesting of elements in order to be valid. An XML document that isn't valid may not open, and may not contain the data in a way that can be accessed. The same is true for HTML. If it isn't formatted and nested properly, the page won't appear as it is supposed to. If you expect your DOCX file to be able to be opened and used, you have to start with valid HTML. Most rich content editors that are available today for browsers and applications generate valid HTML, and don't have to be worried about. If you are generating the HTML by hand, I strongly suggest that you validate it through the W3C HTML Validator.

Handling Plain Text

Plain text is handled in much the same way that HTML is, but it is nested deeper in the document and doesn't need the altChunk element. We still need to create elements and append them to other elements, though.

XmlElement tagParagraph = xmlStartPart.CreateElement("w:p", WordprocessingML);
tagBody.AppendChild(tagParagraph);
XmlElement tagRun = xmlStartPart.CreateElement("w:r", WordprocessingML);
tagParagraph.AppendChild(tagRun);
XmlElement tagText = xmlStartPart.CreateElement("w:t", WordprocessingML);
tagRun.AppendChild(tagText);

XmlNode nodeText = xmlStartPart.CreateNode(XmlNodeType.Text, "w:t", WordprocessingML);
nodeText.Value = BodyText;
tagText.AppendChild(nodeText);

We can see more nesting of tags for the WordprocessingML document (the "start part"):

w:document contains ...
    w:body, which contains ...
        w:p (paragraph), which contains ...
            w:r (run), which contains ...
                w:t (text), which containsâ€¦
                    w:t (text)

In this nesting, we've created a paragraph element (w:p) and appended it to the body element, a run element (w:r) and appended it to the paragraph element, and a text element that is appended to the run element. This achieves our nesting of elements and builds the structure. We are also introduced to the XmlNode. The difference here is that XmlNode represents a single node (or element) within an XML file, whereas XMLDocument extends the whole thing to represent a file. XMLElement is also similar, but again gives more than just XMLNode.

The Importance of Clean Text

Before you put plain text into this, you must be absolutely sure that it does not contain any HTML markup. If it does, you'll have a file that has the HTML tags plainly visible and very likely confusing to the user. The common rule is: Never Trust Data from a User! Whether you are taking input from a textarea tag on a website or from an application, you really should consider stripping out any and all HTML tags. I prefer to do this before I send the text to this tool, where I am free to send it back to the user if I want, but if you wish, you can build it into your own version.

Creating the File

This is a two-step process. We'll start by creating the main document:

Uri docuri = new Uri("/word/document.xml", UriKind.Relative);
PackagePart docpartDocumentXML = pkgOutputDoc.CreatePart(docuri,
    "application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml");
StreamWriter streamStartPart = new StreamWriter(docpartDocumentXML.GetStream
                (FileMode.Create, FileAccess.Write));
xmlStartPart.Save(streamStartPart);
streamStartPart.Close();
pkgOutputDoc.Flush();

pkgOutputDoc.CreateRelationship(docuri, TargetMode.Internal,
  "http://schemas.openxmlformats.org/officeDocument/" + 
  "2006/relationships/officeDocument",
  "rId1");
pkgOutputDoc.Flush();

We start by creating the main document by creating a file called document.xml, creating an address for it within the main XML file, and then placing that file in the specific folder. This file is empty, but that's not our concern. We just need it to be there. Part of this creation process involves using StreamWriter to help create, capture, and send the document to the main XML file. Once we have that done, we can close StreamWriter and send the output to our main XML file via the Flush() method. Now that we have done that, we need yet another relationship. This one dictates how the opening program will know what to do with the document.xml file. With the relationship added, we can again send the output to the main XML file with the Flush() method.

The Flush() method is actually rather important. As your program runs and takes on more data, it takes up more and more memory. By calling the Flush() method, we tell the program to take what it has built up in memory and commit it to the file, thus freeing up the memory for re-use by this or another program. If you don't call Flush(), your program continues to hold on to this information, and then adds to it as we continue to build the file. By itself, a small file created like this is not an issue. However, if you put this on a very active website that allows the creation of very large documents, the result is a server that progressively gets slower as more files are created, and thus a website that begins to take longer and longer to process and load a page. It all adds up, so we remove things where and when we can.

With all of that done, we'll tie it all together by telling the XML document where to get its data from, close the SaveDOCX function, and send it off:

Uri uriBase = new Uri("/word/document.xml", UriKind.Relative);
    PackagePart partDocumentXML = pkgOutputDoc.GetPart(uriBase);

    Uri uri = new Uri("/word/websiteinput.html", UriKind.Relative);

    string html = string.Concat("<!DOCTYPE HTML PUBLIC \
      "-//W3C//DTD HTML 4.0 Transitional//EN\"><html>" + 
      "<head><title></title></head><body>",
    BodyText, "</body></html>");
    byte[] Origem = Encoding.UTF8.GetBytes(html);
    PackagePart altChunkpart = pkgOutputDoc.CreatePart(uri, "text/html");
    using (Stream targetStream = altChunkpart.GetStream())
    {
        targetStream.Write(Origem, 0, Origem.Length);
    }
    Uri relativeAltUri = PackUriHelper.GetRelativeUri(uriBase, uri);

    partDocumentXML.CreateRelationship(relativeAltUri, TargetMode.Internal,
    "http://schemas.openxmlformats.org/officeDocument/2006/relationships/aFChunk",
        "rId2");

    pkgOutputDoc.Close();
}

We start by referencing the document.xml file that we created just a moment ago, and then move on to creating the actual HTML file to hold our contents. We use this same file whether we have plain Text or HTML being processed, because we have defined this file as where the text within the document will be, and HTML handles plain text just fine. With that done, we now have to convert this file into an array of bytes that can then be written to the file. The UTF8 encoding is a well-accepted standard, so using that won't present us any problems later. After creating the last part of the package, the part that holds the actual data, we stream it out to the file one byte at a time. The use of the using block here gives us a built-in cleanup of the stream. What this means is that as the execution of the code passes out of the using block, a silent call to the Close() method of the Stream is made, and then the Stream is nulled and the memory cleared.

Our last step is to create the final relationship in this file, telling the program where to find the information for the altChunk processing, and then closing the package, which closes the file and then saves it to wherever you have instructed it to.

Usage

Compile the project into a DLL and place it in the Bin folder of your website, or into your project, and make the necessary references, and then use it as follows:

NoInkSoftware.HTMLtoDOCX NewFile = new NoInkSoftware.HTMLtoDOCX();
NewFile.CreateFileFromHTML(MyHTMLSource, MyDestination);

Or:

NoInkSoftware.HTMLtoDOCX NewFile = new NoInkSoftware.HTMLtoDOCX();
NewFile.CreateFileFromText(MyTextSource, MyDestination);

Credit Where Credit Is Due

I need to give credit where it is due. My first searches for a solution took me to some work by Doug Mahugh, found here (http://openxmldeveloper.org/archive/2006/07/20/388.aspx) and here (http://blogs.msdn.com/dmahugh/archive/2006/06/27/649007.aspx). Armed with that information, I next began searching CodeProject for something similar that would do what I needed. At the time of publication, this article by Paulo Vaz (http://www.codeproject.com/KB/aspnet/HTMLtoWordML.aspx) is the only thing that even partially addresses what I was looking to do. However, I didn't want to have a template sitting there on my website, so some further searching in the standards (found on the main page as several downloadable files at http://openxmldeveloper.org/) led me to the ability to put the altChunk into the file directly.

HTML as a Source for a DOCX File

Background

A Caveat

A Bit About the DOCX Format

A Note About Schemas

Using the Code

Setting Up

Creating the File

Handling HTML

The Importance of Valid HTML

Handling Plain Text

The Importance of Clean Text

Creating the File

Usage

Credit Where Credit Is Due

Points of Interest

Working Sample

History