Click here to Skip to main content
Click here to Skip to main content

How to interleave two HTML files into one .DOCX file with C#, the HTMLAgilityPack, and the DOCX Library

, 18 Feb 2014
Rate this:
Please Sign up or sign in to vote.
Using HTMLAgilityPack and the DOCX Library with C#, create a .DOCX file from two HTML files

Why Combine Two Files into One?

First of all, you may wonder why one would want to take two files and merge them into one. In my case, it's to help me learn Spanish. I take the English version and the Spanish version of the same (HTML) document and create a .DOCX file that contains alternating English and Spanish paragraphs. I find this the easiest way to learn a new language - it's how I learned German half my life ago: with an English magazine or book in one hand, and its German counterpart in the other hand (while simultaneously listening to the German audio).

To use this tip, you will need to NuGet the HTMLAgilityPack and the DOCX Library.

You can supply your own files, of whichever two languages you want (presumably one will be your mother tongue, with the other one being the language you want to learn). If you know of no source for free publications like this, you can go to the Publications tab here, where there are dozens of publications in literally hundreds of languages. You can download these publications in several formats, depending on the exact publication; usually, PDF and one or two other formats, such as EPUB and MOBI. There are also audio files of many of these publications, too.

What You Get and How to Get It

What you end up with is a document that is not exactly pristine or beautifully formatted, but it does contain the information you need for this purpose. Here's an example (I cleaned up the formatting a little to make it look better):

Here is the code I used; it's not "normalized" (it's basically a big mess -- a big block of code in one method), but my excuse for that is twofold: it's a relatively simple one-trick pony, and: it's my utility for personal use, not really meant to be a programming showpiece. Anyway, without further palabra, here's the code:

        
private void ParseHTMLFilesAndSaveAsDOCX()
{
    const string BOLD_MARKER = "~";
    const string HEADING_MARKER = "^";
    List<string> sourceText = new List<string>();
    List<string> targetText = new List<string>();
    HtmlAgilityPack.HtmlDocument htmlDocSource = new HtmlAgilityPack.HtmlDocument();
    HtmlAgilityPack.HtmlDocument htmlDocTarget = new HtmlAgilityPack.HtmlDocument();

    // There are various options, set as needed
    htmlDocSource.OptionFixNestedTags = true;
    htmlDocTarget.OptionFixNestedTags = true;

    htmlDocSource.Load(sourceHTMLFilename);
    htmlDocTarget.Load(targetHTMLFilename);

    // Popul8 generic list of string with source text lines
    if (htmlDocSource.DocumentNode != null)
    {
        IEnumerable<HtmlAgilityPack.HtmlNode> pSourceNodes = htmlDocSource.DocumentNode.SelectNodes("//text()");
        string sourcePar;
        foreach (HtmlNode sText in pSourceNodes)
        {
            if (!string.IsNullOrWhiteSpace(sText.InnerText))
            {
                string formattingMarker = string.Empty;
                if (sText.OuterHtml.Contains("FONT SIZE=4"))
                {
                    formattingMarker = BOLD_MARKER;
                }
                else if (sText.OuterHtml.Contains("FONT SIZE=5"))
                {
                    formattingMarker = HEADING_MARKER;
                }
                sourcePar = string.Format("{0}{1}", formattingMarker, sText.InnerText);
                sourceText.Add(HttpUtility.HtmlDecode(sourcePar));                    
            }
        }
    }

    // Popul8 generic list of string with target text lines
    if (htmlDocTarget.DocumentNode != null)
    {
        IEnumerable<HtmlAgilityPack.HtmlNode> pTargetNodes = htmlDocTarget.DocumentNode.SelectNodes("//text()");
        string targetPar;
        foreach (HtmlNode tText in pTargetNodes)
        {
            if (!string.IsNullOrWhiteSpace(tText.InnerText))
            {
                string formattingMarker = string.Empty;
                if (tText.OuterHtml.Contains("FONT SIZE=4"))
                {
                    formattingMarker = BOLD_MARKER;
                }
                else if (tText.OuterHtml.Contains("FONT SIZE=5"))
                {
                    formattingMarker = HEADING_MARKER;
                }
                targetPar = string.Format("{0}{1}", formattingMarker, tText.InnerText);
                targetText.Add(HttpUtility.HtmlDecode(targetPar));
            }
        }
    }

    // Alternate through the two generic lists, writing to a doc file that will write the source
    // as regular text and the target bolded.
    int sourceLineCount = sourceText.Count;
    int targetLineCount = targetText.Count;
    int higherCount = Math.Max(sourceLineCount, targetLineCount);
    string sourceParagraph = string.Empty;
    string targetParagraph = string.Empty;

    // Write it out
    string docxFilename = string.Format("{0}.docx", textBoxDOCXFile2BCre8ed.Text.Trim());
    using (DocX document = DocX.Create(docxFilename))
    {
        for (int i = 0; i < higherCount; i++)
        {
            if ((i < sourceLineCount) && (null != sourceText[i]))
            {
                sourceParagraph = sourceText[i];
            }
            if ((i < targetLineCount) && (null != targetText[i]))
            {
                targetParagraph = targetText[i];
            }

            if (!string.IsNullOrWhiteSpace(sourceParagraph))
            {
                Paragraph pSource = document.InsertParagraph();
                if (sourceParagraph.Contains(BOLD_MARKER))
                {
                    sourceParagraph = sourceParagraph.Replace(BOLD_MARKER, "");
                    pSource.Append(sourceParagraph).Font(new FontFamily("Palatino Linotype")).FontSize(13).Bold();
                }
                else if (sourceParagraph.Contains(HEADING_MARKER))
                {
                    sourceParagraph = sourceParagraph.Replace(HEADING_MARKER, "");
                    pSource.Append(sourceParagraph).Font(new FontFamily("Palatino Linotype")).FontSize(16);
                }
                else
                {
                    pSource.Append(sourceParagraph).Font(new FontFamily("Palatino Linotype")).FontSize(11);
                }
                Paragraph pSpacer = document.InsertParagraph();
                pSpacer.Append(Environment.NewLine);
            }
            if (!string.IsNullOrWhiteSpace(targetParagraph))
            {
                Paragraph pTarget = document.InsertParagraph();
                if (targetParagraph.Contains(BOLD_MARKER))
                {
                    targetParagraph = targetParagraph.Replace(BOLD_MARKER, "");
                    pTarget.Append(targetParagraph).Font(new FontFamily("Georgia")).FontSize(13).Bold();
                }
                else if (targetParagraph.Contains(HEADING_MARKER))
                {
                    targetParagraph = targetParagraph.Replace(HEADING_MARKER, "");
                    pTarget.Append(targetParagraph).Font(new FontFamily("Georgia")).FontSize(16).Bold();
                }
                else
                {
                    pTarget.Append(targetParagraph).Font(new FontFamily("Georgia")).FontSize(11).Bold();
                }
                Paragraph pTargetSpacer = document.InsertParagraph();
                pTargetSpacer.Append(Environment.NewLine);
            }
        }
        document.Save();
    }
    MessageBox.Show("done!");
}

I have uploaded the source code, too. Feel free to clean it up/refactor it - please post it back here to Code Project if you do, though. From the source you can see which controls you need to add to the form and what to name them.

I did not tell you how to get the files from EPUB (or whatever format you download) to HTML; that is sort of an "exercise left to the reader," but in my case I use AVS Document Converter to convert EPUB files to DOCX, then I manually save those as HTML files before running my utility against those html files. You may find a better way; my experimentation did not, as converting directly to HTML created a rather malformed file, and saving a PDF as text also produced a very "ugly" text file. 

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

B. Clay Shannon
Publisher "Found in the Translation"
United States United States
I am the entire team at "Found in the Translation," which produces multilingual books (paperback and Kindle versions) such as "Don Quixote: In Spanish and English, Paragraph-by-Paragraph" among many others (English paired with not only Spanish, but also French and even Finnish). You can see all those books here: http://jsfiddle.net/clayshannon/pRgQL/75/
 
Personal web sites that I have created can be seen at http://usamaporama.azurewebsites.net and http://bigsurgarrapata.azurewebsites.net/ and http://www.awardwinnersonly.com
 
Peripatetic and picaresque, I have lived in eight states; specifically, besides my native California (where I was born and where I now again reside) in chronological order: New York, Montana, Alaska, Oklahoma, Wisconsin, Idaho, and Missouri.
 
I am also a writer of both fiction (for which I use a nom de plume, "Blackbird Crow Raven", as a nod to my Native American heritage - I am "½ Cowboy, ½ Indian") and nonfiction: http://www.lulu.com/spotlight/blackbirdcraven
Follow on   Twitter   Google+   LinkedIn

Comments and Discussions

 
Questionquestion Pinmemberdoggieshu22-Apr-14 17:46 
AnswerRe: question PinpremiumB. Clay "el Gonző" Shannon22-Apr-14 18:10 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web02 | 2.8.140721.1 | Last Updated 18 Feb 2014
Article Copyright 2014 by B. Clay Shannon
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid