Click here to Skip to main content
12,829,660 members (44,308 online)
Click here to Skip to main content
Add your own
alternative version

Stats

8.8K views
139 downloads
7 bookmarked
Posted 17 Feb 2014

How to interleave two HTML files into one .DOCX file with C#, the HTMLAgilityPack, and the DOCX Library

, 18 Feb 2014 CPOL
Rate this:
Please Sign up or sign in to vote.
Using HTMLAgilityPack and the DOCX Library with C#, create a .DOCX file from two HTML files

Why Combine Two Files into One?

First of all, you may wonder why one would want to take two files and merge them into one. In my case, it's to help me learn Spanish. I take the English version and the Spanish version of the same (HTML) document and create a .DOCX file that contains alternating English and Spanish paragraphs. I find this the easiest way to learn a new language - it's how I learned German half my life ago: with an English magazine or book in one hand, and its German counterpart in the other hand (while simultaneously listening to the German audio).

To use this tip, you will need to NuGet the HTMLAgilityPack and the DOCX Library.

You can supply your own files, of whichever two languages you want (presumably one will be your mother tongue, with the other one being the language you want to learn). If you know of no source for free publications like this, you can go to the Publications tab here, where there are dozens of publications in literally hundreds of languages. You can download these publications in several formats, depending on the exact publication; usually, PDF and one or two other formats, such as EPUB and MOBI. There are also audio files of many of these publications, too.

What You Get and How to Get It

What you end up with is a document that is not exactly pristine or beautifully formatted, but it does contain the information you need for this purpose. Here's an example (I cleaned up the formatting a little to make it look better):

Here is the code I used; it's not "normalized" (it's basically a big mess -- a big block of code in one method), but my excuse for that is twofold: it's a relatively simple one-trick pony, and: it's my utility for personal use, not really meant to be a programming showpiece. Anyway, without further palabra, here's the code:

private void ParseHTMLFilesAndSaveAsDOCX()
{
    const string BOLD_MARKER = "~";
    const string HEADING_MARKER = "^";
    List<string> sourceText = new List<string>();
    List<string> targetText = new List<string>();
    HtmlAgilityPack.HtmlDocument htmlDocSource = new HtmlAgilityPack.HtmlDocument();
    HtmlAgilityPack.HtmlDocument htmlDocTarget = new HtmlAgilityPack.HtmlDocument();

    // There are various options, set as needed
    htmlDocSource.OptionFixNestedTags = true;
    htmlDocTarget.OptionFixNestedTags = true;

    htmlDocSource.Load(sourceHTMLFilename);
    htmlDocTarget.Load(targetHTMLFilename);

    // Popul8 generic list of string with source text lines
    if (htmlDocSource.DocumentNode != null)
    {
        IEnumerable<HtmlAgilityPack.HtmlNode> pSourceNodes = htmlDocSource.DocumentNode.SelectNodes("//text()");
        string sourcePar;
        foreach (HtmlNode sText in pSourceNodes)
        {
            if (!string.IsNullOrWhiteSpace(sText.InnerText))
            {
                string formattingMarker = string.Empty;
                if (sText.OuterHtml.Contains("FONT SIZE=4"))
                {
                    formattingMarker = BOLD_MARKER;
                }
                else if (sText.OuterHtml.Contains("FONT SIZE=5"))
                {
                    formattingMarker = HEADING_MARKER;
                }
                sourcePar = string.Format("{0}{1}", formattingMarker, sText.InnerText);
                sourceText.Add(HttpUtility.HtmlDecode(sourcePar));                    
            }
        }
    }

    // Popul8 generic list of string with target text lines
    if (htmlDocTarget.DocumentNode != null)
    {
        IEnumerable<HtmlAgilityPack.HtmlNode> pTargetNodes = htmlDocTarget.DocumentNode.SelectNodes("//text()");
        string targetPar;
        foreach (HtmlNode tText in pTargetNodes)
        {
            if (!string.IsNullOrWhiteSpace(tText.InnerText))
            {
                string formattingMarker = string.Empty;
                if (tText.OuterHtml.Contains("FONT SIZE=4"))
                {
                    formattingMarker = BOLD_MARKER;
                }
                else if (tText.OuterHtml.Contains("FONT SIZE=5"))
                {
                    formattingMarker = HEADING_MARKER;
                }
                targetPar = string.Format("{0}{1}", formattingMarker, tText.InnerText);
                targetText.Add(HttpUtility.HtmlDecode(targetPar));
            }
        }
    }

    // Alternate through the two generic lists, writing to a doc file that will write the source
    // as regular text and the target bolded.
    int sourceLineCount = sourceText.Count;
    int targetLineCount = targetText.Count;
    int higherCount = Math.Max(sourceLineCount, targetLineCount);
    string sourceParagraph = string.Empty;
    string targetParagraph = string.Empty;

    // Write it out
    string docxFilename = string.Format("{0}.docx", textBoxDOCXFile2BCre8ed.Text.Trim());
    using (DocX document = DocX.Create(docxFilename))
    {
        for (int i = 0; i < higherCount; i++)
        {
            if ((i < sourceLineCount) && (null != sourceText[i]))
            {
                sourceParagraph = sourceText[i];
            }
            if ((i < targetLineCount) && (null != targetText[i]))
            {
                targetParagraph = targetText[i];
            }

            if (!string.IsNullOrWhiteSpace(sourceParagraph))
            {
                Paragraph pSource = document.InsertParagraph();
                if (sourceParagraph.Contains(BOLD_MARKER))
                {
                    sourceParagraph = sourceParagraph.Replace(BOLD_MARKER, "");
                    pSource.Append(sourceParagraph).Font(new FontFamily("Palatino Linotype")).FontSize(13).Bold();
                }
                else if (sourceParagraph.Contains(HEADING_MARKER))
                {
                    sourceParagraph = sourceParagraph.Replace(HEADING_MARKER, "");
                    pSource.Append(sourceParagraph).Font(new FontFamily("Palatino Linotype")).FontSize(16);
                }
                else
                {
                    pSource.Append(sourceParagraph).Font(new FontFamily("Palatino Linotype")).FontSize(11);
                }
                Paragraph pSpacer = document.InsertParagraph();
                pSpacer.Append(Environment.NewLine);
            }
            if (!string.IsNullOrWhiteSpace(targetParagraph))
            {
                Paragraph pTarget = document.InsertParagraph();
                if (targetParagraph.Contains(BOLD_MARKER))
                {
                    targetParagraph = targetParagraph.Replace(BOLD_MARKER, "");
                    pTarget.Append(targetParagraph).Font(new FontFamily("Georgia")).FontSize(13).Bold();
                }
                else if (targetParagraph.Contains(HEADING_MARKER))
                {
                    targetParagraph = targetParagraph.Replace(HEADING_MARKER, "");
                    pTarget.Append(targetParagraph).Font(new FontFamily("Georgia")).FontSize(16).Bold();
                }
                else
                {
                    pTarget.Append(targetParagraph).Font(new FontFamily("Georgia")).FontSize(11).Bold();
                }
                Paragraph pTargetSpacer = document.InsertParagraph();
                pTargetSpacer.Append(Environment.NewLine);
            }
        }
        document.Save();
    }
    MessageBox.Show("done!");
}

I have uploaded the source code, too. Feel free to clean it up/refactor it - please post it back here to Code Project if you do, though. From the source you can see which controls you need to add to the form and what to name them.

I did not tell you how to get the files from EPUB (or whatever format you download) to HTML; that is sort of an "exercise left to the reader," but in my case I use AVS Document Converter to convert EPUB files to DOCX, then I manually save those as HTML files before running my utility against those html files. You may find a better way; my experimentation did not, as converting directly to HTML created a rather malformed file, and saving a PDF as text also produced a very "ugly" text file. 

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

B. Clay Shannon
Founder Across Time & Space
United States United States
I am in the process of morphing from a software developer into a portrayer of Mark Twain. My monologue (or one-man play, entitled "The Adventures of Mark Twain: As Told By Himself" and set in 1896) features Twain giving an overview of his life up till then. The performance includes the relating of interesting experiences and humorous anecdotes from Twain's boyhood and youth, his time as a riverboat pilot, his wild and woolly adventures in the Territory of Nevada and California, and experiences as a writer and world traveler, including recollections of meetings with many of the famous and powerful of the 19th century - royalty, business magnates, fellow authors, as well as intimate glimpses into his home life (his parents, siblings, wife, and children).

Peripatetic and picaresque, I have lived in eight states; specifically, besides my native California (where I was born and where I now again reside) in chronological order: New York, Montana, Alaska, Oklahoma, Wisconsin, Idaho, and Missouri.

I am also a writer of both fiction (for which I use a nom de plume, "Blackbird Crow Raven", as a nod to my Native American heritage - I am "½ Cowboy, ½ Indian") and nonfiction, including a two-volume social and cultural history of the U.S. which covers important events from 1620-2006: http://www.lulu.com/spotlight/blackbirdcraven

You may also be interested in...

Pro
Pro

Comments and Discussions

 
SuggestionMy vote of 4 + Suggestion for a simpler approach Pin
KatiKeller2-Oct-15 23:46
memberKatiKeller2-Oct-15 23:46 
GeneralRe: My vote of 4 + Suggestion for a simpler approach Pin
Livalwas12-Oct-15 23:23
memberLivalwas12-Oct-15 23:23 
QuestionMessage Automatically Removed Pin
22-Apr-14 18:46
memberdoggieshu22-Apr-14 18:46 
AnswerRe: question Pin
B. Clay "el Gonző" Shannon22-Apr-14 19:10
professionalB. Clay "el Gonző" Shannon22-Apr-14 19:10 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Terms of Use | Mobile
Web01 | 2.8.170326.1 | Last Updated 18 Feb 2014
Article Copyright 2014 by B. Clay Shannon
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid