Click here to Skip to main content
Click here to Skip to main content

A Simple Parser That Converts HTML from OneNote to Markdown

, 14 Jan 2013 CPOL
Rate this:
Please Sign up or sign in to vote.
OneNote2Markdown converts the html generated from OneNote to Markdown format, which can then be translated to a cleaner normalized html by any online Markdown parser later.

Introduction

This article introduces OneNote2Markdown, a parser I made that converts the html file generated from OneNote (by sending to Word and save as html) to Markdown format, which can then be translated to a cleaner html by any online Markdown parser later.

  • Written in F#, the tool works with OneNote 2010 and Word 2010. It handles normal paragraphs, headings, links, lists, inlined code and code blocks only.
  • The tool reads from "input.html" and writes to "output.txt".
  • The source code of the latest version can be viewed at Bitbucket. It requires HtmlAgilityPack to compile.
  • The example pack contains this article in docx, html & Markdown formats, which would give a basic demonstration on how the tool works.

Background

I tend to take notes in OneNote. When I first time try to submit an article which is composed in OneNote, it is really a hassle to adapt the content to the template in Code Project manually. So I decided to make a parser which would automate most of the formatting work for me.

Implementation Overview

Preparations

  • An Active Pattern to pattern match if a text node has a certain ancestor such as <b> or <i>.

    let (|HasAncestor|) tag (node: HtmlNode) =
        node.Ancestors(tag) |> Seq.isEmpty |> not
    
  • A function to dig up a certain CSS property that a text node inherits from style attribute.

    let getPartialStyle cssProperty (node: HtmlNode) =
        let predicate node =
            // "property1:value1;property2:value2"
            let myMatch = Regex.Match(getStyle node, sprintf "%s:(.+?)(;|$)" cssProperty)
            if myMatch.Success then
                Some myMatch.Groups.[1].Value
            else None
        // Gets the value for the closest cssProperty.
        node.Ancestors("span") |> Seq.tryPick predicate
    
  • A function to get a certain CSS property of a node from style attribute.

    let getPartialStyleSelf cssProperty (node: HtmlNode) =
        let myMatch = Regex.Match(getStyle node, sprintf "%s:(.+?)(;|$)" cssProperty)
        if myMatch.Success then
            Some myMatch.Groups.[1].Value
        else
            None
    

Headings

  • Determines the heading type of a paragraph by checking its font-size & color CSS property as well as if it has <b> or <i> ancestor.

    match font, color, node with
    | Some "16.0pt", Some "#17365D", (HasAncestor "b" true)   -> H 1
    | Some "13.0pt", Some "#366092", (HasAncestor "b" true)   -> H 2
    | Some "11.0pt", Some "#366092", (HasAncestor "b" true)  & (HasAncestor "i" false) -> H 3
    | Some "11.0pt", Some "#366092", (HasAncestor "b" true)  & (HasAncestor "i" true)  -> H 4
    | Some "11.0pt", Some "#366092", (HasAncestor "b" false) & (HasAncestor "i" false) -> H 5
    | Some "11.0pt", Some "#366092", (HasAncestor "b" false) & (HasAncestor "i" true)  -> H 6
    | _ -> Normal
    
  • Uses ## heading ## syntax so that Markdown parser doesn't eat the last # contained in the heading.

    let headIt n text =
        String.Format("{1} {0} {1}", text, (String (Array.create n '#')))
    

Code

  • Any text whose font is Consolas is considered as code, otherwise not.

    match getPartialStyle "font-family" textNode with
    | Some "Consolas" -> varIt text
    | _ -> text
    
  • Simplifies Markdown syntax by combining several inlined code pieces into one if they are separated by white-spaces (e.g. a b -> a b). Preserves the leading spaces so as to protect indentations and blank lines with code blocks (anything inside is non-trivial and will not be removed later, e.g. a -> a). However, the limitation exists that the text itself cannot contain `.

    let simplifyVar (text: string) =
        Regex.Replace(text, @"(?<=.)`(\s*)`", "$1")
    
  • Differentiates code blocks from inlined code.

    let tryGetPureCode (text: string) =
        let myMatch = (Regex.Match(text, @"^`([^`]*)`$"))
        if myMatch.Success then
            Some (myMatch.Result "$1")
        else
            None
    

Lists

  • Distinguishes between ordered lists and unordered lists by the symbol. Lists without symbols are considered as normal paragraphs without indentation.

    let listIt x text =
        match x with
        | "o" | "·" -> sprintf "*  %s" text
        | _         -> sprintf "1. %s" text
    
  • Gets the indentation by margin-left:54.0pt CSS property.

    let getIndent (node: HtmlNode) =
        let getMargin (x: string) =
            let unit = 27 // each level is 27
            let array = x.Split '.'
            let (success, x) = Int32.TryParse array.[0]
            if success then x / unit
            else failwith "indent parse error!"
        match getPartialStyleSelf "margin-left" node with
        | Some x -> getMargin x
        | None -> 0
    

Links

  • Checks if a piece of text contains the link by looking for <a> in its ancestors.

    match textNode with
    | (HasAncestor "a" true) ->
        let ancestor_a = textNode.Ancestors("a") |> Seq.head
        linkIt text (ancestor_a.GetAttributeValue("href", "none"))
    | _ -> text
    

Finalization 

  • Gets the correct indentations and paragraph spacing for the whole content.

    /// Assumes in OneNote there are no spaces in front of a code block (indent by tabs).
    /// Assumes in OneNote the internal indentations of a code block are either all tabs or all spaces, never mixed.
    let review paragraphs =
        // indentOffset is used for nesting indentations.
        // If a benchmark line with indentation a, actually indents x, we set indentOffset = a - x.
        // So any line with indentation b, does actually indent b - indentOffset = b - a + x.
        let mutable listIndentOffset = 0
        let mutable codeIndentOffset = 0
        let oldCopy = paragraphs |> Seq.toArray
        let newCopy = Array.zeroCreate oldCopy.Length
        // Looks at the current paragraph and the previous paragraph.
        // I don't care about the first paragraph as it will be the title.
        // Uses "\r\n" so that Notepad reads correctly.
        for i in 1 .. oldCopy.Length - 1 do
            match oldCopy.[i - 1], oldCopy.[i] with
            | (Code _ | Listing _) , (Heading text | Basic text) ->
                // Code block / list block ends, prepends and appends new lines, and resets both indentOffsets.
                newCopy.[i] <- sprintf "\r\n%s\r\n" text
                listIndentOffset <- 0
                codeIndentOffset <- 0
    
            | (Heading _ | Basic _), (Heading text | Basic text) ->
                // Appends a new line.
                newCopy.[i] <- sprintf "%s\r\n" text
    
            | Code (_, a)          , Code (text, b)              ->
                // Don't add a new line in between code blocks.
                newCopy.[i] <- indentIt (b - codeIndentOffset) text
    
            | (Heading _ | Basic _), Code (text, b)              ->
                // Code block starts, cache codeIndentOffset
                // Indents 1 level only as Heading or Basic indents none.
                newCopy.[i] <- indentIt 1 text
                codeIndentOffset <- b - 1
    
            | Listing (_, a)       , Code (text, b)              ->
                // Code block within a list requires 1 additional level on top of the list indentation.
                // Code block starts, cache codeIndentOffset.
                // Prepends a new line.
                newCopy.[i] <- sprintf "\r\n%s" (indentIt (b - listIndentOffset + 1) text)
                codeIndentOffset <- listIndentOffset - 1
    
            | Listing (_, a)       , Listing (text, b)           ->
                // Don't add  a new line in between list blocks.
                newCopy.[i] <- indentIt (b - listIndentOffset) text
    
            | Code (_, a)          , Listing (text, b)           ->
                // Code block ends, reset codeIndentOffset.
                // Prepends a new line.
                codeIndentOffset <- 0
                newCopy.[i] <- sprintf "\r\n%s" (indentIt (b - listIndentOffset) text)
    
            | (Heading _ | Basic _), Listing (text, b)           ->
                // List block starts, cache listIndentOffset
                listIndentOffset <- b
                newCopy.[i] <- text
        newCopy
    

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

colinfang

United Kingdom United Kingdom
No Biography provided

Comments and Discussions

 
GeneralMy vote of 1 Pinprofessionalsaramgsilva9-Aug-14 12:23 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web04 | 2.8.141223.1 | Last Updated 14 Jan 2013
Article Copyright 2013 by colinfang
Everything else Copyright © CodeProject, 1999-2014
Layout: fixed | fluid