Click here to Skip to main content
15,881,173 members
Articles / Web Development / HTML5

How to export a blogspot blog to HTML/GitHub with C#

Rate me:
Please Sign up or sign in to vote.
4.58/5 (5 votes)
25 Aug 2014LGPL32 min read 20.1K   5   6
A LINQ-to-XML example

I finally did it: I bought LINQPad's Code Completion so I could write C# scripts easily. Now, sure, I could have used C# script for free, but... wait, why didn't I do that?

Anyway, in my previous post I explained in detail how to set up a blog on GitHub. Now it's time to convert my old blogger blog to a shiny new GitHub format.

To export your blog from Blogger, log into Blogger, go to your blog control panel, go to the Settings | Other tab, and click "Export Blog".

You get an XML file that is basically in Atom format. It's hard to understand because it doesn't have any line breaks (apart from the line breaks in your blog template, which just serve to distract you). It's a <feed> root element containing a bunch of <entry> elements, some of which contain your posts and others of which contain metadata.

Here's the code I came up with to export to GitHub. Just paste this into LINQPad or whatever, change the filepath to point to your xml file, and run it!

C#
void Main()
{
    string filepath = @"C:\Downloads\Blog.xml";
    string text = File.ReadAllText(filepath);
    XDocument doc = XDocument.Parse(text);

    // Use XNamespaces to deal with those pesky "xmlns" attributes.
    // The underscore represents the default namespace.
    var _ = XNamespace.Get("http://www.w3.org/2005/Atom");
    var app = XNamespace.Get("http://purl.org/atom/app#");

    var posts = doc.Root.Elements(_+"entry")
        // An <entry> is either a post, or some bit of metadata no one cares about.
        // Exclude entries that don't have a child like <category term="...#post"/>
        .Where(entry => entry.Element(_+"category").Attribute("term").ToString().Contains("#post"))
        // Exclude any entries with an <app:draft> element except <app:draft>no</app:draft>
        .Where(entry => !entry.Descendants(app+"draft").Any(draft => draft.Value != "no"));

    var outfolder = Path.Combine(Path.GetDirectoryName(filepath), Path.GetFileNameWithoutExtension(filepath));
    Directory.CreateDirectory(outfolder);

    foreach (var entry in posts)
    {
       // Extract data from XML
        DateTime published = DateTime.Parse(entry.Element(_+"published").Value);
        DateTime updated = DateTime.Parse(entry.Element(_+"updated").Value);
        string title = entry.Element(_+"title").Value;
        string content = entry.Element(_+"content").Value;
        string type = entry.Element(_+"content").Attribute("type").Value ?? "html";
        XElement empty = new XElement("empty");
        XAttribute emptA = new XAttribute("empty","");
        string originalLink = ((entry.Elements(_+"link")
            .FirstOrDefault(e => e.Attribute("rel").Value == "alternate") ?? empty)
            .Attribute("href") ?? emptA).Value;
        string outFileName = string.Format("{0:yyyy-MM-dd}-{1}.{2}", published,
               Path.GetFileNameWithoutExtension(originalLink), type);
        var outPath = Path.Combine(outfolder, outFileName);

        if (content.Count(c => c == '\n') <= 3)
            content = AddLineBreaks(content); // optional

        // Write output file (partial HTML for Jekyll)
        using (StreamWriter output = File.CreateText(outPath)) {
            output.WriteLine("---");
            output.WriteLine("title: \"{0}\"", title);
            output.WriteLine("layout: post");
            output.WriteLine("# Pulled from Blogger. Last updated there on: {0:yyyy-MM-dd}", updated);
            output.WriteLine("---");
            if (originalLink != "")
                output.WriteLine("<small><p><i>This post was imported from "+
                 "<a href='{0}'>blogspot</a>.</i></p></small>", originalLink);
            output.WriteLine(""); // Disable Jekyll/Liquid
            output.Write(content);
            output.WriteLine("");
        }
    }
}

It will create a folder named after the xml file, and inside that folder it will create an html file for each post, like this:

2007-09-03-hello-no-one.html
2011-07-05-why-wpf-sucks.html
2012-06-07-smart-tabs.html
2013-05-28-onward.html

These filenames are in the correct format for Jekyll, so if you're moving to GitHub, just move all these files to your /_posts folder, commit, and you're done! If you want "proper" HTML files, modify the code above to produce proper code like <html><head>...</head>... instead of Jekyll front-matter.

Oh, by the way, Blogger's exported HTML contains no line breaks at all in your posts. So I wrote this little method to add some line breaks at appropriate spots:

C#
string AddLineBreaks(string content)
{
    var sb = new StringBuilder(content.Length + 100);

    bool pre = false, fail;
    for (UString rest = content; !rest.IsEmpty;) {
        if (rest.StartsWith("<pre")) pre = true;
        if (rest.StartsWith("</pre")) pre = false;
        bool s;
        if ((s = rest.StartsWith("<br />")) || rest.StartsWith("<br/>")) {
            sb.Append(pre ? "\n" : "<br/>\n");
            rest = rest.Substring(s ? 6 : 5);
            continue;
        }
        if (rest.StartsWith("<li>") || rest.StartsWith("<p>") || rest.StartsWith("<tr>") || rest.StartsWith("<pre>") || rest.StartsWith("<blockquote>") || rest.StartsWith("<img"))
            sb.Append('\n');
        if (rest.StartsWith("</ul>") || rest.StartsWith("</ol>") || rest.StartsWith("</blockquote>"))
            sb.Append('\n');
        char c = (char)rest.PopFront(out fail);
        if (!fail) sb.Append(c);
    }

    return sb.ToString();
}

This relies on UString in my Loyc.Essentials.dll library though (it's a kind of string slice). If you want to use this function, download LoycCore from NuGet.

The code was good enough for me, and hopefully it will be good enough for you... but I don't know if images work (certainly Blogger doesn't include images in the export file). Note: by default the HTML in blogspot has an "implicit" line breaks feature in which newlines are converted to <br/> for you. I turned off that feature because it often screws up formatting of nontrivial posts; if you left that option on, I'm not sure if the HTML that blogger gives to you in the XML file includes those auto-inserted <br/>s.

License

This article, along with any associated source code and files, is licensed under The GNU Lesser General Public License (LGPLv3)


Written By
Software Developer None
Canada Canada
Since I started programming when I was 11, I wrote the SNES emulator "SNEqr", the FastNav mapping component, the Enhanced C# programming language (in progress), the parser generator LLLPG, and LES, a syntax to help you start building programming languages, DSLs or build systems.

My overall focus is on the Language of your choice (Loyc) initiative, which is about investigating ways to improve interoperability between programming languages and putting more power in the hands of developers. I'm also seeking employment.

Comments and Discussions

 
Bugcode throws error in first line making it unable to run Pin
hacknorris-dev3-Feb-22 8:51
hacknorris-dev3-Feb-22 8:51 
GeneralRe: code throws error in first line making it unable to run Pin
Qwertie4-Feb-22 4:55
Qwertie4-Feb-22 4:55 
GeneralRe: code throws error in first line making it unable to run Pin
hacknorris-dev4-Feb-22 22:17
hacknorris-dev4-Feb-22 22:17 
GeneralMy vote of 5 Pin
Volynsky Alex3-Jan-15 7:41
professionalVolynsky Alex3-Jan-15 7:41 
QuestionAwesome! Pin
Member 1103544925-Aug-14 9:58
Member 1103544925-Aug-14 9:58 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.