Click here to Skip to main content
11,415,734 members (85,588 online)
Click here to Skip to main content

Data URI Image Extractor

, 5 Jul 2011 BSD
Rate this:
Please Sign up or sign in to vote.
This short article will show an easy way to extract HTML data URI images and convert the HTML to use external images.

Introduction

In HTML, it is actually possible to embed raw image data directly into HTML so that separate image files are not needed. This can speed up HTTP transfers in many cases, but there are compatibility issues especially with Internet Explorer up to and including version 8. This short article will show an easy way to extract these images and convert the HTML to use external images.

Data URI

Normally images are included using this syntax:

<img src="Image1.png">

The data URI syntax however allows images to be directly embedded, reducing the number of HTTP requests and also allowing for saving on disk as a single file. While this article deals with only the img tag, this method can be applied to other tags as well. Here is an example of data URI usage:

<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUA
AAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO
9TXL0Y4OHwAAAABJRU5ErkJggg==">

More information on data URIs is available here:

SeaMonkey 2.1

Most editors do not use the data URI syntax. However, starting with SeaMonkey 2.1 Composer (Mozilla HTML Editor), images which are dragged and dropped are imported using this syntax. This is quite a bad change in my opinion, especially since it is not obvious and because it is a change in behavior from 2.0. In my case, I made a large HTML file with over 50 images before I discovered it was not linking them, but instead embedding them.

Utilities

Amazingly, there are quite a few online utilities to convert images to the data URI format, but none that I could find that could do the reverse. Because I did not want to hand-edit my document, I wrote a quick utility to extract the images to disk and change the HTML to use external images. This allows the document to be loaded by any standard browser including Internet Explorer 8.

About the Source Code

The source code is quite targeted to my specific need. It has a lot of limitations. I have published it however so that it is available as a foundation for you to expand should you have the same need.

  • The parsing is very basic, but works fine with SeaMonkey output.
  • It only supports PNG format currently.
  • There is no exception handling.
  • The code has not been optimized in any way.

Usage

ImageExtract is a console application and accepts one parameter. The parameter is the HTML file for input. The images will be output in the same directory, and the new HTML file will have a -new suffix. So if the input is index.html, the output HTML will be index-new.html.

Source Code

I have made the project available for download, but it is quite simple. It is a C# .NET Console application. For easy viewing, here is the class:

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;

namespace ImageExtract {
  class Program {
    // NOTE - This program is rough and dirty - I designed
    // it to accomplish and urgent task. I have not built in 
    // normal error handling etc.
    //
    // It also has not been optimized at all
    // and certainly is not very efficient.
    //
    // It also assumes all images are png files.
    static void Main(string[] aArgs) {
      string xSrcPathname = aArgs[0];
      string xPath = Path.GetDirectoryName(xSrcPathname);
      string xDestPathname = Path.Combine(xPath, 
             Path.GetFileNameWithoutExtension(xSrcPathname) + "-New.html");
      int xImgIdx = 0;

      Console.WriteLine("Processing " + Path.GetFileName(xSrcPathname));
      string xSrc = File.ReadAllText(xSrcPathname);
      var xDest = new StringBuilder();

      string xStart = @"data:image/png;base64,";
      string xB64;
      int x = 0;
      int y = 0;
      int z = 0;
      do {
        x = xSrc.IndexOf(xStart, z);
        if (x == -1) {
          break;
        }
        // Write out preceding HTML
        xDest.Append(xSrc.Substring(z, x - z));

        // Get the Base64 string
        y = xSrc.IndexOf('"', x + 1);
        xB64 = xSrc.Substring(x + xStart.Length, y - x - xStart.Length);
        // Convert the Base64 string to binary data
        byte[] xImgData = System.Convert.FromBase64String(xB64);

        string xImgName;
        // Get Image name and replace it in the HTML
        // We don't want to overwrite images that might already exist on disk,
        // so cycle till we find a non used name
        do {
          xImgIdx++;
          xImgName = "Image" + xImgIdx.ToString("0000") + ".png";
        } while (File.Exists(Path.Combine(xPath, xImgName)));

        Console.WriteLine("Extracting " + xImgName);

        // Write image name into HTML
        xDest.Append(xImgName);
        // Write the binary data to disk
        File.WriteAllBytes(Path.Combine(xPath, xImgName), xImgData);

        z = y;
      } while (true);
      // Write out remaining HTML
      xDest.Append(xSrc.Substring(z));

      // Write out result
      File.WriteAllText(xDestPathname, xDest.ToString());
      Console.WriteLine("Output to " + Path.GetFileName(xDestPathname));
    }
  }
}

History

  • 5th July, 2011: Initial version

License

This article, along with any associated source code and files, is licensed under The BSD License

Share

About the Author

Chad Z. Hower, a.k.a. Kudzu
"Programming is an art form that fights back"
www.KudzuWorld.com

Formerly the Regional Developer Adviser (DPE) for Microsoft Middle East and Africa, he was responsible for 85 countries spanning 4 continents and 10 time zones. Now Chad is a Microsoft MVP.

Chad is the chair of several popular open source projects including Indy and Cosmos (C# Open Source Managed Operating System).

Chad is the author of the book Indy in Depth and has contributed to several other books on network communications and general programming.

Chad has lived in Canada, Cyprus, Switzerland, France, Jordan, Russia, Turkey, and the United States. Chad has visited more than 60 countries, visiting most of them several times.

Comments and Discussions

 
QuestionPaste URI into browser address bar for quick result Pin
Rich Trefz10-Jul-13 3:53
memberRich Trefz10-Jul-13 3:53 
AnswerRe: Paste URI into browser address bar for quick result Pin
Chad Z. Hower aka Kudzu10-Jul-13 3:56
memberChad Z. Hower aka Kudzu10-Jul-13 3:56 
QuestionUpdated to handle JPEG Pin
Chad Z. Hower aka Kudzu6-Jul-11 17:47
memberChad Z. Hower aka Kudzu6-Jul-11 17:47 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web03 | 2.8.150427.4 | Last Updated 5 Jul 2011
Article Copyright 2011 by Chad Z. Hower aka Kudzu
Everything else Copyright © CodeProject, 1999-2015
Layout: fixed | fluid