Click here to Skip to main content
Click here to Skip to main content

Introduction to XPS - Part 1 of n (of too many)

, 2 Aug 2008
Rate this:
Please Sign up or sign in to vote.
XPS is a fixed document format derived from XAML. Learn how to use it to produce the documents you want?

Contents

Introduction

XPS (XML Paper Specification) is a fixed page format specification that is a useful alternative to PDF. Just as PDF is a 'cut-down' version of PostScript, XPS is a reduced schema version of XAML specifically for fixed page layout. With XPS being XML based, it should be a great format for generating your own documents. Unfortunately, there seems to be little available describing this format in a way that's useful for actual implementation when you want to do just that. I'm hoping to help fill in some of those gaps with a (short) series of articles.

My introduction to XPS began with mocking up documents using Word and then "printing" them using the XPS Printer driver provided by .NET v3, afterwards examining the XPS documents to learn how they are structured and how to manipulate them. Apparently, if you have MS Office 2007 and get an optional update, you can also do a "Save As" to produce an XPS document.

I found that those XPS files produced by Word and the "XPS Printer" often included a large number of unnecessary artifacts (especially if it's a file you've edited several times, changed the fonts, etc.). This particular tool cleans out a large number of those artifacts, eliminates some duplicates, and does a few other tweaks that help to reduce the overall size of the XPS file, although in most cases, only by a few KB. Stepping through what it does also serves as a useful introduction to XPS files.

If you're planning on doing your own XPS output, then mocking up your intended format and using this tool to clean up the result is a really handy way to start.

Originally, I had pursued XPS purely as a proof of concept for a billing system. However, when it became apparent that commercial systems for PDF / PostScript production were going to be in the "insanely expensive" price bracket, my "proof of concept" became the actual production system.

This particular part of the project came out of the necessity of cleaning up marketing materials ready for their inclusion into the customer's bill. This CodeProject article is derived from that work.

XPS Internal Structure

The OOXML organisation used by XPS files includes a large number of cross references between different parts (files) and within the individual files. I won't go into whether OOXML is a good or a bad thing, there's already enough argument about that. However, just to add to the confusion, the OOXML "spec" has been slightly tweaked for XPS.

In the case of XPS files, the internal structure can be thought of as having three tiers (please note that this is not the official explanation, but it works for me). At the root, there's the XPS file itself. Next, there are the individual documents carried within that. Finally, there are the individual pages for each document. At each of these tiers may be held references to other parts and also resources of various types. All of this is a gross over-simplification, of course, but you get the idea.

Many of the parts (files) within the OOXML structure can be given different names, rather than the ones used by the "XPS Printer" or that are shown in the sample files, so long as all of the cross references line up.

Within each file, the various parts generally don't have to be in any specific order. It's only specific issues with regard to the layout of pages where order may become important. Otherwise, it's just do whatever is most convenient for processing.

After "printing" a sample XPS file yourself (and renaming it to a .zip file), you would probably see a structure similar to the following:

At each tier within the XPS document, you can find three different folders, although they don't have to be present at each tier:

Folder Name Description
_rels Contains files that describe the relationships the files at this tier have with other parts within the XPS file.
Metadata Holds metadata files related to this tier. For instance, thumbprint images of the document or the PrintTicket files.
Resources Contains the resources (e.g., fonts and images) used by this tier of the XPS file.

Root Tier (XPS File)

At the root tier, there will be two files:

[Content_Types].xml Enumerates the different types of files, specifically the file extensions contained within this XPS document.
FixedDocumentSequence.fdseq Will list out the actual documents contained within the XPS file, in effect pointing to the next tier in the hierarchy.

[Content_Types].xml would normally look something like this:

<types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
<default contenttype="application/vnd.openxmlformats-package.relationships+xml"  
    extension="rels" />
<default contenttype="application/vnd.ms-package.xps-fixeddocumentsequence+xml" 
    extension="fdseq" />
<default contenttype="application/vnd.ms-package.xps-fixeddocument+xml" 
    extension="fdoc" />
<default contenttype="application/vnd.ms-printing.printticket+xml" 
    extension="xml" />
<default contenttype="image/jpeg" extension="JPG" />
<default contenttype="application/vnd.ms-package.xps-fixedpage+xml" 
    extension="fpage" />
<default contenttype="application/vnd.ms-package.obfuscated-opentype" 
    extension="odttf" />
</types>

Note the schema namespace declaration in the Root Types element, and the "rels" extension declaration, these are specific to OOXML. Next, there's the "fdseq", "fdoc", and "fpage" extensions which all declare parts of the XPS structure. Then, the "odttf" for obfuscated open type font files; more on these in another article. Unfortunately, "xml" is used as the extension for the metadata PrintTicket files. And then finally, "JPG" and "PNG" for the image files; you may also see others depending on what's sitting in your original source document. You can assume "JPG" is always going to be present because the metadata thumbnail image that's generated by the XPS printer driver is always a small JPEG image.

FixedDocumentSequence.fdseq is normally very simple. Not just its name, but also its extension tells us that it is a fixed document sequence file. For an XPS printer driver generated document, it should always look like this:

<fixeddocumentsequence xmlns="http://schemas.microsoft.com/xps/2005/06">
    <documentreference source="/Documents/1/FixedDocument.fdoc" />
</fixeddocumentsequence>

Now we now where to go to find the first part of our document. However, we should have a look within the _rels folder first.

The first file is called .rels; in effect, this is the relationships file that corresponds to the [Content_Type].xml file. It would normally look like this:

<?xml version="1.0" encoding="utf-8"?>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
<Relationship Target="/FixedDocumentSequence.fdseq" Id="R0"
 Type="http://schemas.microsoft.com/xps/2005/06/fixedrepresentation"/>
<Relationship Target="/Documents/1/Metadata/Page1_Thumbnail.JPG" Id="R1"
 Type="http://schemas.openxmlformats.org/package/2006/relationships/metadata/thumbnail"/>
</Relationships>

You can see that it identifies the FixedDocumentSequence.fdseq file in the root tier and assigns it an arbitrary ID of R0. It also identifies the metadata thumbnail image which will be the thumbnail image for the entire XPS file itself.

Also, in the _rels folder is FixedDocumentSequence.fdseq.rels - it should be fairly obvious what this is the relationships file for:

<?xml version="1.0" encoding="utf-8"?>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
<Relationship Target="/Metadata/Job_PT.xml" Id="R0"
    Type="http://schemas.microsoft.com/xps/2005/06/printticket"/>
</Relationships>

Here, the only relationship described is to the metadata PrintTicket file. PrintTickets will also be described in another article. This file will often be the only file in the root tier Metadata folder.

Document Tier

Also in the root there will be the Documents folder. This folder will contain the actual document within the XPS file. When using the .NET v3 XPS Printer Driver, this document (in its own subfolder) is always named "1", although the document can actually have any name. Normally, resources such as fonts and images used within the document will be contained at this tier under Resources.

Under the "1" folder will be FixedDocument.fdoc referred to in FixedDocumentSequence.fdseq above. This file lists out the pages in the order they are to be displayed or printed.

Page Tier

Finally, each document subfolder will contain a Pages subfolder, and each Pages subfolder has the individual page files. There will also be another _rels folder at this level containing a .rels file corresponding to each .fpage file.

If you open up each page file, you'll see quite plainly how XPS is a restricted subset of XAML with all the Path and Glyphs elements. Don't be surprised though to see the different parts of the page layout seemingly scattered about within the file. As long as there are no z-axis issues (i.e., one element must appear behind another), the XPS Printer Driver pumps out the various elements of the page in the order that suits it.

<FixedPage Width="816" Height="1056" 
    xmlns="http://schemas.microsoft.com/xps/2005/06" xml:lang="und">
    <Glyphs Fill="#ff000000" 
        FontUri="/Documents/1/Resources/Fonts/87850AD7-9FD8-4CF2-9ED3-D635DE0AC70C.odttf" 
        FontRenderingEmSize="22.5173" StyleSimulations="None" 
        OriginX="105.6" OriginY="106.88" 
        Indices="44;81;87;85,42;82,52;71,55;88,55;70,45;87,34;76,27;82,51;81" 
        UnicodeString="Introduction" />
    <Path Data="F1 M 105.6,109.28 L 228.16,109.28 228.16,111.52 105.6,111.52 z" 
        Fill="#ff000000"/>
    <Glyphs Fill="#ff000000" 
        FontUri="/Documents/1/Resources/Fonts/87850AD7-9FD8-4CF2-9ED3-D635DE0AC70C.odttf" 
        FontRenderingEmSize="22.5173" StyleSimulations="None" 
        OriginX="228.16" OriginY="106.88" 
        Indices="3" UnicodeString=" " />
    <Glyphs Fill="#ff000000" 
        FontUri="/Documents/1/Resources/Fonts/BCA29EFB-F86B-4B42-A6B7-754D68DD5A3A.odttf" 
        FontRenderingEmSize="15.0115" StyleSimulations="None"  
        OriginX="105.6" OriginY="132.96"  
        Indices="59,71;51;54;3,34;11,34;59,71;48,91;47,57; ... ;82,48;3" 
        UnicodeString="XPS (XML Paper Specification) is a ... useful alternative to " />
    ...

FixedPage is the root element for all pages. There can be a lot of other elements contained within a FixedPage element, but the XPS printer driver typically leaves us with just Path (graphics) and Glyphs (text) elements.

When it comes to the actual output of Glyphs, it's the Indices that are used in preference to the UnicodeString. I've occasionally found that this has led to some interesting output. The Indices attribute is a list of all the glyphs to be used. If it is present, then there must be a corresponding character in the UnicodeString for each Indices entry. Each entry in the list of indices comprises a glyph ID, optionally a comma, followed by an AdvanceWidth, and finally, delimited with a semi-colon. There is actually a lot more that could be present in Indices, but this is about the limit of what you'll see being pumped out by the XPS printer driver. If you want Justified, Centered, or Right aligned text, then the Indices attribute is essential; take it out and you end up with simple Left aligned text with no special tricks. Although, there is a special trick to outputting Right aligned text without having to delve into the font files, which I'll cover in another article.

In the above extract, you can see some of the redundant artifacts that can be "cleaned" out. Within the Data attribute of the Path element, the spaces behind the "M" and "L" are not needed as is the space before the terminating "z". The Glyphs element that has a UnicodeString of " " is completely unnecessary, and the trailing space at the end of the UnicodeString (and Indices) attribute in the next Glyphs element can also be eliminated. These may not seem like much, but a heavily edited Word document will tend to have a large number of such artifacts that end up in the corresponding XPS; get rid of these, and you can quite often get rid of some of the embedded font files as well, resulting in a massive reduction in file size.

Other redundant artefacts can be identified by comparing all of the files within the XPS file looking for duplicates and keeping a copy of those that are found. Later, the files that refer to the duplicate copies can have that reference altered to point to the original.

Speaking of the obfuscated font files, these are really extracts from the full font file of only the characters needed for your document. This can get interesting when you want to programmatically output some XPS (without using the .NET XPS methods) and find some of your characters have mysteriously disappeared.

Using the Code

This is a simple console application designed to be executed from your command line. Pass it the name of the XPS file you want cleaned. It will describe the steps it's going through as it progresses, and then finally, leave you with an output file with "-clean" appended to the filename.

Please read "Other Stuff" at the bottom of this article as you will need to get the ICSharpCode zip library to make this all work and I haven't put its DLL into the Zip.

It will be very trivial to convert this simple application to a service or DLL.

This code should really be thought of as an XML pipeline, and in fact much of its operation could be changed to pass the constituent documents through as streams from one step to the next rather than using the intermediate files as I have here. However, I've structured it this way so that you can comment out the code that deletes the intermediate files and then go in and have a look inside them.

Also, having cleaned out a lot of the unnecessary artifacts, the resulting parts that make up the "cleaned" version of the files tend to make more sense.

How it Works

First of all, the application loads up the four XSLTs that do most of the actual work.

// Load up the Cleanup XSLT
XslCompiledTransform cleanupXSLT = new XslCompiledTransform();
cleanupXSLT.Load("Resources\\XPSCleaner.xsl");
Console.WriteLine("Cleanup XSLT Loaded.");

// Load up the Resource Relationships XSLT
XslCompiledTransform relsXSLT = new XslCompiledTransform();
relsXSLT.Load("Resources\\XPSRels.xsl");
Console.WriteLine("Resource Relationships XSLT Loaded.");

// Load up the Resource Relationship Listing XSLT
XslCompiledTransform relRefsXSLT = new XslCompiledTransform();
relRefsXSLT.Load("Resources\\XPSRelRefs.xsl");
Console.WriteLine("Resource Relationship Listing XSLT Loaded.");

// Load up the References XSLT
XslCompiledTransform referencesXSLT = new XslCompiledTransform();
referencesXSLT.Load("Resources\\XPSReferences.xsl");
Console.WriteLine("References XSLT Loaded.");

Next, the original XPS file is opened up and each file is compared with every other file of the same type and size in an effort to identify duplicates. These duplicates will be dumped as the cleaned version of the XPS is built up, and any references to them in other files will also be altered. This code isn't that elegant, but it does the job.

// Duplicate files will be dropped and references to them
// altered to point to the 'original' 
foreach (ZipEntry ze1 in zf)
{
    string ze1NewName = ze1.Name.Replace("Documents/1/", "Documents/2/");
    // Skip this file if we've already identified it as a duplicate
    if (dupFiles.ContainsKey(ze1NewName))
        continue;

    // Go back through the list to identify any duplicates
    foreach (ZipEntry ze2 in zf)
    {
        // Ready the stream for the 'original' file
        using (Stream zs1 = zf.GetInputStream(ze1))
        {
            string ze2NewName = ze2.Name.Replace("Documents/1/", 
                                                 "Documents/2/");

            // Skip this file if it happens to be the same one
            // or is not the same type (extension)
            // or are of differing file sizes
            if (ze1NewName == ze2NewName ||
                Path.GetExtension(ze1NewName) != Path.GetExtension(ze2NewName) ||
                ze1.Size != ze2.Size)
                continue;

            bool isEqual = true;

            // Ready some small buffers for the comparison
            byte[] buffer1 = new byte[4096];
            byte[] buffer2 = new byte[4096];
            int sourceBytes1;
            int sourceBytes2;

            // Now open up the two files and check if they are the same
            using (Stream zs2 = zf.GetInputStream(ze2))
            {
                // Using a fixed size buffer here makes no noticeable difference 
                // for performance but keeps a lid on memory usage.
                do
                {
                    sourceBytes1 = zs1.Read(buffer1, 0, buffer1.Length);
                    sourceBytes2 = zs2.Read(buffer2, 0, buffer2.Length);

                    for (int i = 0; i < buffer1.Length; i++)
                    {
                        if (buffer1[i] != buffer2[i])
                        {
                            isEqual = false;
                            break;
                        }
                    }

                    // If filesize can be relied on
                    // this test should never fire
                    if (sourceBytes1 != sourceBytes2)
                    {
                        isEqual = false;
                    }

                } while (sourceBytes1 > 0 && isEqual);
            } 

            if (isEqual)
            {
                // This file must be identified as a duplicate
                dupFiles.Add(ze2NewName, ze1NewName);
            }
        }
    }
}

Then, the actual cleaning phase begins with each file in the XPS that's not some kind of resource or metadata file processed in turn, being put into the output XPS file once it's been worked on. One common change that's applied is to 'move' all the files and references from document '1' to document '2'. Doing this sort of thing makes it a lot easier to merge one XPS file, produced by the XPS printer driver, with another later on.

The page files (.fpage) are passed through the cleanup XSLT to remove the redundant references and do some of the other tweaks; their corresponding .rels files are also regenerated from this 'cleaned' page file. This file in turn is processed to build up a list of resources and metadata actually used.

// Clean up the .fpage file itself
// Also alters references to 'duplicate' files 
string entryFileName = CopyAndCleanFile(baseFileName, processingFileName, 
                                        cleanupXSLT, zf, ze, s);

// Determine the temporary file names we'll use
string relsFileName = baseFileName + "Rels." + 
                      Path.GetFileName(processingFileName);
string processingRelsFileName = 
    processingFileName.Replace("Documents/1/Pages/", 
    "Documents/2/Pages/_rels/") + ".rels"

// Generate the Associated .rels file (removing any redundant references)
relsXSLT.Transform(entryFileName, relsFileName);
Console.WriteLine("{0} has been generated.", processingRelsFileName);

// Delete the cleaned file
File.Delete(entryFileName);

// Do a search and replace for each of the 'duplicate' file references
ReplaceReferencesToDuplicates(relsFileName, dupFiles);

// Add the generated rels entry to the new zip
AddZipEntry(processingRelsFileName, s, relsFileName);

// Identify the actual resources needed
relRefsList = IdentifyRels(relRefsList, relRefsXSLT, relsFileName);
Console.WriteLine("{0} Resources have been listed.", 
                  processingRelsFileName);

// Delete the rels file
File.Delete(relsFileName);

Below is the XSLT that does most of this cleanup work on the page file itself. The existing XPS methods in .NET 3 are focused around the simple generation of XPS output. To actually manipulate it requires switching to something like XSLT.

Just a note, these XSLTs are specifically set up to accommodate a Microsoft XSLT quirk that dates back at least to MSXML 3. Within each template, each element being created must have the correct namespace declared (unless it's being created inside another element), which will be discarded by the MS XSLT processor when it realises it doesn't need it. If you don't have a namespace declaration, the MS XSLT processor will insert an empty namespace declaration (xmlns="") in your element, which really tends to screw things up quite nicely.

<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:x="http://schemas.microsoft.com/xps/2005/06"
    exclude-result-prefixes="x">

    <xsl:output indent="yes" method="xml" 
       encoding="utf-8" omit-xml-declaration="yes"/>

    <xsl:template match="/">
        <!-- Work our way through every element -->
        <xsl:apply-templates select="*"/>
    </xsl:template>

    <xsl:template match="x:Glyphs">
        <!-- Include Glyphs that aren't all whitespace -->
        <xsl:if test="string-length(normalize-space(@UnicodeString)) &gt; 0">
            <Glyphs xmlns="http://schemas.microsoft.com/xps/2005/06">
                <xsl:apply-templates select="@*"/>
            </Glyphs>
        </xsl:if>
    </xsl:template>

    <xsl:template match="*">
        <!--  General processing for all other elements -->
        <xsl:element name="{name(.)}" 
            namespace="http://schemas.microsoft.com/xps/2005/06">
            <xsl:apply-templates select="@*"/>
            <xsl:choose>
                <xsl:when test="count(*) &gt; 0">
                    <xsl:apply-templates select="*"/>
                </xsl:when>
                <xsl:otherwise>
                    <xsl:value-of select="."/>
                </xsl:otherwise>
            </xsl:choose>
        </xsl:element>
    </xsl:template>

    <xsl:template match="@Data[name(..) = 'Path']">
        <!-- Clean up the Data attribute of Path elements -->
        <xsl:attribute name="Data">
            <xsl:call-template name="CleanPath">
                <xsl:with-param name="pathData" select="."/>
            </xsl:call-template>
        </xsl:attribute>
    </xsl:template>

    <xsl:template match="@UnicodeString">
        <!-- Clean up the UnicodeString attribute -->
        <xsl:attribute name="UnicodeString">
            <xsl:call-template name="CleanUnicodeString">
                <xsl:with-param name="unicodeString" select="."/>
            </xsl:call-template>
        </xsl:attribute>
    </xsl:template>

    <xsl:template match="@Indices">
        <!-- Clean up the Indices attribute, 
        removing indices for redundant whitespace from the end -->
        <!-- The source string is first of all reversed,
        then redundant indices removed from what is now the 'front' of the string,
        then it's all reversed back again -->
        <xsl:attribute name="Indices">
            <xsl:call-template name="StringReverse">
                <xsl:with-param name="string">
                    <xsl:call-template name="CleanIndices">
                        <xsl:with-param name="indices">
                            <xsl:call-template name="StringReverse">
                                <xsl:with-param name="string" select="."/>
                            </xsl:call-template>
                        </xsl:with-param>
                    </xsl:call-template>
                </xsl:with-param>
            </xsl:call-template>
        </xsl:attribute>
    </xsl:template>

    <xsl:template match="@*">
        <!-- General processing for all other attributes -->
        <xsl:attribute name="{name(.)}">
            <xsl:choose>
                <xsl:when test="starts-with(., '/Documents/1/Resources/Fonts/')">
                    <!-- Move the fonts down to the XPS root -->
                    <xsl:value-of select="substring-after(., '/Documents/1')"/>
                </xsl:when>
                <xsl:when test="starts-with(., '/Documents/1')">
                    <!-- Move everything else to document number 2 - because we can -->
                    <xsl:value-of select="concat('/Documents/2', 
                        substring-after(., '/Documents/1'))"/>
                </xsl:when>
                <xsl:otherwise>
                    <xsl:value-of select="."/>
                </xsl:otherwise>
            </xsl:choose>
        </xsl:attribute>
    </xsl:template>

    <xsl:template name="CleanPath">
        <!-- Clean path data eliminating redundant whitespace -->
        <xsl:param name="pathData" select="''"/>

        <xsl:choose>
            <xsl:when test="contains($pathData, '  ')">
                <xsl:call-template name="CleanPath">
                    <xsl:with-param name="pathData"
                        select="concat(substring-before($pathData, '  '), ' ', 
                            substring-after($pathData, '  '))"/>
                </xsl:call-template>
            </xsl:when>
            <xsl:when test="contains($pathData, ' M ')">
                <xsl:call-template name="CleanPath">
                    <xsl:with-param name="pathData"
                        select="concat(substring-before($pathData, ' M '), ' M', 
                            substring-after($pathData, ' M '))"/>
                </xsl:call-template>
            </xsl:when>
            <xsl:when test="contains($pathData, ' L ')">
                <xsl:call-template name="CleanPath">
                    <xsl:with-param name="pathData"
                        select="concat(substring-before($pathData, ' L '), ' L', 
                            substring-after($pathData, ' L '))"/>
                </xsl:call-template>
            </xsl:when>
            <xsl:when test="contains($pathData, ' z')">
                <xsl:call-template name="CleanPath">
                    <xsl:with-param name="pathData"
                        select="concat(substring-before($pathData, ' z'), 'z', 
                            substring-after($pathData, ' z'))"/>
                </xsl:call-template>
            </xsl:when>
            <xsl:otherwise>
                <xsl:value-of select="$pathData"/>
            </xsl:otherwise>
        </xsl:choose>
    </xsl:template>

    <xsl:template name="CleanUnicodeString">
        <!-- Clean unicode string removing redundant whitespace from the end -->
        <xsl:param name="unicodeString" select="''"/>

        <xsl:if test="substring($unicodeString, string-length($unicodeString), 1) = ' '">
            <xsl:value-of select="substring($unicodeString, 1, 
                string-length($unicodeString) - 1)"/>
        </xsl:if>
    </xsl:template>

    <xsl:template name="CleanIndices">
        <!-- Clean indices removing redundant whitespace references
        from the end (reversed to be at the beginning) -->
        <xsl:param name="indices" select="''"/>

        <xsl:choose>
            <xsl:when test="starts-with($indices, '3;')">
                <!-- Strip off simple spaces -->
                <xsl:call-template name="CleanIndices">
                    <xsl:with-param name="indices" 
                        select="substring-after($indices, '3;')"/>
                </xsl:call-template>
            </xsl:when>
            <xsl:when test="contains($indices, ',') and
                        not(contains(substring-before($indices, ','), ';')) and
                        starts-with(substring-after($indices, ','), '3;') ">
                <!-- Strip off spaces with a size -->
                <xsl:call-template name="CleanIndices">
                    <xsl:with-param name="indices" 
                        select="substring-after($indices, '3;')"/>
                </xsl:call-template>
            </xsl:when>
            <xsl:otherwise>
                <xsl:value-of select="$indices"/>
            </xsl:otherwise>
        </xsl:choose>
    </xsl:template>

    <xsl:template name="StringReverse">
        <!-- Take any given string and reverse it -->
        <xsl:param name="string"/>

        <xsl:variable name="len" select="string-length($string)"/>

        <xsl:choose>
            <xsl:when test="$len &lt; 2">
                <xsl:value-of select="$string"/>
            </xsl:when>
            <xsl:otherwise>
                <xsl:call-template name="StringReverse">
                    <xsl:with-param name="string" 
                        select="substring($string, $len div 2 + 1, $len div 2)"/>
                </xsl:call-template>
                <xsl:call-template name="StringReverse">
                    <xsl:with-param name="string" 
                        select="substring($string, 1, $len div 2)"/>
                </xsl:call-template>
            </xsl:otherwise>
        </xsl:choose>
    </xsl:template>
</xsl:stylesheet>

The above XSLT is primarily focussed around identifying redundant whitespace and eliminating that. What this occasionally leads to is a situation where a particular font file is no longer needed, and it's this situation where we can really reduce the size of the XPS file.

I could have added a call to the XSL documents() function to include the list of duplicate files (formatted in XML) and use them in the processing. However, this requires making further changes to how the precompiled XSLT is generated, because it's a potential security risk, and also substantial changes to the XSLT itself for it to identify the references to the 'duplicates' and replace them with a reference to the 'original'. I opted for a simpler solution, from a coding perspective, to just do a search and replace, line by line, on the output from the above XSLT.

The next XSLT to be run regenerates the .rels file for us from the 'cleaned' fpage file, in effect throwing away the references to now redundant resources and/or metadata.

<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:x="http://schemas.microsoft.com/xps/2005/06"
    exclude-result-prefixes="x">

    <xsl:output indent="yes" method="xml" 
       encoding="utf-8" omit-xml-declaration="yes"/>

    <xsl:key name="resourceKey" match="//@*[starts-with(., '/Resources/Fonts/') 
        or starts-with(., '/Documents/2/Resources/Images/') 
        or starts-with(., '/Documents/2/Metadata/')]" use="."/>

    <xsl:template match="/">
        <Relationships 
            xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
            <!-- Work our way through every unique resource attribute 
                using the Muenchian method -->
            <xsl:apply-templates 
                select="//@*[contains(., '/Resources/') or contains(., '/Metadata/')]
                    [generate-id() = generate-id(key('resourceKey', .))]"/>
            <!-- Add in a reference for the printticket 
                as this won't be found in the source page files -->
            <Relationship Type="http://schemas.microsoft.com/xps/2005/06/printticket" 
                Target="/Documents/2/Metadata/Page1_PT.xml">
                <xsl:attribute name="Id">
                    <xsl:value-of 
                        select="concat('R', count(//@*[starts-with(., '/Resources/Fonts/') 
                            or starts-with(., '/Documents/2/Resources/Images/') 
                            or starts-with(., '/Documents/2/Metadata/')]))"/>
                </xsl:attribute>
            </Relationship>
        </Relationships>
    </xsl:template>

    <xsl:template match="@*">
        <!-- List out the resource identifier -->
        <Relationship Type="http://schemas.microsoft.com/xps/2005/06/required-resource" 
            xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
            <xsl:attribute name="Target">
                <xsl:value-of select="."/>
            </xsl:attribute>
            <xsl:attribute name="Id">
                <xsl:value-of select="concat('R', position())"/>
            </xsl:attribute>
        </Relationship>
    </xsl:template>
</xsl:stylesheet>

Well, it wouldn't be a real project involving XSLT unless the Muenchian method made an appearance now, would it? This XSLT ensures that we have only one Relationship element for each unique resource.

Another XSLT works with all the other files that need their references to various resources adjusted because we're moving everything from "1" to "2".

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" 
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
    xmlns:x="http://schemas.microsoft.com/xps/2005/06" 
    xmlns:r="http://schemas.openxmlformats.org/package/2006/relationships" 
    exclude-result-prefixes="x r">

    <xsl:output indent="yes" method="xml" 
       encoding="utf-8" omit-xml-declaration="yes"/>

    <xsl:template match="/">
        <xsl:apply-templates select="*"/>
    </xsl:template>

    <xsl:template match="r:Relationships">
        <!-- Processing for the 'primary' elements of page related .rels files -->
        <!-- Actually this particular template should never be invoked,
            it's here as an 'insurance' policy against 'maintenance' -->
        <Relationships 
        xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
            <xsl:apply-templates select="@*"/>
            <xsl:apply-templates select="*[not(contains(@Target, '/Fonts/'))]"/>
        </Relationships>
    </xsl:template>

    <xsl:template match="r:Relationship">
        <!-- Processing for the 'primary' elements of other .rels files -->
        <Relationship 
        xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
            <xsl:apply-templates select="@*"/>
        </Relationship>
    </xsl:template>

    <xsl:template match="x:FixedDocument|x:FixedPage|x:FixedDocumentSequence">
        <!-- Processing for the 'primary' elements for other than .rels files -->
        <xsl:element name="{name(.)}" 
            namespace="http://schemas.microsoft.com/xps/2005/06">
            <xsl:apply-templates select="@*"/>
            <xsl:apply-templates select="*"/>
        </xsl:element>
    </xsl:template>

    <xsl:template match="*">
        <!-- Processing for all other elements -->
        <xsl:element name="{name(.)}" 
            namespace="http://schemas.microsoft.com/xps/2005/06">
            <xsl:apply-templates select="@*"/>
            <xsl:choose>
                <xsl:when test="count(*) &gt; 0">
                    <!-- If there are sub-elements process these -->
                    <xsl:apply-templates select="*"/>
                </xsl:when>
                <xsl:otherwise>
                    <!-- If there are no sub-elements 
                    then just take the contents of this element -->
                    <xsl:value-of select="."/>
                </xsl:otherwise>
            </xsl:choose>
        </xsl:element>
    </xsl:template>

    <xsl:template match="@*">
        <!-- Processing for all attributes -->
        <xsl:attribute name="{name(.)}">
            <xsl:choose>
                <xsl:when test="starts-with(., '/Documents/1/Resources/Fonts/')">
                    <!-- Alter font references to point to the 'root' resources folder -->
                    <xsl:value-of select="substring-after(., '/Documents/1')"/>
                </xsl:when>
                <xsl:when test="starts-with(., '/Documents/1')">
                    <!-- Alter all other document references to point to document '2' -->
                    <xsl:value-of select="concat('/Documents/2', 
                        substring-after(., '/Documents/1'))"/>
                </xsl:when>
                <xsl:otherwise>
                    <!-- Leave all other references alone -->
                    <xsl:value-of select="."/>
                </xsl:otherwise>
            </xsl:choose>
        </xsl:attribute>
    </xsl:template>
</xsl:stylesheet>

The final XSLT actually produces text output. This one is designed to read all of the .rels files (those for each page, and the other one in the 'root' _rels folder, plus any others) and simply generate a listing that we can process to determine what resources and metadata files we really need.

<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:r="http://schemas.openxmlformats.org/package/2006/relationships"
    exclude-result-prefixes="r">

    <xsl:output indent="yes" method="text" 
               encoding="utf-8" omit-xml-declaration="yes"/>

    <xsl:template match="/">
        <!-- List all the relationship 'targets' the resources and metadata files -->
        <xsl:for-each select="r:Relationships/r:Relationship/@Target">
            <xsl:value-of select="."/>
            <xsl:value-of select="' '"/>
        </xsl:for-each>
    </xsl:template>
</xsl:stylesheet>

The output from this last XSLT is the only one we don't pump out to a temporary file. It instead is pushed via stream into a StringBuilder that's later processed into a list.

That then brings up stage 3 of processing the original XPS file. In the third pass, the files we worked with in the second stage are skipped (their processed output is already in the new XPS file); instead, it picks up all the resource and metadata files and, using the above list, puts them into the right places in the new XPS file. Any 'duplicate' files are tossed (ignored), and then finally, any other outstanding files are also grabbed at this time.

// If this is a 'duplicate' then we just dump it
// As you can see this code fragment is from inside a loop
if (dupFiles.ContainsKey(processingFileName))
    continue;

if ((processingFileName.StartsWith("Documents/1/Pages") 
    && processingFileName.EndsWith(".fpage")) ||
    (processingFileName.EndsWith(".fpage.rels")))
{
    // Skip these - we've already processed them
}
else if (processingFileName.StartsWith("Documents/1/Resources/Fonts"))
{
    #region Resource files that require 'moving' to the 'root' Resources folder
    string newFileName = processingFileName.Replace("Documents/1/", "");
    if (relsFileNames.Contains(newFileName))
    {
        Console.WriteLine("XPS file entry '{0}' moving to {1}", 
            processingFileName, 
            processingFileName.Replace("Documents/1/", ""));
        CopyZipEntry(ze.Name.Replace("Documents/1/", ""), s, zf, ze);
    }
    #endregion
}
else if (processingFileName.StartsWith("Documents/1/") || 
    processingFileName.Contains("_rels/") || 
    processingFileName.EndsWith(".fdseq"))
{
    // Identify the files that were cleaned up
    bool bTransformRequired = (processingFileName.EndsWith(".rels") || 
        processingFileName.EndsWith(".fdoc") || 
        processingFileName.EndsWith(".fdseq"));

    if (!bTransformRequired)
    {
        #region Files that only require 'moving' to document '2'
        string newFileName = 
          processingFileName.Replace("Documents/1/", "Documents/2/");
        if (relsFileNames.Contains(newFileName))
        {
            Console.WriteLine("XPS file entry '{0}' moving to {1}", 
            processingFileName, newFileName);
            CopyZipEntry(ze.Name.Replace("Documents/1/", 
                         "Documents/2/"), s, zf, ze);
        }
        #endregion
    }
}
else
{
    #region Files we just put in the same place in the new zip
    Console.WriteLine("XPS file entry '{0}' transferred as is", processingFileName);
    CopyZipEntry(ze.Name, s, zf, ze);
    #endregion
}

With all the files moved into their new places (and redundant/duplicate ones silently dropped), the new XPS is closed and the program has finished 'cleaning up' the original XPS.

Other Stuff

This application uses the ICSharpCode SharpZipLib library to do its zip file packing and unpacking (http://www.icsharpcode.net/OpenSource/SharpZipLib/Default.aspx). It's not included in the project, so you'll need to download separately.

I also used Stylus Studio for the XSLT coding (http://www.stylusstudio.com). Although, in this particular application, the XSLT is pretty trivial in its nature.

I also strongly recommend reading the official XPS spec from Microsoft (http://www.microsoft.com/whdc/xps/downloads.mspx) and obtaining the sample XPS documents (http://www.microsoft.com/whdc/XPS/XpsSamples.mspx) from which I gained more than a few insights. Also consult the official team blog (http://blogs.msdn.com/xps/default.aspx) and Feng Yuan's blog (http://blogs.msdn.com/fyuan/default.aspx).

I'd also recommend:

I'll try not to duplicate too much of the work of all these people.

Along with this, I also recommend getting a copy of the IsXPS.exe test tool, which you'll find in the Windows Driver Kit (WDK).

Other Parts

History

  • 2008-04-19: First version completed.
  • 2008-04-21: Added some more recommended reading.
  • 2008-04-28: Added an additional processing stage to identify duplicate files and eliminate them.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Lee Humphries
Founder md8n
Australia Australia
If it ain't broke - that can be arranged.
Follow on   Twitter

Comments and Discussions

 
GeneralRe: Identify Paragraphs in XPS PinmemberAjas13-Jul-08 23:30 
GeneralRe: Identify Paragraphs in XPS PinmemberAjas14-Jul-08 3:34 
GeneralRe: Identify Paragraphs in XPS PinmemberLee Humphries14-Jul-08 12:41 
GeneralXPS by Autodesk (DWFx) PinmemberIzhar A.26-Apr-08 10:07 
GeneralNice PinmemberNick Butler19-Apr-08 2:03 
GeneralRe: Nice PinmemberLee Humphries19-Apr-08 2:15 
Anything with regard to XPS that you'd like me to tackle next?
 
Basically the list is as follows:
Building up documents from data (probably two articles)
PrintTickets
Fonts
and probably something else
 
I am convinced that lobotomising users will make little to no difference.

GeneralRe: Nice PinmemberNick Butler19-Apr-08 2:42 
GeneralRe: Nice PinmemberLee Humphries19-Apr-08 4:04 
GeneralFormatting PinmemberNick Butler18-Apr-08 22:58 
GeneralRe: Formatting PinmemberLee Humphries19-Apr-08 1:24 
GeneralPlease reformat this PinmemberMaruf Maniruzzaman18-Apr-08 22:53 
GeneralRe: Please reformat this PinmemberLee Humphries19-Apr-08 1:27 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web03 | 2.8.140721.1 | Last Updated 3 Aug 2008
Article Copyright 2008 by Lee Humphries
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid