Update – February 2012
One of my perennial wishlist items for this
application was to add page view support.
The application as it was in version 1.1 had functionality I needed, but
it was cumbersome in some ways. You
needed to know exactly which pages you wanted to save or delete, because you
couldn’t see the pages "live". Recently,
I found another article here on Code Project entitled "Show
thumbnails of XPS documents" by Pravesh Soni [you can find his blog here :
http://pravesh.me/]. I believed I could adapt what he had written
as a proof-of-concept in C# into my application to get it the added
functionality I always wanted, but just didn’t know how to do previously. I grabbed slices of his code and converted
the parts I needed to VB, and prettied up some display functionality, and the
application is more user-friendly now than ever before. It is still not perfect – I just did the
displays in sort of a rough fashion, but it is definitely an improvement.
When a page is selected in the list box, a small thumbnail is shown between
the list box and the other controls (see first picture above). If that thumbnail is clicked, a larger
version is shown (as shown in the second picture above). If the larger version is clicked, it
disappears and the other controls become visible again. Additionally, if an XPS file has only a
single page, in addition to stating that fact (there’s nothing else for this
application to do with a one-page document), it allows a picture of that page
to be displayed. The sizes of all the
displays were chosen essentially at random, based on what fit on the
application’s display window. As a
result, the full picture of a page from the XPS file is usually not shown. Sizings could
conceivably be tweaked further to present a better display. However, having *some* kind of picture of
each page should be extremely helpful in the page selection process for the
other operations available.
Truth be told, with the far greater support for PDF now available in Windows
7 (and Windows Vista, which if you read further was my original reason for
developing the app), I don’t use XPS all that much anymore, although I still do
from time-to-time. I don’t think the
format ever gained all that much traction.
Yet, I am still a little surprised that I still haven’t found any
freeware tools to perform this kind of file manipulation. That said, I have found that several
additional commercials projects (aside from the NiXPS
which existed even back in 2008 when I originally wrote this) to do the
manipulations have surfaced. Among them : Split
XPS Merge from XPSDev ($24.90) and XPS Split and Merge from
CAD-KAS ($29). Obviously, my app here is
nowhere near the level of these professional apps [and I still never did figure
out how to merge files successfully].
It has also been noted in a comment from SoAna in
late 2010 that MS Word-created XPS files are not handled correctly by my
application. I investigated and when dissecting
the XPS files, I found that they were created differently than those created by
the Microsoft XPS Document Writer, and thus the logic to process those files
would have to be updated differently. I
worked on it for awhile, but never solved it, and the project got buried. Maybe I will get back to it someday.
I have not made any substantive changes to the remainder of the article
below, except to line out the "Future Enhancement" wishlist
item for thumbnail support, and add a newly-known shortcoming corresponding to Microsoft
Word-created XPS documents. I changed
none of the XPS processing code or split/re-ordering functionality, neither as
described here nor in the project itself.
The difference between v1.01 and v2.01 is the addition of the image
viewing code and its incorporation into the project. The background details on image creation code
are better handled in the above-referenced article by Pravesh Soni here on Code Project. I have tweaked his code to fit the needs here,
but the concepts came from his article.
Introduction
There are (to the best of my knowledge) no freeware tools which allow page
manipulation of XPS files — the PDF-like Microsoft open format XML Paper
Specification. Simple tasks like removing pages or changing the order of the
existing pages in the displayed file can be performed by hand, but the process
takes some time and some know-how, and can be frustrating if not done precisely.
For reasons I'll describe in "Background", XPS is my most
commonly-used output format for file printing, and I have long wanted to have
the ability to perform these kinds of simple page manipulations quickly and
easily. For a LONG time, I have searched for tools to do these simple tasks
(similar tools do exist for the comparable PDF file type...), but I have not
found a tool to manipulate XPS files in this way. Now, finally, standing on the
shoulders of others who have walked before me, I offer up a version of just
such a simple tool, as well as the utility package you can you use to perform
these operations yourself if you're not crazy about my particular GUI stylings...
There are many good detailed articles on the internals of the XPS structure,
so by-and-large I will avoid most
of that discussion here. For example, Lee Humphries has a couple of good XPS
articles here on Code Project, and in one of those, he references a blog with a
lot of good XPS-related information. The first of this series of articles is
at: Dissecting
XPS, part 1 - The basics
Background
There are numerous freeware PDF-writing solutions out there, and by now I
presume most, if not all, support Vista. However, I was an early Vista adopter,
and back in the early days of Vista, none of the PDF writing tools were
compatible with the fledgling OS. Previously, I had extensively used the tool PrimoPDF [PrimoPdf.com]
to do my document printing, but there was a significant delay following Vista's
release before they were able to bring a compatible solution to bear. So, as I
wanted/needed to continue printing even though I had moved to the new OS, I
gravitated to the only PDF replacement which WAS available... the default
Microsoft XPS Document Writer already incorporated in Vista. Naturally, as the
only solution, the XML Paper Specification (XPS) format became pretty
intriguing at the time.
My long-time major complaint with this solution, however, was that while
numerous free PDF-editing tools exist, none did for the relatively-new XPS
format. Typically most of the printing for which I had used PDF in the past
(and thus began using XPS when PDF became unavailable) was to print webpages — pages I wanted to print and keep, but not
necessarily print to hard-copy paper. And while there were other edits I'd have
been interested in making from time-to-time, the majority of my desired
manipulation was to remove pages at the end of the webpage printouts that were
of no value — page footers, ads at the bottom of the webpage, etc., that bled
over and made the resulting file larger than necessary with irrelevant data. I
was frustrated by the inability to trim down excess pages at the end of my
printed files — pages with no real data, just spillover garbage.
Every once in awhile I would go online and research it anew, but no tools
for XPS ever appeared. I would back-burner the query away, and come back months
later to search again, only to find repeatedly that still no tools had been
created. To me, this was a major drawback of the XPS format, but by this time
I'd gotten used to XPS, and I actually kind of liked the format. Also, amidst
frequent machine rebuilds, it IS nice to have the software right there,
available right away, already resident on the machine without having to
re-install printing software too. I now find that even when I do re-install PrimoPDF (it has since become Vista-compatible), I pretty
much stick with XPS as my default printing format.
Yet, time and again, the inability to manipulate these files as I wanted (as
I could with PDF) has been a frustration.
Truth be told, there actually is a full-featured commercial XPS-editing
application available : NiXPS
[NiXPS.com]. I'm sure it is a well-matured
application, as it has been around for several years now. It also carries a
fairly hefty price tag (~$400, I believe). In addition, the things I imagine
you can do with it are way beyond the scope of my personal needs. If I actually
needed to do anything official or important with these files, maybe I could
find a way to justify it... but as a hobbyist, there's just not that kind of
need. I figured someone would put out a freeware toolset to do the kinds of
minor page manipulations I was interested in at some point, but it has just
never happened.
Development
One positive attribute of XPS is that it's an open format. The spec is
freely available, and you can actually work with the internals of the format
without being too much of a rocket scientist. The XPS file is essentially a
zipped-container containing a host of other files which describe the content to
be displayed, layered out by document and by page.
Researching my questions again recently, I came across a number of helpful
articles. As I read up on the format, I did some by-hand manipulation of the
XPS file internals and eventually met with success... I was able to remove
pages from the file and still have them open successfully in the standard XPS
viewer. I also found some additional threads on the MSDN forums related to
doing this manipulation programmatically. And yet, despite there being a
handful of articles regarding XPS on the MSDN forums, no one had apparently
followed through and built an application to do these manipulations (at least
not done so and made it publicly available). This time, armed with this code
which looked promising, I was adequately motivated and confident enough to
actually forge ahead on my own.
Following my successful experiments at manual manipulation of the files, I
did some additional research into the actual XPS spec document, and starting
with the excellent code sample referenced below in "Using the Code",
I attempted to build an application to do the tricky work programmatically. I
created a C# DLL library using the imported code as a base, built a rough VB
GUI, and linked the DLL in to a VB project. And after some tweaking, I got it
to work. I had successfully removed pages from an XPS file programmatically,
just as I had by hand.
I was somewhat disappointed though, as the file size difference between the
original and the edited XPS files was minimal. As it turns out, this is because
the only thing that has been removed at this point is the XML file representing
the page content, as well as the one describing the relationships that page had
as well; basically, two relatively small XML text files per page. Since all
content is zipped up in assembling together the XPS file, this slim file size
savings ends up appearing even smaller.
In the post where I found the code I originally used to base my app on,
"Jo0815" had described one of unimplemented portions of the code
provided: that the code removed only the page file(s) themselves; it left all
resources (fonts/images) associated with those pages in place. It is possible
that some or all of these resources are no longer used anywhere else in the XPS
file, and thus could be safely removed as well.
As I was troubleshooting this process, I found that I was changing the
utility routine again and again. That required going back and forth between the
different languages and projects (C# for the utility, VB for the GUI) each time
and re-linking the new DLL... this proved to be a real pain. So, I converted
the processing utility to VB so ALL the raw code would be enclosed in the same
VB project. After successfully converting the XPS parsing routing to VB, I
worked on the unimplemented algorithm to handle the resources as well. Again,
the time previously spent hand-manipulating the internals of the XPS files
helped as I struggled to create a "good" XPS file with the edited set
of resources. Once I got the resource removal algorithm working, I built a much
fuller GUI, as can be seen in the screenshot of the final application as
pictured above.
In addition to expanding the capabilities to alternatively include OR remove
a specific set of pages, I also added the ability to re-write the order of the
pages in the FixedDocument
element, so that the output file now contains the selected pages IN THE ORDER
as designated by the user (via the GUI in my application, or in the order of
the pages passed into the utility routine, if you prefer to use a different
interface).
I did observe that even after removing the resources no longer referenced by
any remaining pages, typically the file size of the output XPS file still
remains fairly high. Some rudimentary and incomplete analysis indicates that
the fonts used by the format are very expensive size-wise. In my experience, it
appears that a majority of the size of the XPS file comes from the various
fonts embedded in the file. I would like to investigate this further at a later
time... specifically comparing the internals of the XPS files and their
relative sizes when created via two different means — for example: printing out
a 6-page webpage and then deleting pages 5 & 6 via the application
presented here, compared to the XPS file created by the original writer when
directly printing out only pages 1-4 of the web page. The displayed content of
both files would be the same... [aside from the
displayed page header saying "Page x of 6" vs. "Page x of
4"] -- but how do the files themselves compare? I
would imagine the XPS Writer has some build in optimization that make the direct
creation the better choice, but again I'm just speculating.
Using the Code
The VB '08 project included in the source files is immediately rebuildable to the application also available above,
containing all necessary source files with the appropriate library references
set in the project.
All of the XPS functionality requires .NET 3.0 or higher; so, the same is
true of this application and/or utility package.
Specific libraries from .NET 3.0 which need to be linked in to the project are :
- PresentationCore
- PresentationFramework
- ReachFramework
- System.Printing
- WindowsBase
In the project, there is the main XPS processing utilities class, which is
probably all you'd want to keep if you were to re-use any of the code. The
public interface of XPSProcessingUtilities.vb
provides four external functions :
-
getPageCount()
-
RemovePageFromXpsFile()
-
RemovePagesFromXpsFile()
-
RetainPagesFromXpsFile()
Logically, the function to remove a single page is really just a simplified case
of removing multiple pages, and that is actually how it is implemented — it
takes the single page input parameter and initializes it into an array of
length 1 and then in turn calls the RemovePagesFromXpsFile()
function. However, I created it as a
separate function because I thought it made sense as a separate, simplified
call. This function is utilized in the GUI of my associated application by the
button "Remove last page". There is certainly a reasonable need to be
able to pick and choose which pages to keep or delete; however, as described in
"Background", my most frequent desire to edit XPS files is to remove
a stray final page. I typically print webpages to the
XPS format, and find that the last page of the printout often contains
spillover (page footer and other unnecessary data) and is not something I
really wish to keep. Thus, the "Remove last page" functionality makes
sense for my needs.
Also in the project is a GUI to demonstrate the functionality of the
utilities class. The GUI could easily be replaced with a different one which
you find more useful — I created this one to fill my generic XPS editing file
needs. The buttons on the GUI all react to the current state of the application
— for example, the details pane doesn't appear until a
file with more than one page has been analyzed. It also won't let you remove
ALL pages from a file, and so on. Much of the GUI functionality has been reused
from other applications I've developed previously. The GUI is pretty simple and
straight-forward and doesn't contain anything too tricky at all.
The most interesting code provided is the main private routine in the
utilities package which walks through the passed-in XPS file and actually does
the processing necessary to generate the output file.
As referenced in "Development" and specifically detailed in the
source code, the main routine buildNewXPSFile()
is built around code posted by
"Jo0815" on the MSDN forums in the thread: Re:
How can I delete pages from already created XPS document? (how
can I Replace, Insert pages also) The original routine was written in C#
and did not have complete removal functionality, but it certainly laid the
framework for what I've done here. I translated the code to VB, made the code
more generic to expand its capabilities (the option to pass in pages to retain
instead of remove, for example), added processing to removes the resources
(fonts/images) found on the removed pages if un-referenced elsewhere, and added
code to allow the output file to write its pages out in a user-specified
non-linear order.
Prior to calling the buildNewXPSFile()
routine, a copy of the original source
XPS file should be made under the designated name of the output file. All
processing in the routine is made on the file passed in (the COPY,
preferably!), so the source file will remain untouched if the copy is passed to
the processing function. The file passed in to the utility IS directly
manipulated — passing in the original source file WILL change the source file.
The code opens up the first FixedDocument in the XPS file (see known
limitations below), then walks through the pages contained in the FixedDocument
one-by-one.
When removing pages from the file: as the routine loops through the pages,
if a given page is not selected for removal, then that page reference is
re-written into the new FixedDocument.
If that page is to be removed, then (1) all resources (fonts/images)
associated with the page are marked for further processing, (2) the page
reference is NOT written into the new FixedDocument, and (3) the page
(the PackagePart
in XPS spec terms, containing the FPAGE file and FPAGE.RELS file) is deleted
from the XPS file (container).
Once all pages have been cycled through, the FixedDocument is finalized and
written out to the file.
Processing concludes by cycling again through all of the pages in order to
determine if any of the resources marked for potential deletion are still
referenced elsewhere in the document. If a resource IS used elsewhere, it is
removed from the deletion list. Once all remaining pages have had their
resources cross-checked against the deletion list, any resources remaining in
the list can be safely removed [via DeletePart()
] from the XPS file.
Processing is slightly different if pages are selected for retention in the
output file (as opposed to selected for removal).
Essentially processing is the same, except that instead of writing to the FixedDocument
in real-time as the pages are walked through, the Uri
's of the retained pages are written to a Collection
instead. This is because the
pages are stepped through in numerical order 1 --> end. On this initial
pass, we only want to determine if the page is retained or removed. Once we've
made it all the way through, we then cycle through the array of pages passed
in, which allows the FixedDocument
to be written in the order that the user specified when making the call to the
utility routine.
This is the main utility routine that does the bulk of the heavy lifting in
the app :
Private Sub buildNewXPSFile(ByVal xpsFileToProcess As String,
ByVal pageSet As Integer(), ByVal pageRemoval As Boolean)
Dim pageUriStrings As Collection = New Microsoft.VisualBasic.Collection()
Dim resourceUris As Collection = New Microsoft.VisualBasic.Collection()
Using thePackage As Package = Package.Open(xpsFileToProcess, FileMode.Open,
FileAccess.ReadWrite)
Dim xpsDoc As XpsDocument = New XpsDocument(thePackage)
Dim fixedDocSeqReader As IXpsFixedDocumentSequenceReader =
xpsDoc.FixedDocumentSequenceReader
If (fixedDocSeqReader.FixedDocuments.Count = 0) Then
Throw New InvalidOperationException(
"The source XPS file does not contain any documents!!")
End If
Dim fixedDocReader As IXpsFixedDocumentReader = fixedDocSeqReader.FixedDocuments(0)
Dim memStream As MemoryStream = New MemoryStream()
Using xmlWriter As XmlTextWriter = New XmlTextWriter(memStream, Encoding.UTF8)
xmlWriter.WriteStartDocument()
xmlWriter.WriteStartElement("FixedDocument",
"http://schemas.microsoft.com/xps/2005/06")
For currentPageNumber As Integer = 1 To fixedDocReader.FixedPages.Count
Dim pageReader As IXpsFixedPageReader = fixedDocReader.FixedPages(
currentPageNumber - 1)
If (shouldThisPageBeExcluded(currentPageNumber, pageSet,
pageRemoval)) Then
Dim thePageBeingDeleted As PackagePart = thePackage.GetPart(
pageReader.Uri)
For Each resourceRelationship As PackageRelationship In thePageBeingDeleted.GetRelationships
Try
resourceUris.Add(resourceRelationship.TargetUri,
resourceRelationship.TargetUri.ToString)
Catch ex As System.ArgumentException
End Try
Next
thePackage.DeletePart(pageReader.Uri)
Else
If pageRemoval Then
xmlWriter.WriteStartElement("PageContent")
xmlWriter.WriteAttributeString("Source",
pageReader.Uri.ToString())
xmlWriter.WriteEndElement()
Else
pageUriStrings.Add(pageReader.Uri.ToString(),
currentPageNumber.ToString)
End If
End If
Next currentPageNumber
If Not pageRemoval Then
For Each x As Integer In pageSet
xmlWriter.WriteStartElement("PageContent")
xmlWriter.WriteAttributeString("Source",
pageUriStrings.Item(x.ToString).ToString)
xmlWriter.WriteEndElement()
Next
End If
For currentPageNumber As Integer = 1 To fixedDocReader.FixedPages.Count
Dim pageReader As IXpsFixedPageReader = fixedDocReader.FixedPages(
currentPageNumber - 1)
Try
Dim thisPage As PackagePart = thePackage.GetPart(pageReader.Uri)
For Each resourceRelationship As PackageRelationship In thisPage.GetRelationships
For Each looper As Uri In resourceUris
If looper = resourceRelationship.TargetUri Then
resourceUris.Remove(looper.ToString)
End If
Next
Next
Catch ex As System.InvalidOperationException
End Try
Next
For Each removableUri As Uri In resourceUris
thePackage.DeletePart(removableUri)
Next removableUri
xmlWriter.WriteEndElement()
xmlWriter.WriteEndDocument()
End Using
Dim newFixedDoc() As Byte = memStream.ToArray()
Dim fixedDocPart As PackagePart = thePackage.GetPart(fixedDocReader.Uri)
Dim partStream As Stream = fixedDocPart.GetStream(FileMode.Create)
partStream.Write(newFixedDoc, 0, newFixedDoc.Length)
End Using
End Sub
Caveats
- This
is intended to be a very simple, straight-forward app to produce a
XPS-viewer readable output XPS file which contains a subset of the pages
in the original source XPS file and/or the pages from the source file
re-organized in a new order.
- I
would be the first to admit that in spite of the development effort, and
my research into the XPS spec, I am by no means an expert on XPS. I am NOT
certain that the files created by the application I've written here still
conform to the spec proper. I have edited XPS files by hand and seen them
display properly, and also seen them error out in the XPS viewer when I
edited the file internals incorrectly. I have used the app included here
to edit pages out of XPS files, and with the current version, I have not
seen any files which have failed to open in an XPS viewer. I can't be
certain that you don't have a file out there that would fail... I just
don't know enough to guarantee that. However, I do believe that there is
no reason why a file output from this application should fail. The only
things that should be changed in the output file is
the removal of the desired page files (the FPAGE and corresponding
FPAGE.RELS files), as well as any resources (fonts/images) which are no
longer required. So, if the input XPS file works, I *think* the output XPS
file should work as well. If, by chance, you run an XPS file through this
app, and produce something that no longer works as an XPS file, I would be
happy to troubleshoot and fix the problem. Of course, you'd have to share
the XPS file and the conditions used to create the failure. But, I would
be interested in troubleshooting were that to arise... (which
again I believe will not be the case).
- In
light of what I just wrote, please double-check the content of your output
file, to make sure what you expected is what you
got. In my experience, that is now the case, but there could be special
cases I haven't considered or don't fully understand.
- If
you truly have need of a full-featured powerful tool that probably will
guarantee its output and its conformance to the XPS spec, you should
probably look into something like the commercial NiXPS
tool described earlier which pretty much appears to be the only thing out
there.
- One
interesting wrinkle (see related discussion re: thumbnails below): Using
this application, it is now quite possible to make the thumbnail
associated with the file outdated and inaccurate, as it is (presumably)
the view of page 1 of the original source XPS. Even if you delete page 1
of that file, the thumbnail associated to the global file will NOT change
and will "show" the old first page, even though that actual page
is no longer in the output file.
Future Enhancements
and/or Known Shortcomings
- This
app (and its underlying utilities package) only processes the first FixedDocument (document 0) in the XPS file. The XPS file has the
capability to store multiple documents, although all simple files I've
worked with have only contained a single document in them. I imagine that
any documents beyond the first one would remain completely unchanged,
because the only changes made by the app/utility package are to re-write
the FixedDocument file which contains the page list (in order), and to
remove the desired page & relationships files, along with any of their
orphaned resources, specifically by removing those parts from the file. I
would guess everything else would be untouched, although at this point I
have not confirmed that.
- Allow
multiple copies of a page to be created in the output file (?) -- This is
not currently supported in the GUI provided, which currently prevents a
second copy of a page from being added to the "Selected" list
box. By-hand experimentation shows that simply editing the FixedDocument file to add a duplicate reference to the same page a
second time does *NOT* work. I would have to research the XPS spec further
to determine how to effect this duplicate page in
the output file.
- Merge
XPS files -- select pages from a second XPS file to integrate into the
first. This appears to be a much trickier problem.... From the posts I've
read online, the addition of new pages to an XPS file is not trivial in
nature. I have not researched this in depth, but it is something I'm
interested in investigating. Significant GUI changes would be required to
support this interface as well. [Note : In the
same MSDN thread referenced in "Using the Code", a NiXPS employee indicates that their commercial editor
does have this capability.]
- As
noted in Merge XPS files -- select pages from a second XPS file to
integrate into the first. This appears to be a much trickier problem....
From the posts I've read online, the addition of new pages to an XPS file
is not trivial in nature. I have not researched this in depth, but it is
something I'm interested in investigating. Significant GUI changes would
be required to support this interface as well. [Note :
In the same MSDN thread referenced in "Using the Code", a NiXPS employee indicates that their commercial editor
does have this capability.]
History
V1.01 – the original version.
v2.01 – revised during 2/2012 to provide page view support to improve page
selection and application usability.
Acknowledgements
First and foremost, I'd like to acknowledge "Jo0815" for posting
the code in the MSDN forum thread which I used as the starting point for this
app. Without finding that thread, I probably would have given up again and
continued to hope that someone else would eventually create an app to perform
the functionality I was looking for.
Also, new appreciation for Pravesh Soni for his Code Project article from which I
borrowed some new ideas.
Additionally, to others who have posted articles and/or code, here on Code
Project or elsewhere on the 'net. I've learned much through the years picking
through what was out there... I'm glad to finally have something (worthwhile?)
I can contribute back.
Conclusion
I have no specific
data on which this opinion is based, but... due to the relatively wide
availability of PDF toolsets compared to practically non-existent ones for XPS,
I would guess that XPS probably has not made too much of a dent in the PDF
"monopoly" of document image formatting.
I have read that one of the complaints about XPS was that the typical file
sizes for printing the same output to PDF is smaller than that created in the
XPS format. Additionally, I have heard rumblings about the PDF quality being
better than that of XPS.
Whatever the truth really is as to those issues, the fact is that the XPS
Writer is embedded in Vista (and future Windows versions, I would presume...).
Also, the XPS Writer is available to XP users through the .NET 3.0 Framework.
Being built into the infrastructure as it is, I would expect that XPS is here
to stay even if it never overtakes PDF in popularity... And despite its
widespread availability and acceptance, PDF has its own shortcomings of course.
Since it should be around for the foreseeable future (I, for one, plan to
continue using the format), I hope that what I've provided here fills a gap in
available functionality and that someone else out there finds this tool as
important and useful as I have.
This is my first Code Project project, and I'm
quite happy to give back to the community after much time spent as a lurker.
I welcome any comments (positive or negative) -- I'm certain there are
improvements that could be made to either the article or my implementation of
the solution, be it in the code or the methodologies/algorithms used.
Additionally, if you find a bug or the application fails in some way, I'd be
interested in fixing the problem and learning more about my code, trying to
make it more robust.
Thank you for reading!