Since Microsoft unveiled OpenXML with Office 2007, many people started to check if they can take advantage of it. However, if you search on newsgroups/communities/forums, you will find find that it is much more complicated and difficult to study and implement.
I think it is not difficult, but lengthy and somewhat complex.
ODF vs. OpenXML: It is another point of debate. ODF (OpenDocument Format) is simpler than MS OpenXML. Both follow the same XML+ Zip format. However, there is lack of help/tutorials/support of Open Source technologies if we compare those with MS products.
This article: I try to explain here the basics of OpenXML programming to help beginners. I have dealt with Word 2007, and hence I will cover the part regarding Word 2007 only. However, OpenXML implementations are quite similar across Office components.
This is not my new invention, but I am putting the basic facts scattered over the internet in one place.
Let's start with the basics. A Word file with the extension DOCX is actually a compressed archive (Zip) of some files. These files are nothing but XML files and some folders/subfolders. These files are inter-related with relations.
The following figures shows the files inside a DOCX:
To view these files, just open the DOCX file with WinZip (or any other software you have). Everything (some exclusions like images, ActiveX) is converted into XML. You need to remember the following keywords: Package, Parts, Relations.
Package: Package is nothing but your DOCX file. This zip file is called a Package.
Parts: Parts are nothing but files in the Package. E.g., the area where you type (after opening Word) is the main document part. If you insert an image, it will be another part. Everything is managed in parts (numbering [bullets], images, styles, settings etc.). If you want to insert/delete/retrieve images, then you have to play with ImageParts (a sub-class of part) and so on.
Relations: The parts are linked with relations. The main relations are maintained in .rels files inside the _rels folder within a package. Of-course, you can find XML tags in this file. There are other relation files in the word/_rels folder. These are sub-relationships. E.g., if you include an image for a bullet (picture bullets), then you can find the numbering.xml.rels file in this folder. There are many other relation files and it is hard to list all of those.
Relations IDs: Each relation has a unique ID. This is referred in the referencing part and in the relation file. With the help of this ID, Office searches for the appropriate referenced parts and displays them accordingly. For instance, add a new image in your document, then save it. Open it with WinZip. Open document.xml, look for the
w:drawing tag, then inside that, look for the
a:blip r:embed tag. The value of this tag will look like
rId2. Then, open the document.xml.rels file and search for
rId2; you will find the path of the image in the package!
To deal with DOCX programmatically and to simplify programming, you may want to download this SDK [Microsoft SDK for OpenXML Formats] [SDK 2.0 here] provided by Microsoft. The final release is not out yet. Download it -> Add a reference to your project -> Import it.
To open a document
Dim doc As WordprocessingDocument = _
Dim mainDoc As MainDocumentPart = doc.MainDocumentPart
MainDoc is the main document (document.xml) and contains every line of text you typed in the document.
To load in XMLDocument
You may want to load the XML of document.xml in an
XMLDocument class object. Try this:
Dim streamReader As System.IO.StreamReader = New IO.StreamReader(mainDoc.GetStream)
Dim str As String = streamReader.ReadToEnd
Dim xmlDoc As New System.Xml.XmlDocument
Remember, to travel within this XML, you need
XmlNamespaceManager and add the required namespaces to that. You can add the required namespaces in the document.xml file. If you want to add paragraphs, then add child nodes in
xmlDoc and then save the
Add New Part
To add a new part, you can use the
AddNewPart methods of the
WordprocessingDocument and d d
MainDocumentPart classes. These are generic methods and you need to specify which part you want to add. The method returns the part you added, and then you can play with that.
To add a new numbering part in the main document, the try following:
Dim numPart As NumberingDefinitionsPartnumPart = _
numPart's XML using the
GetStream method into the
XmlDocument, do the manipulations, and then save it.
To add a new image part in the main document, try the following:
Dim doc As WordprocessingDocument = WordprocessingDocument.Open("C:\Test\abc.docx", True)
Dim mainDoc As MainDocumentPart = doc.MainDocumentPart
Dim iPartImage As ImagePart = mainDoc.AddImagePart(ImagePartType.Png)
Dim img As Image = Image.FromFile("C:\images\test.gif")
The above code will add an image in the package. Remember, it will not display in your document unless you manually add paragraphs and the required nodes in
mainDoc's XML. After executing the code above, open the package with WinZip, and check that the image is added under the media folder. Also, check the relation file document.xml.rels and search for media/image; you will find a new relation tag is added and a new unique ID is created for that image.
[This article would help you to add new paragraphs.]
You can iterate through each part using the
Parts property of the
Part class. Try to use a for-each loop and check each part in Debug mode (put a breakpoint inside the for-each loop). [Check the
Delete existing part
You can see the ID of the part from document.xml. Once you have the ID of the part, call the
GetPartById method of
mainDoc. This will return the part that you want to delete. Then, call thr
DeletePart method. This will delete the part as well as updates the relation file (document.xml.rels).
WordprocessingDocument.Close() automatically saves and closes the document. You don't need to save it explicitly.
You need to work hard to understand OpenXML. Debugging and some R&D will help you know it better.