|
||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||
|
Announcements
Want a new Job?
Chapters
Services
Feature Zones
|
Note: This is an unedited contribution. If this article is inappropriate,
needs attention or copies someone else's work without reference then please
Report This Article
Microsoft is trying to push new file formats that are using ZIP and XML. Are those new file formats any good for Office developers ? In other words, should anyone feel safe to make direct access to file parts, and start getting free of running instances of Microsoft Office and its COM object model, usually through VBA ? Microsoft does not run out of teasing. There is ton of videos, see here, and here for example, screencasts, articles and blog posts (self-serving Microsoft blog posts mostly) about how much they are opening up. It boils down to the following, excerpt from Microsoft Office 12 introduction white paper : (...) The use of XML offers the benefits of greater transparency and openness than were possible with the previous binary file formats. The new formats allow Office documents to easily integrate with existing and future line-of-business systems, as the contents are now open and accessible. The new formats are also designed with long-term robustness and accessibility in mind. (...) The binary file formats in use currently were designed in 1994—before the advent of XML and before widespread exchange of documents and data that is common today. These file formats, .doc, .xls, and .ppt, were introduced with the release of Microsoft Office 97, at a time when it was important to optimize the files for storage on slow hard drives and "floppy" disks; it was not as crucial to focus on easy access to data within the files for better content reuse, document generation, and seamless integration of the documents into business processes. (...) The new XML-based file formats in these programs enable broader integration and interoperability between Office documents and enterprise applications. Additionally, "Office 12" files are all wrapped using ZIP technologies, which allows for easy access to the content parts as well as standard compression, reducing file sizes and improving reliability and data recovery. (...) Because documents stored in the Open XML Formats are machine-readable and editable by any text editor or XML processor, solutions need not use Microsoft Office programs to view or edit content within the documents. Enterprise business solutions can access document contents easily and efficiently. Technology providers can utilize the Microsoft Office System and Office authoring applications within their solutions, reuse Microsoft Office documents as other Office documents, or open and act on Office documents on other platforms and in other applications. They insist on the fact that, provided you make a valid use of the XML, pretty much changing the content of anything in an existing document can be achieved by sequentially 1) unzipping the content 2) making appropriate changes to one or more XML parts that are compatible with the provided XML schemas and open packaging relationships 3) zipping the content back . Let's see if that's true. 1) Self-exploding spreadsheets 2) Entered versus stored values 3) Optimization artefacts become a feature instead of an embarrasment 4) VML isn't XML 5) Open packaging parts minefield 6) International, but US English first and foremost 7) Many ways to get in trouble 8) Windows dates 9) All roads lead to Office 2007 10) A world of ZIP+OLE files 11) Document security is a (bad) joke 12) BIFF is gone...not! 13) Document backwards compatibility subject to neutrino radio-activity 14) ECMA 376 documents just do not exist 15) How the ISO OpenDocument format (ODF) compares?
1) Self-exploding spreadsheetsTo reproduce the scenario :
The relevant XML in the corresponding part xl/worksheets/sheet1.xml is : <row r="2" spans="3:5"> <c r="C2"> <v>10</v> </c> <c r="D2"> <v>20</v> </c> <c r="E2"> <f>SUM(C2:D2)</f> <v>30</v> </c> </row> Pretty simple XML. Now say we want to edit cell E2 and set a constant value of 40 in place of a formula. But instead of doing that with Excel 2007 interactively, we are going to do it manually :
The corresponding valid (and carefully changed) XML for setting the constant value of 40 in cell E2 is : <row r="2" spans="3:5"> <c r="C2"> <v>10</v> </c> <c r="D2"> <v>20</v> </c> <c r="E2"> <v>40</v> </c> </row> Now open the file in Excel 2007. You get a blocking error message which says :
Followed by another even more frightening message :
Interestingly enough, we thought parts of a spreadsheet file were individually updatable as long as we did not touch elements that are, according to the ECMA 376 documentation, indexes to other parts. Now that's an interesting issue. The ECMA 376 documentation says that the calculation chain is the graph of formulas, sorted wrt to their dependencies. Suddenly, the little change we would like to make looks way more expensive. Rebuilding the graph of formulas ourselves is what Excel itself does, and sure enough it involves parsing the entire spreadsheet, discovering formulas, and applying formula parsing algorithms to induce a graph of the dependencies. It certainly sounds like we are going to have to rewrite a portion of Excel itself. Not every employer can afford waiting that much...Or perhaps that's Microsoft's way to tell us : you shall not touch this without our approbation. We can perhaps get rid of the calculation chain, if we carefully delete the part and associated relationship. But then, there are three problems : - Parts are intertwined together through implicit and explicit relationships (relationships are separate zip entries). It gets even more grainy, and the risk of corruption increases. Again, we have to further take into account a number of details that are out of our scope, like how the calculation chain is defined as per the workbook part. - If I were a programmer, the question would be : Microsoft gives no library that I can use, at least not a library that I could use no matter the execution environment. Microsoft provides an API which works in a recent .NET run-time, and installs on Windows XP SP2 and Windows Vista. There goes the platform independence. - If I delete the calculation chain, I am admitting that the resulting spreadsheet is being degraded, something that Excel 2007 does not suffer from since it takes care of that thanks to its infrastructure. In other words, making changes outside of Excel 2007 doesn't look as first-class citizen and safe and robust as it sounded first, and in either case either the application takes care of a lot of details, replicating what Excel itself does (there is no known non-Microsoft Excel 2007 implementation available out there), or Excel 2007 will have to do that for me next time the spreadsheet is opened. If the changes are made on a server without Excel 2007 installed and the resulting spreadsheet is distributed to employees throughout the organization, then every single employee will have to get through a full spreadsheet recalculation (arbitrarily lengthy) next time they open the spreadsheet. The application is a second-class citizen compared to Excel 2007, that's what everybody thinks : me and the employees. What this shows pretty clearly is that either we lack the tools, or Microsoft does not think we should be doing that in the first place. With that being said, if making a simple change to a cell is too much to ask, then what is this new format good for? The prospect of getting our recipients facing those dreaded message boxes is not exactly a change compared to the well known ugly stories with corrupted binary file formats. Let's play the devil's advocate now and see how far it will take us. Here is the contents of the calculation chain, xl/calcChain.xml : <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <calcChain xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"> <c r="E2" i="1"/> </calcChain> The reference to formula in cell E2 is what seems to be causing the problem. All right, then since it's just XML, let's remove that reference. Edit the calculation chain xl/calcChain.xml so it looks like this : <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <calcChain xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"> </calcChain> After the modification, if we open the file in Excel 2007, it still complains :
I particularly like the wording : catastrophic failure. Now it looks like a lot of blood is spilling. Well, then let's stop the hemorrhage, and remove the contents of this part altogether. Make the update, and open the file in Excel 2007, it continues to complain :
I guess it's time to take a look at the ECMA 376 documentation. In Part 4, page 2087, it says (emphasis mine) : 3.6.2 calcChain (Calculation Chain Info) This element represents the root of the calculation chain. <complexType name="CT_CalcChain"> <sequence> <element name="c" type="CT_CalcCell" minOccurs="1" maxOccurs="unbounded"/> <element name="extLst" minOccurs="0" type="CT_ExtensionList"/> </sequence> </complexType> I guess there we have it, we can't have a calculation chain part with no cell reference in it. Excel guts spilling through the specs here, aren't it? Wait a minute, is this supposed to make into an international standard? Let's do a quick summary : we had Excel 2007 complain that the calculation chain was left with a reference to a formula that did not exist anymore, but in fact Excel 2007 complains even with this manually fixed. The solution is to delete the physical calculation chain part and the relationship it has with the workbook (defined in the workbook relationship part xl/_rels/workbook.rels), and to update some other parts as well. Let's get a visual diff of what it takes to make a "proper" change in a cell :
2) Entered versus stored valuesWe all take for granted that when we type a value such as 1234.1234 in a cell of a spreadsheet, that's what actually gets stored. Excel has this auto-number format matching capability where it tries to make sense of what is manually entered in order to deduce if that's a string, a number, a boolean or a date and applies a number format accordingly, but what's being stored as a value is what is entered. By the way, if you hit Alt+F11 in Excel, and enter something like Range("C3").Value and run the macro, you'll get the entered value in cell C3 ; if you enter something like Range("C3").Text, you'll get the formatted value in cell C3, where the number format has been applied to the value. Note that the return value takes advantage of the locale and number formatting which means you may get "1234,1234" (note the comma) instead of "1234.1234". Is this storage neutrality true with the new formats? To reproduce the scenario :
Here is a screenshot of what you should see at this point : ![]() Typing a few numeric values in Excel 2007 The corresponding XML in the main part xl/worksheets/sheet1.xml is : <sheetData> <row r="2" spans="4:4"> <c r="D2"> <v>123456.123456</v> </c> </row> <row r="3" spans="4:4"> <c r="D3"> <v>12345.123449999999</v> </c> </row> <row r="4" spans="4:4"> <c r="D4"> <v>1234.1233999999999</v> </c> </row> <row r="5" spans="4:4"> <c r="D5"> <v>123.123</v> </c> </row> <row r="6" spans="4:4"> <c r="D6"> <v>12.12</v> </c> </row> </sheetData> The problem is that Excel 2007 does not store what we entered. If we read the XML, we are going to grab numbers that have rounding errors compared to the actual numbers we typed. Let's see how far the problem goes :
Not only there is a rounding error, but its order of magnitude changes depending on the value. Ironically enough, if you entered 4321.4321, it would be stored as is, with no rounding error. It is absolutely lost on me how implementers are expected to deal with this mess. The spreadsheet does not reflect the proper values, and you can easily see where it goes. Imagine non-Microsoft applications used in healthcare and critical systems relying on the spreadsheet data. Not only the rounding error seems arbitrary (one would have to go back and study the artefacts of IEEE floating-point values, several decades of work), but it changes. There is no way we can possibly take advantage of this, with one notable exception : if we are able to be in an execution environment for which reading those floating-point values does not produce those artefacts, and returns the proper entered values, then we are good. Problem : Microsoft does not document the execution environment. We can fairly assume its Windows, but what else? And if I am using Linux, how do I work with this? It's important to understand that if we open the spreadsheet in Excel 2007, we see the proper values. No loss (based on the values entered) seem to have occured, the problem is that the data in XML just cannot be used as is. As an aside, the stored value does not use the locale (it always uses the dot as decimal separator), therefore we have to assume this is all US English. If we wrote software in Excel VBA that grabs the value in cells, then processes it, there is no way we could migrate our VBA code to work with this XML part without substantial rework. We are left with Excel's own international implementation artefacts, undocumented. 3) Optimization artefacts become a feature instead of an embarrasmentHistorically, the BIFF file format used in Excel spreadsheets was designed to be small and fast. But this design decision goes all way back to early 90s, when the Pentium CPU did not exist yet. Regular desktop computers we use everyday are at least several orders of magnitude faster and memory-friendly that those early computers were. Yet, Microsoft chose to keep those optimization artefacts as is, with the side effect that they are now exposed to the surface as part of the XML. Among interesting optimizations is Excel insistence in trying to factor formulas as much as possible. This happens with Excel when you create a formula, then drag the cell to replicate it in other cells. That's shared formulas. Excel chooses to declare the formula itself only once, and then creates a mechanism to infer the formula in other cells (the relative position counts as an offset). Shared formulas are supposed to be transparent for developers. Is it true? To reproduce the scenario :
Here is a screenshot of what you should see at this point : ![]() Typing a few numbers and formulas in Excel 2007 The corresponding XML in the main part xl/worksheets/sheet1.xml is : <sheetData> <row r="4" spans="3:5"> <c r="C4"> <v>2</v> </c> <c r="D4"> <v>3</v> </c> <c r="E4" s="1"> <f>C4-D4</f> <v>-1</v> </c> </row> <row r="5" spans="3:5"> <c r="C5"> <v>2</v> </c> <c r="D5"> <v>3</v> </c> <c r="E5" s="1"> <f t="shared" ref="E5:E10" si="0">C5-D5</f> <v>-1</v> </c> </row> <row r="6" spans="3:5"> <c r="C6"> <v>2</v> </c> <c r="D6"> <v>3</v> </c> <c r="E6" s="1"> <f t="shared" si="0"/> <v>-1</v> </c> </row> <row r="7" spans="3:5"> <c r="C7"> <v>2</v> </c> <c r="D7"> <v>3</v> </c> <c r="E7" s="1"> <f t="shared" si="0"/> <v>-1</v> </c> </row> <row r="8" spans="3:5"> <c r="C8"> <v>2</v> </c> <c r="D8"> <v>3</v> </c> <c r="E8" s="1"> <f t="shared" si="0"/> <v>-1</v> </c> </row> <row r="9" spans="3:5"> <c r="C9"> <v>2</v> </c> <c r="D9"> <v>3</v> </c> <c r="E9" s="1"> <f t="shared" si="0"/> <v>-1</v> </c> </row> <row r="10" spans="3:5"> <c r="C10"> <v>2</v> </c> <c r="D10"> <v>3</v> </c> <c r="E10" s="1"> <f t="shared" si="0"/> <v>-1</v> </c> </row> </sheetData> In cell E5, we see a "ref "attribute where the shared formula range applies, a "si" attribute which identifies the shared formula range (itself redundant with the "ref" attribute), and the actual formula for that cell. But in cell E6, the cell below E5 in the grid, we see a definition with just a "si" attribute. In other words, cell E6 is linked to cell E5. Here is the problem, let's say we make a manual change to cell E5 and remove the formula. We've seen in the first section of this article that the calculation chain is left unsynched, but an additional problem is that cell E6 is also left unsynched because its "si" attribute now points to nowhere. Note that the situation isn't any better if we merely update the formula, by doing so cell E5 is fine, but linked cells will reference the new formula, not the old one. So a simple change in cell E5 actually spreads unintentionally. It goes without saying that it's very expensive to make a simple change. That is a direct result of the optimization artefact. Someone willing to make a change cannot proceed without taking care of depending cells if the cell defines a shared formula range. The problem gets only bigger since we have to either remove the shared formula altogether in all cells, or translate the shared formula definition accordingly, which implies parsing the formula (the goal was only make a change in a cell!) and making a number of offset changes, many of which are left for the user to discover (formula tokens are complex). To remove the shared formula, the actual formula definitions in linked cells have to be built, and that's where an algorithm has to find a way to create these, essentially rolling back Excel's optimization as a preliminary step. We have a case where an optimization artefact becoming an embarassment, not a feature. Let's see how the situation compares to the old Excel binary file format (BIFF internal format is stored inside an OLE container). Here are the corresponding BIFF records : // BIFF : a shared formula, and cells linked to the shared formula [SHRFMLA 0015] 03 00 08 00 03 03 00 06 0B 00 4C 00 00 FE C0 4C 00 00 FF C0 04 [FORMULA 001B] 04 00 03 00 3E 00 00 00 00 00 00 00 00 40 08 00 05 00 03 FE 05 00 01 03 00 03 00 [FORMULA 001B] 05 00 03 00 3E 00 00 00 00 00 00 00 00 00 08 00 06 00 03 FF 05 00 01 03 00 03 00 [FORMULA 001B] 06 00 03 00 3E 00 00 00 00 00 00 00 F0 BF 08 00 07 00 03 FF 05 00 01 03 00 03 00 [FORMULA 001B] 07 00 03 00 0F 00 00 00 00 00 00 00 F0 3F 08 00 08 00 03 FF 05 00 01 03 00 03 00 [FORMULA 001B] 08 00 03 00 0F 00 00 00 00 00 00 00 00 C0 08 00 03 00 03 FF 05 00 01 03 00 03 00 // BIFF : same than above, made understandable [SHRFMLA 0015] (ref = 03 00 08 00 03 03) (formula = [currentrow][currentcolumn-2]-[currentrow][currentcolumn-1]) [FORMULA 001B] (cell row = 5 col = E) (formula is a link to the shared formula) [FORMULA 001B] (cell row = 6 col = E) (formula is a link to the shared formula) [FORMULA 001B] (cell row = 7 col = E) (formula is a link to the shared formula) [FORMULA 001B] (cell row = 8 col = E) (formula is a link to the shared formula) [FORMULA 001B] (cell row = 9 col = E) (formula is a link to the shared formula) With the Excel binary format, the design is right. The shared formula is defined outside a cell, so you can very simply remove the formula in cell E5 without breaking other cells. From a programming perspective, the new XML format qualifies as a regression compared to the binary file format. 4) VML isn't XMLContrary to what the ECMA 376 documentation says in many places, VML drawing parts are not deprecated at all. VML is in fact very pervasive in Word, Excel and Powerpoint documents, so it's even more a blatant problem. In the ECMA 376 documentation, part 4, page 4343, we learn : (emphasis mine) [Note: The VML format is a legacy format originally introduced with Office 2000 and is included and fully defined in this Standard for backwards compatibility reasons. The DrawingML format is a newer and richer format created with the goal of eventually replacing any uses of VML in the Office Open XML formats. VML should be considered a deprecated format included in Office Open XML for legacy reasons only and new applications that need a file format for drawings are strongly encouraged to use preferentially DrawingML .end note] Here is a way to create a VML part in a new document :
The corresponding XML in the drawing part xl/drawings/vmlDrawing1.vml is : <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <xml xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel"> <o:shapelayout v:ext="edit"> <o:idmap v:ext="edit" data="1"/> </o:shapelayout> <v:shapetype id="_x0000_t202" coordsize="21600,21600" o:spt="202" path="m,l,21600r21600,l21600,xe"> <v:stroke joinstyle="miter"/> <v:path gradientshapeok="t" o:connecttype="rect"/> </v:shapetype> <v:shape id="_x0000_s1025" type="#_x0000_t202" style="position:absolute; margin-left:203.25pt;margin-top:37.5pt;width:96pt;height:55.5pt;z-index:1; visibility:hidden" fillcolor="#ffffe1" o:insetmode="auto"> <v:fill color2="#ffffe1"/> <v:shadow on="t" color="black" obscured="t"/> <v:path o:connecttype="none"/> <v:textbox style="mso-direction-alt:auto"> <div style="text-align:left"/> </v:textbox> <x:ClientData ObjectType="Note"> <x:MoveWithCells/> <x:SizeWithCells/> <x:Anchor> 4, 15, 2, 10, 6, 15, 6, 4</x:Anchor> <x:AutoFill>False</x:AutoFill> <x:Row>3</x:Row> <x:Column>3</x:Column> </x:ClientData> </v:shape> </xml> From a pure markup perspective, it is XML,
but there are application-encoded values such as <v:shape style='top: 0; left: 0; width: 250; height: 250' stroke="true" strokecolor="red" strokeweight="2" fill="true" fillcolor="green" coordorigin="0 0" coordsize="175 175"> <v:path v="m 8,65 l 72,65,92,11,112,65,174,65,122,100,142,155,92,121,42,155,60,100 x e"/> <formulas> <f eqn="sum #0 0 10800"/> <f eqn="prod #0 2 1"/> <f eqn="sum 21600 0 @1"/> <f eqn="sum 0 0 @2"/> <f eqn="sum 21600 0 @3"/> <f eqn="if @0 @3 0"/> <f eqn="if @0 21600 @1"/> <f eqn="if @0 0 @2"/> <f eqn="if @0 @4 21600"/> <f eqn="mid @5 @6"/> <f eqn="mid @8 @5"/> <f eqn="mid @7 @8"/> <f eqn="mid @6 @7"/> <f eqn="sum @6 0 @5"/> </formulas> </v:shape> If that is XML, then I propose writing C code the following way : <v:shape style='top: 0; left: 0; width: 250; height: 250' stroke="true" strokecolor="red" strokeweight="2" fill="true" fillcolor="green" coordorigin="0 0" coordsize="175 175"> <v:path v="m 8,65 l 72,65,92,11,112,65,174,65,122,100,142,155,92,121,42,155,60,100 x e"/> <formulas> <f eqn=" int main(int argc, char** argv) { printf("Hello World\n"); return 0; } "/> </formulas> </v:shape> This is XML, right? There are angle brackets, it conforms to the XML W3C recommendation, therefore it's XML. VML also contains application-specific markup, with no documentation associated to it, for instance in the example above, The implication for an implementer, or for someone willing to make a change is that there is no way someone can possibly edit this thing without a proper implementation of the VML library itself. The risk of corruption is extremely high. Obviously, Microsoft expects that VML parts are replaced by other VML parts as a whole, without a finer granularity. The problem is that VML too can contain references to objects and other parts, so that contradicts even a simple template scenario. VML is an old, undocumented, library that speaks volume of past Microsoft lock-in strategy. Mr Bill Gates in person sent in 1998 a memo to the Office product group (led by Steven Sinofsky at the time), memo undisclosed to the public thanks to the IOWA consumer case :
The undocumented VML library shipped in Internet Explorer 5 in 2000, and has been part of Internet Explorer ever since. The DAV protocol (Distributed Authoring and Versioning) is an international cross-platform standard, open to everybody. God only knows why Bill Gates likes so much VML, and dislikes so much DAV... For the record, the ECMA 376 documentation describes the VML markup, but it does not specify it. Much of application-defined behaviors are left for one to guess. 5) Open packaging parts minefieldThe underlying architecture of how zip entries relate together is called by Microsoft "open packaging conventions". What it means is that zip entries are not independent, or even related by way of a single master zip entry which would work as a directory of all zip entries of relevance. There is a logical tree of entries which uses separate zip entries to define relations between zip entries. The logical tree has nothing to do with the physical tree of zip entries in a package, despite Microsoft continuously using screenshots of Windows XP's built-in ZIP folders to mimic a folder hierarchy. The problem with such an architecture is that a part may or may not relate to another and there is no standard way to know. Often, there is a r:id attribute right in the content of some XML part that tells the application that there is a relation, but this is not standard. By the way, Microsoft's PDF fixed format competitor called XPS is also based on the same underlying architecture, except that the team who developed XPS did not quite want to play by the same rules than the Office team. For instance the XPS main zip document entry related to one or more XPS pages with an attribute such as : Source="Pages/1.fpage". In other words, they are not using the r:id attribute, instead relying on their own mechanism. This makes it impossible for a generic library to know which part relates to which part, and it has an unfortunate consequence. The unfortunate consequence is being unable to know whether a part relates or not to another part makes it impossible to know, when you delete a part, if you are going to corrupt the document or not. The document becomes corrupt if it points or relates (implicitely or explicitely) to a missing part. It's unclear why Microsoft chose this way of doing things, obviously leading to an internal chaos, instead of just copying the research from the OpenOffice project, where a central directory is used (OpenOffice ZIP initiative predates Microsoft's by at least three years, despite Microsoft stealing the thunder). When you don't know the dependencies of a part, the consequence is obvious, you leave those parts alone. If you do this enough times, it clutters up the package, and soon enough you end up with a package containing any number of parts god only knows why they are there. Add to this you can add a part of any content type (arbitrary MIME type), and you have a recipe for disaster. Among other things, virus could proliferate. Microsoft's deletePart() function which is available in their System.IO.Packaging library (itself part of .NET), does not solve this problem. We have a case of poor engineering, creating unnecessary problems for others to worry about. 6) International, but US English first and foremostAn important ongoing tension with Office documents is the support for locales. Microsoft historically used a number of mechanisms to address this need, but they kept evolving and Microsoft aggregated all mechanisms to keep compatibility with older versions. What was hidden is being surfaced with the new XML. Anything that gets displayed, calculated, rendered or stored depends one way or another on an complex and undocumented combination of locale settings including : the Office application language, the Office application language settings (per application), the Office document language settings (per document), the system locale of the operating system. To save them time, Microsoft chose to store XML using the US English locale regardless of all settings above. This has an unfortunate consequence for implementers or those willing to make a manual change. Indeed, Microsoft is imposing everybody else to adapt to US English locale options (separators, date formats, formula conventions, ...) despite the fact that when using Office interactively, this fact is hidden to the user. The Office application infrastructure manages to abstract it away from users, which is a good thing. Office developers using VBA all over the world are used to work with localized functions, the complexity is hidden to them. But since the XML resurfaces this US English locale, all the complexity is left for one to implement. We are talking two decade worth of internationalization issues, for Office-related locale issues and Windows-related locale issues. To get an idea of how bad the situation is, suffice to say that a Microsoft employee part of the internationalization team in Windows has a blog where he posts daily horror stories. Also, for Excel formulas, it means the formula names are US English formula names, which you'll never see in Excel if you are using a locale version such as French or Brazilian. It's left for one to guess how to map function names one way to another, and of course the ECMA 376 documentation does not provide those localized formula names. If you intend not to implement a mapping to a locale, ideally your customer's locale, it implies you are willing to work with US English function names (plus US English separators, ...). If your company has invested in libraries or developed libraries in-house, they cannot be used anymore. Can it get any worse than that? Unfortunately yes. Despite Microsoft insistence to store everything using the US English locale, they still manage to store a number of contradicting country/encoding flags in the XML. Examples of that are DrawingML and VML languages. They store encoding tags for storing text chunks, but text chunks in document itself does not use any such encoding tag. It is in fact entirely possible that DrawingML and VML are implementations which involve nothing localized itself but which store localized tags in the document, while the rest of languages (WordML, SpreadsheetML, ...) are implemented otherwise : their implementation is chockful of encoding settings, but they need not store anything in the document itself. In other words, everything gets localized at run-time with WordML, SpreadsheetML, ... except DrawingML and VML. It's clear at this point that the legacy shows...One would have expected Microsoft to fix this once for all, provide a consistent framework. They chose not do so, and as a result, it's left for any implementer or someone willing to make a change to do the heavy lifting. What we are talking about here is entire internationalization implementation stacks which can represent years of work and stabilization. Ironically enough, you will not only have to implement this stuff (reverse engineering since it is not addressed by the ECMA 376 documentation), you will have to implement in a way that reproduces current Office flaws. No matter how correct your implementation is, you have to retrofit it to work just like Office does. To get a flavor of non-US English within US-English (thereby violating ECMA 376's own rules), all you have to do is to insert a chart :
Here is an excerpt of the chart part xl/charts/chart1.xml : <c:chartSpace xmlns:c="http://schemas.openxmlformats.org/drawingml/2006/chart" xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"> <c:lang val="fr-FR"/> <c:chart> <c:title> <c:tx> <c:rich> <a:bodyPr/> <a:lstStyle/> <a:p> <a:pPr> <a:defRPr/> </a:pPr> <a:r> <a:rPr lang="fr-FR"/> <a:t>Some title</a:t> </a:r> ... A reader of the article also mentions that within a strict US-English stream, you can get localized text formatting. Here is how :
I am using a French version of Excel 2007. Here is an excerpt of the chart part xl/worksheets/sheet1.xml : <headerFooter> <oddHeader>&Ctrt&"-,Gras Italique"uiy tuieyrtui</oddHeader> </headerFooter> Gras and Italique are French for Bold and Italic. Just because I am using a French version of Excel 2007, the format produced inserts French localized fragments, therefore anyone willing to read Excel 2007 files in the most general case must be ready to parse non-US English. The ECMA 376 documentation says, in section 3.3.1.36, Part 4, page 1965 (emphasis mine) : &"font name,font type" - code for "text font name" and "text font type", where font name and font type are strings specifying the name and type of the font, separated by a comma. When a hyphen appears in font name, it means "none specified". Both of font name and font type can be localized values. What can be is supposed to mean? Who reviewed this documentation? And where are the localized values to expect? Reading the documentation I have the feeling that :
7) Many ways to get in troubleThe extensiveness of the ECMA 376 documentation, over 6000 pages, is telling how much legacy Microsoft is willing to bring into the future. Taking an example of such legacy clarifies what it takes to implement even a portion of the documentation. The example is text formatting. Any of the 3 applications, Word, Excel and Powerpoint uses its own text formatting markup. Worse, the shared libraries themselves (VML, DrawingML, MathML, ...) also use separate text formattings, each different. Even worse, if that's possible, Word has many own ways to do text formatting. Excel has many own ways to do text formatting. Powerpoint has many own ways to do text formatting. By "many ways" is meant different markup, sometimes drastically different : in one you could have no country/encoding at all, and in another it's cluttered up with country/encoding markup. If Microsoft were to design a general purpose Office document model (note : ECMA 376 is a description of one specific Office document : Microsoft's), they would have factorized all of this into a single text formatting markup. God only knows why they chose not to do so, keep all the legacy, and try to get away with this mess by making as little publicity as possible about it. Now enter the implementer, or someone willing to make a change to a document. There are three scenarios :
The third scenario is just a combination of the others, so there is nothing interesting to say about it. The first scenario is the most simple. To write a document, of a given type, including a given set of objects (from shared languages or not), you only need to write the document in a way that is compatible with the expected XML. In other words, you can use only one text formatting markup model. It's you who decides which one, whether you implement one or more, and so on. So from a writer perspective, you don't suffer the problem very much. Now consider the second scenario. To read a document, you cannot assume what's in that document, therefore you've got to implement all possible combinations of objects that may be part of the document. In particular, you've got to implement all ways to get text formatting markup models because that may well be the XML you face. This is a horrible scenario. To support this scenario, either you are Microsoft, or you have a number of years of work ahead on the subject with plenty of implementation done already. There is no way around, the barrier to entry to this scenario is sky high. Of course, if you read a document, read the markup, and do nothing with it, or nothing of substance with it, it's not quite the same problem. But then, remember that even reading a small chunk of markup can be complicated because of the implicit semantics. You don't need a lot of XML markup to find yourself unable to process it in any meaningful way. To give you an example of how bad the situation is, here is 4 different Excel text formatting markup chunks, all meant to do the same thing (not entirely accurate here, but you get the idea) : 1) regular cell formatting <xf numFmtId="0" fontId="2" fillId="0" borderId="0" xfId="0" applyFont="1"/><font><sz val="11"/><color theme="6" tint="-0.249977111117893"/><name val="Calibri"/><family val="2"/><scheme val="minor"/></font> 2) shared-string cell formatting (note that the shared-string is a technical artefact surfacing as everyone's problem now) <r><rPr><sz val="11"/><color rgb="FFFF0000"/><rFont val="Calibri"/><family val="2"/><scheme val="minor"/></rPr><t>ruir</t></r> 3) cell formatting in a conditional alert (note: the conditional alert itself is declared elsewhere) <dxf><font><b/><i val="0"/></font><numFmt numFmtId="2" formatCode="0.00"/><fill><patternFill patternType="solid"><fgColor auto="1"/><bgColor rgb="FFFFFFFF"/></patternFill></fill></dxf> 4) text formatting in charts <c:rich><a:bodyPr/><a:lstStyle/><a:p><a:pPr><a:defRPr/></a:pPr><a:r><a:rPr lang="en-US"/><a:t>t t</a:t></a:r><a:r><a:rPr lang="en-US" sz="1850" u="dash" baseline="0"><a:solidFill><a:schemeClr val="accent4"><a:lumMod val="60000"/><a:lumOff val="40000"/></a:schemeClr></a:solidFill><a:uFill><a:solidFill><a:schemeClr val="accent1"><a:lumMod val="60000"/><a:lumOff val="40000"/></a:schemeClr></a:solidFill></a:uFill></a:rPr><a:t>ruiry</a:t></a:r><a:r><a:rPr lang="en-US"/><a:t>t gfgfgfg</a:t></a:r></a:p></c:rich> The beauty about a file format that is impossibly hard to read and update, and is decently easy to write from scratch, is that it fits perfectly in the read-only model that is exactly the reason why Microsoft has a monopoly in Office documents. As a side effect to its proprietary-ness (Office 2007 documents are extensions of ECMA 376 documents), it provides zero interoperability with anything | |||||||||||||||||||||||||||||||||||||||