Tuesday, June 14, 2005

Microsoft Office XML Document Formats

Microsoft has announced plans to make XML the native format for Office 12 documents. To understand this and an overview to what these documents look like, MS has just published the Microsoft Office Open XML Formats Architecture Guide.

In essence, these new documents cab/zip files. So a word 12 document will be called .docx, but can be opened, viewed, and changed with WinZip. Each of these cab files contains mainly XML files, but also other files representing embedded resources. Thus, the structure of a OLE compound document (i.e. old style doc) has been exposed as a zip file, with the constituent parts (what used to be streams in the compound documents) surfaced as native OS file types.

The result is that instead of using complex tools such as OLEVIEW to see inside a compound file, and using developing complex (and sometimes non-perfect) programming operations to manage the file, you can use WinZip to open, view and manage the contents of a docx file using native OS tools. Better yet, you can use command line tools to do document management.

This change makes it possible for you to write simple, but very, very powerful add-ins, or bolt-ons, to Office 12. Consider a development project that has lots of specs containing lots of bitmaps (i.e. the UI design) where these compound docs are now docx documents. It would be relatively simple (at least compared to OLE compound documents), to write code to open a .docx in a folder, search for the table of contents, then look directly for the sub-documents you want, e.g. all the bitmaps used in the spec set. You could then just dump this to a folder somewhere. With Monad-MSH such a script could probably be written in half a dozen lines. Heck, you could probably add a few more lines to the script to to make a NEW .docx with just the individual bit maps, add in the index information, generate an xref, etc. And for what can't be done in Monad-MSH, there's always VB.NET or C#.

This change of format types really opens up access to the components of an Office document. This makes it incredibly easy to manage the contents of a document. MS said at TechEd that one big source of errors in Office documents (i.e. ones leading to a support call!) are the add-ins to Office that usually work great (but not always). Managing OLE Compound documents was always complex, and the idea of implementing this structure directly into the OS (NSS for those who remember this!), was never released. So, instead of implementing a complex document in an easy to corrupt (and hard to fix) binary format, implement it with old tried and trusted tools (WinZip, notepad, et al). As an IT Pro, I might now be able to fix (or better salvage) damaged documents. This is cool - and it also opens up the possibility of some incredible fusion.

Finally, a few of questions. First with this new structure, why there are different programs (word, excel, ppt, etc)? This new paradigm should allow one program (Office) to do it all. As I open new document types, Office just brings in the parts I need and adds the relevant components to the zip file. As I embed other things inside the document (dropping an excel spreadsheet into a report, bringing a ppt slide to an excel spread sheet, with a link to a live video feed), it would adapt - the base document just getting a more complex TOC, and more components getting added to the zip file. Second, does this pave the way for putting basic Office functionality directly into the OS? Given the idea of adding basic workflow to Windows at some point (i.e. WinOE), why not also add some basic document management as well? You could layer an office document management interface on top of or to the side of WinOE. This could simplify the creation of basic workflow objects in an Office based productivity solution. The possibilities here are significant. And finally, does this mean you can author office documents in Notepad and WinZip? I can't wait for Office 12 beta!!

No comments: