Hack 17 Convert Microsoft Office Files, Old or New, to XML

figs/beginner.gif figs/hack17.gif

Use OpenOffice as a tool to convert Microsoft Office files to XML.

OpenOffice (http://www.openoffice.org/), the free, open source, multiplatform office application suite that provides an alternative to Microsoft Office, uses a documented XML format as its native file format. Put this together with OpenOffice 1.1's ability to read Word, Excel, and PowerPoint files from Office 97, 2000, and XP, plus Word 6.0 files, Word 95 files, and Excel 4.0, 5.0, and 95 files, and you've got a simple way to convert these files to XML.

When you store a document in OpenOffice's own file format [Hack #65], you'll create a ZIP file with the extension .sxw if you saved it with the OpenOffice Writer word processing program, .sxc if you saved it with the OpenOffice Calc spreadsheet program, or .sxi if you used the OpenOffice Impress slideshow program. The six files that you'll find in these ZIP files have self-explanatory names: mimetype, content.xml, styles.xml, meta.xml, settings.xml, and manifest.xml.

Unless you're strongly interested in the inner workings of OpenOffice, the file content.xml should hold the most interest. Along with file content, it stores information about the use of built-in styles, styles you defined yourself, and even on-the-fly styling information not tied to defined styles, such as bolding of text with Ctrl-B. For word-processing files, the XML also identifies bulleted and numbered lists and footnotes. XML versions of spreadsheets include information about spanned cells and calculation formulas as well as results, and OpenOffice XML versions of slideshows store separate slides in separate elements, with slide notes in their own elements. (As soon as I found out about that, I wrote an XSLT stylesheet to pull slide titles and slide notes, minus slide content, into a single document that I could print and hold in my hand when giving presentations?something I'd always wanted to do when giving PowerPoint presentations, but could not.)

2.8.1 DocBook

The OpenOffice Writer application provides an added bonus for DocBook [Hack #62] users: a DocBook (simplified) option in the Save As menu. This saves your document with a document type of article and with each paragraph in a para element. If the document used built-in Word styles such as Heading 1 and Heading 2, OpenOffice saves them as title children of appropriate containers such as sect1 and sect2, and it adds the container start and end tags in the right places to explicitly identify the hierarchical structure that the original Word file only hinted at.

The conversion to DocBook format loses any references to defined paragraph and inline styles or on-the-fly formatting in the document, but because the conversion to DocBook is done with an XSLT stylesheet installed as part of the OpenOffice distribution, anyone familiar with XSLT can edit it?adding template rules to handle specialized cases for their own documents. From the OpenOffice Write Tools menu, select XML Filter Settings, and then with DocBook File highlighted, click the Edit button and pick the Transformation tab to find out the name and location of the stylesheet that creates exported DocBook files.

The formats I've listed above aren't the only formats that OpenOffice can read. If you have data dating back to the earlier days of personal computers, you may be interested in OpenOffice Calc's ability to read Lotus 1-2-3 and dBase files. So download OpenOffice?remember, it's free?and take a look at the formats it can read and the XML that it can create from these formats.

?Bob DuCharme