Welcome to MSDN Blogs Sign in | Join | Help

Brian Jones: Open XML Formats

I'm Brian Jones, a program manager in Office. I've been working on the XML functionality and file formats in Office for about 5 years now. In this blog, I'll mainly focus on XML in Office and the Open XML File Formats coming in the 2007 Microsoft Office system.
Links 12.13.2006

Some interesting links from the past few days:

File Formats

Why China's UOF is good - O'Reilly XML Blog

Very interesting post from Rick Jelliffe about how there are often needs for multiple formats, in order to properly meet the needs of the industry.

"Standardization is about the consolidation of variety: the reduction of variety is the means not the ends; but the total reduction of variety is not really a necesary means or a worthwhile ends."

Open Document Format (ODF) Accessibility Evaluator

Cool tool for scanning ODF documents looking for a set of criteria that helps determine how accessible the documents are. It would be really cool to see a similar tool built for Open XML.

I love these types of tools that show the power you can get from an open format. We have some examples that do similar things looking for what we call PII (personally identifiable information). It basically scans looking for XML that represents deletions, comments, hidden text, etc. and automatically strips it out.

Palimpsest: XML 2006: The ODF Plugin for MS Office

This was an interesting blog post from someone who attended a presentation that was made at the XML conference last week by one of the guys from the OpenDocument Foundation. It sounds like this may be where the whole "war of the file formats" is coming from :-). Even more interesting was that he actually compared it to Star Wars:

"Having said that, the presenter just compared the document debate to Star Wars...with an empire and forces of light. He left it to the attendee to figure out whether Microsoft or the OpenDocument Foundation represents the Empire."

Open XML (Ecma 376)

Wouter van Vugt : Paris and back again

Wouter van Vugt discusses the Open XML sessions in Paris.

[Open XML] GOAAAAL ! Open XML devient un standard de l'ECMA , Blog de Neodante (Julien Chable)

A French blog discussing the standardization of the Open XML formats.

Peter O'Kelly's Reality Check: Ecma International Approves Office Open XML as Worldwide Industry Standard

 

Doug Mahugh : Images in Open XML documents

Doug goes into more details around working with images inside of Open XML documents.

Doug Mahugh : Open XML in Estonia

"It's great to see so many young developers working with Open XML. I've met several college students in recent weeks who are already working with Open XML, and some even as young as high school (hi Antoine). It will be fun to see what kinds of creative Open XML applications this new generation of developers comes up with in the months and years ahead."

Novell & Microsoft

Technology Decision Makers Upbeat About Microsoft-Novell Deal

Coverage on the Microsoft-Novell deal from a bit of a different angle.

Stop bashing the Novell / Microsoft agreement | John Carroll | ZDNet.com

 

 

Posted Wednesday, December 13, 2006 10:14 AM by BrianJones | 1 Comments

What’s up with all those “rsids”?

As many folks who worked with the 2003 wordprocessingML format have probably noticed by now, there are is a new set of attributes/elements in the Open XML wordprocessingML format that shows up all over the place. I'm talking about RSIDs.

The rsid element is used to allow applications to more effectively merge two documents that have forked. It's best to use an example for explaining the use, so let's image I have a document that has the following text (we'll call this document "Brian1"):

Clearly this is a great thing for the industry. I personally feel like it's really cool. We now have an official standard that provides all the details necessary to read and write office documents.

I then send this document out to my coworker Steve to review and make changes. Steve decides that he wants to add in a bit of a sarcastic remark for the first sentence so when he sends back the document it looks like this (we'll call it "Steve1"):

Clearly this is a great thing for the industry (unless you happen to be one of those folks who had investments in growing this myth that there was some kind of "file format war" underway). I personally feel like it's really cool. We now have an official standard that provides all the details necessary to read and write office documents.

While Steve was reviewing his copy of the document, I also made some changes. I removed that second sentence, so now my document looks like this (we'll call it "Brian2"):

Clearly this is a great thing for the industry. We now have an official standard that provides all the details necessary to read and write office documents.

Now, when Steve sends me his copy back, I'd like to have my word processor merge my document and his so that I get the most up to date version with both of our edits. Ultimately, the merged document would look like this (we'll call it "Final"):

Clearly this is a great thing for the industry (unless you happen to be one of those folks who had investments in growing this myth that there was some kind of "file format war" underway). I personally feel like it's really cool. We now have an official standard that provides all the details necessary to read and write office documents.

The blue text is tracked as an insertion and the red text is a deletion.

Now, why is this example interesting at all? Well, if we only stored the basic text of this document, it would be very difficult to merge. In looking at the difference between "Brian2" and "Steve1", how would the application know what was an insertion and what was a deletion? If I still had my original file ("Brian1"), it would be easy to track this, but that's most likely not the case. I only have my edited document "Brian2", and Steve's document "Steve1". How do you know that the text "I personally feel like it's really cool" wasn't something that Steve added, as opposed to something that I deleted?

One way you can do this is via "track changes" functionality, where the application tracks the insertions and deletions as they happen and stores that in the format, but this often isn't desired. Often, for privacy reasons, people don't want to have the revisions tracked in their documents. Instead, they just want to be able to merge the documents later, and have the application figure out what was inserted, and what was deleted.

Well, the way we deal with this is through revision identifiers (rsids). Every time a document is opened and edited a unique ID is generated, and any edits that are made get labeled with that ID. This doesn't track who made the edits, or what date they were made, but it does allow you to see what was done in a unique session. The list of RSIDS is stored at the top of the document, and then every piece of text is labeled with the RSID from the session that text was entered.

This approach is what allows us to properly merge the two documents. When we merge documents, we can see what RSIDS the two documents share. Any shared RSIDS will represent text that was entered before the document was forked. Any RSIDS that are unique to one of the documents represent edits that were made after it was forked.

This means that if we see text in one document, but not in the other, all we need to do is look at the RSID applied to that text. If it's one of the shared RSIDs, that means the text existed before the documents were forked. That also means that when we merge the documents, we can assume that the text was deleted from one of the documents, rather than added to the other.

Let's go back to our example. In the original file, the XML would look something like this:

<w:body>
  <w:p w:rsidRDefault=
"00544FOB">
    <w:r>
      <w:t> Clearly this is a great thing for the industry. I personally feel like it
's really cool. We now have an official standard that provides all the details necessary to read and write office documents.</w:t>
    </w:r>
  </w:p>
</w:body>

This is saying that all runs (<w:r>) in the paragraph by default have the RSID "00544FOB". And in the document settings, we would have "00544FOB" listed as one of the RSIDs for the document. (note that there are a number of other places that RSIDs show up, but we're only focusing on the text for this case).

Now, after the document went to Steve, and he made his edits, the document "Steve1" would look like this:

<w:body>
  <w:p w:rsidRDefault=
"00544FOB">
    <w:r>
      <w:t> Clearly this is a great thing for the industry</w:t>
    </w:r>
    <w:r w:rsidR=
"00FF1F58">
      <w:t>(unless you happen to be one of those folks who had investments in growing this myth that there was some kind of "file format war" underway)</w:t>
    </w:r>
    <w:r>
      <w:t>. I personally feel like it
's really cool. We now have an official standard that provides all the details necessary to read and write office documents.</w:t>
    </w:r>
  </w:p>
</w:body>

Notice that while the formatting properties on all three runs are the same, the RSID values are different. This happens because Steve added that additional text, so it was assigned to a new RSID value "00FF1F58". If you look in the document settings for this document, there will be two RSIDS: "00544FOB" and "00FF1F58".

Now, separately I opened my copy and deleted some text. So the document "Brian2" is going to look like this:

<w:body>
  <w:p w:rsidRDefault=
"00544FOB">
    <w:r>
      <w:t> Clearly this is a great thing for the industry. We now have an official standard that provides all the details necessary to read and write office documents.</w:t>
    </w:r>
  </w:p>
</w:body>

Notice that the runs in the paragraph all have the same RSIDs still. There aren't any new RSIDs in the body because I didn't add any text. I did however edit the document, so if you look in the document settings, there will be a new RSID. So in "Brian2", we have the following two RSIDs: "00544FOB" and "00A95BA5".

So, when we go to generate the "final" document, we merge "Brian2" with "Steve1". As we merge the two documents, we see that they share the RSID "00544FOB", but that all other RSIDs are unique to those copies. This means that any text with the RSID "00544FOB" existed in the original file, and any other text was added after the fork. There are two pieces of text in Steve's document that aren't in mine. The first piece of text that reads "(unless you happen to be one of those folks who had investments in growing this myth that there was some kind of "file format war" underway)" was an addition made by Steve, rather than something I deleted. That text had an RSID unique to Steve's document. The other text that reads: "I personally feel like it's really cool." on the other hand has an RSID that is shared between the two documents. That tells us that it was deleted from my copy, rather than added to Steve's.

So, next time you're looking at a wordprocessingML document and you're wondering why it's broken out into so many runs, you'll know the answer. This is another example of how the simplicity of the flat schema wordprocessingML uses makes it easy to add properties to the various runs of text. The RSID isn't a container, but rather just a property of the text. If we had the ability to nest runs within other runs (similar to the HTML <span> model), then it would be a bit more complicated (not impossible, just more complicated). The architecture of a wordprocessing file is much simpler. Since runs can be nested in other runs, you have a more predictable ancestor list to walk through when finding the properties of that particular run.

If you would rather not have these RSIDs in your files, it's easy enough to turn off. Just go to the trust center and turn off the setting: "Store random number to improve combine accuracy"

Two other important things to note. First is that the RSID tells us nothing about the time or order things were done. They are completely random, and are only used for seeing where things match. So they aren't of much use unless you are merging with another document that also has RSIDs. Another thing to note is that these are not just used for content, but other settings as well like styles, layout, etc.

-Brian

Posted Monday, December 11, 2006 4:11 PM by BrianJones | 7 Comments

More on Ecma Standard 376

I wanted to post a few links that were interesting from the news yesterday about Ecma approving the Office Open XML formats as Ecma Standard 376. I'm sure by now most of you have seen all the articles and blogs discussing this big milestone.

Clearly this is a great thing for the industry (unless you happen to be one of those folks who had investments in growing this myth that there was some kind of "file format war" underway). We now have an official standard that provides all the details necessary to read and write office documents. Of course for most solutions they will only need to leverage certain pieces of the standard, and not the whole thing. But for those that do want to build a full-fledged office suite, all the details about how to read and write the formats is there. The standardization process did two very important things. The first was that it allowed a group of experts from different areas in the industry (IT; archivists; software developers; hardware manufacturers) to ensure that the specification was fully documented and could work cross platform. The second was that the ownership and stewardship of the specification is now in the public's hand (and no longer owned or operated by a single vendor). This point could become even stronger once the Open XML formats have gone through their ISO submission. So now you have more choices available to you. Last year ODF was introduced as a choice, and this year Open XML is available as well. Remember, that just because something is a standard it doesn't mean you have to use it. It just means that if you want to use it, it's fully documented and available for use with no encumbrances. You know that a group has worked through it to ensure that there will be no barriers to implementation. That's the key that gives you the freedom to choose.

Here are some blogs and news stories worth pointing to:

Vive le Open XML Revolution

Doug Mahugh discusses an Open XML workshop that took place this week in Paris, where a collection of developers delved into the details of the Open XML formats, as well as the translation tools currently under development

More on the motivations behind IBM's opposition (hint: it's not about making the world a better place)

Yates said he does not understand why a large company such as IBM is at the forefront of creating conflict around the OpenXML format.

"They are also really focused on mandating ODF, mandating a single format that their commercial products support. This push to mandate ODF seems to be so antithetical to what they ordinarily talk about around open standards, interoperability and choice," he said.

Format approved as standard

Microsoft submitted the proposal with Ecma International, a Geneva-based industry group that establishes technical standards, and got backing from other players, including rival Apple Computer Inc. Ecma International announced Thursday its approval of Office Open XML as a standard, touting the step as vital for document creation and archiving.

BetaNews | Office Open XML Gains ECMA Approval; IBM Votes No

Sutor's claims run contrary to those of Ecma's Open XML white paper, offered to prospective supporters and the general public, well prior to today's vote.

"The interoperability of OpenXML has been accomplished through extensive contributions, modification, and review of the Specification by members of the Ecma TC45 committee with diverse backgrounds and corporate interests," the paper reads. "During preparation, committee members raised and resolved hundreds of issues regarding policy, clarity, semantics, and possible dependence on environment."

The paper goes on to list "specific areas in which OpenXML departs from the original binary formats for the sake of interoperability." Among them are the fact that embedded images may be of any type, embedded functionality is not dependent on any one programming language or runtime environment, and that embedded fonts utilize font metrics systems for determining the best available font on any user's system, when the specified font is not available.

Massachusetts to review the new Ecma standard

The Initiative for Software Choice, a trade association, hailed the ECMA approval. "There is no downside here," said Melanie Wyne, ISC executive director, in a statement.

"ECMA's action enhances document manipulation, interoperability, and archival storage for public and private institutions. The ECMA process also represents an important step toward expedited ratification by ISO, which will give governments and enterprises added assurance that Office Open XML meets the rigors of the evolving technological marketplace -- especially as it pertains to interoperability of documents between competing products," Wyne said.

Have a great weekend everyone

-Brian

Posted Friday, December 08, 2006 3:02 PM by BrianJones | 16 Comments

Ecma Standard 376 – Office Open XML formats

It's finally official. Today the Ecma General Assembly voted almost unanimously to approve the Office Open XML formats as an official Ecma standard. They also voted to submit the standard to ISO for fast track certification. The official press release from Ecma International can be found here: http://www.ecma-international.org/news/PressReleases/PR_TC45_Dec2006.htm

Here's a quote from Jan van den Beld, the Secretary General of Ecma International:

"The broad spectrum of sponsors from the industry and public institutions ensure the creation of an open standard that can create a wide range of possibilities for document processing, archival and interoperability" said Jan van den Beld, Secretary General of Ecma International. "The Open XML standard recognizes the benefit of backward compatibility preservation of the billions of documents that have already been created while enabling new future applications of document technology."

As I blogged about back in early October when we finalized the draft, it was a ton of hard work by around 20+ individuals from about 12 corporations, but it was well worth it. The dramatic improvements that the specification has undergone over the past year show it all. I was just looking back at one of my earlier blog posts from last year when we first submitted the Open XML formats to Ecma for standardization. It really is amazing that the group was able to accomplish so much work this past year. I can't wait to see what we're able to accomplish as we start working on the next version.

There are a few things that will be pretty fun to watch over the next few months. The one I'm most excited is the huge growth we're seeing in the Open XML developer community. We already have larger corporations/applications like Apple, Corel, Novel, Intel and Microsoft either participate directly in the standardization, or publicly comment on upcoming support for the Open XML formats. This is cool, but not the piece I'm as excited about. I'm looking forward to all the smaller companies (like the folks participating up on openxmldeveloper.org) that take advantage of this interoperable format as a building block for rich solutions that create and consume documents, spreadsheets, presentations, and other document types we haven't thought of yet. There are simple tools out there like the docx converter (http://docx-converter.com). There is the open source project up on sourceforge for converting from Open XML to ODF and back (http://odf-converter.sourceforge.net/). There is even the MindManager tool that converts mind maps into wordprocessing documents. The possibility are infinite… who knows how long it will be before we even see things like rich spreadsheetML files with branding, charting, pivot tables, etc. being output directly from hardware devices (like medical imaging equipment) rather than a basic CSV file.

Next step is going to be ISO certification, and here's hoping that IBM doesn't try to push too hard against this too. They've already said they were the only ones who voted against the Ecma certification which is a shame. Having a complete spec that outlines every piece of the Office file formats and is publicly maintained is a good thing. We shouldn't have to waste time arguing about that (unless we're in training to be a politician or something). Open XML may not be the best format for every use, but what is? HTML is great for some things, but really limited for others. XML is great for some things, but really limited to others. The same goes for these newer formats like ODF an Open XML. Microsoft had no problem with ODF becoming an ISO standard for instance, and didn't put up any roadblocks. It was a good thing. Hopefully IBM will be able to say the same about Open XML, but we'll have to wait and see. I haven't seen any public comments on that yet.

So, to all of my colleagues from TC45, congratulations! I think these stats say it all in terms of your hard work:

  • 72 presentations were given to the technical committee explaining the existing behaviors of features so that discussion on how to best structure and document it could then take place.
  • 66 hours of live meeting discussions (starting at 6am every Thursday for those of us on the west coast of the US)
  • 88 schema files
  • 128 hours of face to face meetings held in Brussels (ECMA); Cupertino, CA (Apple); London (British Library); Sapporro, Japan (Toshiba); Redmond, WA (Microsoft); Trondheim, Norway (StatOil)
  • 6,000 pages of documentation between the 5 parts of the standard
  • 9,422 different items to document (3,114 attributes, 2,500 elements, 3,243 enumerations, 567 simple types)

-Brian

Posted Thursday, December 07, 2006 6:14 PM by BrianJones | 16 Comments

Friday Thoughts 12-01-2006

I wanted to point out a few interesting things that have popped up over the past couple weeks:

IDC Report on Open Standards

Interesting study by IDC where they polled companies in Finland, Sweden, Norway, and Denmark to try and gauge the level of interest in open document standards. As you would imagine, the level of interest in the Open XML formats continues to grow, especially with the standardization vote only a week away and Office 2007 already out the door with full support for the final draft.

CompTIA believes Open XML will benefit the industry

"Our members believe that approval of Open XML by ECMA as an open standard will be a key advancement to the IT industry and provide critically needed choice in the document format space allowing for greater vendor independence and reduced lock-in. Competition among multiple open document standards will enhance innovation in document formats and increase flexibility and interoperability all to the benefit of software consumers. The approval of Open XML as an open standard will ensure that digital content is more efficiently stored and managed today as well as into the future."

Office 2007 "Business Launch"

Yesterday we announced the business launch of Windows Vista; Office 2007; and Exchange Server 2007. There is a blog describing all the details from the launch.

40 example code snippets for Open XML

Kevin Boske has an update on the code snippets that we last published back in June. They have now all been updated to work with the final draft of the Ecma standard. These are really great for folks trying to get started with development on top of the open xml formats. Kevin just had a baby girl (congratulations Kevin), so I'm not sure how active he'll be on his blog, but I'm sure he'll be checking every once in a while if you have any questions.

The snippets show how to do the following:

Excel Snippets

  • Add Custom UI
  • Delete Comments by a specific User
  • Delete Worksheet
  • Delete Excel 4.0 Macro sheets
  • Retrieve hidden rows or columns
  • Export Chart
  • Get Cell Value
  • Get Comments as XML
  • Get Hidden Worksheets
  • Get Worksheet Information
  • Get Cell for Reading
  • Get Cell for Writing
  • Insert Custom XML
  • Insert Header or Footer
  • Insert a Numeric Value into a Cell
  • Insert a String Value into a Cell
  • Set Recalc Option

PowerPoint Snippets

  • Delete Comments by User
  • Delete Slide by Title
  • Get Slide Count
  • Get Slide Titles
  • Modify Slide Title
  • Reorder Slides
  • Replace Image
  • Retrieve Slide Location by Title

Word Snippets

  • Accept Revisions
  • Add Header
  • Convert DOCM to DOCX
  • Remove Comments
  • Remove Headers and Footers
  • Remove Hidden Text
  • Replace Style
  • Retrieve Application Property
  • Retrieve Core Property
  • Retrieve Custom Property
  • Retrieve Table of Contents
  • Set Application Property
  • Set Core Property
  • Set Custom Property
  • Set Print Orientation

Doug Mahugh talks about arbitrary content within an Office Open XML file

Doug has a post where he talks more about the packaging conventions, and what you can to to put your own content into an Open XML file while still maintaining it's validity.

Technorati Profile

Posted Friday, December 01, 2006 4:14 PM by BrianJones | 6 Comments

Convert OpenXML files to HTML on sharepoint

The sharepoint team recently posted an article up on OpenXMLdeveloper.org on how they allow you to convert files in the Office Open XML format into HTML directly on the server: http://openxmldeveloper.org/articles/MOSSconvert.aspx

When they started investigating this solution, I made sure to point them at the XSLT that we build in Office 2003 to convert the earlier version of WordprocessingML into HTML (http://blogs.msdn.com/brian_jones/archive/2005/09/30/475794.aspx).

This article goes into the details on the challenge they faced with images as images are embedded by default in OpenXML and HTML doesn't allow for embedded images.

-Brian

Posted Thursday, November 16, 2006 10:18 AM by BrianJones | 2 Comments

Filed under:

Simple SpreadsheetML file (part 2 of 3)

This is a continuation on the "Simple SpreadsheetML file Part 1" post I made a couple weeks ago. In that post we created a SpreadsheetML file that consisted of a simple table with 3 columns and 3 rows of data (plus a header row). The table looked like this:

Sub Total

Tax

Total

14.95

   

19.95

   

4.95

   

Our goal though at the end of this series is to create a table that looks like this:

Sub Total

Tax

Total

$ 14.95

$ 1.20

$ 16.15

$ 19.95

$ 1.60

$ 21.55

$ 4.95

$ 0.40

$ 5.35

The pieces that we still need to add are the functions that calculate the values for the second and third columns, as well as the cell formatting. Today we're going to add the functions.

Part 1 - Simple Table

Let's pick up where we ended in Part 1 with the following parts:

workbook.xml

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<
workbook xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">
  <sheets>
    <
sheet name="Brian" sheetId="1" r:id="rId1"/>
  </
sheets>
</workbook>

_rels/workbook.xml.rels

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<
Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
  <
Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/worksheet" Target="worksheet.xml"/>
</
Relationships>

worksheet.xml

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<
worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">
  <
sheetData>
    <
row>
      <
c t="inlineStr">
        <
is>
          <
t>Sub Total</t>
        </
is>
      </
c>
      <
c t="inlineStr">
        <
is>
          <
t>Tax</t>
        </
is>
      </
c>
      <
c t="inlineStr">
        <
is>
          <
t>Total</t>
        </
is>
      </
c>
    </
row>
    <
row>
      <
c>
        <
v>14.95</v>
      </
c>
   
</row>
    <
row>
      <
c>
        <
v>19.95</v>
      </
c>
    </
row>
    <row>
      <
c>
        <
v>4.95</v>
      </
c>
    </
row>
  </
sheetData>
</
worksheet>

_rels/.rels

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<
Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
  <
Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" Target="workbook.xml"/>
</
Relationships>

[Content_Types].xml

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<
Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
  <
Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
  <
Override PartName="/workbook.xml" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.main+xml"/>
  <
Override PartName="/worksheet.xml" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.worksheet+xml"/>
</
Types>

Re-Create Version 1 of our simple SpreadsheetML file

So if you take those five parts and ZIP them up you'll get a spreadsheet that looks like this:

Sub Total

Tax

Total

14.95

   

19.95

   

4.95

   

Version 2 - Add functions to the first row of data

Now we're going to add a function in cells B2 and C2 that will give us the tax, and total value for the first row of data. Let's say that the tax we apply is going to be 8%. That means that the function to calculate the tax is going to be: =A2*0.08. The function for the total will then just be: =A2+B2.

In order to update our spreadsheet, we're only going to need to edit the worksheet.xml part, and we can leave the rest of the parts alone.

worksheet.xml (version 2)

We will need to create cells for B2 and C2, and add the formula definition we want for each of those cells. So, in the second row, we'll add two more <c> elements. This time though, rather than using an inline string (<is>) or value (<v>), we'll use the function tag <f>.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<
worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">
  <
sheetData>
    <
row>
      <
c t="inlineStr">
        <
is>
          <
t>Sub Total</t>
        </
is>
      </
c>
      <
c t="inlineStr">
        <
is>
          <
t>Tax</t>
        </
is>
      </
c>
      <
c t="inlineStr">
        <
is>
          <
t>Total</t>
        </
is>
      </
c>
    </
row>
    <
row>
      <
c>
        <
v>14.95</v>
      </
c>
   <c>
    <
f>A2*0.08</f
>
   </
c
>
   <
c
>
    <
f>A2+B2</f
>
   </
c>

   
</row>
    <
row>
      <
c>
        <
v>19.95</v>
      </
c>
    </
row>
    <row>
      <
c>
        <
v>4.95</v>
      </
c>
    </
row>
  </
sheetData>
</
worksheet>

Create Version 2 of the spreadsheet

Take the original parts (_rels/.rels; _rels/workbook.xml.rels; workbook.xml; [Content_Types].xml) as well as our new worksheet.xml part and ZIP them up. When you open the resulting file, you should have a spreadsheet with the formulas automatically calculated and it looks something like this:

Sub Total

Tax

Total

14.95

1.196

16.146

19.95

   

4.95

   

Version 3 - Make the formulas repeat for the other rows

There are now two options for adding the formulas to the next two rows. You could do the same thing we did in the first row, and update the cell references (ie A3*0.08 & A4*0.08), but that requires you to update the cell references for each row. It also requires the consuming application to parse each formula, which can be time consuming when you get into larger spreadsheets and more complex formulas.

Another approach is to use a shared formula. If you were in an application like Microsoft Excel, you could copy the formula from the first row and paste it into the rows below it. Excel would automatically update the cell references in each row so that the tax and total was properly calculated. You can do the exact same thing in the file format by specifying that the formula is a shared formula.

worksheet.xml (version 3)

To specify that the formula from the first row is shared, we use the t="shared" attribute. We then specify what the range is that we want it to apply to, and give the formula an id that the lower cells can reference. Then we create the cells for B3:C4 and specify that they are sharing a formula. The resulting worksheet.xml part will look like this:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<
worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">
  <
sheetData>
    <
row>
      <
c t="inlineStr">
        <
is>
          <
t>Sub Total</t>
        </
is>
      </
c>
      <
c t="inlineStr">
        <
is>
          <
t>Tax</t>
        </
is>
      </
c>
      <
c t="inlineStr">
        <
is>
          <
t>Total</t>
        </
is>
      </
c>
    </
row>
    <
row>
      <
c>
        <
v>14.95</v>
      </
c>
   <c>
    <f t="shared" ref="B2:B4" si="0">A2*0.08</f>
   </c>
   <
c
>
    <
f t="shared" ref="C2:C4" si="1">A2+B2</f
>
   </
c>

   
</row>
    <
row>
      <
c>
        <
v>19.95</v>
      </
c>
   <c>
    <f t="shared" si="0"/>

   </c>
   <
c
>
    <
f t="shared" si="1"/>

   </
c>

    </
row>
    <row>
      <
c>
        <
v>4.95</v>
      </
c>
   <c>
    <f t="shared" si="0"/>

   </c>
   <
c
>
    <
f t="shared" si="1"/>

   </
c>

    </
row>
  </
sheetData>
</
worksheet>

Create Version 3 of the spreadsheet

So, again we've only updated the worksheet.xml part, so re-generate the ZIP file with the updated worksheet.xml part. You should get the following table of data:

Sub Total

Tax

Total

14.95

1.196

16.146

19.95

1.596

21.546

4.95

0.396

5.346

So, now you know the basics of using a formula in a spreadsheetML file. It's pretty straightforward. You could also specify the values of the formulas using the <v> tag as a sibling of the <f> tag, but that isn't necessary.

In the next post, we'll format the cells so that they actually look like currency, and not just plain numbers.

-Brian

Posted Wednesday, November 15, 2006 5:29 PM by BrianJones | 1 Comments

Filed under: ,

Week off

As I just mentioned in a comment I left in last week's post, I'm actually attempting to take a bit of a break this week. We actually wrapped up the development work on Office 2007 last Friday. It's been a huge release, and it's really exciting to see it finally go out the door. I've been working on these file formats for years now, and it's going to be fun to see this release wrap up. I need a vacation :-)

I didn't want to be totally lame though and not blog at all, so here are a few interesting things I wanted to point out:

  • Office 2007 released to manufacturing - it's been a lot of hard work, but it's been worth it.
  • Save Open XML from Older versions of Office - since it's just a web release, you can already get the final version of the free updates that allows older versions of Office to open and save in the Office Open XML format (before Office 2007 is available). This latest release allows you to save using the final 1.5 version of the Ecma working draft (which will be up for a final approval vote by the Ecma General Assembly in about a month).
  • Latest version of the ODF to Open XML translator available - The team working on this open source project has announced an updated release of the translator is now available. It sounds like they are pretty much done with the ODF to Open XML conversion for wordprocessing documents, and they even have a prototype going the other way (Open XML to ODF). Another really funny thing they found when testing this tool was that while Google claims ODF support in google docs, they actually are supporting the old star office XML format, but putting the ODF extension on it.

-Brian

Posted Tuesday, November 07, 2006 2:08 PM by BrianJones | 6 Comments

Novell and Microsoft teaming up on document interoperability

I'm sure by now most folks have read the news on the collaboration agreement between Novell and Microsoft. If not, you can read more up on the Microsoft interoperability site: http://www.microsoft.com/interop (Novell's press release is here)

There are a number of really cool pieces to this (Jason Matusow blogged on this yesterday), and one of those directly relates to the Office Open XML file formats. With this announcement Novell has said they will do the development work to allow OpenOffice to support the Office Open XML formats. This plug-in will be directly distributed with their edition of OpenOffice, but it will also be provided back to the OpenOffice.org organization so that everyone can leverage it. 

Another great piece is that they are going to start participating in the Open XML Translator project. It's an open source project we helped start back in the summer that will translate from Open XML to ODF and from ODF to Open XML. It's a really sweet project because it could be plugged in just about anywhere. You can read more about it on the blog that the developers of the project set up (http://odf-converter.sourceforge.net/blog/index.php).

Here is some more information from the FAQ up on Novell's site (http://www.novell.com/linux/microsoft/faq.html):

Document Format Compatibility. Microsoft and Novell have been focusing on ways to improve interoperability between office productivity applications. The two companies will now work together on ways for OpenOffice and Microsoft Office users to best share documents and both will take steps to make translators available to improve interoperability between Open XML and OpenDocument Formats.

As you know, Novell has been working with us the past year on the Ecma standardization of the Office Open XML file formats. Jody Goldberg, who works on both Gnumeric as well as Open Office was a huge help with SpreadsheetML. One of the key things that Novell was focused on was ensuring that the formats were fully documented and interoperable so that they could support them as well.

I also really liked reading Michael Meeks' blog on this subject. Michael is a distinguished engineer from Novell who works on OpenOffice (and he's Jody's manager):

  • Inevitably people will have some really good questions here, and to save my fingers I thought I'd point out a few points.
    • Why help Microsoft with OpenXML interop. ? OpenXML sucks, OpenDocument Rocks !?. So several thoughts:
      • This should not be a surprise - Jody Goldberg (on my team) has been working hard for months with Microsoft and others on the ECMA process. At one stage there around 1/2 the open 'issues' wrt. improving disclosure (and hence the spec.) came from Jody. I for one am proud of the job that he did there, an (ongoing) investment that will yield better interoperability for years to come.
      • As I have said for many months now, focusing on an 'Open-Standard' of ~700 pages written by a small team over a short period, is to miss the staggering value that is found in Free software. OpenOffice (as anyone who tried to start it recently knows) contains millions of lines of code, and a staggering investment of thousands of man years of sweat, tears (and perhaps blood). It's localized to umpteen languages, has deep help, scripting, accessibility, interoperability; it's just an immensely feature rich and powerful product.
      • To re-emphasise this, the value in OpenOffice.org is not what file format it supports (eg. we want to add good Lotus Word Pro support) but that it is truly Free software, that gives people critical Freedoms. An open format is anyhow implicit in the native file format of any openly developed Free software project.
      • Telling people about open standards, instead of Free Software is easy - 'normal' people generate data, not software so they understand, but it sells them radically short. In my view better interoperability (with any and all formats) strengthens Free Software, quite without the obvious pragmatic benefits to users & customers.
    • Why do business with these scum ?
      • It's true there is a widespread perception of unfair business practice from Microsoft out there, but my experience of working in the ECMA process with the developers, has been of meeting a (to my mind) mis-directed, but equally passionate world-view based around the love of their technology.
      • Broadly, I think it's fair to say there is a certain kind of person that loves to solve complex, technical problems, and I like that kind of person. It's also interesting to note that the average Microsoft (from my small sample) political viewpoint is -way- to the left of the average Novell Free software developer (perhaps a statistical aberration but ...). So, in a nutshell, they're good guys, if mis-directed. The great news is that we can help change that direction and get these guys addicted to the Free Software model.
      • One couple it was fun to meet, both on the Office team, obviously in love, confided in me that they had delayed their marriage to meet the Office 12 schedule: is that dedication ? Let's hope Wedding 2007 will ship on time; but imagine if we can help focus these guys on improving Linux <-> Windows interoperability, and in time Free software for it's own sake.
    • What does it mean for OpenOffice ? - my hope is over the long haul: better interop, more bodies hacking on OO.o, wider penetration of (Novell's) OpenOffice into the enterprise, and more individuals able to boldly hack on Free software.
    • What does it mean for Hackers ? - of course, I'm pleased that our team got such a great formal IPR covenant for individual developers from Microsoft. For sceptics that think this is a pure gesture, it's always surprising to me how a few key people seem to pop up again and again in Free software, and not everyone has the 7 year stamina that can be required, the RIAA demonstrates the danger well.
    • What does it mean for Novell ? - I'm pleased that it seems Microsoft will be distributing lots of SLES coupons, the more the merrier. Of course Nat and Miguel who helped setup the deal have a clearer view.
    • What is this Translator ? - it's the early stages of a open-source project to make a standalone bi-directional Open XML to OpenDocument converter. See SourceForge: odf-converter. What is important to me is not the set of design choices here (eg. a standalone XML to XML converter, though that may be useful for other Free software projects, or it's capabilities: a sub-set of Word only so far), but the end-goal of getting substantially better MS Office interop. (with OpenXML) into OO.o.

It's really cool to see that there will be a number of office applications (Corell, OpenOffice, older versions of MS Office) that will have support for these formats. I've personally been even more excited about the smaller 3rd party non-"office-type" applications that can also now get involved in consuming and generating rich office documents. It continues to raise the value of office documents, as they are no longer just a black box, but instead every office document can serve as a data source. I'm like the possibilities for the developer community that keeps growing here. There is obviously a ton of valuable information that's going to be made available via solution providers. The fact that to both consuming and generating documents has become so much easier is huge.

The translator project is particularly interesting though as it makes it easier for folks to choose the format they want to work with. The Office Open XML formats clearly have customer needs that they were designed to solve; and the OpenDocument format had customer needs that those folks were trying to solve.

-Brian

P.S. Michael,
Tristan and Krista are still on track for the wedding (and it will actually be Wedding 2006, not 2007 :-)

Posted Friday, November 03, 2006 2:32 AM by BrianJones | 8 Comments

Simple SpreadsheetML file Part 1 of 3

I posted a bunch of "Intro to SpreadsheetML" posts about a year or so ago, but those were all based on the Office XP spreadsheetML format. I think an updated series based on the Open XML standard is long overdue. I'll start off just building a simple table, and in future posts show more about formatting, formulas, and maybe even some charts.

Today, we'll start by creating a simple table. Then we'll add a little bit of number formatting and some formulas. In the end, we'll have the following table (where the tax & total columns are automatically calculated based on the Sub Total column):

Sub Total

Tax

Total

$ 14.95

$ 1.20

$ 16.15

$ 19.95

$ 1.60

$ 21.55

$ 4.95

$ 0.40

$ 5.35

We'll take this in 3 separate blog posts:

  • Part 1 - Create the simple table without formatting or calculations
  • Part 2 - Add functions to calculate "Tax" and "Total"
  • Part 3 - Add formatting so the data shows up as currency

Part 1 - Simple Table

Since we won't do the formatting or formulas initially, our table will look like this:

Sub Total

Tax

Total

14.95

   

19.95

   

4.95

   

As I discussed in my "simple wordprocessingML document" post, the Office Open XML format is comprised of a number of XML files within a ZIP package. The files follow a simple set of conventions called the open packaging conventions (described in Part 2 of the standard). You need to declare the content types of the parts, as well as tell the consuming application where it should start (via the package relationship).

Unlike the WordprocessingML document we created though, a SpreadsheetML file has a bit more structure to it (the same is true for presentations). A SpreadsheetML file is actually a workbook that can contain multiple worksheets. so even your most simple workbooks will have at least 5 files within the ZIP package. So for this example, let's start by creating a folder somewhere and in that folder create the following files:

  • workbook.xml
  • worksheet.xml
  • [Content_Types].xml
  • _rels/.rels
  • _rels/workbook.xml.rels

workbook.xml

The workbook is essentially the container for the various worksheets. The workbook is where you can reference the styles part, shared string tables, and any other pieces of information that apply to the entire Spreadsheet file. In this example, since we're just creating a super basic spreadsheet, the workbook will be very simple:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<
workbook xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">
  <sheets>
    <
sheet name="Brian" sheetId="1" r:id="rId1"/>
  </
sheets>
</workbook>

The only interesting thing we did here was to create the sheet tag, which then references out worksheet via the r:id attribute. Remember that almost every time you reference another part or even something outside of the file like a hyperlink or a linked image, you will use a relationship. The next thing we need to do is actually create that relationship in the workbook.xml part's relationship file.

_rels/workbook.xml.rels (part 1)

This is pretty basic. We just need to create a relationship that has an id of rId1 so that it will match the reference from the workbook.xml part:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<
Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
  <
Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/worksheet" Target="worksheet.xml"/>
</
Relationships>

Notice that a relationship has three main attributes. It has an Id attribute who's use will be more obvious in a bit. The Target attribute tells you where to go, and the path is relative to the parent directory of the "_rels" folder that relationship file is in (in this case that's the root directory). The Type attribute describes what kind of relationship it is (ie what kind of stuff is it pointing at).

In this part we have one relationship who's type is "worksheet", and the target points to our worksheet.xml part. Now we need to actually put some content in the worksheet.xml part

worksheet.xml

The worksheet.xml part is going to be pretty simple. The first row in the sheet will have the column titles ("Sub Total", "Tax", and "Total"). The next 3 rows will only have data in the first column as we won't create the calculation functions until later in this example.

The worksheet.xml part should look something like this:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<
worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">
  <
sheetData>
    <
row>
      <
c t="inlineStr">
        <
is>
          <
t>Sub Total</t>
        </
is>
      </
c>
      <
c t="inlineStr">
        <
is>
          <
t>Tax</t>
        </
is>
      </
c>
      <
c t="inlineStr">
        <
is>
          <
t>Total</t>
        </
is>
      </
c>
    </
row>
    <
row>
      <
c>
        <
v>14.95</v>
      </
c>
   
</row>
    <
row>
      <
c>
        <
v>19.95</v>
      </
c>
    </
row>
    <row>
      <
c>
        <
v>4.95</v>
      </
c>
    </
row>
  </
sheetData>
</
worksheet>

If you've played around with the SpreadsheetML that Excel outputs, you'll notice that I've handled strings a bit differently. For faster save and load times, Excel actually uses the shared string table that is an optional feature of the Office Open XML formats. It allows you to write a string just once, and then reference that string from within the grid. So, instead of saying "Sub Total" in that first cell, Excel would have an id that references the entry in the string table for "Sub Total". This is just an optional feature though, and for simplicities sake I've just put my strings inline (using the <is> tag). There isn't really any perf gain if the string is only used once.

Also, you'll notice that unlike a typical table format (like HTML, CALS, etc.) the XML above is representing a spreadsheet. It's a subtle difference when working with simple examples like this, but becomes more obvious as you move into more complex spreadsheets. One noticeable difference right away though is that we don't write any elements down for the empty cells B2:C4. If there isn't any data in a cell, then you just don't write anything. This is a bit of a different model from table formats that are more presentation based.

_rels/.rels

How does a consuming application know where it should start when opening an OpenXML file? The first place you always look is the package relationships file. The package relationship file will always be located in the "_rels" directory, and it's always called ".rels". We need to create an XML file that tells the consumer that "workbook.xml" is the first place you should go, and that this type of document is an Office Open XML document:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<
Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
  <
Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" Target="workbook.xml"/>
</
Relationships>

[Content_Types].xml

OK, so we've now created the main workbook.xml part and the worksheet.xml part, as well as created a relationship between the two.

Every Office Open XML file must declare the content types used in the ZIP package. That is done with the [Content_Types].xml file. We currently have two parts in this document that we need to declare content types for. The first is the document.xml part; the second is the _rels/.rels part. So, the content types file should look like this:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<
Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
  <
Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
  <
Override PartName="/workbook.xml" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.main+xml"/>
  <
Override PartName="/worksheet.xml" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.worksheet+xml"/>
</
Types>

Here we are saying that anything ending with the .rels extension is of type Package Relationship, and we also declare that the part workbook.xml is of type workbook and worksheet.xml is of type worksheet.

Create Version 1 of our simple SpreadsheetML file

OK, we should now have five files. Three of the files are in the root directory (workbook.xml; worksheet.xml & [Content_Types].xml); and in the "_rels" directory we have the ".rels" file and the "workbook.xml.rels" file. Select the two files and the "_rels" directory and ZIP them up. Make sure that when you zip them up, the two files and the _rels directory are all at the root level.

Open this file in Excel, and you now have a simple file.

*Beta 2 TR Note*

Note that the namespaces are different between B2TR of Excel, and the final version of the Ecma standard. In this post, I have shown what the final version of the Ecma standard would look like. The RTM build of Office will use this same format. If you are on B2TR though, then you'll need to tweak the following namespace in the XML files: http://schemas.openxmlformats.org/spreadsheetml/2006/main

Instead of using that namespace, you'll need to use this namespace http://schemas.openxmlformats.org/spreadsheetml/2006/7/main to get it working in B2TR. The two parts you'll need to update with this namespace are wordsheet.xml and workbook.xml. Everything else should work fine

Well, that was the first piece. In the next post, we'll add functions to the spreadsheet.

-Brian

Posted Thursday, November 02, 2006 10:50 AM by BrianJones | 4 Comments

Filed under: ,

Friday thoughts (Oct 27, 2006)

I'd been meaning to post a write-up on how to create a simple SpreadsheetML document from scratch, but just haven't had the time this week to pull it all together. Hopefully I'll get that out early next week. I had already done a similar post for WordprocessingML (both for Beta 1 as well as RTM). Here are a couple things I wanted to point out for the week:

  1. Arccast interview on Office Open XML - Doug Mahugh and I did a live webcast last month with Ron Jacobs. Ron now has both Part 1 and Part 2 of the interview available up on Channel 9.
  2. Upcoming SpreadsheetML Generator - Stephane Rodriguez (who also wrote the Open XML diffing tool) is getting really close to releasing the latest version of his xlsgen tool; and it will include SpreadsheetML support (http://www.arstdesign.com/BBS/BulletinBoard.php?qs_id=1661). I think he's planning on still going with the B2TR version of SpreadsheetML and won't update to the RTM version until after Office 2007 ships. There were still a couple changes that tool place between B2TR and RTM for spreadsheetML that brought it fully inline with the Ecma standard, but that resulted in B2TR not having the ability to open RTM spreadsheetML files (similar to what all three applications experienced between B2 and B2TR).
  3. Apose.Words supports WordprocessingML - I saw this blog post the other day that mentions that Apose.Words now supports exporting as WordprocessingML.
  4. Document your SQL DB using WordprocessingML - From this blog post: "Data Dictionary Creator (DDC) is a simple application which helps you document SQL Server databases. It stores all the information in Extended Properties, so it's easier to keep the documentation in sync with the database as it changes… DDC exports to WordML, Excel, HTML, and XML."
  5. Generate Wordprocessing Documents from your SAP Web Application Server - This is a cool intro article that shows how you can leverage WordprocessingML to generate rich documents directly from your SAP server. It would be really interesting to see some examples of leveraging the custom defined schema support and content controls in Word 2007 to not only populate the documents with SAP data, but to also mine that information back out of the document if the user has edited it.
  6. Leverage SpreadsheetML to build rich reports - I actually don't know anything about this product :-), but I randomly came across it and noticed that it allows you to generate spreadsheets using the original SpreadsheetML format we started working on over 8 years ago (and shipped with Office XP): "Perfect table creation – NEW SPREADSHEETML The new Microsoft format SpreadsheetML is supported. Based on XML, it generates richer editing and formatting of Excel files optimized for Windows Office 2003. Your tables are perfectly reproduced in Excel 2003, retaining the text and the colored background." I always love seeing people leveraging the existing technologies. The new SpreadsheetML format will give them a lot more power, as the old one didn't support Excel's full feature set.
  7. WordML import and export on the Mac - I think that this tool is built on top of the TextEdit functionality built into the Mac (which supports WordML itself). Not sure if they actually do anything additional in terms of the WordML support.

I hope everyone has a great weekend.

-Brian

Posted Friday, October 27, 2006 2:29 PM by BrianJones | 10 Comments

Performance of an XML file format for spreadsheets

I've blogged in the past about some of the things we did with the SpreadsheetML format to help improve the performance of opening and saving those files. While a file format most likely won't affect the performance of an application once the file is loaded, it can significantly impact the performance of the actual load or save. The majority of a wordprocessing file is usually the actual text content and anything you can do to avoid making the storage of that content complex the better. The majority of a presentation is usually the rich media and styling information and it's important to store those in an efficient manner. The biggest challenge though hands down in terms of file format performance though is a spreadsheet.

It's very easy to get Spreadsheets comprised of millions of cells. In older versions of Excel, it used to be that the maximum number of columns allowed in a worksheet was 256 and the maximum number of rows was 64,000. That meant that a single worksheet in a workbook could consist of over 16 million cells. An of course there can me many worksheets in a workbook.

There were a number of customers who were hitting this limit and wanted us to increase the number of rows and columns allowed in a worksheet (this was actually one of Excel's top requests). This had been a request for awhile, but it was going to be extremely difficult to increase this limit in the old binary formats so we didn't do it. The move to the new Open XML formats though allowed us to increase the limits, so now in Excel 2007, you can have up to 16,000 columns and 1 million rows. That means you could have a worksheet with 16 billion cells. I can't imaging anyone hitting that limit, but to be honest, I couldn't believe folks were hitting the older limits.

So, while that's a look at the extreme end of things, it's not too uncommon to have very large spreadsheets, and a workbook with a million cells of data is definitely something we see a lot of.

Now, while in a Wordprocessing document you may potentially have a million words (which would be about 2000 pages), it's not as likely. More importantly though, the structure of the XML doesn't have as big of an impact. In wordprocessingML, you don't have every word contained in it's own XML element. So you while may have a million words, your XML element count would probably only be in the tens of thousands.

In a spreadsheet though, you have each cell represented by at least one, if not more XML elements which means that a spreadsheet of any reasonable size can easily have millions of XML elements and attributes. This can have significant implications on the load and save performance of the files.

A couple examples:

There are a number of things we looked into doing to help improve the performance of these files both using current technology as well as thinking about future approaches. Some of the approaches we took that are easiest to understand are:

  1. Reduce the need to fully parse repeating data - In order to cut back on parsing large strings repeated multiple times, or formulas that are repeated down an entire column, spreadsheetML does a lot of sharing. I already talked about the huge benefits we can get from the shared formulas work in a post last spring. The shared string table can give similar results.
  2. Tag Size - Contrary to this post from IBM's Rob Wier, the size of XML elements actually does directly affect performance times. The more text you have to parse, the longer it will take. Rob I hope you didn't misunderstand some of my earlier points on this though. I don't mean in any way to give the impression that tag size is the biggest factor in terms of file format performance because it's not. There are a number of other architectural issues that play a much bigger role. That said, the size of the tags does have an impact (as I've blogged about before).

    The simplest way to see the impact for yourself is to just take an application like OpenOffice, and get a decent sized spreadsheet (I took a file with just one worksheet, that was 15 columns by 15,000 rows). Save it in ODF and make 2 copies. In one of the copies open content.xml and change some of the namespace prefixes to make them much longer. I aimed to make it so that the resulting content.xml file was about 3 times larger than the original since this helps replicate the size difference you would see if we changed SpreadsheetML so that it's tag length was longer and more descriptive like ODF (it would probably be more than that, but I found that making content.xml 3 times larger was more than enough to get a noticable slowdown in load times). I used namespace prefix approach because it's the easiest way to replicate the effect of a larger element name without changing the actual element names (since that would prevent the file from being readable by OpenOffice). In my test, by essentially tripling the size of the element names, I found that the file went from loading in about 9 seconds to taking about 11 seconds.

  3. Splitting the XML into multiple parts - In SpreadsheetML we break the XML content out into multiple parts in the ZIP (each worksheet, the string table, pivot table data, etc.). This can help a ton with on demand loading, or even loading multiple pieces at once. For example, it means that if you had a multi-proc machine, you could load each worksheet on a separate thread. It also means that you could decide to only load on sheet and you wouldn't have to parse through all the XML from the other sheets.
  4. Relationships stored outside of content - By storing all the relationships from one part to another in separate (and much smaller) files, it makes it really easy to see what other parts you should load when you are loading a particular part of the file. If it weren't for the relationships, you'd actually have to parse through the content of the XML to determine what other resources that part used. For example, if you wanted to just load one slide in a presentation, you can also see what images are used on that slide before you start parsing the XML for the slide.

There are clearly a large number of factors that play into the performance of a particular format. As you can see, it primarily comes into affect when loading or saving a file. If you do any type of delay loading though or incremental saves, it can also start to have an impact on your editing performance though. The key issues I've focused on primarily affect the general load and save times. The examples above are just a handful of issues we looked at when designing the formats.

-Brian

Posted Thursday, October 26, 2006 3:24 AM by BrianJones | 14 Comments

SpreadsheetML Dates

Recently a couple people have questioned the decision to leave the legacy Excel date behavior in the Open XML formats. This was primarily triggered by a post from IBM's Rob Wier. While those folks have referred to it as an Excel bug, it's actually something that was purposely designed into Excel since the beginning. Whenever you work on a feature, you eventually get to a point where you have to make difficult decisions and there are positives and negatives on each side. Let's go though a couple points here around this date issue: (1) how the dates are stored; (2) history of the bug; (3) backwards compatibility concerns; (4) drawbacks.

How dates are stored

Let's first look at the basic issue of how dates are stored in file formats.

When it comes to the storage of dates, there are a number of different approaches you could take. The key issue to first decide is whether you want to store it in a presentation friendly format, or one designed for easy storage and processing. For example, you could just store it as it's displayed to the user (ie "10/23/2006" or "October 23, 2006"), or maybe using ISO 8601. This could actually be a fine approach for certain types of formats (like a wordprocessing application, or maybe for metadata), but in the case of a spreadsheet that's probably not the best way to go. In order to quickly consume and create the files, you should probably choose one simple consistent storage method as you may be dealing with hundreds of thousands of dates in one file and you don't want to have to deal with any complicated parsing.

In the case of spreadsheetML, where you have the possibility of hundreds of thousands (if not millions) of dates existing in a single workbook, you need to think carefully about how those dates are going to be used. You also need to think about the fact that there may be formulas that take those dates into account. For example, you might have a table like this:

Project Number

Start Date

End Date

Daily Cost

Total Cost

1

9/13/2006

9/27/2006

$5,000.00

$70,000

2

9/20/2006

10/3/2006

$1,000.00

$13,000

The "total cost" column will have a function that is essentially:

=([end date] - [start date]) * [daily cost]

The simplest way to store these dates (both internally and in the formats) is to just pick a specific date as the base date and then every other date is just an index from there. This is similar to the Julian date system. So, if you decided that 9/13/2006 was the base date, then the data stored in the table above would be:

Project Number

Start Date

End Date

Daily Cost

Total Cost

1

1

15

$5,000.00

$70,000

2

8

21

$1,000.00

$13,000

This makes the storage and parsing of the dates a lot easier, and it also makes any formulas that work with dates easy to work with. The dates are always just stored as an index, and then it's just a matter of formatting to determine how the date is presented to the user. So the following dates (9/13/2006; Sept. 13 2006; Wednesday, September 13, 2006) are all the exact same values, they are just displayed to the user differently. This approach also has the advantage of working with other calendar systems since the storage of the date doesn't account for months, years, or days of the week, but is instead just a counter of how many days from a given date have passed. Then any unit of measurement smaller than a day (hours, minutes, seconds, milliseconds, etc.) are just a fraction. So 1 second would be 1.157e-5

History of the date "bug"

It's because of this "index" approach that the bug in question presents itself. Excel uses January 1, 1900 as the base date, meaning January 1, 1900 is stored as "1". The problem is that 1900 was not a leap year because while it passes the first test of being divisible by 4, it doesn't pass the second test which is that it must a multiple of 100 that is not divisible by 400. 2000 actually was a leap year, but 2100 will not be. Rob Wier's blog post has some great information on the history of this behavior in our modern calendar system.

Unfortunately this bug occurs because, February 29, 1900 is actually treated as a date when the indexes are calculated. This means that every date after February 29, 1900 is off by one (or you could also say that the base date was supposed to be 12/31/1899 and every day before March 1, 1900 is off by one). This is only the case for the year 1900 though. 2100 and on are treated properly. Why did this happen in the first place? It's actually covered pretty well in the following Microsoft KB article:

When Lotus 1-2-3 was first released, the program assumed that the year 1900 was a leap year, even though it actually was not a leap year. This made it easier for the program to handle leap years and caused no harm to almost all date calculations in Lotus 1-2-3.

When Microsoft Multiplan and Microsoft Excel were released, they also assumed that 1900 was a leap year. This assumption allowed Microsoft Multiplan and Microsoft Excel to use the same serial date system used by Lotus 1-2-3 and provide greater compatibility with Lotus 1-2-3. Treating 1900 as a leap year also made it easier for users to move worksheets from one program to the other.

Although it is technically possible to correct this behavior so that current versions of Microsoft Excel do not assume that 1900 is a leap year, the disadvantages of doing so outweigh the advantages.

If this behavior were to be corrected, many problems would arise, including the following:

  • Almost all dates in current Microsoft Excel worksheets and other documents would be decreased by one day. Correcting this shift would take considerable time and effort, especially in formulas that use dates.
  • Some functions, such as the WEEKDAY function, would return different values; this might cause formulas in worksheets to work incorrectly.
  • Correcting this behavior would break serial date compatibility between Microsoft Excel and other programs that use dates.

If the behavior remains uncorrected, only one problem occurs:

  • The WEEKDAY function returns incorrect values for dates before March 1, 1900. Because most users do not use dates before March 1, 1900, this problem is rare.

NOTE: Microsoft Excel correctly handles all other leap years, including century years that are not leap years (for example, 2100). Only the year 1900 is incorrectly handled.

Backward Compatibility

While it's no longer as important to maintain compatibility with Lotus 1-2-3, it is important to maintain compatibility with the billions of existing spreadsheet files out there. Barclay's Capital was one of the members of the Ecma TC45, and they helped make it crystal clear how important this backwards compatibility was. Imagine how upset our customers would be if we changed the values returned from functions like =WEEKDAY.

I think that one of the reasons some folks think this is just a "stupid decision" made by TC45 is that they aren't looking at all the places where the behavior of a spreadsheet could be affected. Remember, we don't just want the date values to be interoperable, but spreadsheet functions have to be as well. We spent a lot of time documenting how every single spreadsheet function would work so any application can come along and properly consume and generate SpreadsheetML files.

If we decided to fix this bug and shift each date value down by one, how many spreadsheet formulas out there would we break? Here's a really simple example, where the following function that had worked in previous versions would no longer work:

=IF(TODAY()=39013, "Due Today!", "Not Due Today!")

I will admit that the above function is probably not the best approach for comparing dates, but it's something that's always worked. Believe me, I've seen much crazier stuff in spreadsheets. We can't control how our customers use the product, and if something like this worked before, we can't break it now. If we changed our behavior so that 39013 no longer represented 10/23/06 and instead represented 10/24/06, that would completely break the function above.

It's true that we could probably write a solution that parses through all functions out there and tries to determine if the number in the function is supposed to represent a date, but that would be seriously buggy, and to be honest it wouldn't be worth it.

Drawbacks?

One way I like to think about this whole issue is to look at this the other way. Let's instead say that the base date system for SpreadsheetML was supposed to start on December 31, 1899. There's just a bug where dates between then and March 1, 1900 are off by one. So while we have this odd behavior for the first two months of the last century, everything after March 1, 1900 is stored correctly. I'm actually only half kidding here... since the date is just stored as in index, there is no real problem caused by this bug, other than maybe looking a bit foolish. In fact, OpenOffice has a behavior vary similar to this. In OpenOffice and Microsoft Excel, the numerical value for 9/13/2006 is equivalent. The numberical values for dates map all the way back to March 1, 1900; and then from that point back they are off by one (so in OpenOffice, 12/31/1899 maps to "1"). So the key thing to notice is that this behavior is fully documented, and there is no reason it should prevent anyone from properly consuming and generating valid SpreadsheetML files.

As I said in some earlier comments, this did come up in the TC45 meetings fairly early on. We didn't spend too much time on it though as it was pretty obvious that while it looked a bit silly, there was no harm. And we all recognized that if we did try to change this, it could seriously change the meaning of existing files.

Remember, this format was not an attempt at building the ultimate generic Office application file format from the ground up. If we did that, and didn't take our customers existing documents into account, then none of our customers would use the format. I think at times people misunderstand the design goals behind the Office Open XML formats, and I'm sorry if I haven't been clear enough on that. The Office Open XML formats were definitely designed with the existing base of legacy Microsoft Office file formats in mind, and it's important to remember that when you look at the format. There are definitely designs in the Open XML formats that clearly show their origins. A key focus of TC45 though was to make sure those behaviors were completely implementable on any other platform. This is covered in more detail in the TC 45 whitepaper.

We not only wanted to create an open format that folks could build solutions on top of, but we wanted the format to be something that our customers would actually use... otherwise what's the point? We didn't want this to just be another optional format that only some people would use, it's the new default format and we hope that all of our customers will use it.

-Brian

Posted Wednesday, October 25, 2006 2:34 AM by BrianJones | 24 Comments

Whitepaper summarizing the Office Open XML standard

Ecma has now published a 14 page whitepaper that does an excellent job of describing the Office Open XML standard and the different goals and challenges TC45 had over the past year while working on the spec. I highly recommend everyone interested in office file formats take a look: http://www.ecma-international.org/news/TC45_current_work/OpenXML%20White%20Paper.pdf

Rather than describe the whitepaper in my own words, I figured I'd just leverage the introduction as it does an excellent job of summarizing the overall purpose of the whitepaper:

Office Open XML (OpenXML) is a proposed open standard for word-processing documents, presentations, and spreadsheets that can be freely implemented by multiple applications on multiple platforms. Its publication benefits organizations that intend to implement applications capable of using the format, commercial and governmental entities that procure such software, and educators or authors who teach the format. Ultimately, all users enjoy the benefits of an XML standard for their documents, including stability, preservation, interoperability, and ongoing evolution.

The work to standardize OpenXML has been carried out by Ecma International via its Technical Committee 45 (TC45), which includes representatives from Apple, Barclays Capital, BP, The British Library, Essilor, Intel, Microsoft, NextPage, Novell, Statoil, Toshiba, and the United States Library of Congress (1).

This white paper summarizes OpenXML. Read it to:

  • Understand the purposes of OpenXML and structure of its Specification
  • Know its properties: how it addresses backward compatibility, preservation, extensibility, custom schemas, subsetting, multiple platforms, internationalization, and accessibility
  • Learn how to follow the high-level structure of any OpenXML file, and navigate quickly to any portion of the Specification from which you require further detail

I also really like the second section in the whitepaper titled "Purposes for the Standard," as it helps in dealing with a lot of the questions I've received over the past several months. I think at times folks still aren't clear on the reasons we created this file format in the first place and then passed ownership of it to Ecma international.

OpenXML was designed from the start to be capable of faithfully representing the pre-existing corpus of word-processing documents, presentations, and spreadsheets that are encoded in binary formats defined by Microsoft Corporation. The standardization process consisted of mirroring in XML the capabilities required to represent the existing corpus, extending them, providing detailed documentation, and enabling interoperability. At the time of writing, more than 400 million users generate documents in the binary formats, with estimates exceeding 40 billion documents and billions more being created each year.

The original binary formats for these files were created in an era when space was precious and parsing time severely impacted user experience. They were based on direct serialization of in-memory data structures used by Microsoft® Office® applications. Modern hardware, network, and standards infrastructure (especially XML) permit a new design that favors implementation by multiple vendors on multiple platforms and allows for evolution.

Concurrently with those technological advances, markets have diversified to include a new range of applications not originally contemplated in the simple world of document editing programs. These new applications include ones that:

  • generate documents automatically from business data;
  • extract business data from documents and feed those data into business applications;
  • perform restricted tasks that operate on a small subset of a document, yet preserve editability;
  • provide accessibility for user populations with specialized needs, such as the blind; or
  • run on a variety of hardware, including mobile devices.

Perhaps the most profound issue is one of long-term preservation. We have learned to create exponentially increasing amounts of information. Yet we have been encoding that information using digital representations that are so deeply coupled with the programs that created them that after a decade or two, they routinely become extremely difficult to read without significant loss. Preserving the financial and intellectual investment in those documents (both existing and new) has become a pressing priority.

The emergence of these four forces – extremely broad adoption of the binary formats, technological advances, market forces that demand diverse applications, and the increasing difficulty of long-term preservation – have created an imperative to define an open XML format and migrate the billions of documents to it with as little loss as possible. Further, standardizing that open XML format and maintaining it over time create an environment in which any organization can safely rely on the ongoing stability of the specification, confident that further evolution will enjoy the checks and balances afforded by a standards process.

Various document standards and specifications exist; these include HTML, XHTML, PDF and its subsets, ODF, DocBook, DITA, and RTF. Like the numerous standards that represent bitmapped images, including TIFF/IT, TIFF/EP, JPEG 2000, and PNG, each was created for a different set of purposes. OpenXML addresses the need for a standard that covers the features represented in the existing document corpus. To the best of our knowledge, it is the only XML document format that supports every feature in the binary formats.

Tom Ngo, the editor of the whitepaper did an excellent job of summarizing what I've tried to convey numerous times in this blog (with varying degrees of success I admit). The industry absolutely needs OpenXML, and we've heard this repeatedly from our customers. It doesn't make it the "file format to replace all file formats," and we would never make such ridiculous claims. Instead it's simply an open standard that helps serve very real customer needs.

-Brian

Posted Tuesday, October 24, 2006 9:41 AM by BrianJones | 3 Comments

More on content controls

I posted earlier this year on the support for custom defined schema in wordprocessingML via the new content controls functionality. The key reason for opening the file formats was to make Office a more valuable platform for solution builders. We wanted folks to have the ability to take Office documents and plug them into new and existing business processes. All the work we've done in Ecma has helped to ensure that people have the necessary information for implementing the new file formats, but that isn't enough. The Office Open XML specification defines the XML for Office documents, but it doesn't define the types of structures that our customers use for their own business data. That's where the support for custom defined schema comes into play. It allows people to take their industry specific data structures (either their own, or one defined by someone else), and apply that structure to their documents so they have that additional semantic meaning. I'd promised to post more information on this but that fell through the cracks (sorry).

Tristan Davis from the Word team is now blogging on content controls and he'll be able to cover it in much greater detail than I could. You should check out his first post up on the Word blog: http://blogs.msdn.com/microsoft_office_word/archive/2006/10/23/control-yourself.aspx

-Brian

Posted Monday, October 23, 2006 11:44 AM by BrianJones | 1 Comments

Filed under: ,

More Posts Next page »