2004.12.29 12:50 PM

Normalize Excel's SpreadsheetML using XSLT

[Update: Didn't realize until just now that Microsoft repeats their office:spreadsheet namespace in SpreadsheetML documents, once as the default namespace and then again with a prefix. This led me to unnecessarily reference the same namespace twice with different prefixes in my XSLT. Also, changed the third xsl:when in the row calculation logic from count($prevRows/@ss:Index), which counted row index attributes, to count($prevRows[@ss:Index]), which counts rows with index attributes. The results are the same, but this is clearer, and it's consistent with the column calculation logic.]

[Update: .NET can't get this style sheet right. See my next post.]

[Update: Oops, even though I wrote about it, I still forgot to "prominently display" Microsoft's licensing verbiage in my XSLT. Only by doing so am I granted a royalty-free license to play with their patented SpreadsheetML. My apologies.]

 

Excel as XML is a great thing. It offers endless opportunities for analyzing workbook content and structure. However, SpreadsheetML's hierarchical organization of cells within rows, and offset-based row and column counting, makes some kinds of analysis tough to do.

Here's a small sample that illustrates the problem (I clipped the ss:Type and ss:Style attributes to reduce clutter):

<Row ss:Index="19">
  <Cell ss:Index="2"><Data>14</Data></Cell>
</Row>
<Row>
  <Cell ss:Formula="=1+1"><Data>2</Data></Cell>
  <Cell><Data>14</Data></Cell>
  <Cell ss:Index="11"><Data>1</Data></Cell>
  <Cell><Data>2</Data></Cell>
</Row>
<Row>
  <Cell ss:Index="5" ss:MergeAcross="3" ss:MergeDown="6"/>
  <Cell><Data>gapper</Data></Cell>
  <Cell><Data>What col am I?</Data></Cell>
  <Cell ss:Index="13"><Data>3</Data></Cell>
  <Cell><Data>4</Data></Cell>
</Row>

Quick, in what row and column does the Cell containing "What col am I?" appear?

Well, let's see. The last Row having an explicit ss:Index is 19, and our text appears in a Cell two Row elements below that, so the row is 21. And, within our text's Row, the first Cell has an explicit ss:Index of 5, then an ss:MergeAcross (i.e., span) of 3 more, then another Cell, and finally our text, so the column is 10. Note that Row counting is not affected by ss:MergeDown, as for some reason Excel always marks the first Row following an ss:MergeDown with an explicit ss:Index.

I'm sure this format benefits Excel in reduced size, but it sure makes analysis harder. As my four year old would say, "Good for them, bad for us."

To resolve this problem, I wrote an XSLT style sheet that calculates the row and column of Cell elements, making it possible to normalize the above XML into this:

<cell sheet="Sheet1" row="19" col="2">14</cell>
<cell sheet="Sheet1" row="20" col="1">2</cell>
<cell sheet="Sheet1" row="20" col="2">14</cell>
<cell sheet="Sheet1" row="20" col="11">1</cell>
<cell sheet="Sheet1" row="20" col="12">2</cell>
<cell sheet="Sheet1" row="21" col="9">gapper</cell>
<cell sheet="Sheet1" row="21" col="10">What col am I?</cell>
<cell sheet="Sheet1" row="21" col="13">3</cell>
<cell sheet="Sheet1" row="21" col="17">4</cell>

Of course, this may not be exactly what you want. For instance, this particular style sheet only considers non-blank cells, and it doesn't capture information concerning data types, formulas, named ranges, or styles. But, you should be able to reuse the row and column logic in your own style sheets without much effort. Keep in mind that the logic must execute in the context of a Cell element.

Here's the XSLT style sheet:

<?xml version="1.0"?>
<!--
This product may incorporate intellectual property owned by Microsoft Corporation. The terms 
and conditions upon which Microsoft is licensing such intellectual property may be found at 
http://msdn.microsoft.com/library/en-us/odcXMLRef/html/odcXMLRefLegalNotice.asp.
-->
<xsl:stylesheet version="1.0"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
                exclude-result-prefixes="ss">

  <xsl:output method="xml" indent="yes" encoding="UTF-8"/>

  <xsl:template match="/">

    <book>

      <xsl:for-each select="//ss:Cell[./ss:Data]">

        <xsl:element name="cell">

          <xsl:attribute name="sheet">
            <xsl:value-of select="../../../@ss:Name"/>
          </xsl:attribute>

          <xsl:variable name="prevRows" select="../preceding-sibling::ss:Row"/>

          <xsl:attribute name="row">
            <xsl:choose>
              <xsl:when test="../@ss:Index">
                <xsl:value-of select="../@ss:Index"/>
              </xsl:when>
              <xsl:when test="count($prevRows) = 0">
                <xsl:value-of select="1"/>
              </xsl:when>
              <xsl:when test="count($prevRows[@ss:Index]) > 0">
                <xsl:value-of select="($prevRows[@ss:Index][position()=1]/@ss:Index) + ((count($prevRows) + 1) - (count($prevRows[@ss:Index][position()=1]/preceding-sibling::ss:Row) + 1))"/>
              </xsl:when>
              <xsl:otherwise>
                <xsl:value-of select="count($prevRows) + 1"/>
              </xsl:otherwise>
            </xsl:choose>              
          </xsl:attribute>

          <xsl:variable name="prevCells" select="preceding-sibling::ss:Cell"/>

          <xsl:attribute name="col">              
            <xsl:choose>
              <xsl:when test="@ss:Index">
                <xsl:value-of select="@ss:Index"/>
              </xsl:when>
              <xsl:when test="count($prevCells) = 0">
                <xsl:value-of select="1"/>
              </xsl:when>
              <xsl:when test="count($prevCells[@ss:Index]) > 0">
                <xsl:value-of select="($prevCells[@ss:Index][position()=1]/@ss:Index) + ((count($prevCells) + 1) - (count($prevCells[@ss:Index][position()=1]/preceding-sibling::ss:Cell) + 1)) + sum($prevCells/@ss:MergeAcross)"/>
              </xsl:when>
              <xsl:otherwise>
                <xsl:value-of select="count(preceding-sibling::ss:Cell) + 1 + sum(preceding-sibling::ss:Cell/@ss:MergeAcross)"/>
              </xsl:otherwise>
            </xsl:choose>              
          </xsl:attribute>

          <xsl:value-of select="ss:Data"/>

        </xsl:element>

      </xsl:for-each>

    </book>

  </xsl:template>

</xsl:stylesheet>

Now imagine combining this with an Excel formula parser and doing things like counting function calls or range references, or building a call graph. As I said above, the possibilities are endless.


Comments

**Very** cool!!
I was just wondering - how difficult would it be to create the "inverse" of this stylesheet, translating back from your (much simpler!) XML to MS' SpreadsheetML?
The reason I ask is - at work, I'm needing to use SAS (programming language) to plonk some data into specific cells in an Excel sheet, and it all has to be done in a Unix environment. So, I'd use SAS to spit the data out into an XML file using your "nice-and-clean" layout, then use the "reverse" stylesheet to translate it back to something that Excel can read.
( The reason I have to do all this is that the data has to go into a quite-elaborate Excel file, and at present we're having to use a combination of remote (Unix) and local (Windows) processing. I want to be able to do it all in Unix. )
The ability to do the translation from your format back to Excel's format would be **EXACTLY** the thing I've been looking for, and would make things **so** much easier! .... :-) Looking forward to hearing from you - bye for now!
- Andy

Andy Elvey | 2005.01.13 12:06 AM

Hi Andy,

Thanks for the comments.

I think I understand what you're asking. You need to generate an Excel-compatible file containing one or more sheets having data disbursed on it in something other than a simple row-and-column layout, as might be accomplished using CSV or some other simple format. And, rather than try to target Excel's SpreadsheetML format directly, you're thinking it would be easier to target a flat/normalized XML format like I've described above and then convert that into Excel's more complex SpreadsheetML format using XSLT.

Yes, I think that's quite possible. The tricky part I think is figuring out the minimum SpreadsheetML schema requirements. For instance, I know from the documentation that it is not necessary to do row/column index counting the way Excel does - you can simply be explicit on each row/column. However, I'm not sure what else is needed (or not) in order to create a valid SpreadsheetML document.

I'll contact you directly by email to discuss, and if we come up with something I'll share it here in a separate post.

ewbi.develops | 2005.01.13 11:26 AM

Hi Andy,

I didn't hear back from you and had a little time this evening, so I cranked out a new simplified XML format and XSLT style sheet to generate SpreadsheetML:

http://ewbi.blogs.com/develops/2005/01/create_excels_s.html

Give it a look and let me know what you think.

ewbi.develops | 2005.01.13 11:43 PM

Hi,

Here is an adaptation of your transformation that keeps your SpreadsheetML document unchanged except that ss:Index attributes are added to each ss:Cell element.

This one can be used as a first step before other processing. It should be harmess enough to insure that Excel can still load a transformed document...

http://eric.van-der-vlist.com/blog/1310_Normalizing_Excel%27s_SpreadsheetML_using_XSLT.item

Thanks for posting your transformation, it did save me several hours of coding :-)...

Eric

Eric van der Vlist | 2005.07.27 06:05 AM

Eric,

Thanks! TypePad apparently couldn't deal with the XML in your comment, so I pulled it and put it in a file and updated your comment to include a link to it.

I haven't had a chance to run your stylesheet yet, but the idea of simply adding an index to an existing spreadsheetML document (as opposed to producing an entirely new document, as in my example) is fantastic. It never even crossed my mind!

Thanks so much for sharing it here.

ewbi.develops | 2005.07.27 06:42 AM

Eric - Better yet, I just saw that you posted your stylesheet on your blog, so I modified your comment to include a link back to it. Thanks again!

ewbi.develops | 2005.07.27 06:48 AM

You're welcome. That just seemed natural to post it back as an answer (and reward) to your own post...

Eric

Eric van der Vlist | 2005.07.27 09:04 AM

Has anyone had a problem with the column attribute? It seems to be incorrect for some cells w/o an explicit index. I'm not sure but it might involve cells that follow empty (not appearing the the xml) cells.

ajr | 2005.12.19 07:22 AM

this code will fail for all cells lacking an ss:Index that are preceded by two or more missing cells

for instance, the input cells (two missing cells, each followed by a cell with an ss:Index)...

1,2,3,4,5,6,,8,,10,11,12

will result in output cells with indices...

1,2,3,4,5,6,8,10,10,11

i haven't tested fully, but in practice i think the output index count is one too low for each one or more contiguous cells with an ss:Index attribute

i just wish my xsl skills were sufficient to fix it!

jc | 2005.12.20 01:16 PM

ajr and jc:

Thanks for bringing these things to my attention. I'm completely overloaded right now, but hope to take a closer look later this week.

ewbi.develops | 2005.12.20 01:22 PM

i did a little searching/experimenting, i'm working on a modified version of the original sheet so this may not be the best/complete answer, but for my usage i it seems to resolve the problem if the column index calculation in the third xsl:when uses this...

count(preceding-sibling::*)
- count(preceding-sibling::*[@ss:Index][1]/preceding-sibling::*)
+ preceding-sibling::*[@ss:Index][1]/@ss:Index

...instead of...

($prevCells[@ss:Index][position()=1]/@ss:Index) + ((count($prevCells) + 1) - (count($prevCells[@ss:Index][position()=1]/preceding-sibling::ss:Cell) + 1)) + sum($prevCells/@ss:MergeAcross)

btw, the calculation came from here...

http://www.biglist.com/lists/xsl-list/archives/200508/msg00538.html

jc | 2005.12.20 03:12 PM

jc and ajr,

Please see my new post that addresses your issues in detail:

http://ewbi.blogs.com/develops/2006/01/roundtripping_s.html

Thanks again for the feedback.

ewbi.develops | 2006.01.05 06:45 AM

Suppose I want to get the value of the cell in column 17? I wote a parametriced function template to replace my Row[17]/Data references: here is the call and then the template. Does this help ?

[xsl:call-template name="getCellData"]
[xsl:with-param name="targetIndex" select="17"/]
[/xsl:call-template]

[xsl:template name="getCellData"]
[xsl:param name="targetIndex"]1[/xsl:param]
[xsl:param name="otherIndex"]0[/xsl:param]
[xsl:choose]
[xsl:when test="name()='Row'"]
[xsl:for-each select="xl:Cell[1]"]
[xsl:call-template name="getCellData"]
[xsl:with-param name="targetIndex" select="$targetIndex"/]
[xsl:with-param name="otherIndex" select="$otherIndex"/]
[/xsl:call-template]
[/xsl:for-each]
[/xsl:when]
[xsl:otherwise]
[xsl:variable name="newIndex"]
[xsl:choose]
[xsl:when test="@ss:Index"]
[xsl:value-of select="number(@ss:Index)"/]
[/xsl:when]
[xsl:otherwise]
[xsl:value-of select="number($otherIndex)+1"/]
[/xsl:otherwise]
[/xsl:choose]
[/xsl:variable]
[xsl:choose]
[xsl:when test="$newIndex=$targetIndex"]
[xsl:value-of select="normalize-space(xl:Data)"/]
[/xsl:when]
[xsl:when test="$newIndex]$targetIndex"/]
[xsl:when test="not(following-sibling::xl:Cell[1])"/]
[xsl:otherwise]
[xsl:for-each select="(following-sibling::xl:Cell)[1]"]
[xsl:call-template name="getCellData"]
[xsl:with-param name="targetIndex" select="$targetIndex"/]
[xsl:with-param name="otherIndex" select="$newIndex"/]
[/xsl:call-template]
[/xsl:for-each]
[/xsl:otherwise]
[/xsl:choose]
[/xsl:otherwise]
[/xsl:choose]
[/xsl:template]

NickN | 2008.11.18 03:36 AM

NickN, thanks for sharing that. TypePad's crappy comment system ate the XML and won't preserve indents, but I replaced the angle brackets with square ones so folks could see what you shared. Thanks for taking the time - I'm sure someone will find this to be a useful addition!

ewbi | 2008.11.26 07:50 AM

Hi, Can i make a question?
I use this technology since 2005. it works fine, but there are problems with office 2003. Microsoft Office 2003 not professional doesn't read this kind of document.
Are there same problems with your solution?
Thank for an answer.
I beg Perdon for my english.

Aldous | 2010.07.27 01:39 AM

Hi Aldous,

Do you mean you can't load workbooks in SpreadsheetML (XML) format with Excel 2003 (non-professional)?

ewbi.develops | 2010.07.27 08:25 AM

Yes. It is.
If You see this http://msdn.microsoft.com/en-us/library/bb226687(office.11).aspx.
this kind of trasformation apply to Applies to: Microsoft Office Excel Professional 2003, Microsoft Office Excel Enterprise 2003, Microsoft Office 2003 Edition XML Schema References.
It's true? Or your solution overcame this kind of problems?
Tahnks for an answer

Aldous | 2010.07.27 11:42 AM

Interesting. I didn't realize that Microsoft had limited their support for SpreadsheetML to the professional+ versions of Office. Unfortunately, this blog post (and some others like it here at ewbi.develops) aren't about overcoming that limitation, but are instead about transforming SpreadsheetML into other XML formats that are easier to work with, primarily outside of Excel. Good luck.

ewbi.develops | 2010.07.27 11:59 AM

Great Blog.
Looks like I landed at right place. Have a query.

Have you tried to embed images/logos?

Sidharath Sharma | 2011.05.17 06:44 PM

Sidharath Sharma, thanks for the comment. I haven't been in here for quite a while. It's unfortunate that I haven't had time to clean up this old blog - it displays just terribly in Firefox.

Anyhow, regarding your question, I haven't been in SpreadsheetML for quite a while either, but my recollection is that it doesn't support autoshapes, graphs, or other objects, like images. The new OpenXML standard does, but it encodes and treats them as separate resources in what is essentially a ZIP file with an XSLX extension - entirely different.

If you learn otherwise, or I've misunderstood, please let me know. Good luck!

ewbi.develops | 2011.05.17 08:30 PM



Post a Comment

 
  (optional)
  (no html)
 
   


TrackBack

TrackBack URL:  http://www.typepad.com/services/trackback/6a00d8341c7bd453ef00d834227a8853ef

Listed below are links to weblogs that reference Normalize Excel's SpreadsheetML using XSLT:

» Normalizing Excel's SpreadsheetML using XSLT from Eric van der Vlist
Spreadsheet tables are full of holes and spreadsheet processors such as OpenOffice or Excel have implemented hacks to avoid having to store empty cells. In the case of Excel, that's done using ss:Index and ss:MergeAcross attributes. [Read More]

Tracked on Jul 27, 2005 6:34:14 AM