2005.02.08 03:10 PM

XML Crosstabs using XSLT

Reader Andy Elvey left a comment on my last post about the Access SQL TRANSFORM statement asking how one might do the same sort of grouping and pivoting of XML using XSLT. Good question.

There are a lot of ways to do this. Choosing an acceptable approach depends in large part on the XML being grouped (e.g., whether it's element- or attribute-oriented, how many nodes it contains, etc.) and the type of grouping you intend to do. Unfortunately, I don't have time to explore all the ins and outs of arbitrary groupings of XML using XSLT, so I will instead just describe a technique that replicates the fixed column SQL TRANSFORM described in my earlier post.

In that post, I explored using a parameterized SQL TRANSFORM statement to convert the following table:

key  project year    amount
1    100     2003    $0.50
2    100     2004    $1.00
3    100     2004    $1.50
4    100     2005    $2.00
5    200     2004    $3.00
6    200     2005    $4.00
7    200     2005    $4.50
8    200     2006    $5.00

...into resultsets having one row per project showing the total of each project's amounts for two years: thisyear and nextyear, where thisyear is specified using a parameter named foryear. The final query produced the following results given foryear parameter values of 2002, 2004, and 2005:

project   thisyear   nextyear
100                  $0.50

project   thisyear   nextyear
100       $2.50      $2.00
200       $3.00      $8.50

project   thisyear   nextyear
100       $2.00
200       $8.50      $5.00

So now let's do the same thing using XML and XSLT.

For simplicity, I settled on the following attribute-based XML format for the data:

<?xml version="1.0" ?>
<data>
  <project key="100" year="2003" amount=".50"/>
  <project key="100" year="2004" amount="1.00"/>
  <project key="100" year="2004" amount="1.50"/>
  <project key="100" year="2005" amount="2.00"/>
  <project key="200" year="2004" amount="3.00"/>
  <project key="200" year="2005" amount="4.00"/>
  <project key="200" year="2005" amount="4.50"/>
  <project key="200" year="2006" amount="5.00"/>
</data>

The XML results we want to generate will be similarly attribute-oriented, with each project element in the results having a thisyear and nextyear attribute. In order to match the SQL TRANSFORM results above, we will make sure that for a given foryear, only those projects having at least one row for the specified year or next year will be included, and failure to find a row for the specified year or next year will result in a blank value for the respective attribute.

Here's what the results will look like given foryear parameter values of 2002, 2004, and 2005:

<?xml version="1.0"?>
<data>
  <project key="100" thisyear="" nextyear="0.5" />
</data>

<?xml version="1.0"?>
<data>
  <project key="100" thisyear="2.5" nextyear="2" />
  <project key="200" thisyear="3" nextyear="8.5" />
</data>

<?xml version="1.0"?>
<data>
  <project key="100" thisyear="2" nextyear="" />
  <project key="200" thisyear="8.5" nextyear="5" />
</data>

And here's an XSLT style sheet that will produce those results:

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

  <xsl:output method="xml" indent="yes" encoding="UTF-8"/>

  <xsl:param name="foryear"/>
   
  <xsl:variable name="thisyear" select="$foryear"/>
  <xsl:variable name="nextyear" select="$foryear + 1"/>

  <!-- key used to group projects using the Muenchian Method -->

  <xsl:key name="projects" match="project" use="@key" />

  <!-- key used to avoid inefficient full traversals using // for conditional checks and sums -->

  <xsl:key name="projects.years" match="project" use="concat(@key, ':', @year)" />
  
  <xsl:template match="data">
    <data>
    
      <!-- group projects using the Muenchian Method via the 'projects' key -->
    
      <xsl:apply-templates select="project[count(. | key('projects', @key)[1]) = 1]">
        <xsl:sort select="@key" data-type="number"/>      
      </xsl:apply-templates>

    </data>
  </xsl:template>
  
  <xsl:template match="project">
  
    <!-- skip projects not having a row for thisyear or nextyear -->
  
    <xsl:if test="count(key('projects.years', concat(@key, ':', $thisyear)) | 
                        key('projects.years', concat(@key, ':', $nextyear))
                       ) > 0">

      <project key="{@key}">
      
        <!-- include thisyear and nextyear attributes for all projects, but only populate if rows exist -->
      
        <xsl:attribute name="thisyear">
          <xsl:if test="count(key('projects.years', concat(@key, ':', $thisyear))) > 0">
            <xsl:value-of select="sum(key('projects.years', concat(@key, ':', $thisyear))/@amount)"/>
          </xsl:if>
        </xsl:attribute>      
        
        <xsl:attribute name="nextyear">
          <xsl:if test="count(key('projects.years', concat(@key, ':', $nextyear))) > 0">
            <xsl:value-of select="sum(key('projects.years', concat(@key, ':', $nextyear))/@amount)"/>
          </xsl:if>
        </xsl:attribute>      
        
      </project>

    </xsl:if>

  </xsl:template>

</xsl:stylesheet>

Some notes concerning the style sheet:

  • The style sheet utilizes the Muenchian Method for grouping projects. The Muenchian Method, named for Steve Muench, is a very efficient way to group nodes by one or more properties using an xsl:key. In this case, the grouping of project elements relies on a single property, specifically the key attribute, but it's easy to use multiple properties by simply concatenating them in the xsl:key use attribute.

    Unfortunately, if you're not using an XSLT processor that supports xsl:key, you'll have to utilize the less efficient method of looking back up the preceding-sibling axis to identify first instances of each value. The results will be the same, but the reverse-axis check will get increasingly slower for XML having many nodes.

  • The first xsl:if in the project template, which skips projects that don't have a row for thisyear or nextyear, is only necessary because the style sheet is parameterized. If it weren't parameterized, this filter could be defined in the 'projects' xsl:key instead, like so:

      <xsl:key name="projects" match="project[@year = 2003 or @year = 2004]" use="@key" />
    

    Unfortunately, parameter and variable references cannot be used in xsl:key match or use attributes, ostensibly to avoid circular references, so this is invalid:

      <xsl:key name="projects" match="project[@year = $thisyear or @year = $nextyear" use="@key" />
    

    Thus the secondary xsl:if.

  • The 'projects.years' xsl:key is used to efficiently determine whether a project has a row for a given year and to sum project amounts for a given year. Again, it's possible to do these things without using an xsl:key, but it would require multiple inefficient full node traversals from the root using //.

  • Finally, the only reason the style sheet double-checks that a project has at least one row for both thisyear and nextyear before populating their respective attributes is to match the NULL behavior of the SQL TRANSFORM. If you don't mind seeing the "0" that results from a sum of no nodes, feel free to remove these xsl:if elements.

Mr. Elvey, hope this answers your question, at least a little. Please leave a comment if you'd like to discuss more.

Oh yeah, I forgot to mention that XSLT 2.0 will include a new for-each-group element, among other things, making this all pretty trivial.


Comments

Good stuff!

All this XML and XSLT stuff makes the ol' brain spin a wee bit ... :-)
Thanks very much for your great work and comments - bye for now!
- Andy

Andy Elvey | 2005.02.08 11:17 PM


TrackBack

TrackBack URL:  http://www.typepad.com/services/trackback/6a00d8341c7bd453ef00d83436779853ef

Listed below are links to weblogs that reference XML Crosstabs using XSLT: