This article focuses on some of the challenges of working with authored XML documents, a term that in this context refers to data sets originated by content creators and typically guided by a DTD or schema. Many environments have adopted guided authoring with XML for various reasons, including consistency, efficiency, and cost savings. With the DTD in place, the expectation of 100% consistency is not unreasonable. If authors are guided (and restricted) by a DTD, then surely the result will be predictable. Right?
A close examination of most DTDs, however, reveals a fair degree of flexibility—particularly if the DTD is from a generic source or derived from a standard or specification. This is not a shortcoming of the DTD paradigm, but it can create headaches if the infrastructure surrounding the XML data set—expensive publishing pipelines, document management systems, translation routines—anticipates documents in a particular way.
The practical solution is often to enforce a style guide to limit the variances. The expensive way to implement the style guide is to cross-check it to the data set manually, but a more effective way might be to automate the checking.
What is appropriate for automation?
Content creators have an XML data set and a series of tasks that they need to perform. Can this task be automated?
The straightforward answer to this is usually a question: "Is the task predictable, repeatable, and definable?" The easier checks are those that do not involve parsing textual content but concentrate on the document structure:
- Does the new section have a cross-reference target? Is it highly likely that someone else will link to this topic?
- Does the list have more than one item? If not, it's not really a list.
- Does the all-important safety information appear before the task list? It's best to warn the reader about the potential electric shock in advance.
You can either identify a known document model or aim to be generic. After all, this is an issue of scalability: If the utility serves one purpose and saves time through automation, then maybe a simple, self-contained script with the business logic tied directly to the code is a reasonable approach. If there is interest in a multi-purpose utility that a user can customize, then maybe a more ambitious, configurable approach is needed. I take the latter option here.
An example utility
XSLT serves as an all-purpose XML processing language. It is not the only choice, however, as much XML processing is carried out using DOM techniques, and some of what is done in this article can be replicated in a DOM. But when you look at taking corrective action, XSLT shows itself to be the ideal tool.
The XSLT example included here is combined with an HTML wrapper to demonstrate how to easily deploy the utility as a stand-alone application. This combination implies that the XSLT should be version 1.0 (see Resources for more information) and that the embedded script is in the Microsoft® JScript® scripting language.
Process a document and return a set of error messages according to business logic
The first step is to capture the business logic. For the purpose of this exercise, you make the checks against the XML source behind this article. The rules are based on the style guide provided for the authors of these documents.
The checks are designed to enforce that written content is complete by checking the structure of the document rather than analyzing the textual content. XPath is the ideal candidate for capturing these types of checks.
Formalize and genericize the checks by encoding in an XML vocabulary
This approach means going through a design process that defines how best to capture document errors, how to categorize the errors, and then how to handle the errors. Why not just embed this process in the XSLT? The benefit of this technique is that after the error checking is encoded in an XML vocabulary, the utility becomes generic code to handle one or more profiles. A user can select different rules sets for different document data:
- Define the means by which to test the document. For the purpose of this exercise, the design decision is to use XPath.
- Define the pass–fail criteria. Use XPath-based document checks to query for the existence of one or more nodes that comply with the XPath expression.
- Define the severity of the fail. Individual checks
can be categorized as:
- Enforceable. The error-checking process fails for the first instance of this type of check.
- Advisable. There is no process failure, but the process logs instances as errors.
- Conditional. A variant of enforce, this check is more complex, as an additional context check is made based on the node returned from the XPath expression test.
- Create and import a mapping file. The file should
use these document checks:
- Define a namespace for the document—for
example:
<err:document xmlns:err="http://error.com/mynamespace">
- Create each error definition.
- Note error checks made against the document at a high level:
<err:element type="structure" name="dw-document" context="/dw-document" enforce="yes">
- Note error checks made at the element level:
<err:element type="element" name="ol" context="./li" pass=">=2"/>
- Define a namespace for the document—for
example:
For a full set of sample tests, see Resources.
When you have defined the error-checking syntax, you can define one or more rule sets to be applied to different data sets.
Create the XSLT to process the rule set file
The XSLT can potentially have two output streams: the log messages and the refined document source (if corrective action is taken).
The design is to use the XSLT output stream to create a new, refined document and an XSLT extension to write the log messages to a separate output stream. The stand-alone example adds the log messages to an HTML logging pane.
The error checks are categorized in two distinct ways: top-level, structural checks and element-level checks. The XSLT first processes the top-level checks; then, if applicable (in other words, if all these checks pass), it processes the document's content using conventional XSLT templates.
To create the XSLT, perform the following steps:
- Define a
script
element in the XSLT to define embedded scripting. First, create a logging environment, then create a function to store messages, as in Listing 1.Listing 1. Define embedded scripting in the XSLT
<msxsl:script language="JScript" implements-prefix="xslext"> <![CDATA[ var messages = new Array(); var msgct = 0; function addMsg( msg ){ messages[msgct++] = msg; return ""; } ]]> </msxsl:script>
- Add a template to handle the messages. Listing 2
shows the code.
Listing 2. Add a template
<xsl:template name="handlemsg"> <xsl:param name="msg"/> <xsl:param name="terminate">no</xsl:param> <xsl:param name="lvl">1</xsl:param> <xsl:variable name="logmsg"> <!-- Indent the log messages to help with readability --> <xsl:choose> <xsl:when test="$lvl=2"> • </xsl:when> <xsl:when test="$lvl=3"> • </xsl:when> <xsl:when test="$lvl=4"> • </xsl:when> <xsl:when test="$lvl=4"> • </xsl:when> </xsl:choose> <xsl:value-of select="$msg"/> </xsl:variable> <xsl:variable name="log" select="xslext:addMsg( string( $logmsg ) )"/> <xsl:if test="$terminate='yes'"> <xsl:variable name="errormsg" select="xslext:addMsg( 'ERROR: Error checking caused the process to stop' )"/> <!-- If the error msg force termination, the process must first output all existing log messages --> <xsl:variable name="output" select="xslext:outputMsgs( $logfileout )"/> <xsl:message terminate="yes"></xsl:message> </xsl:if> </xsl:template>
The template is called from throughout the XSLT to handle the messages sent to the message extension functions.
- Use a global document variable against which XPath expressions are
evaluated, and create a function to which you can pass an expression.
Listing 3 shows the code.
Listing 3. Create a global variable
<msxsl:script language="JScript" implements-prefix="xslext"> <![CDATA[ var xpathdoc = null; function setUpXPath( ns, trialexpr ){ var xml = ns.nextNode().xml; try{ xpathdoc = new ActiveXObject( "Msxml2.DOMDocument.3.0" ); xpathdoc.loadXML( xml ); return trialexpr + ": " + xpathdoc.selectNodes( trialexpr ).length; } catch(e) { return "ERROR: " + e.description; } } ]]> </msxsl:script>
Listing 3 shows a function that creates a DOM document to use as a context node for further XPath evaluations.
- Call this initialization function from within the main body of the XSLT, as
in Listing 4.
Listing 4. Add an initialization function
<xsl:call-template name="handlemsg"> <xsl:with-param name="msg">Setup ' <xsl:value-of select="xslext:setUpXPath( $root, concat( '//', name($root) ) )"/> '</xsl:with-param> </xsl:call-template>
Note how the extension function is called using the namespace prefix (
xslext
in this example). This prefix distinguishes this custom function from the standard functions available through XSLT such asnumber()
,string()
, andcontains()
. - Process the top-level document tests:
- Define a parameter for the ruleset file:
<xsl:param name="rulesetfile"></xsl:param>
Supply this parameter as a file URI. The stand-alone example takes a user selection at run time.
- Create a template to process each test:
xsl:template name="process-check"
This template works in the following way. First, you create an extension function that uses the
xpathdoc
as a context node and evaluates the test expression set in the rule file:function evalXPath( exp ){ try{ return xpathdoc.selectNodes( exp ).length; } catch(e) { return "Exception: " + e.description; } }
If successful, this code returns an integer; it should be at least 1. A zero indicates that the test ran successfully but no matches were found; an error description indicates either that the function threw an exception or that the XPath expression was badly formed.
- Call the function, and store the return value in a variable:
<xsl:variable name="check" select="xslext:evalXPath( string( $context ) )"/>
where
$context
is the expression string set for theerr:element
(for example,/dw-document//meta-dcsubject
).If the value of
$check
is at least 1 and the test is set to Enforce, then the test has passed.If the value of
$check
is 0 and the test is not set to Enforce, then the test has passed, but the user should see a warning.Otherwise, the test has failed and the process should halt. You can force the termination by an
xsl:message
, withterminate
set to Yes (see Listing 2). The template is called with the log message and theterminate
parameter set to Yes. - Define a nodeset of all enforceable tests to process:
document($rulesetfile)//err:element[@type='structure'][@enforce='yes']
- Process all other top-level tests that are not enforceable:
document($rulesetfile)//err:element[@type='structure'][not(@enforce='yes')]
- Define a parameter for the ruleset file:
- Process the element-level tests.
These tests are processed at the individual templates. To keep the process generic, the XSLT has a simple template to process elements:
xsl:template match="node()"
Within this generic template, you set a variable to determine whether the rule set contains an applicable test:
<xsl:variable name="match" select="document($rulesetfile)//err:element[@type='element'] [@name=$name]"/>
where
$name
is defined as the name of the current element.If
$match
is found to be True, the context of this test is run using another extension function. This function, similar to the top-level XPath evaluation, passes in the current node from the XSLT and evaluates the expression against that, as in Listing 6.Listing 6. Function to evaluate an expression
function evalXPathAgainstNode( node, exp ){ try{ return node.nextNode().selectNodes( exp ).length; } catch(e) { return "Exception: " + e.description; } }
If this function returns a value that parses as a number (that is, the return value isn't 0 or an error message), the integer is passed to another function to test the number against the pass–fail criteria, defined in the
pass
attribute:<err:element type="element" name="ol" context="./li" pass=">=2" />
- Test that the
ol
element has a number ofli
children greater than or equal to 2, as in Listing 7.Listing 7. Test the number of li elements
function evalExpr( str, pass ){ return eval( str + pass ); } ... <xsl:variable name="eval" select="xslext:evalExpr( $check, $pass )"/>
- The XSLT returns log results similar to Listing 8.
Listing 8. XSLT log results
Start Setup '//dw-document: 1'... · Check (Top-level document?) '1' · Conditional check '(Document ID missing?) '1' (1==1) == true' · Conditional check '(Article missing?) '1' (1==1) == true' · Conditional check '(Meta field (document type) missing?) '1' (1==1) == true' · Conditional check '(Meta field (subject) missing?) '1' (1==1) == true' · Conditional check '(Article title missing?) '1' (1==1) == true' · Conditional check '(Document author missing?) '1' (1==1) == true' · Conditional check '(Published date missing?) '1' (1==1) == true' · Check (Missing abstract?) '1' · Conditional check '(Dates out of sync?) '0' (00) == 0' · Conditional check '(Broken internal links?) '0' (0==0) == true' · Context checking 'heading' (./a[@name]) '(1==1) == true'... · Error context checking 'heading' (./a[@name]) '(0==1) == false'... · Context checking 'heading' (./a[@name]) '(1==1) == true'... · Context checking 'ol' (./li) '(3>=2) == true'... / End
Where next?
Having built a process that makes checks, identifies errors, and remakes the document, the next obvious step is to take corrective action on the element. This example includes basic code to add back to the document.
If the ruleset shows an err:onfail
element as a child
of err:element
, the code can take any of the
following:
<err:insertbefore></err:insertbefore>
<err:insertatstart></err:insertatstart>
<err:insertatend></err:insertatend>
<err:insertafter></err:insertafter>
The insert
element contains XML tags to correct the
document—for example:
<err:insertatstart> <a name="function:generate-id()" /></err:insertatstart>
The XSLT needs to process this.
Then, you can create a template to iterate over a nodeset:
<xsl:template name="copy-nodeset">
Pass the contents of the err:insertbefore
err:insertatstart
, err:insertatend
,
and err:insertafter
elements to this template at the
relevant points in the XSLT—for example:
<-- Add 'err:insertbefore' here --> <xsl:element name="{name()}"> <xsl:copy-of select="@*"/> <-- Add 'err:insertatstart' here --> <xsl:apply-templates/> <-- Add 'err:insertatend' here --> </xsl:element> <-- Add 'err:insertafter' here -->
The template has special treatment for the function:generate-id()
method.
For completeness, add logging as the content is inserted into the document:
Summary
This article showed how to use XSLT to analyze document structure to determine whether a set of business rules is met. This process can perform an important function in two significant ways: first, as an aid to the content creator to enable him or her to meet authoring objectives—for example, users can work offline and run the tests multiple times to verify that they have completed certain tasks—and second, as a formal part of a documentation workflow—for example, the utility can be embedded in a document repository workflow, and the pass–fail criteria can control the movement of a managed document between edit, review, and acceptance.
Separating the business logic from the XSLT makes the utility more flexible. The code becomes generic, as multiple rule sets can be applied using a single code base. Using XSLT instead of DOM methods proves powerful, as doing so allows document refinement using the transform process to correct the document.
Download
Description | Name | Size |
---|---|---|
Example XSLT code | xslt_source.zip | 9KB |
Resources
Learn
- Simplified English (Wikipedia): Read about this initiative to encourage the use of unambiguous and consistent writing styles among technical authors.
- XSL Transformations (XSLT) Version 1.0 specification (W3C, November 1999): Learn about the syntax and semantics of XSLT, which is a language for transforming XML documents into other XML documents.
- Expand XSL with extensions (Jared Jackson, developerWorks, April 2002): Use extensions, a technique that allows you to expand the capabilities of core XSL features.
- XML for Data: Extend XSLT's functionality with EXSLT (Kevin Williams, developerWorks, December 2002): Look at the EXSLT standard and how it extends the functionality of XSLT 1.0.
- Plan to use XML namespaces, Part 1 (David Marston, developerWorks, April 2004):Learn the best ways to use XML namespaces to your advantage.
- XML area on developerWorks: Get the resources you need to advance your skills in the XML arena.
- My developerWorks: Personalize your developerWorks experience.
- IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.
- XML technical library: See the developerWorks XML Zone for a wide range of technical articles and tips, tutorials, standards, and IBM Redbooks. Also, read more XML tips.
- developerWorks technical events and webcasts: Stay current with technology in these sessions.
- developerWorks on Twitter: Join today to follow developerWorks tweets.
- developerWorks podcasts: Listen to interesting interviews and discussions for software developers.
- developerWorks on-demand demos: Watch demos ranging from product installation and setup for beginners to advanced functionality for experienced developers.
Get products and technologies
- Business Rules Exchange (Svante Ericsson, TPSMG, September 2004): Consider an XML vocabulary for encoding project-specific rules for document creation. It is designed to specify authoring rules above and beyond the DTDs and schemas used to validate the documents.
- IBM product evaluation versions: Download or explore the online trials in the IBM SOA Sandbox and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
Discuss
- Yahoo! Groups related to XML: Join the discussions.
- XML zone discussion forums: Participate in any of several XML-related discussions.
- The developerWorks community: Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.
Comments
Dig deeper into XML on developerWorks
- Overview
- New to XML
- Technical library (tutorials and more)
- Forums
- Downloads and products
- Open source projects
- Standards
- Events
Bluemix Developers Community
Get samples, articles, product docs, and community resources to help build, deploy, and manage your cloud apps.
developerWorks Labs
Experiment with new directions in software development.
DevOps Services
Software development in the cloud. Register today to create a project.
IBM evaluation software
Evaluate IBM software and solutions, and transform challenges into opportunities.