Processing XML Documents SNU IDB Lab.

Processing XML Documents
Processing XML documents
 Processing XML Data
 Document Formatting (XSL & XSLT)
Contents : processing XML data
Writing XML
Reading XML
Event processing
Tree manipulation
Events or trees?
Transformation tools
Concepts (1/4)
 Developing software to generate XML output is a trivial matter.
However, reading an XML documents can be complicated by a
number of issues and features of the language. Thus the DTD may
need to be processed, either to add default information, or to
compare against the document instance in order to validate it.
XML processor
Concepts (2/4)
 Programmers wishing to read XML data files need an XML-aware
processing module, termed an XML processor.
 XML processor
– XML processor is responsible for marking the content of the document
available to the application
– detect problems such as file formats that the application cannot process, or
URLs that do not point to valid resources.
Concepts (3/4)
 Two fundamentally different approaches to reading the content
of an XML document are known as the ‘event-driven’ and ‘treemanipulation’ techniques.
 Event-driven
– Document is processed in strict sequence.
– Each element in the data stream is considered as event trigger, which may
precipitate some special action on the part of the application.
Concepts (4/4)
 Tree-manipulation
– The tree approach provides access to the entire document, allowing its
contents to be interrogated and manipulated in any order.
Writing XML (1/3)
 To produce XML data, it is only necessary to include XML tags in
the output strings. However, one decision that has to be made is
whether to output line-end codes or whether to omit them.
 In many respects it is simpler and safer to omit line-end codes.
But if the XML document is likely to be viewed or edited using
tools that are not XML-aware, this approach makes the document
very difficult to read.
Writing XML (2/3)
 Some text editors will only display as much text as will fit on one
line in the window
 Although some editors are able to display more text by creating
‘soft’ line breaks at the right margin, the content is still not very
 It would seem to be more convenient to break the document into
separate lines at obvious points in the text. However, there may
be a problem for the recipient application in determining when
line-end codes are there purely to make the XML data file more
Writing XML (3/3)
<book><front><title>The Book Title</title><author>J.
Smith</author><date>October 1917</date></front><body>
<chapter><title>First Chapter</title><para>This is the
first chapter in the book.</para><para>This is the …….
<title>The Book Title</title>
<author>J. Smith</author>
<date>October 1917</date>
<title>First Chapter</title>
<para>This is the first chapter in the
<para>This is the …….
Reading XML (1/4)
XML processor
Reading XML (2/4)
 The XML processor hides many complications from the
 The XML processor has at least one sub-unit, termed the entity
manager, which is responsible for locating fragments of the
document held in entity declarations or in order data files, and
handling replacement of all references to them
Reading XML (3/4)
 The XML processor delivers data to application, but there are two
distinct ways in which this can be done.
 (1) Event driven
– The simplest is to pass the data directly to the application as a stream. The
application accepts the data stream and reacts to the markup as it is
Reading XML (4/4)
 (2) Tree-walking
– XML processor holding onto the data on the application’s behalf, and
allowing the application to ask questions about the data and request
portions of it.
 Grove
– A tree or group of trees can be stored in a data structure.
Event processing (1/2)
 The simplest method of processing an XML document is to read
the content as a stream of data, and to interpret mark up as it is
 If out-of-sequence processing is required, such as needing to
collect all the titles in a document for insertion at the start of the
document as a table of contents, then a ‘two -pass’ processor is
 In the first pass, the titles are collected. In the second pass, they
are inserted where they required.
Event processing (2/2)
 Simple API for XML(SAX 1.0)
– To reduce the workload of the application developer, and make it easy to
replace one parser with another, a common event-driven interface has
been proposed for object-oriented languages such as JAVA.
Tree manipulation (1/3)
 Software that holds the entire document in memory needs to
organized the content so that it can be easily searched and
 There is no need for multi-pass parsing when any part of the
document can be accessed instantly.
 Applications that benefit from this approach include XML-aware
editors, pagination engines and hypertext-enabled browsers.
Tree manipulation (2/3)
 The abstract description of the model for SGML documents is
called grove, and the grove scheme is equally applicable to XML.
 The name ‘grove’ is appropriate because it mainly describes a
series of trees.
 A grove is a ‘directed graph of nodes’
 Each node is an object of a specified type: a package of
information that conforms to a pre-defined template.
Tree manipulation (3/3)
 A property has a name and a value, so can be compared to an
Property value
 A node that describes a person mat have a property called ‘age’
which holds the value representing the age of an individual.
 A node must have a type property, and name property, so that it
can be identified, or referred to.
Events or trees ? (1/3)
 Event-driven benefits
– The parser does not have to hold much information about the documents
in memory.
– The document structure does not have to be managed in memory, either
by the parser or, depending on what it needs to do, by the application. This
make parsing very fast.
– It does not have to do anything special in order to process the document in
a simple linear fashion, from start to end.
Events or trees ? (2/3)
 Tree-walking benefits.
– With the entire document held in memory, the document structure can be
analyzed several times over, quickly and easily.
– The data structure management module may be profitably utilized by the
application to the manage the document components on its behalf.
– A documents that contains errors can be rejected before the application
begins to process its contents, thereby eliminating the need for messy rollback routines.
Events or trees ? (3/3)
 Other considerations
– The memory usage advantage of the event-driven approach may be only
– If the application uses an event-driven API, the parser need not build a
document tree, but if the application uses a tree-walking API, it can itself
use the event-driven API to build its tree model.
Transformation tools
 When the intent is simply to change an XML document structure
into a new structure, there are existing tools.
 These tools can usually do much more advanced things, such as
changing the order of elements, sorting them, and generating
new content new content automatically.
 It can transform XML document into another XML document, or
into an HTML document.
Processing XML documents
 Processing XML Data
 Document Formatting (XSL & XSLT)
Contents : Document Formatting
Selecting a style sheet
Style sheet DTD issues
Concepts of XSL
 XML Stylesheet Language
 XML documents are intended to be easily read by both people
and software
 People don’t want to see documents with tags
 It is necessary to replace the tags with appropriate text styles
Concepts of Style sheets (1/2)
<title>An example of style</title>
<intro><para>This example shows how important style
Is to material intended to be read.</para></intro>
<para>This is a <em>normal</em> paragraph.</para >
<warning><para>Styles are important!</para><warning>
Style applied
Removal of tag ?
An example of style This example shows how important
style Is to material intended to be read. This is a normal
paragraph. Styles are important!
An example of style
This example shows how important
style Is to material intended to be read.
This is a normal paragraph.
Warning: Styles are important!
Concepts of Style sheets (2/2)
style sheet
<title>This is a title</title>
This is a title
<p>This paragraph contains
a <em>highlighted</em> term.</p>
This paragraph contains
a highlighted term
This is a title
This paragraph contains a
highlighted term
Concepts of DTD and style sheet
 A single style sheet may be applied to a number of documents
formatted in the same way
 An XML document can be associated with more than one style
Style sheet A
Style sheet B
Concepts of Styling with XSL
 A set of formatting objects
 In this first version, all allowed formatting objects are rectangular
 FO DTD(Formatting Objects DTD)
– Elements such as ‘block’
– Attributes such as ‘text-align’
Concepts of
Transforming with XSLT(1/2)
 To author XML document with FO DTD is obviously negate the
entire philosophy of XML – self describing, not self formatting of
 An XSLT processor takes an existing XML document as input, and
generates a new XML document with new DTD as output.
Concepts of
Transforming with XSLT (2/2)
Source DTD
XSLT style sheet
XML document
XSLT processor
An <emph>emphasized</emph> word.
<template match=“emph”>
New XML document
XSL processor
An emphasized word.
Selecting a style sheet
 An XML processing instruction is used for selecting a style sheet.
<?xml-stylesheet href=“mystyles.xsl”
title=“default” ?>
<?xml-stylesheet href=“myBIGstyles.xsl”
title=“bigger font”
alternative=“yes” ?>
XSLT : general structure (1/3)
 Root element – stylesheet, transform
– <stylesheet xmlns=“”>
– <transform xmlns=“”>
 Another namespace – an XSLT style sheet may also contain
elements that are not part of stylesheet or transform
– <stylesheet xmlns=“”
… <X:my-element>…</X:my-element>…
XSLT : general structure (2/3)
 Result namespace – Indicator of what the output of the XSL
processor is
– <stylesheet xmlns=“”
 Id – embedded stylesheet in a larger XML document
– <?xml-stylesheet type=“text/xsl” href=“#MyStyles” ?>
<stylesheet id=“MyStyles” …>
XSLT : general structure (3/3)
 Result Version
Result Encoding – to specify which version of XML and a character
set encoding scheme should be used for the output file
– <stylesheet … result-version=“2.0”
XSLT : White space
 An XSLT processor creates a tree of nodes, including nodes for
each text string in and between the markup tags.
 Default – all white space is preserved.
Default Space – when ‘strip’ applied, it is possible to remove the
white space.
– <stylesheet … default-space=“strip”>
<preserve-space elements=“pre poetry”/>
XSLT : Templates
 The body of the style sheet consists of at least one transformation
rule, as represented by the Template element
– <template match=“para”>
– <template match=“warning/para”>
XSLT : Imports and Inclusions
 Multiple style sheets may share some definitions.
– <stylesheet …>
<import href=“tables.xsl”>
<import href=“colours.xsl”>
<template …>…</template>
– <include href=“…”>…</include>
 Import rules are not considered to be as important as other rules.
 The include element can be used anywhere and included rules are
not considered to be less important than other rules
XSLT : Priorities
 When more than one complex rule matches the current element,
it is necessary to explicitly give one rule a higher priority than the
others, using the Priority attribute.
– <template match=“chapter//para”><!-- priority = 1-->
– <template match=“warning//para” priority = “2”>
 It the priority attribute is not used, or not used correctly, an XSLT
processor may choose to simply select the last rule.
XSLT : Recursive processing
 If an animal element existed within the paragraph, and there was
no rule for this element, but it could contain the emphasis
element, then the emphasized text would not be formatted.
– <para>A <animal><emph>Giraffe</emph></animal> is an animal.</para>
 To eliminate this problem, a rule is needed to act as a catch-all,
representing the elements not covered by explicit formatting
– <template match=“/|*”>
<apply-templates />
XSLT : Selective processing
 The Apply Templates element can take a Select attribute, which
overrides the default action of processing all children. Using Xpath
patterns, it is possible to select specific children, and ignore the
– <template match=“names”>
<apply-templates select=“name[@type=‘company’]” />
 The Apply Templates element can be used more than once in a
XSLT : Output formats
 An XSLT transformation tool is expected to write out a new XML
document. One way to do this is simply to insert the appropriate
elements into the templates.
– <template match=“para”>
 Comments and processing instructions can be inserted into the
output document using comment and processing instruction
– <processing-instruction name=“ACME”>INSERT_TOC</processinginstruction>
– <comment>This is the HTML version</comment>
XSLT : Sorting elements
 The Sort element is used within the Apply Templates element to
sort the elements it selects:
– <list>
<item sortcode=“1”>ZZZ</item>
<item sortcode=“3”>MMM</item>
<item sortcode=“2”>AAA</item>
<template match=“list”>
<sort select=“@sortcode” />
XSLT : Automatic numbering
 In many XML documents, list items are not physically numbered
in the text, making it easy to insert, move or delete items without
having to edit all the items, so the style sheet must add the
required numbering.
– <template match=“selection/title”>
<number level=“multi” count=“chapter|section” format=“1.A” />
– 1.A First section of Chapter One
2.C Third section of Chapter Two
XSLT : Variables and templates(1/3)
 A style sheet often contains a number of templates that produce
output that is identical, or very similar, and XSLT includes some
mechanisms for avoiding such redundancy.
 Variable, Value Of
– <variable name=“Colour”>red</variable>
The colour is <xsl:value-of select=“$Colour”/>.
The colour is red.
XSLT : Variables and templates (2/3)
 When the same formatting is required in a number of places, it is
possible to simply reuse the same template.
– <template name=“CreateHeader”>
<template match=“title”>
<call-template name=“CreateHeader” />
<template match=“head”>
<call-template name=“CreateHeader” />
XSLT : Variables and templates (3/3)
 Such a mechanism is even more useful when the action
performed by the named template can be modified, by passing
parameters to it that override default values.
– <template name=“CreateHeader”>
<param name=“Prefix”>%%%</param>
<html:h2><value-of select=“$Prefix”/>
<call-template name=“CreateHeader”>
<with-param name=“Prefix”>%%%%%</with-param>
Creating and copying elements(1/2)
 An element can be created in the output document using the
Element element, with the element name specified using the
Name attribute, and an optional namespace specified using the
Namespace attribute
 Elements can also be created that are copies of the source
element, using the Copy element.
– <template match=“third-header-level”>
<element namespace=“html” name=“h3”>
Creating and copying elements(2/2)
 Source document elements can also be selected and copied out to
the destination document using the Copy Of element, which uses
a Select attribute to identify the document fragment or set of
elements to be reproduced at the current position.
– <template match=“body”>
<copy-of select=“//h1 | //h2” />
XSLT : Repeating structures
 When creating tabular output from source elements, or some
other very regular structure, a technique is available that reduces
the number of templates needed significantly, and in so doing
improves the clarity of the style sheet.
– <template match=“countries”>
<for-each select=“country”>
<html:th><apply-templates select=“name”/></html:th>
<for-each select=“borders”>
XSLT : Conditions (1/2)
 When a template transforms a source document element into
formatted output, it is possible to vary the output depending on
certain conditions.
– <template match=“para”>
<if test=“not (position() mod 2 = 0)”>
<attribute name=“style”>color: red</attribute>
XSLT : Conditions (2/2)
 When an attribute can take a number of different values, each one
producing a different format.
– <template match=“para”>
<when test=“@type=‘normal’”>
<attribute name=“style”>color:black</attribute>
<attribute name=“style”>color:yellow</attribute>
XSLT : Keys
 XSLT allows keys to be defined and associated with particular
 The Name attribute provides a name for a set of identifiers.
 The Match attribute specifies the elements to be included in this
set of identifiers, using an Xpath pattern.
 The Use attribute is an Xpath expression that identifies the
location of the identifier values.
– <key name=“global” match=“*” use=“@id” />
<book id=“book”>
<chapter id=“chap1”>…</chapter>
Style sheet DTD issues
 The XSLT standard includes a DTD that defines the XSLT elements
and attributes. But this DTD alone may not be sufficient. The fact
that XSLT markup can be mixed with output markup means that a
DTD may need to be defined that includes both sets of elements.
 The DTD must, of course, also contain the definition for the
elements concerned.
 However, this problem can be avoided entirely, by using the
Element and Attribute elements throughout the style sheet.
XSL : Representation format
 Each formatting object can be represented by an element from
the FO(Formatting Objects) DTD, which is defined in an annex to
the standard.
 An XSL processor is expected to receive input from the XSLT
processor, though in some cases an implementation may be able
to receive an XML document that conforms to the FO DTD
instead, created by other means.
General presentation model (1/3)
 XSL creates formatting objects to hold the content to be
presented. Formatting objects create rectangular areas.
 Areas are divided into four categories: area-containers, blockareas, line-areas, and inline-areas.
General presentation model (2/3)
 An area-container has a coordinate system, by which embeded
objects can be placed, defining the ‘top’, ‘bottom’, ‘left’ and ‘right’
directions, and is able to contain other area-containers.
 Area-containers may also contain block-areas. The placement of
block-areas within area-containers depends on the ‘writing mode’.
 When a block-area is too long to fit in the area-container, another
block-area may be created in the next area-container.
area container
General presentation model (3/3)
 A block-area may contain more block-areas.
 Embedded blocks may be narrower than the enclosing block, in
the non-writing-mode direction, using indent properties.
 Blocks may contain line-areas, which are adjacent to each other in
the line-progression direction.
 Line-areas can contain inline-areas, which correspond with XML
inline elements.
 Inline-areas are drawn from an initial position-point, and it is
possible to adjust this point upward or downward in relation to
neighboring inline-areas.
XSL – CSS-compatible formatting objects and properties
 Many of the XSL formatting object types correspond with
established CSS display property types.
 Many of the formatting options available in XSL are derived from
properties provided in CSS.
XSL : Block-level objects
 The Block element is used enclose any simple block of text.
 Text styles can be defined, margin, border and padding attributes
may be added, text may be aligned in different ways,
hyphenation can be controlled, and the whole block can be
removed from the flow and positioned explicitly.
<fo:block>A block of text.</fo:block>
<fo:block>Another block of text.</fo:block>
A block of text.
Another vlock of text.
 Graphics between text blocks are represented by Display Graphic
 A rule can be drawn horizontally or vertically between text blocks,
using the Display Rule element.
XSL : Inline objects
 Individual characters can be represented by the Character
 The first object inside a Block element may be a First Line Marker
 A graphic can be presented within a line of text, using the Inline
Graphics element.
 Rule line can be drawn inline, as well as between blocks
 The Inline Sequence element has already been demonstrated.
 The current page number can be inserted into the text using the
Page Number element
XSL : Lists
 The items are a sequence of List Item elements.
– <fo:list-block>
 The block may directly contain a sequence of labels followed by
contents, using the List Item Label and List Item Body elements.
– <fo:list-item-label><fo:block>LABEL</fo:block></fo:list-i …
<fo:block>First Block In Content</fo:block>
XSL : Tables
 When a table has a caption, the main element is called Table and
 The Table element contains the actual table grid. The model
follows the HTML approach
– <fo:table>
XSL : Hypertext links
 A range of text can be enclosed in a Simple Link element, which is
used to provide a mechanism for hypertext linking to another
– See <fo:simple-link internal-destination=“chap9”>Chapter 9</fo:simplelink> for details.
See <fo:simple-link external-destination=“file:///book3.xml”>Book
3</fo:simple-link> for details.
XSL: Alternative document fragments
 When publishing electronically, it is possible to hide and reveal
portions of the document depending on user actions.
– <fo:multi-case name=“closed” initial=“true”>
<fo:multi-toggle switch-to=“opened”>
<fo:multi-case name=“opened”>
<fo:multi-toggle switch-to=“closed”>
XSL : Alternative properties
 The Multi Properties element contains a number of initial, empty
Multi Property Set elements, each one providing the style to
apply under a given circumstance, identified using the State
– <fo:multi-properties>
<fo:multi-property-set state=“visited” color=“#FF0000” />
<fo:multi-property-set state=“active” color=“#00FF00” />
This text to be coloured depending on the state
XSL: Floating objects and footnotes
 The Float element is used to contain such items, indicating to the
pagination engine that it may move the content as appropriate.
– <fo:float><fo:table>…</fo:table></fo:float>
 The Footnote element contains the footnote text, which will float
to the base of the page, and may also contain a reference to the
– Here is a reference<fo:footnote>
<fo:block>* The footnote</fo:block>
</fo:footnote> to a footnote
XSL : Building pages (1/2)
 The Flow element contains all the block-level objects that
constitute the main text flow content of the document.
 The Page Sequence element may contain the Flow element and
any number of objects that are to be repeated in the same place
on each page in the sequence, which is termed static content.
 A page sequence must include information on how different page
master templates are to be used in the sequence.
XSL : Building pages (2/2)
Xml camp
Xml camp
Writing mode
Binding margins
XSL : Hyphenation
 The ‘hyphenate’ property defaults to ‘false’, but can be set to
‘true’, so enabling the hyphenation or words.
– <fo:block hyphenate=“true”