Implicitly deriving the SPL attributes from the XML output stream
An implicit output statement is generated by extrapolating the XML elements from the output tuple schema.
Converting XML to tuples is complicated by the structure of XML. XML elements can hold XML attributes, a text value, and also nested XML elements. The closest approximation to an XML element is the SPL tuple:
type XMLElement = tuple<map<rstring, rstring> _attrs, rstring _text> ,optionalNestedXMLElements>;
- When none is specified, an XML element is matched to an SPL attribute of the same name that has the type tuple<rstring _text, map<rstring, rstring> _attrs> splAttributeName. The text of the XML element is assigned to _text, and any attributes are assigned to _attrs. If either _text or _attrs is omitted, the corresponding XML information is ignored. _text and _attrs names can be changed with the textName and attributesName parameters.
- When elements is specified, an XML element is matched to an SPL attribute with the same name. The value of the XML element is assigned to the SPL attribute. If a nested element has nested elements for which the operator is to capture information, the element cannot be flattened.
- When attributes is specified, an XML attribute from the XML element is matched to an SPL attribute with the same name. The value of the XML attribute is assigned to the SPL attribute. Only XML attributes for which there is a corresponding SPL attribute are stored in the tuple.
The _attrs attribute represents all the name="value" XML attributes and _text represents the element value itself. This allows conversion to a tuple without too much information loss. In the following XML:
<a b="1" c="vc1">
va1
<d>vd1</d>
<e>ve1a</e>
<e>ve1b</e>
</a>
<a b="2">
va2
<d>vd2</d>
<e>ve2</e>
<f>vf2</f>
</a>
This is represented with an SPL tuple:
type aElem = tuple<
map<rstring,rstring> _attrs,
rstring _text,
tuple<rstring _text> d,
list<tuple<rstring _text>> e,
list<tuple<rstring _text>>[1] f
>;
In this example, list<X> is used to represent 0 or more elements of X, and list<X>[1] is used to represent 0 or 1 X. The _attrs attribute is omitted for elements without XML attributes.
If the above tuple definition is used in defining the output stream, then the output clause may be omitted. The XMLParse operator implies the XML shape based on the output tuple schema. For example:
stream<aElem> O = XMLParse(...) { param trigger : "/a"; }
The tuple schema definition is somewhat complex for relatively simple XML. The spl-schema-from-xml command is provided to compute the tuple schema definition for a given representative sample XML document.
If you do not use the spl-schema-from-xml command, you can represent XML attributes as SPL tuple attributes for ease of use by providing an alternative tuple schema representation that automatically associates XML attribute values with tuple scalar values. For example, in the following tuple definition:
type aElem = tuple<
int32 b, list<rstring>[1] c,
rstring _text,
tuple<rstring _text> d,
list<tuple<rstring _text>> e,
list<tuple<rstring _text>>[1] f
>;
In this example, the map named _attrs is removed and two SPL attributes which are either scalar or list<scalar>[1] are added. If there are scalar (or list<scalar>[1]) values in a given tuple, then the XMLParse operator assigns the value of XML attributes of that name to those SPL attributes. In the example XML above, the b attribute of element a is assigned to SPL attribute b. This optimization is referred to as flattening. Notice that implicit type conversion is introduced. In the example, b is given the type int32, so an implicit conversion of XML attribute b is performed. Using this form, the ability to accept any attributes other than b and c is lost.
Another flattening optimization is to notice that a tuple<rstring _text> d is not more expressive than rstring d. This allows the following:
type aElem = tuple<
map<rstring,rstring> _attrs,
rstring _text,
rstring d,
list<rstring> e,
list<rstring>[1] f
>;
The SPL attributes for elements d, e and f have been flattened to either scalar or list<scalar>. The flattened expression for element d is indistinguishable from an XML attribute named d, so the optimization can only be applied to nested elements or attributes, but not both.
For information on parameters that control how the tuple schema is interpreted, see the flatten, textName, and attributesName parameters in the XMLParse operator.
The output tuple is initialized to a default value (see below) at operator startup and after each tuple is submitted.
type T = tuple<int32 id, tuple<rstring b, list<int32> x, float64 d> a, rstring c>;
stream<rstring xmlData, int32 id> Data = Op() {};
stream<T> OutTuples = XMLParse (Data) {
param xmlInput : xmlData; // Accept XML from input attribute mlData.
trigger : "/something/bar"; // Submit tuple to OutTuples when end of top
// level /something/bar element seen.
flatten : elements; // Any unassigned scalar value should
// be assumed to be a nested element
// of that name.
output OutTuples:
id = Data.id, // set from incoming data
a = { b = XPath("@bdata"), x = (list<int32>)XPathList("foo/text())"),
d = (float64) XPath ("a/d/text()")};
// tuple attributes use XML attribute, field
// c defaults to (float64)XPath("c/text()");
}
The input XML to XMLParse looks like:
<something>
<bar bdata="t">
<foo>1</foo>
<foo>2</foo>
<a>
<d>1.6</d>
</a>
<c>a string</c>
</bar>
<bar bdata="t2">
<foo>5</foo>
<foo>6</foo>
<a>
<d>100.6</d>
</a>
<c>another string</c>
</bar>
</something>
If Data.xmlData contains the above XML and Data.id is 1, the output tuples produced are:
{id = 1, a = { b = "t", x = [1, 2], d = 1.6}, c = "a string" }
{id = 1, a = { b = "t2", x = [5, 6], d = 100.6}, c = "another string" }
XML elements and attributes of these types are implicitly converted to the SPL attribute type. Any conversion errors are treated as dictated by the parsing parameter value.
All other XML data types, for example, timestamp, should be read as rstring/ustring, and then converted to the correct type downstream. For scalar SPL attributes, the attribute is assigned the value of the XML element/attribute when it is seen. For lists, the value of the XML element/attribute is appended to the current value of the attribute. If a given tuple attribute has not been seen in the XML, then the default initializer for the type is used for the value (0 for numeric, empty for strings and lists, and recursively the same for tuples).