Operator XMLParse
The XMLParse operator accepts a single input stream and generates tuples as a result.
Checkpointed data
When the XMLParse operator is checkpointed in a consistent region, any partially parsed input data and logic state variables (if present) are saved in checkpoint. When the XMLParse operator is checkpointed in an autonomous region, logic state variables (if present) are saved in checkpoint.
Behavior in a consistent region
The XMLParse operator can be an operator within the reachability graph of a consistent region. It cannot be the start of a consistent region. When in a consistent region, the operator checkpoints and resets any partially parsed input data. Logic state variables (if present) are also automatically checkpointed and resetted.
Checkpointing behavior in an autonomous region
When the XMLParse operator is in an autonomous region and configured with config checkpoint : periodic(T) clause, a background thread in SPL Runtime checkpoints the operator every T seconds, and such periodic checkpointing activity is asynchronous to tuple processing. Upon restart, the operator restores its internal state to its initial state, and restores logic state variables (if present) from the last checkpoint.
When the XMLParse operator is in an autonomous region and configured with config checkpoint : operatorDriven clause, no checkpoint is taken at runtime. Upon restart, the operator restores to its initial state.
Such checkpointing behavior is subject to change in the future.
Exceptions
- If the XML is invalid.
- If the parsing parameter is strict and there is an invalid conversion of XML data to SPL attributes.
Summary
- Ports
- This operator has 1 input port and 1 or more output ports.
- Windowing
- This operator does not accept any windowing configurations.
- Parameters
- This operator supports 11 parameters.
Required: trigger
Optional: attributesName, flatten, ignoreNamespaces, ignorePrefix, ignorecase, nullify, parsing, textName, xmlInput, xmlParseHuge
- Metrics
- This operator reports 1 metric.
Properties
- Implementation
- C++
- Threading
- Always - Operator always provides a single threaded execution context.
- Ports (0)
-
The XMLParse operator has one input port, which contains XML to be converted to tuples.
The XMLParse operator accepts as input a single stream that contains an attribute with XML data to convert. The one attribute that contains XML data must have type rstring, ustring, blob, or xml. If the attribute type is xml, then it represents a complete XML document every tuple. If the attribute type is rstring, ustring, or blob, the attribute might contain a chunk of XML that is not well-formed as the complete XML document might be contained across multiple input tuples. The XMLParse operator acts as if the chunks are concatenated together. The concatenated XML can contain multiple, sequential, XML documents.
- Properties
-
- Optional: false
- ControlPort: false
- TupleMutationAllowed: false
- WindowingMode: NonWindowed
- WindowPunctuationInputMode: Oblivious
- Assignments
- This operator allows any SPL expression of the correct type to be assigned to output attributes.
- Output Functions
-
- XMLPathFunctions
-
- <any T> T AsIs(T)
-
Passthrough function
- public rstring XPath(rstring xpathExpn)
-
Extracts a scalar value from a nodeset that contains a single node.
- public list<rstring> XPathList(rstring xpathExpn)
-
Extracts a list of scalars from XML.
- <tuple T> public T XPath (rstring xpathExpn, T tupleLiteral)
-
Extracts a nested tuple value from a nodeset that contains a single node.
- <any T> public list<T> XPathList(rstring xpathExpn, T elements)
-
Extracts a list of objects from XML.
- public map<rstring,rstring> XPathMap(rstring xpathExpn)
-
Extracts a map of XML attributes.
- Ports (0)
-
The XMLParse operator is configurable with one or more output ports, which have tuples generated from XML input.
Each output port generates tuples that correspond to one subtree of the input XML. The specific subtree of the XML document that triggers a tuple for a particular port is specified by the trigger parameter by using a subset of XPath. Each output stream corresponds to one expression on the trigger. Tuples are generated as the XML documents are parsed, and a WindowMarker punctuation is generated at the end of each XML document. If errors occur when the XML is parsed that do not result in an exception, the errors are logged and no tuples are generated until the start of the next trigger. Receipt of a WindowMarker punctuation resets the XMLParse operator, causing it to start parsing from the beginning of a new XML document. Tuples are output from a stream when the end tag of the element that is identified by the trigger parameter for that stream is seen.
- Properties
-
- Optional: false
- TupleMutationAllowed: true
- WindowPunctuationOutputMode: Generating
- Ports (1...)
-
Tuples generated from XML input
- Properties
-
- TupleMutationAllowed: true
- WindowPunctuationOutputMode: Generating
Required: trigger
Optional: attributesName, flatten, ignoreNamespaces, ignorePrefix, ignorecase, nullify, parsing, textName, xmlInput, xmlParseHuge
- attributesName
-
Specifies the SPL attribute name to be used in the handling of implicit XPath. The default value is _attrs.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- ExpressionMode: Constant
- flatten
-
Specifies the interpretation of scalar (or list<scalar>) attributes seen in the tuple definition for implicit XPath generation. The valid values are attributes, elements, and none. The default is none.
- Properties
-
- Type: Flatten (none, attributes, elements)
- Cardinality: 1
- Optional: true
- ExpressionMode: CustomLiteral
- ignoreNamespaces
-
Specifies whether to ignore namespaces in names. If the parameter value is true, names in the XML ignore the leading namespace: and are compared only with the local name. By default, the parameter value is false and the whole name, including the colon (:), is used. A name such as foo:bar can be matched only by using XPath ("foo:bar") or similar functions.
- Properties
-
- Type: boolean
- Cardinality: 1
- Optional: true
- ExpressionMode: Constant
- ignorePrefix
-
Specifies a string that, if present, is removed from the start of an attribute name that is used to form an implicit XPath directive. You can use this method for XML that contains elements or attributes with SPL or C++ keywords. For example:
stream <rstring __graph> A = XMLParse(Input) { param trigger : "/a"; flatten : element; ignorePrefix : "__"; }
This example accepts XML of the following form:
<a> <graph>value</graph> </a>
Since graph is an SPL keyword, stream<rstring graph> A = XMLParse is not valid SPL.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- ExpressionMode: Constant
- ignorecase
-
Specifies whether to ignore the case of elements and attributes.
- Properties
-
- Type: boolean
- Cardinality: 1
- Optional: true
- ExpressionMode: Constant
- nullify
-
Set the values of missing output attributes by using the default initializer for the type (0 for numeric, empty for strings and lists), instead of using a default XPath or XPathList expression.
- Properties
-
- Type: boolean
- Cardinality: 1
- Optional: true
- ExpressionMode: Constant
- parsing
-
Specifies the parsing behavior of the XMLParse operator. The valid values are strict and permissive. The default value is strict.
When the parameter value is strict, an exception is raised for invalid conversions of XML data to SPL attributes and the operator terminates. When the parameter value is permissive, an error is logged and execution continues.
- Properties
-
- Type: ParseOption (strict, permissive)
- Cardinality: 1
- Optional: true
- ExpressionMode: CustomLiteral
- textName
-
Specifies the SPL attribute content name that is used in the handling of implicit XPath. The default value is _text.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- ExpressionMode: Constant
- trigger
-
Specifies the subtree of the XML document that triggers a tuple to be output. This parameter is a list of rstring values, one for each output stream, in output stream declaration order. Each rstring contains an absolute XPath expression that identifies the top-level element of a subtree with the XML document. The XPath expression is a UTF-8 string value.
- Properties
-
- Type: rstring
- Optional: false
- ExpressionMode: Constant
- xmlInput
-
Specifies which attribute of the input stream carries the XML data that the operator parses. If there is only one attribute in the input stream, this parameter is optional.
- Properties
-
- Optional: true
- ExpressionMode: Attribute
- xmlParseHuge
-
The XMLParse operator uses libxml2, which imposes some arbitrary size limits for internal buffers used for XML parsing. These limits can be removed by setting this parameter to true. The default value for this parameter is false.
- Properties
-
- Type: boolean
- Cardinality: 1
- Optional: true
- ExpressionMode: AttributeFree
- Implicit XMLParse
-
stream<${schema}> ${outputStream} = XMLParse(${inputStream}) { param trigger : ${triggerExpression}; }
- Explicit XMLParse
-
stream<${schema}> ${outputStream} = XMLParse(${inputStream}) { param trigger : ${triggerExpression}; output ${outputStream} : ${outputAttribute} = ${value}; }
- nInvalidTuples - Counter
-
The number of tuples that failed to convert from XML to an SPL tuple.
- xml-spl