Batching in a sink operator
Batching is used in a sink operator to transmit a batch of information that represents multiple tuples to the external system in a single operation.
When batching tuples, an implementation choice here is to either manage a batch of tuple objects, or manage the batch in the external system's Java™ API.
Batching using Tuple objects
When batching
tuples, the Operator.process()
adds the incoming
tuple object to a set of tuples. A separate batch processing method
receives the set of tuples and interacts with the external system's Java™ API to translate and transmit
the batch as efficiently as possible. The Java™ Operator sample class com.ibm.streams.operator.samples.patterns.TupleConsumer
provides
generic batching function that is based on a set of tuples.
Batching using the external system's Java™ API
For batching using the external
system's Java™ API, the Operator.process()
method
translates the tuple object, and then adds it to any batching mechanism
provided by the API. Then, the batch processing method would transmit
the batch to the external system. For example, with JDBC, the PreparedStatement API
information from the tuples would be translated into parameters and
added using addBatch()
, and then run later using executeBatch()
.
Batching considerations
- What defines a complete batch, for example, number of tuples, elapsed time, or memory consumption of the batch.
- How to handle final punctuation, which indicates no more tuples arrive and thus the processing of the operator can complete. Typically, the batch is processed even though it might be incomplete according to the definition of a complete batch.
- How to handle delays in tuple arrivals. In any stream, there might be delays in tuple arrivals, which would delay the transmission of the batch to the external system. This arrival delay might cause delays in responses as seen by the external system. For example, a dashboard might not be updated. Typically, this problem is solved by having a timeout where the incomplete batch is transmitted to the external system if there is no incoming tuple for some time.