SPL binary encoding

SPL uses a platform independent binary encoding for serializing and deserializing data.

This encoding is used for the FileSource, TCPSource, UDPSource, FileSink, TCPSink, and UDPSink operators that are configured with the bin format specifier. This encoding is also used by the binary file read and write functions that are provided as part of the spl.file namespace. Furthermore, the same encoding is used for the binary serialization and deserialization function that is provided for the SPL C++ types.

A variable-size encoding scheme is employed by SPL for encoding the size of types that do not have a fixed-size, such as strings and lists. Given a non-negative size n, the binary encoding for it is given as follows: If n<128, then the size is encoded as a single byte that represents n in binary. If n>=128, then the first byte is encoded as 128 in binary followed by the 4 bytes encoding of n in binary, in Network Byte Format (NBF). A few examples are given as follows:

  3    -> 0x03
  85   -> 0x55
  127  -> 0x7F
  128  -> 0x80 0x00 0x00 0x00 0x80
  240  -> 0x80 0x00 0x00 0x00 0xF0
  1234 -> 0x80 0x00 0x00 0x04 0xD2

Each SPL type is encoded in binary.

Integer types, which include int8, uint8, int16, uint16, int32, uint32, int64, and uint64 are encoded in binary using NBF. Signed values are encoded using two's complement binary encoding.

A Boolean is encoded with a single byte, where 0 in binary represents false, and 1 in binary represents true.

Float types, which include float32, float64 are encoded with IEEE 754 binary32, binary64, and binary128 formats. Again, the encoding is in NBF.

Complex types, which include complex32, complex64 are encoded by first encoding the real part as a float and then encoding the imaginary part as a float. In other words, complex<n> is encoded as two float<n>, where the first float is the real part and the second one is the imaginary part.

decimal32 values are encoded as IEEE 754 values in little endian (LE) byte order:

s - sign bit
c - combination bits
e - exponent continuation
x - coefficient continuation
(n) - refers to 'byte number'
x(n) - is 8 bits of coefficient in byte n

IEEE 754 logical representation: scccccee(0) eeeexxxx(1) x(2) x(3)
SPL binary encoding in LE order:  x(3) x(2) eeeexxxx(1) scccccee(0) (i.e. the low order byte of the decimal32 is first)

decimal64 values are encoded as 64-bit IEEE 754 values that are stored in LE byte order:

IEEE 754 logical representation: scccccee(0) eeeeeexx(1) eexxxxxx(2) x(3) x(4) x(5) x(6) x(7)
SPL binary encoding in LE order: x(7) x(6) x(5) x(4) x(3) x(2) eeeeeexx(1) scccccee(0)

decimal128 values are encoded as 128-bit IEEE 754 values that are stored as two 64-bit words in LE byte order:

IEEE 754 logical representation: scccccee(0) eeeeeeee(1) eexxxxxx(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9) x(a) x(b) x(c) x(d) x(e) x(f)
SPL binary encoding in LE order: x(7) x(6) x(5) x(4) x(3) eexxxxxx(2) eeeeeeee(1) scccccee(0) x(f) x(e) x(d) x(c) x(b) x(a) x(9) x(8)

String types include rstring and ustring. An rstring is encoded by first encoding the number of bytes in the string using SPL's variable-size encoding for sizes, followed by the binary encodings of the individual bytes contained in the string. rstring[n] is encoded by first encoding the number of bytes in the string using uint8 for n<2ˆ⁸, uint16 for 2ˆ⁸<n<2ˆ¹⁶, and uint32 for 2ˆ¹⁶< n<2ˆ³², followed by n+1 bytes encoding the string. Any bytes past the number of bytes in the string are not part of the current value of the string and might be ignored. A ustring is encoded by first encoding the number of code units in the UTF-16 encoding of the Unicode string using SPL's variable-size encoding for sizes, followed by the binary encodings of the individual code units that are contained in the string. Each code unit is encoded using NBF. UTF-16 code units are uint16.

A time stamp is encoded as a int64 representing the seconds field, followed by a uint32 representing the nanoseconds field, followed by a uint32 representing the machine id field.

A blob is encoded by first encoding the size of the blob as a uint64, followed by the binary encodings of the individual bytes in the blob.

Lists are encoded by first encoding the size of the list using SPL's variable-size encoding for sizes, followed by the SPL binary encodings of the individual elements in the list.

Sets are encoded by first encoding the size of the set using SPL's variable-size encoding for sizes, followed by the SPL binary encodings of the individual elements in the set.

Maps are encoded by first encoding the size of the map using SPL's variable-size encoding for sizes, followed by the SPL binary encodings of the individual pairs of keys and values in the list. For each pair, the key is encoded first, followed by the value.

Bounded lists are encoded by first encoding the size (the number of used elements) of the bounded list, followed by the SPL binary encodings of the individual elements in the list, including the unused elements (using the default value of the element type). The number of elements that are encoded is always equal to the bound of the list. Unused elements come after the used elements. The size of the bounded list is encoded using the smallest SPL unsigned integer type that can hold the bound of the list. If n is the bound, then uint8 is used for n<2ˆ⁸, uint16 for 2ˆ⁸<n<2ˆ¹⁶, and uint32 for 2ˆ¹⁶< n<2ˆ³².

Bounded sets are encoded by first encoding the size (the number of used elements) of the bounded set, followed by the SPL binary encodings of the individual elements in the set, including the unused elements (using the default value of the element type), followed by a series of Boolean encodings that represent whether the previously encoded elements are used or unused in the set. Unlike in a list, the used elements in a set can be dispersed. Both the number of elements that are encoded and the number of Boolean encoded for tracking which elements are used, are always equal to the bound of the set. The size of the bounded set is encoded using the smallest SPL unsigned integer type that can hold the bound of the set. See bounded lists for more details.

Bounded maps are encoded by first encoding the size (the number of used elements) of the bounded map, followed by the SPL binary encodings of the individual elements (pairs of keys and values) in the map, including the unused elements (using the default value of the element type), followed by a series of Boolean encodings that represent whether the previously encoded elements are used or unused in the map. Similar to a set, the used elements in a map can be dispersed. Both the number of elements that are encoded and the number of Boolean that is encoded for tracking which elements are used, are always equal to the bound of the map. Each element in a bounded map contains a key and value, and the key is encoded before the value. The size of the bounded map is encoded using the smallest SPL unsigned integer type that can hold the bound of the map. See bounded lists for more details.

Tuple types are encoded by following the order of the attributes that are specified in the SPL language definition of the tuple type and encoding each attribute using SPL's binary encoding for the attribute type. Attribute names are not encoded.

Optional types are encoded by first setting a single byte to 1 if the optional type has a data value, or to 0 if there is no data value (the value of the optional type is null). If the optional type has a data value, this is followed by the SPL binary encoding of the data value, otherwise there are no additional bytes.

Enumerations are encoded using an uint32 that represents the index of the enumeration value in the list of valid values as they appear in the SPL language definition of the enum type.

XML values are encoded as a 1-byte version number with value 0x01, followed by the XML value that is encoded as an rstring.

C++ APIs provided for serializing and deserializing SPL types also support a non-NBF, native encoding. The native encoding follows the same encoding scheme except that it does not employ NBF for encoding individual pieces.

The binary encoding is also used at the transport layer for communicating across PEs. The largest tuple that can be transported across PEs can have at most 2ˆ^32-1 bytes when encoded in binary.