SPL binary encoding
SPL uses a platform independent binary encoding for serializing and deserializing data.
This encoding is used for the FileSource
, TCPSource
, UDPSource
, FileSink
, TCPSink
,
and UDPSink
operators that are configured with the bin format
specifier. This encoding is also used by the binary file read and
write functions that are provided as part of the spl.file namespace.
Furthermore, the same encoding is used for the binary serialization
and deserialization function that is provided for the SPL C++ types.
A variable-size encoding scheme is employed by SPL for encoding
the size of types that do not have a fixed-size, such as strings and lists.
Given a non-negative size n
, the binary encoding
for it is given as follows: If n<128
, then the
size is encoded as a single byte that represents n
in
binary. If n>=128
, then the first byte is encoded
as 128 in binary followed by the 4 bytes encoding of n
in
binary, in Network Byte Format (NBF). A few examples are given as
follows:
3 -> 0x03
85 -> 0x55
127 -> 0x7F
128 -> 0x80 0x00 0x00 0x00 0x80
240 -> 0x80 0x00 0x00 0x00 0xF0
1234 -> 0x80 0x00 0x00 0x04 0xD2
Each SPL type is encoded in binary.
Integer types,
which include int8
, uint8
, int16
, uint16
, int32
, uint32
, int64
,
and uint64
are encoded in binary using NBF. Signed
values are encoded using two's complement binary encoding.
A Boolean is encoded with a single byte, where 0 in binary represents false, and 1 in binary represents true.
Float types, which include float32
, float64
are
encoded with IEEE 754 binary32
, binary64
,
and binary128
formats. Again, the encoding is in
NBF.
Complex types, which include complex32
, complex64
are
encoded by first encoding the real part as a float and then encoding
the imaginary part as a float. In other words, complex<n>
is
encoded as two float<n>
, where the first float
is the real part and the second one is the imaginary part.
decimal32
values
are encoded as IEEE 754 values in little endian (LE) byte order:s - sign bit
c - combination bits
e - exponent continuation
x - coefficient continuation
(n) - refers to 'byte number'
x(n) - is 8 bits of coefficient in byte n
IEEE 754 logical representation: scccccee(0) eeeexxxx(1) x(2) x(3)
SPL binary encoding in LE order: x(3) x(2) eeeexxxx(1) scccccee(0) (i.e. the low order byte of the decimal32 is first)
decimal64
values
are encoded as 64-bit IEEE 754 values that are stored in LE byte order:IEEE 754 logical representation: scccccee(0) eeeeeexx(1) eexxxxxx(2) x(3) x(4) x(5) x(6) x(7)
SPL binary encoding in LE order: x(7) x(6) x(5) x(4) x(3) x(2) eeeeeexx(1) scccccee(0)
decimal128
values
are encoded as 128-bit IEEE 754 values that are stored as two 64-bit
words in LE byte order:IEEE 754 logical representation: scccccee(0) eeeeeeee(1) eexxxxxx(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9) x(a) x(b) x(c) x(d) x(e) x(f)
SPL binary encoding in LE order: x(7) x(6) x(5) x(4) x(3) eexxxxxx(2) eeeeeeee(1) scccccee(0) x(f) x(e) x(d) x(c) x(b) x(a) x(9) x(8)
String
types include rstring
and ustring
. An rstring
is
encoded by first encoding the number of bytes in the string using
SPL's variable-size encoding for sizes, followed by the binary encodings
of the individual bytes contained in the string. rstring[n]
is
encoded by first encoding the number of bytes in the string using uint8
for n<2ˆ8
, uint16
for 2ˆ8<n<2ˆ16
,
and uint32
for 2ˆ16< n<2ˆ32
,
followed by n+1
bytes encoding the
string. Any bytes past the number of bytes in the string are not part
of the current value of the string and might be ignored. A ustring
is
encoded by first encoding the number of code units in the UTF-16 encoding
of the Unicode string using SPL's variable-size encoding for sizes,
followed by the binary encodings of the individual code units that
are contained in the string. Each code unit is encoded using NBF.
UTF-16 code units are uint16
.
A time stamp
is encoded as a int64
representing the seconds field,
followed by a uint32
representing the nanoseconds
field, followed by a uint32
representing the machine
id field.
A blob
is encoded by first encoding
the size of the blob as a uint64
, followed by the
binary encodings of the individual bytes in the blob.
Lists are encoded by first encoding the size of the list using SPL's variable-size encoding for sizes, followed by the SPL binary encodings of the individual elements in the list.
Sets are encoded by first encoding the size of the set using SPL's variable-size encoding for sizes, followed by the SPL binary encodings of the individual elements in the set.
Maps are encoded by first encoding the size of the map using SPL's variable-size encoding for sizes, followed by the SPL binary encodings of the individual pairs of keys and values in the list. For each pair, the key is encoded first, followed by the value.
Bounded lists are encoded by first
encoding the size (the number of used elements) of the bounded list,
followed by the SPL binary encodings of the individual elements in
the list, including the unused elements (using the default value of
the element type). The number of elements that are encoded is always
equal to the bound of the list. Unused elements come after the used
elements. The size of the bounded list is encoded using the smallest
SPL unsigned integer type that can hold the bound of the list. If n
is
the bound, then uint8
is used for n<2ˆ8
, uint16
for 2ˆ8<n<2ˆ16
,
and uint32
for 2ˆ16< n<2ˆ32
.
Bounded sets are encoded by first encoding the size (the number of used elements) of the bounded set, followed by the SPL binary encodings of the individual elements in the set, including the unused elements (using the default value of the element type), followed by a series of Boolean encodings that represent whether the previously encoded elements are used or unused in the set. Unlike in a list, the used elements in a set can be dispersed. Both the number of elements that are encoded and the number of Boolean encoded for tracking which elements are used, are always equal to the bound of the set. The size of the bounded set is encoded using the smallest SPL unsigned integer type that can hold the bound of the set. See bounded lists for more details.
Bounded maps are encoded by first encoding the size (the number of used elements) of the bounded map, followed by the SPL binary encodings of the individual elements (pairs of keys and values) in the map, including the unused elements (using the default value of the element type), followed by a series of Boolean encodings that represent whether the previously encoded elements are used or unused in the map. Similar to a set, the used elements in a map can be dispersed. Both the number of elements that are encoded and the number of Boolean that is encoded for tracking which elements are used, are always equal to the bound of the map. Each element in a bounded map contains a key and value, and the key is encoded before the value. The size of the bounded map is encoded using the smallest SPL unsigned integer type that can hold the bound of the map. See bounded lists for more details.
Tuple types are encoded by following the order of the attributes that are specified in the SPL language definition of the tuple type and encoding each attribute using SPL's binary encoding for the attribute type. Attribute names are not encoded.
Optional types are encoded by first setting a single byte to 1
if
the optional type has a data value, or to 0
if there is no data value (the value of
the optional type is null
). If the optional type has a data value, this is
followed by the SPL binary encoding of the data value, otherwise there are no additional bytes.
Enumerations are encoded using an uint32
that
represents the index of the enumeration value in the list of valid
values as they appear in the SPL language definition of the enum type.
XML values are encoded as a 1-byte version number with
value 0x01
, followed by the XML value that is encoded
as an rstring
.
C++ APIs provided for serializing and deserializing SPL types also support a non-NBF, native encoding. The native encoding follows the same encoding scheme except that it does not employ NBF for encoding individual pieces.
The binary encoding is also used at the transport layer for communicating across PEs. The largest tuple that can be transported across PEs can have at most 2ˆ32-1 bytes when encoded in binary.