Operator ObjectStorageSink
Operator writes objects to S3 compliant object storage. The operator supports basic (HMAC) and IAM authentication.
This operator writes tuples that arrive on its input port to the output object that is named by the objectName parameter. You can optionally control whether the operator closes the current output object and creates a new object for writing based on the sizeof the object in bytes, the number of tuples that are written to the object, or the time in seconds that the object is open for writing, or when the operator receives a punctuation marker.
Summary
- Ports
- This operator has 1 input port and 1 output port.
- Windowing
- This operator does not accept any windowing configurations.
- Parameters
- This operator supports 33 parameters.
Required: endpoint, objectStorageURI
Optional: appConfigName, bytesPerObject, closeOnPunct, credentials, dataAttribute, encoding, headerRow, maxAttempts, nullPartitionDefaultValue, objectName, objectNameAttribute, objectStoragePassword, objectStorageUser, parquetBlockSize, parquetCompression, parquetDictPageSize, parquetEnableDict, parquetEnableSchemaValidation, parquetPageSize, parquetWriterVersion, partitionValueAttributes, s3aFastUploadActiveBlocks, s3aFastUploadBuffer, s3aMultipartSize, skipPartitionAttributes, sslEnabled, storageFormat, timeFormat, timePerObject, tuplesPerObject, uploadWorkersNum
- Metrics
- This operator reports 12 metrics.
Properties
- Implementation
- Java
- Ports (0)
-
The ObjectStorageSink operator has one input port, which writes the contents of the input stream to the object that you specified. The ObjectStorageSink supports writing data into object storage in two formats parquet and raw. The storage format raw supports line format and blob format. For line format, the schema of the input port is tuple<rstring line>, which specifies a single rstring attribute that represents a line to be written to the object. For binary format, the schema of the input port is tuple<blob data>, which specifies a block of data to be written to the object.
- Properties
-
- Optional: false
- ControlPort: false
- WindowingMode: NonWindowed
- WindowPunctuationInputMode: Oblivious
- Assignments
- Java operators do not support output assignments.
- Ports (0)
-
The ObjectStorageSink operator is configurable with an optional output port. The schema of the output port is <rstring objectName, uint64 objectSize>, which specifies the name and size of objects that are written to object storage. Note, that the tuple is generated on the object upload completion.
- Properties
-
- Optional: true
Required: endpoint, objectStorageURI
Optional: appConfigName, bytesPerObject, closeOnPunct, credentials, dataAttribute, encoding, headerRow, maxAttempts, nullPartitionDefaultValue, objectName, objectNameAttribute, objectStoragePassword, objectStorageUser, parquetBlockSize, parquetCompression, parquetDictPageSize, parquetEnableDict, parquetEnableSchemaValidation, parquetPageSize, parquetWriterVersion, partitionValueAttributes, s3aFastUploadActiveBlocks, s3aFastUploadBuffer, s3aMultipartSize, skipPartitionAttributes, sslEnabled, storageFormat, timeFormat, timePerObject, tuplesPerObject, uploadWorkersNum
- appConfigName
-
Specifies the name of the application configuration containing IBM Cloud Object Storage (COS) IAM credentials. If not set the default application configuration name is cos. Create a property in the cos application configuration named cos.creds. The value of the property cos.creds should be the raw IBM Cloud Object Storage Credentials JSON.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- bytesPerObject
-
Specifies the approximate size of the output object, in bytes. When the object size exceeds the specified number of bytes, the current output object is closed and a new object is opened.
- Properties
-
- Type: int64
- Cardinality: 1
- Optional: true
- closeOnPunct
-
Specifies whether the operator closes the current output object and creates a new object when a punctuation marker is received. The default value is true if parameters timePerObject, tuplesPerObject and bytesPerObject are not set.
- Properties
-
- Type: boolean
- Cardinality: 1
- Optional: true
- credentials
-
Specifies the JSON credentials of the IBM Cloud Object Storage (COS) service. The application configuration property cos.creds is ignored, when this parameter is set.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- dataAttribute
-
The name of the attribute containing the data to be written to the object storage.
- Properties
-
- Cardinality: 1
- Optional: true
- ExpressionMode: Attribute
- encoding
-
Specifies the character set encoding that is used in the output object.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- endpoint
-
Specifies endpoint for connection to Cloud Object Storage (COS). For example, for S3 the endpoint might be 's3.amazonaws.com'. The default value is the IBM Cloud Object Storage (COS) public endpoint 's3.us.cloud-object-storage.appdomain.cloud'.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: false
- headerRow
-
Specifies if the operator should add header row when starting to write object. By default no header row generated.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- maxAttempts
-
Number of times we should retry errors. Default value is 20.
- Properties
-
- Type: int32
- Cardinality: 1
- Optional: true
- nullPartitionDefaultValue
-
Specifies default for partitions with null values.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- objectName
-
Specifies the name of the object that the operator writes to.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- objectNameAttribute
-
The name of the attribute containing the object name.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- ExpressionMode: Attribute
- objectStoragePassword
-
Specifies password for connection to a Cloud Object Storage (COS), also known as 'SecretAccessKey' for S3-compliant COS.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- objectStorageURI
-
Specifies URI for connection to Cloud Object Storage (COS). For S3-compliant COS the URI should be in 'cos://bucket/ or s3a://bucket/' format. The bucket or container must exist. The operator does not create a bucket or container.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: false
- objectStorageUser
-
Specifies username for connection to a Cloud Object Storage (COS), also known as 'AccessKeyID' for S3-compliant COS.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- parquetBlockSize
-
Specifies the block size which is the size of a row group being buffered in memory. The default is 128M.
- Properties
-
- Type: int32
- Cardinality: 1
- Optional: true
- parquetCompression
-
Enum specifying support compressions for parquet storage format. Supported compression types are 'UNCOMPRESSED','SNAPPY','GZIP'
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- parquetDictPageSize
-
There is one dictionary page per column per row group when dictionary encoding is used. The dictionary page size works like the page size but for dictionary.
- Properties
-
- Type: int32
- Cardinality: 1
- Optional: true
- parquetEnableDict
-
Specifies if parquet dictionary should be enabled.
- Properties
-
- Type: boolean
- Cardinality: 1
- Optional: true
- parquetEnableSchemaValidation
-
Specifies of schema validation should be enabled.
- Properties
-
- Type: boolean
- Cardinality: 1
- Optional: true
- parquetPageSize
-
Specifies the page size is for compression. A block is composed of pages. The page is the smallest unit that must be read fully to access a single record. If this value is too small, the compression will deteriorate. The default is 1M.
- Properties
-
- Type: int32
- Cardinality: 1
- Optional: true
- parquetWriterVersion
-
Specifies parquet writer version.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- partitionValueAttributes
-
Specifies the list of attributes to be used for partition column values.
- s3aFastUploadActiveBlocks
-
The parameter is valid for protocol s3a only. Defines the maximum number of blocks a single output stream can have active uploading. If not set, then the default value 8 is used.
- Properties
-
- Type: int32
- Cardinality: 1
- Optional: true
- s3aFastUploadBuffer
-
The parameter is valid for protocol s3a only. The parameter determines the buffering mechanism to use for s3a multipart upload. Allowed values are: disk, array, bytebuffer (default): "disk" will use local file system directories as the location(s) to save data prior to being uploaded. "array" uses arrays in the JVM heap. "bytebuffer" uses off-heap memory within the JVM. Both "array" and "bytebuffer" will consume memory in a single stream up to the number of blocks set by: s3aMultipartSize * s3aFastUploadActiveBlocks. If using either of these mechanisms, keep this value low.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- s3aMultipartSize
-
The parameter is valid for protocol s3a only. Defines the size (in bytes) of the chunks into which the upload will be split up. If not set, then the default value 5242880 is used.
- Properties
-
- Type: int32
- Cardinality: 1
- Optional: true
- skipPartitionAttributes
-
Avoids writing of attributes used as partition columns in data files. Default value is false.
- Properties
-
- Type: boolean
- Cardinality: 1
- Optional: true
- sslEnabled
-
Enables or disables SSL connections to S3, default is true.
- Properties
-
- Type: boolean
- Cardinality: 1
- Optional: true
- storageFormat
-
Specifies storage format operator uses. The default is raw, i.e. the data is stored in the same format as received.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- timeFormat
-
Specifies the time format to use when the objectName parameter value contains %TIME. The parameter value must contain conversion specifications that are supported by the java.text.SimpleDateFormat. The default format is yyyyMMdd_HHmmss.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- timePerObject
-
Specifies the approximate time, in seconds, after which the current output object is closed and a new object is opened for writing.
- Properties
-
- Type: float64
- Cardinality: 1
- Optional: true
- tuplesPerObject
-
Specifies the maximum number of tuples that can be received for each output object. When the specified number of tuples are received, the current output object is closed and a new object is opened for writing.
- Properties
-
- Type: int64
- Cardinality: 1
- Optional: true
- uploadWorkersNum
-
Specifies number of upload workers.
- Properties
-
- Type: int32
- Cardinality: 1
- Optional: true
- cachedData - Counter
-
Data stored in cache in bytes.
- cachedDataMax - Gauge
-
Maximal size of data stored in cache in bytes.
- closeTimeAvg - Time
-
Average time for closing objects on COS in milliseconds.
- closeTimeMax - Time
-
Maximal duration for closing an object on COS in milliseconds.
- closeTimeMin - Time
-
Minimal duration for closing an object on COS in milliseconds.
- nActiveObjects - Counter
-
Number of active (open) objects
- nCloseFailures - Counter
-
Number of close failures
- nClosedObjects - Counter
-
Number of closed objects
- nWriteFailures - Counter
-
Number of failures during create object or write to object
- objectSizeMax - Gauge
-
Maximal size of an object uploaded to COS in bytes.
- objectSizeMin - Gauge
-
Minimal size of an object uploaded to COS in bytes.
- startupTimeMillisecs - Time
-
Operator startup time in milliseconds
- Operator class library