Operator HBASEScan

The HBASEScan operator scans an HBase table. Like the FileSource operator, it has an optional input port. If no input port is specifed, then the operator scans the table according to the parameters that you specify, and sends the final punctuation. If you specify an input port, the operator does not start a scan until it receives a tuple. After the operator receives a tuple, it scans according to that tuple and produces a punctuation. When you use this operator without an input port, the operator might be multi-threaded. One thread is used per HBASE region, up to the value that you specify in the maxThreads parameter. If the operator is in a parallel region, it further divides up the scanning work between other operators in the region, based on the channel and maxChannel parameter values, such that each row is scanned exactly once. When the scan is multithreaded or used as part of a parallel region, the tuples are not in order.

By default, when you do not specify an input port, the operator scans the whole table. If you specify both the startRow and endRow parameters, the scan starts at the row specified in the startRow parameter and ends at the row specified in the endRow parameter. If you specify only the startRow parameter, the table scan starts there. If your specify only the endRow parameter, the table scan starts at the beginning and scans until the row specified in the endRow parameter. If you specify the rowPrefix parameter, the table scans all rows with that row prefix. When you specify an input port, the operator waits for a tuple to begin the scan. The operator expects the input tuple to contain either (1) a startRow attribute, (2) an endRow attribute, (3) a startRow and an endRow attribute, or (3) a rowPrefix attribute. It then scans according to that attribute and outputs punctuation. The attributes can be any of the valid input types, such as rstring, ustring, long, or blob. Any other attributes are copied through to the output tuple.

Two output modes are supported. In tuple mode, each row/columnFamily/columnQualifer/value entry is mapped to an Streams tuple. The row populates the row attribute, the columnFamily populates the columnFamily attribute, the columnQualifier attribute populates the columnQualifier attribute, and the value populates the value attribute. The value can either be a long or a string data type; all other values must be rstring data types.

In record mode, the value attribute is of tuple type, and each row produces one Streams tuple. The value is populated by taking the attribute names in the value tuple as column qualifiers and placing the values in the attributes that are specified by their column qualifiers.

Behavior in a consistent region

If the operator has an input port, it may not be the source of a consistent region. If it does not have an input port, it can be the source of a region, and the region may either be operator-driven or periodic.

When in a operator-based consistent region, the triggerCount must be set to the number of rows to process before triggering a drain. The operator will process approximately that many rows before starting a drain.

Summary

Ports: This operator has 1 input port and 2 output ports.
Windowing: This operator does not accept any windowing configurations.
Parameters: This operator supports 19 parameters.
Optional: authKeytab, authPrincipal, channel, endRow, hbaseSite, initDelay, maxChannels, maxThreads, maxVersions, minTimestamp, outAttrName, outputCountAttr, rowPrefix, startRow, staticColumnFamily, staticColumnQualifier, tableName, tableNameAttribute, triggerCount
Metrics: This operator does not report any metrics.

Properties

Implementation: Java

Input Ports

Ports (0)

Tuple describing scan. Should contain either (1) startRow, (2) endRow, (3) startRow and endRow, or (4) rowPrefix attribute.

Properties

Optional: true

ControlPort: false
WindowingMode: NonWindowed
WindowPunctuationInputMode: Oblivious

Output Ports

Assignments: Java operators do not support output assignments.

Ports (0)

If outAttrName is a list or a primitive type, there will be one tuple per HBASE entry. If outAttrName is of type tuple, there will be output tuple per row, and the attribute names will be taken as the columnQualifiers for those attributes

Properties

Optional: false

WindowPunctuationOutputMode: Generating

Ports (1)

Optional port for error information. This port submits an error message and a tuple, when an error occurs while HBase actions.

Properties

Optional: true

WindowPunctuationOutputMode: Free

Parameters

This operator supports 19 parameters.

Optional: authKeytab, authPrincipal, channel, endRow, hbaseSite, initDelay, maxChannels, maxThreads, maxVersions, minTimestamp, outAttrName, outputCountAttr, rowPrefix, startRow, staticColumnFamily, staticColumnQualifier, tableName, tableNameAttribute, triggerCount

authKeytab

The authKeytab parameter specifies the kerberos keytab file that is created for the principal.

Properties

Type: rstring
Cardinality: 1
Optional: true

authPrincipal

The authPrincipal parameter specifies the Kerberos principal, which is typically the principal that is created for HBase server

Properties

Type: rstring
Cardinality: 1
Optional: true

channel

If this operator is part of a parallel region, it shares the work of scanning with other operators in the region. To do this, set this parameter value by calling getChannel(). This parameter is required if the maximum number of channels has a value other than zero.

Properties

Type: int32
Cardinality: 1
Optional: true

endRow

This parameter specifies the row to use to stop the scan. The row that you specify is excluded from the scan.

Properties

Type: rstring
Cardinality: 1
Optional: true

hbaseSite

The hbaseSite parameter specifies the path of hbase-site.xml file. This is the recommended way to specify the HBASE configuration. If not specified, then HBASE_HOME must be set when the operator runs, and it will use $HBASE_SITE/conf/hbase-site.xml

Properties

Type: rstring
Cardinality: 1
Optional: true

initDelay

Delay, in seconds, before starting scan.

Properties

Type: float64
Cardinality: 1
Optional: true

maxChannels

If this operator is part of a parallel region, set this parameter value by calling getMaxChannels(). If the operator is in a parallel region, then the regions to be scanned are divided among the other copies of this operator in the other channels. If this parameter is set, you must also set the channel parameter.

Properties

Type: int32
Cardinality: 1
Optional: true

maxThreads

Maximum number of threads to use to scan the table. Defaults to one.

Properties

Type: int32
Cardinality: 1
Optional: true

maxVersions

This parameter specifies the maximum number of versions that the operator returns. It defaults to a value of one. A value of 0 indicates that the operator gets all versions.

Properties

Type: int32
Cardinality: 1
Optional: true

minTimestamp

This parameter specifies the minimum timestamp that is used for queries. The operator does not return any entries with a timestamp older than this value. Unless you specify the maxVersions parameter, the opertor returns only one entry in this time range.

Properties

Type: int64
Cardinality: 1
Optional: true

outAttrName

This parameter specifies the name of the attribute in which to put the value. It defaults to value. If the attribute is a tuple data type, the attribute names are used as columnQualifiers. If multiple families are included in the scan and they have the same columnQualifiers, there is no way of knowing which columnFamily was used to populate a tuple attribute.

Properties

Type: rstring
Cardinality: 1
Optional: true

outputCountAttr

This parameter specifies the output attribute in which to put the number of results that are found. When the result is a tuple, this parameter value is the number attributes that were populated in that tuple.

Properties

Type: rstring
Cardinality: 1
Optional: true

rowPrefix

This parameter specifies that the scan only return rows that have this prefix.

Properties

Type: rstring
Cardinality: 1
Optional: true

startRow

This parameter specifies the row to use to start the scan. The row that you specify is included in the scan.

Properties

Type: rstring
Cardinality: 1
Optional: true

staticColumnFamily

If this parameter is specified, it will be used as the columnFamily for all operations. (Compare to columnFamilyAttrName.) For HBASEScan, it can have cardinality greater than one.

Properties

Type: rstring
Optional: true

staticColumnQualifier

If this parameter is specified, it will be used as the columnQualifier for all tuples. HBASEScan allows it to be specified multiple times.

Properties

Type: rstring
Optional: true

tableName

Name of the HBASE table. It is an optional parameter but one of these parameters must be set in opeartor: 'tableName' or 'tableNameAttribute'. Cannot be used with 'tableNameAttribute'. If the table does not exist, the operator will throw an exception

Properties

Type: rstring
Cardinality: 1
Optional: true

tableNameAttribute

Name of the attribute on the input tuple containing the tableName. Use this parameter to pass the table name to the operator via input port. Cannot be used with parameter 'tableName'. This is suitable for tables with the same schema.

Properties

Type: rstring
Cardinality: 1
Optional: true
ExpressionMode: Attribute

triggerCount

This parameter specifies the number of rows to process before triggering a drain. This parameter is valid only in a operator-driven consistent region.

Properties

Type: int64
Cardinality: 1
Optional: true

Libraries

Operator class library: Library Path: ../../impl/lib/com.teracloud.streams.hbase.jar, ../../opt/downloaded/*