Operator HBASEScan
The HBASEScan operator scans an HBase table. Like the FileSource operator, it has an optional input port. If no input port is specifed, then the operator scans the table according to the parameters that you specify, and sends the final punctuation. If you specify an input port, the operator does not start a scan until it receives a tuple. After the operator receives a tuple, it scans according to that tuple and produces a punctuation. When you use this operator without an input port, the operator might be multi-threaded. One thread is used per HBASE region, up to the value that you specify in the maxThreads parameter. If the operator is in a parallel region, it further divides up the scanning work between other operators in the region, based on the channel and maxChannel parameter values, such that each row is scanned exactly once. When the scan is multithreaded or used as part of a parallel region, the tuples are not in order.
By default, when you do not specify an input port, the operator scans the whole table. If you specify both the startRow and endRow parameters, the scan starts at the row specified in the startRow parameter and ends at the row specified in the endRow parameter. If you specify only the startRow parameter, the table scan starts there. If your specify only the endRow parameter, the table scan starts at the beginning and scans until the row specified in the endRow parameter. If you specify the rowPrefix parameter, the table scans all rows with that row prefix. When you specify an input port, the operator waits for a tuple to begin the scan. The operator expects the input tuple to contain either (1) a startRow attribute, (2) an endRow attribute, (3) a startRow and an endRow attribute, or (3) a rowPrefix attribute. It then scans according to that attribute and outputs punctuation. The attributes can be any of the valid input types, such as rstring, ustring, long, or blob. Any other attributes are copied through to the output tuple.
Two output modes are supported. In tuple mode, each row/columnFamily/columnQualifer/value entry is mapped to an Streams tuple. The row populates the row attribute, the columnFamily populates the columnFamily attribute, the columnQualifier attribute populates the columnQualifier attribute, and the value populates the value attribute. The value can either be a long or a string data type; all other values must be rstring data types.
In record mode, the value attribute is of tuple type, and each row produces one Streams tuple. The value is populated by taking the attribute names in the value tuple as column qualifiers and placing the values in the attributes that are specified by their column qualifiers.
Behavior in a consistent region
If the operator has an input port, it may not be the source of a consistent region. If it does not have an input port, it can be the source of a region, and the region may either be operator-driven or periodic.
When in a operator-based consistent region, the triggerCount must be set to the number of rows to process before triggering a drain. The operator will process approximately that many rows before starting a drain.
Summary
- Ports
- This operator has 1 input port and 2 output ports.
- Windowing
- This operator does not accept any windowing configurations.
- Parameters
- This operator supports 19 parameters.
Optional: authKeytab, authPrincipal, channel, endRow, hbaseSite, initDelay, maxChannels, maxThreads, maxVersions, minTimestamp, outAttrName, outputCountAttr, rowPrefix, startRow, staticColumnFamily, staticColumnQualifier, tableName, tableNameAttribute, triggerCount
- Metrics
- This operator does not report any metrics.
Properties
- Implementation
- Java
- Ports (0)
-
Tuple describing scan. Should contain either (1) startRow, (2) endRow, (3) startRow and endRow, or (4) rowPrefix attribute.
- Properties
-
- Optional: true
- ControlPort: false
- WindowingMode: NonWindowed
- WindowPunctuationInputMode: Oblivious
- Assignments
- Java operators do not support output assignments.
- Ports (0)
-
If outAttrName is a list or a primitive type, there will be one tuple per HBASE entry. If outAttrName is of type tuple, there will be output tuple per row, and the attribute names will be taken as the columnQualifiers for those attributes
- Properties
-
- Optional: false
- WindowPunctuationOutputMode: Generating
- Ports (1)
-
Optional port for error information. This port submits an error message and a tuple, when an error occurs while HBase actions.
- Properties
-
- Optional: true
Optional: authKeytab, authPrincipal, channel, endRow, hbaseSite, initDelay, maxChannels, maxThreads, maxVersions, minTimestamp, outAttrName, outputCountAttr, rowPrefix, startRow, staticColumnFamily, staticColumnQualifier, tableName, tableNameAttribute, triggerCount
- authKeytab
-
The authKeytab parameter specifies the kerberos keytab file that is created for the principal.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- authPrincipal
-
The authPrincipal parameter specifies the Kerberos principal, which is typically the principal that is created for HBase server
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- channel
-
If this operator is part of a parallel region, it shares the work of scanning with other operators in the region. To do this, set this parameter value by calling getChannel(). This parameter is required if the maximum number of channels has a value other than zero.
- Properties
-
- Type: int32
- Cardinality: 1
- Optional: true
- endRow
-
This parameter specifies the row to use to stop the scan. The row that you specify is excluded from the scan.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- hbaseSite
-
The hbaseSite parameter specifies the path of hbase-site.xml file. This is the recommended way to specify the HBASE configuration. If not specified, then HBASE_HOME must be set when the operator runs, and it will use $HBASE_SITE/conf/hbase-site.xml
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- initDelay
-
Delay, in seconds, before starting scan.
- Properties
-
- Type: float64
- Cardinality: 1
- Optional: true
- maxChannels
-
If this operator is part of a parallel region, set this parameter value by calling getMaxChannels(). If the operator is in a parallel region, then the regions to be scanned are divided among the other copies of this operator in the other channels. If this parameter is set, you must also set the channel parameter.
- Properties
-
- Type: int32
- Cardinality: 1
- Optional: true
- maxThreads
-
Maximum number of threads to use to scan the table. Defaults to one.
- Properties
-
- Type: int32
- Cardinality: 1
- Optional: true
- maxVersions
-
This parameter specifies the maximum number of versions that the operator returns. It defaults to a value of one. A value of 0 indicates that the operator gets all versions.
- Properties
-
- Type: int32
- Cardinality: 1
- Optional: true
- minTimestamp
-
This parameter specifies the minimum timestamp that is used for queries. The operator does not return any entries with a timestamp older than this value. Unless you specify the maxVersions parameter, the opertor returns only one entry in this time range.
- Properties
-
- Type: int64
- Cardinality: 1
- Optional: true
- outAttrName
-
This parameter specifies the name of the attribute in which to put the value. It defaults to value. If the attribute is a tuple data type, the attribute names are used as columnQualifiers. If multiple families are included in the scan and they have the same columnQualifiers, there is no way of knowing which columnFamily was used to populate a tuple attribute.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- outputCountAttr
-
This parameter specifies the output attribute in which to put the number of results that are found. When the result is a tuple, this parameter value is the number attributes that were populated in that tuple.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- rowPrefix
-
This parameter specifies that the scan only return rows that have this prefix.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- startRow
-
This parameter specifies the row to use to start the scan. The row that you specify is included in the scan.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- staticColumnFamily
-
If this parameter is specified, it will be used as the columnFamily for all operations. (Compare to columnFamilyAttrName.) For HBASEScan, it can have cardinality greater than one.
- staticColumnQualifier
-
If this parameter is specified, it will be used as the columnQualifier for all tuples. HBASEScan allows it to be specified multiple times.
- tableName
-
Name of the HBASE table. It is an optional parameter but one of these parameters must be set in opeartor: 'tableName' or 'tableNameAttribute'. Cannot be used with 'tableNameAttribute'. If the table does not exist, the operator will throw an exception
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- tableNameAttribute
-
Name of the attribute on the input tuple containing the tableName. Use this parameter to pass the table name to the operator via input port. Cannot be used with parameter 'tableName'. This is suitable for tables with the same schema.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- ExpressionMode: Attribute
- triggerCount
-
This parameter specifies the number of rows to process before triggering a drain. This parameter is valid only in a operator-driven consistent region.
- Properties
-
- Type: int64
- Cardinality: 1
- Optional: true
- Operator class library