Operator HBASEScan
The HBASEScan operator scans an HBase table based on parameter or by input tuples.
- No input port: The operator scans the table using the specified parameters and sends a final punctuation marker.
- With input port: The operator waits for an input tuple, then scans according to the tuple’s attributes and produces a punctuation marker.
Controlling scan range
- By default (no input port, no range parameters), the entire table is scanned.
- startRow and endRow parameters can be used to define the scan range:
- When both specified: scan from startRow to endRow.
- When only startRow is specified: scan from startRow to end of table.
- When only endRow is specified: scan from start of table to endRow.
- rowPrefix parameter can be used to scan all rows with the given prefix.
Input tuple attributes (when using an input port)
- The input tuple must contain at least one of: startRow, endRow, or rowPrefix.
- Attribute types can be rstring, ustring, int64, or blob.
- Other attributes are copied to the output tuple.
Output modes
- Cell mode: If the output stream attribute is of type rstring, ustring, int64, or blob, then an output tuple is submitted per (row, columnFamily, columnQualifier) entry and the attribute value will be the value of the Hbase cell.
- Column mode: If the output stream attribute is a tuple type, then an output tuple is submitted per (row, columnFamily) entry. Each attribute name in the tuple type is considered a columnQualifier and the attribute value will contain its associated value.
Threading and parallelism
- Without an input port, the operator may use multiple threads (one per HBase region, up to maxThreads).
- In a parallel region, scanning is divided among parallel channels using the channel and maxChannel parameters, ensuring each row is scanned once.
- When multithreaded or parallel, output tuple order is not guaranteed.
Behavior in a consistent region
If the operator has an input port, it may not be the source of a consistent region. If it does not have an input port, it can be the source of a region, and the region may either be operator-driven or periodic.
When in a operator-based consistent region, the triggerCount must be set to the number of rows to process before triggering a drain. The operator will process approximately that many rows before starting a drain.
Examples
use com.teracloud.streams.hbase::HBASEScan;
public composite Main {
type BookType = rstring author_lname, rstring author_fname, rstring year, rstring rating, rstring title ;
graph
// No input port, full scan. Output cell values.
stream<rstring row, rstring columnFamily, rstring columnQualifier, int64 value> CellResults = HBASEScan() {
param
tableName: "streamsSample" ;
outAttrName: "value" ;
}
// No input port, range scan for two column families. Output cell values.
stream<rstring row, rstring columnQualifier, rstring value> CellResults2 = HBASEScan() {
param
tableName: "streamsSample" ;
staticColumnFamily: "books" ;
staticColumnFamily: "songs" ;
startRow: "Jasper" ;
endRow: "Xavier" ;
outAttrName: "value" ;
}
// Input port, row prefix scan. Output columns.
stream<rstring tableName, rstring rowPrefix> ScanQueries = Beacon() {}
stream<rstring row, rstring columnFamily, BookType books> ColumnResults = HBASEScan(ScanQueries) {
param
tableNameAttribute: "tableName" ;
outAttrName: "books" ;
}
}
Summary
- Ports
- This operator has 1 input port and 2 output ports.
- Windowing
- This operator does not accept any windowing configurations.
- Parameters
- This operator supports 19 parameters.
Optional: authKeytab, authPrincipal, channel, endRow, hbaseSite, initDelay, maxChannels, maxThreads, maxVersions, minTimestamp, outAttrName, outputCountAttr, rowPrefix, startRow, staticColumnFamily, staticColumnQualifier, tableName, tableNameAttribute, triggerCount
- Metrics
- This operator does not report any metrics.
Properties
- Implementation
- Java
- Ports (0)
-
Tuple describing scan. Should contain either (1) startRow, (2) endRow, (3) startRow and endRow, or (4) rowPrefix attribute.
- Properties
-
- Optional: true
- ControlPort: false
- WindowingMode: NonWindowed
- WindowPunctuationInputMode: Oblivious
- Assignments
- Java operators do not support output assignments.
- Ports (0)
-
If outAttrName is a list or a primitive type, there will be one tuple per HBASE entry. If outAttrName is of type tuple, there will be output tuple per row, and the attribute names will be taken as the columnQualifiers for those attributes
- Properties
-
- Optional: false
- WindowPunctuationOutputMode: Generating
- Ports (1)
-
Optional port for error information. This port submits an error message and a tuple, when an error occurs while HBase actions.
- Properties
-
- Optional: true
Optional: authKeytab, authPrincipal, channel, endRow, hbaseSite, initDelay, maxChannels, maxThreads, maxVersions, minTimestamp, outAttrName, outputCountAttr, rowPrefix, startRow, staticColumnFamily, staticColumnQualifier, tableName, tableNameAttribute, triggerCount
- authKeytab
-
The authKeytab parameter specifies the kerberos keytab file that is created for the principal.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- authPrincipal
-
The authPrincipal parameter specifies the Kerberos principal, which is typically the principal that is created for HBase server
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- channel
-
If this operator is part of a parallel region, it shares the work of scanning with other operators in the region. To do this, set this parameter value by calling getChannel(). This parameter is required if the maximum number of channels has a value other than zero.
- Properties
-
- Type: int32
- Cardinality: 1
- Optional: true
- endRow
-
This parameter specifies the row to use to stop the scan. The row that you specify is excluded from the scan.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- hbaseSite
-
The hbaseSite parameter specifies the path of hbase-site.xml file. This is the recommended way to specify the HBASE configuration. If not specified, then HBASE_HOME must be set when the operator runs, and it will use $HBASE_SITE/conf/hbase-site.xml
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- initDelay
-
Delay, in seconds, before starting scan.
- Properties
-
- Type: float64
- Cardinality: 1
- Optional: true
- maxChannels
-
If this operator is part of a parallel region, set this parameter value by calling getMaxChannels(). If the operator is in a parallel region, then the regions to be scanned are divided among the other copies of this operator in the other channels. If this parameter is set, you must also set the channel parameter.
- Properties
-
- Type: int32
- Cardinality: 1
- Optional: true
- maxThreads
-
Maximum number of threads to use to scan the table. Defaults to one.
- Properties
-
- Type: int32
- Cardinality: 1
- Optional: true
- maxVersions
-
This parameter specifies the maximum number of versions that the operator returns. It defaults to a value of one. A value of 0 indicates that the operator gets all versions.
- Properties
-
- Type: int32
- Cardinality: 1
- Optional: true
- minTimestamp
-
This parameter specifies the minimum timestamp that is used for queries. The operator does not return any entries with a timestamp older than this value. Unless you specify the maxVersions parameter, the opertor returns only one entry in this time range.
- Properties
-
- Type: int64
- Cardinality: 1
- Optional: true
- outAttrName
-
This parameter specifies the name of the attribute in which to put the value. It defaults to value. If the attribute is a tuple data type, the attribute names are used as columnQualifiers. If multiple families are included in the scan and they have the same columnQualifiers, there is no way of knowing which columnFamily was used to populate a tuple attribute.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- outputCountAttr
-
This parameter specifies the output attribute in which to put the number of results that are found. When the result is a tuple, this parameter value is the number attributes that were populated in that tuple.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- rowPrefix
-
This parameter specifies that the scan only return rows that have this prefix.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- startRow
-
This parameter specifies the row to use to start the scan. The row that you specify is included in the scan.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- staticColumnFamily
-
If this parameter is specified, it will be used as the columnFamily for all operations. (Compare to columnFamilyAttrName.) For HBASEScan, it can have cardinality greater than one.
- staticColumnQualifier
-
If this parameter is specified, it will be used as the columnQualifier for all tuples. HBASEScan allows it to be specified multiple times.
- tableName
-
Name of the HBASE table. It is an optional parameter but one of these parameters must be set in opeartor: 'tableName' or 'tableNameAttribute'. Cannot be used with 'tableNameAttribute'. If the table does not exist, the operator will throw an exception
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- tableNameAttribute
-
Name of the attribute on the input tuple containing the tableName. Use this parameter to pass the table name to the operator via input port. Cannot be used with parameter 'tableName'. This is suitable for tables with the same schema.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- ExpressionMode: Attribute
- triggerCount
-
This parameter specifies the number of rows to process before triggering a drain. This parameter is valid only in a operator-driven consistent region.
- Properties
-
- Type: int64
- Cardinality: 1
- Optional: true
- Operator class library