Operator HDFS2FileCopy
The HDFS2FileCopy operator copies files from a HDFS file system to the local disk and also in the opposite direction from a local disk to the HDFS file system.
- copyFromLocalFile : Copies a file from local disk to the HDFS file system.
- copyToLocalFile : Copies a file from HDFS file system to the local disk.
The recursive copy of directories and subdirectories is not supported
Summary
- Ports
- This operator has 1 input port and 1 output port.
- Windowing
- This operator does not accept any windowing configurations.
- Parameters
- This operator supports 23 parameters.
Required: direction
Optional: appConfigName, authKeytab, authPrincipal, configPath, credFile, credentials, deleteSourceFile, hdfsFile, hdfsFileAttrName, hdfsPassword, hdfsUri, hdfsUser, keyStorePassword, keyStorePath, libPath, localFile, localFileAttrName, overwriteDestinationFile, policyFilePath, reconnectionBound, reconnectionInterval, reconnectionPolicy
- Metrics
- This operator does not report any metrics.
Properties
- Implementation
- Java
- Ports (0)
-
The HDFS2FileCopy operator has one input port, which contents the file names that you specified. The input port is non-mutating, and its punctuation mode is Oblivious >.
The schema of the input port can have one the following formats:
, which specifies the local and HDFS file names.<rstring localFileAttrName, rstring hdfsFileAttrName> <rstring localFileAttrName> <rstring hdfsFileAttrName>
- Properties
-
- Optional: true
- ControlPort: false
- WindowingMode: NonWindowed
- WindowPunctuationInputMode: Oblivious
- Assignments
- Java operators do not support output assignments.
- Ports (0)
-
The HDFS2FileCopy operator is configurable with an optional output port. The output port is non-mutating and its punctuation mode is Free .
The schema of the output port is:
, which delivers the result of copy process and the elapsed time in miliseconds.<string result, uint64 elapsedTime>
In case of any error it returns the error message as result
- Properties
-
- Optional: true
Required: direction
Optional: appConfigName, authKeytab, authPrincipal, configPath, credFile, credentials, deleteSourceFile, hdfsFile, hdfsFileAttrName, hdfsPassword, hdfsUri, hdfsUser, keyStorePassword, keyStorePath, libPath, localFile, localFileAttrName, overwriteDestinationFile, policyFilePath, reconnectionBound, reconnectionInterval, reconnectionPolicy
- appConfigName
-
This optional parameter specifies the name of the application configuration that contains HDFS connection related configuration parameter credentials. The credentials is a JSON string that contains key/value pairs for user and password and webhdfs . If a value is specified in the application configuration and as operator parameter, the application configuration parameter value takes precedence. An application configuration can be created in the Streams Console or using the streamtool mkappconfig ... <configObject name>.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- authKeytab
-
This optional parameter specifies the file that contains the encrypted keys for the user that is specified by the authPrincipal parameter. The operator uses this keytab file to authenticate the user. The keytab file is generated by the administrator. You must specify this parameter to use Kerberos authentication.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- authPrincipal
-
This parameter specifies the Kerberos principal that you use for authentication. This value is set to the principal that is created for the Streams instance owner. You must specify this parameter if you want to use Kerberos authentication.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- configPath
-
This optional parameter specifies the path to the directory that contains the HDFS configuration file core-site.xml .
If you have extra HDFS configuration in your hdfs-site.xml file, you have to copy also this configuration file from your Hadoop server into configPath directory.
If this parameter is not specified, by default the operator looks for the core-site.xml and hdfs-site.xml files in the following locations:- $HADOOP_HOME/etc/hadoop
- $HADOOP_HOME/conf
- $HADOOP_HOME/lib
- $HADOOP_HOME/
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- credFile
-
This optional parameter specifies a file that contains login credentials. The credentials are used to connect to WEBHDF remotely by using the schema: webhdfs://hdfshost:webhdfsport The credentials file must be a valid JSON string and must contain the hdfs credentials key/value pairs for user, password and webhdfs in JSON format.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- credentials
-
This optional parameter specifies the JSON string that contains the hdfs credentials key/value pairs for user, password and webhdfs .
This parameter can also be specified in an application configuration.
The JSON string must to have the following format:
It is possible to use hdfsUser or user for hdfs user and hdfsPassword or password for hdfs passwor and webhdfs or hdfsUri for hdfs URL in jSON string.{ "user" : "clsadmin", "password" : "IAE-password", "webhdfs" : "webhdfs://ip-address:8443" }
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- deleteSourceFile
-
This optional parameter specifies whether to delete the source file when processing is finished.
- Properties
-
- Type: boolean
- Cardinality: 1
- Optional: true
- direction
-
This parameter specifies the direction of copy. The parameter can be set with the following values.
- copyFromLocalFile Copy a file from local disk to the HDFS file system.
- copyToLocalFile Copy a file from HDFS file system to the local disk.
- Properties
-
- Type: com.teracloud.streams.hdfs.HDFS2FileCopy.copyDirection (copyFromLocalFile, copyToLocalFile)
- Cardinality: 1
- Optional: false
- ExpressionMode: CustomLiteral
- hdfsFile
-
This optional parameter specifies the name of HDFS file or directory. If the name starts with a slash, it is considered an absolute path of HDFS file that you want to use. If it does not start with a slash, it is considered a relative path, relative to the /user/userid/hdfsFile . If you want to copy all incoming files from input port to a directory set the value of direction to copyFromLocalFile and set the value of this parameter to a directory with a slash at the end e.g. /user/userid/testDir/ .
This parameter is mandatory if the hdfsFileAttrNmae is not specified in input port.
The parameter hdfsFile cannot be set when parameter hdfsFileAttrName is set.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- hdfsFileAttrName
-
This optional parameter specifies the value of hdfsFile that coming through input stream. If the name starts with a slash, it is considered an absolute path of HDFS file that you want to copy. If it does not start with a slash, it is considered a relative path, relative to the /user/userid/hdfsFile .
This parameter is mandatory if the hdfsFile is not specified.
The parameter hdfsFileAttrName cannot be set when parameter hdfsFile is set.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- hdfsPassword
-
This parameter specifies the password to use when you connecting to a Hadoop instance deployed on IBM Analytics Engine. If this parameter is not specified, attempts to connect to a Hadoop instance deployed on IBM Analytics Engine will cause an exception.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- hdfsUri
-
This parameter specifies the uniform resource identifier (URI) that you can use to connect to the HDFS file system. The URI has the following format:
- To access HDFS locally or remotely, use hdfs://hdfshost:hdfsport
- To access GPFS locally, use gpfs:/// .
- To access GPFS remotely, use webhdfs://hdfshost:webhdfsport .
- To access HDFS via a web connection for HDFS deployed on IBM Analytics Engine, use webhdfs://webhdfshost:webhdfsport .
If this parameter is not specified, the operator expects that the HDFS URI is specified as the fs.defaultFS or fs.default.name property in the core-site.xml HDFS configuration file. The operator expects the core-site.xml file to be in $HADOOP_HOME/../hadoop-conf or $HADOOP_HOME/etc/hadoop or in the directory specified by the configPath parameter. Note: For connections to HDFS on IBM Analytics Engine, the $HADOOP_HOME environment variable is not supported and so either hdfsUri or configPath must be specified.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- hdfsUser
-
This parameter specifies the user ID to use when you connect to the HDFS file system. If this parameter is not specified, the operator uses the instance owner ID to connect to HDFS. When connecting to Hadoop instances on IBM Analytics Engine, this parameter must be specified otherwise the connection will be unsuccessful. When you use Kerberos authentication, the operator authenticates with the Hadoop file system as the instance owner by using the values that are specified in the authPrincipal and authKeytab parameters. After successful authentication, the operator uses the user ID that is specified by the hdfsUser parameter to perform all other operations on the file system.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- keyStorePassword
-
This optional parameter is only supported when connecting to a Hadoop instance deployed on IBM Analytics Engine. It specifies the password for the keystore file. This attribute is specified when the keyStore attribute is specified and the keystore file is protected by a password. If the keyStorePassword is invalid the operator terminates.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- keyStorePath
-
This optional parameter is only supported when connecting to a Hadoop instance deployed on IBM Analytics Engine. It specifies the path to the keystore file, which is in PEM format. The keystore file is used when making a secure connection to the HDFS server and must contain the public certificate of the HDFS server that will be connected to. Note: If this parameter is omitted, invalid certificates for secure connections will be accepted. If the keystore file does not exist, or if the certificate it contains is invalid, the operator terminates. The location of the keystore file can be absolute path on the filesystem or a path that is relative to the application directory. See the section on SSL Configuration in the main page of this toolkit's documentation for information on how to configure the keystore. The location of the keystore file can be absolute path on the filesystem or a path that is relative to the application directory.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- libPath
-
This optional parameter specifies the absolute path to the directory that contains the Hadoop library files. If this parameter is omitted and $HADOOP_HOME is not set, the apache hadoop specific libraries within the impl/lib/ext folder of the toolkit will be used. When specified, this parameter takes precedence over the $HADOOP_HOME environment variable and the libraries within the folder indicated by $HADOOP_HOME will not be used.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- localFile
-
This optional parameter specifies the name of local file to be copied. If the name starts with a slash, it is considered an absolute path of local file that you want to copy. If it does not start with a slash, it is considered a relative path, relative to your project data directory. If you want to copy all incoming files from input port to a directory set the value of direction to copyToLocalFile and set the value of this parameter to a directory with a slash at the end e.g. data/testDir/ .
This parameter is mandatory if the localFileAttrNmae is not specified in input port.
The parameter localFile cannot be set when parameter localFileAttrName is set.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- localFileAttrName
-
This optional parameter specifies the value of localFile that coming through input stream. If the name starts with a slash, it is considered an absolute path of local file that you want to copy. If it does not start with a slash, it is considered a relative path, relative to your project data directory.
This parameter is mandatory if the localFile is not specified.
The parameter localFileAttrName cannot be set when parameter localFile is set.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- overwriteDestinationFile
-
This optional parameter specifies whether to overwrite the destination file.
- Properties
-
- Type: boolean
- Cardinality: 1
- Optional: true
- policyFilePath
-
This optional parameter is relevant when connecting to IBM Analytics Engine on IBM Cloud. It specifies the path to the directory that contains the Java Cryptography Extension policy files (US_export_policy.jar and local_policy.jar). The policy files enable the Java operators to use encryption with key sizes beyond the limits specified by the JDK. See the section on Policy file configuration in the main page of this toolkit's documentation for information on how to configure the policy files. If this parameter is omitted the JVM default policy files will be used. When specified, this parameter takes precedence over the JVM default policy files.
Note: This parameter changes a JVM property. If you set this property, be sure it is set to the same value in all HDFS operators that are in the same PE. The location of the policy file directory can be absolute path on the file system or a path that is relative to the application directory.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- reconnectionBound
-
This optional parameter specifies the number of successive connection attempts that occur when a connection fails or a disconnect occurs. It is used only when the reconnectionPolicy parameter is set to BoundedRetry; otherwise, it is ignored. The default value is 5 .
- Properties
-
- Type: int32
- Cardinality: 1
- Optional: true
- reconnectionInterval
-
This optional parameter specifies the amount of time (in seconds) that the operator waits between successive connection attempts. It is used only when the reconnectionPolicy parameter is set to BoundedRetry or InfiniteRetry; othewise, it is ignored. The default value is 10 .
- Properties
-
- Type: float64
- Cardinality: 1
- Optional: true
- reconnectionPolicy
-
This optional parameter specifies the policy that is used by the operator to handle HDFS connection failures. The valid values are: NoRetry, InfiniteRetry , and BoundedRetry . The default value is BoundedRetry . If NoRetry is specified and a HDFS connection failure occurs, the operator does not try to connect to the HDFS again. The operator shuts down at startup time if the initial connection attempt fails. If BoundedRetry is specified and a HDFS connection failure occurs, the operator tries to connect to the HDFS again up to a maximum number of times. The maximum number of connection attempts is specified in the reconnectionBound parameter. The sequence of connection attempts occurs at startup time. If a connection does not exist, the sequence of connection attempts also occurs before each operator is run. If InfiniteRetry is specified, the operator continues to try and connect indefinitely until a connection is made. This behavior blocks all other operator operations while a connection is not successful. For example, if an incorrect connection password is specified in the connection configuration document, the operator remains in an infinite startup loop until a shutdown is requested.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- Operator class library