Supporting multiple Teracloud® Streams versions with a single toolkit

Toolkits can support multiple versions of Teracloud® Streams with a minimum number of source files, if you follow the guidelines provided.

As Teracloud® Streams evolves, new functionality is added that might require changes to the implementation of operators that your toolkit provides in order to take advantage of that new capability. The question then becomes, How do I write my operator code so that it can support both the old and new versions of the Streams product? The answer to this question depends on the versions of Teracloud® Streams you wish to support and the capabilities of Teracloud® Streams you wish to exploit. This discussion will focus primarily on the changes between Version 4.0 and previous versions because the changes in Version 4.0 were the most significant.

There are two approaches you can take to multiple version support, the lowest common denominator approach and the exploitation of new features approach:

The lowest common denominator is the easiest approach to use, where you decide on a base version of Teracloud® Streams that you will support. You limit the features that your operators use to those supported by the base product level.
Under the exploitation of new features approach, you want your operators to use product capabilities that are available based on the Teracloud® Streams version under which you are building your application. More work and more care are required in order to maintain a minimal source version of your toolkit.

This discussion assumes that you maintain the source code for your toolkit in some sort of source control system and that you are able to extract the source and build a unique instance of your toolkit. While your source might be common, you should build an instance of your toolkit for each version of Teracloud® Streams that you wish to support. This is generally required as there might be changes in the generated toolkit artifacts that the Teracloud® Streams compiler consumes. For example, the toolkit.xml file, which is generated when you index your toolkit using spl-make-toolkit, is an internal interface and can be expected to change from version to version. While a newer versions of the Streams compiler might consume a toolkit.xml file generated by a previous version of Teracloud® Streams, it will likely not function correctly at run time.

Some toolkit artifacts that you provide as part of your toolkit have been extended to enable new features in the product. If you take advantage of those changes then that will limit the backwards compatibility with previous Streams versions. For example, the info.xml model has been extended to allow the specification of the sabFiles element. If used, attempting to index the toolkit with a previous version of Teracloud® Streams will fail. In order to support previous versions of Teracloud® Streams, you must either avoid using the newer constructs or provide separate versions of files with these characteristics for each instance of a toolkit you which to produce.

The two most significant changes that Version 4.0 of Teracloud® Streams has introduced are application bundles and the treatment of the data directory. While these might require changes to your toolkit implementation, these changes, if made correctly, should be backwards compatible with previous versions of the Streams product.

The introduction of application bundles affects the way toolkits must be structured. This is true if your toolkit requires that any toolkit artifact be accessible at run time. For example, if any operator your toolkit provides requires a shared object or jar file to support its operation, and that shared object or jar file is part of your toolkit, then there are expectations as to where those entities exist within the toolkit directory structure. In order to support both current and previous versions of Teracloud® Streams, you must ensure that those toolkit artifacts are in the expected locations. The expectations in Version 4.0 are stricter than in previous versions, so that satisfying the expectations for Version 4.0 should also satisfy previous versions of Teracloud® Streams.

With the addition of support for non-shared file systems in Streams Version 4.0, it no longer makes sense to assume that there is always a data directory that exists at some constant location across all nodes on which a Streams application might be running. When running under versions of Teracloud® Streams prior to Version 4.0, operators can assume that, at run time, the current directory is the data directory. Starting with Version 4.0, operators that need access to files that are specified using a path relative to the data directory can no longer assume that the current directory is the data directory. Operators that make this assumption must be modified to build a fully qualified path to the data directory using an API to get the actual, run time, data directory location and use the result to compose a fully qualified path to the file they wish to access. This API has existed since Version 2 of Teracloud® Streams and so is backwards compatible.

Streams Version 4.0 introduced another, somewhat more subtle, difference in the use of the data directory in the context of application bundles. Streams Version 4.0 or later differentiates between data files and configuration files. Data files are those files that are considered unique to an instance of a running Streams application while configuration files would be common to all running instances of a given Streams application. Data files are expected to be found in the data directory and accessed using APIs that provide access to the unique data directory for the specific instance of the application. Configuration files, on the other hand, are expected to be packaged in the application bundle and accessed using APIs that provide access to the file in the location where the application bundle lives at run time, which is a common location to all instances of a given streams application. It should be noted that if you always use absolute paths to files, then there is no change to the behavior or expectations between Version 4.0 and previous versions of Teracloud® Streams. Note that the documentation you provide with your toolkit should specify, for each file that it consumes, where that file is expected to be found by the operator and how relative paths to each of those files will be interpreted.

While there are multiple languages in which operators might be implemented, the mechanism they use to support the differences between Version 4.0 and previous versions are common in concept and are described in the following table. Language differences are further detailed in the sections listed.

Table 1. File treatment
The table describes the differences on how files are treated in Teracloud® Streams, including recommended root for relative files and the relevant APIs for each file.
File Purpose	Recommended root for relative files	Relevant API(s)	Notes®
Data files	data directory	`ProcessingElement::getDataDirectory()`	Relative paths should be converted to absolute using the value from the `getDataDirectory` API. Operators that already build an absolute path to a file need no change. Version 4.0 introduced a `hasDataDirectory` API but it's use is not backwards compatible.
Configuration files	either the application directory or the toolkit directory	`ProcessingElement::getApplicationDirectory()` `ProcessingElement::getToolkitDirectory()`	Operators might have used the data directory as a relative root for configuration files, but this is no longer recommended. They should be modified to use the recommended root. Relative files should be specified in the SPL code as relative to the root of the containing directory. For example, if the file is expected to be found in the `etc` sub-directory, then it should be specified as `etc/theFile`. Operators might have used the application directory as a relative root, but might have used a sub-directory that is not included in the application bundle by default. Operators might either move the file to one of the sub-directories that is included in the application bundle, for example etc or might use the `sabFiles` element of the info.xml file to force the configuration file into the bundle. The later mechanism is not backwards compatible and would require a unique info.xml for Version 4.0 or later, and pre Version 4.0 toolkits. The `getToolkitDirectory` API was introduced in Version 4.0 so its use is not backwards compatible.
Compile time artifacts intended for consumption at run time	the output directory	`ProcessingElement::getOutputDirectory()`	The `usr/impl` sub-directory of the output directory is included in the application bundle by default. Such artifacts should be generated into a unique sub-directory of `usr/impl` so as not to collide with artifacts generated by another operator. This is a rarely used feature.
Working or temporary files	current working directory or some temporary location created using system/language specific services	N/A	There is always a current working directory. Prior to Version 4.0 this is the data directory. As of Version 4.0 this is a temporary directory that is created by the Streams platform. Its location is non-deterministic and it is removed when a PE is restarted or the job canceled so there is no access to this directory from outside of Streams. If your operator needs to generate files, either for operation or for dumping debug information that needs to be later accessed, it is recommended that the data directory, or some absolute location, be used.
Binary files	either the application directory or toolkit directory	N/A	Binary files such as Java™ jar files or C/C++ shared objects should be located in an application or toolkit sub-directory that is included in the application bundle by default. Use either the operator model's dependency mechanism to establish the path to the binary, or a language specific mechanism. For example, the Java™ resource mechanism.
Toolkit files	the toolkit's root directory	`ProcessingElement::getToolkitDirectory()`	These files are essentially the same as configuration or binary files, but are in the same toolkit as the SPL source containing the operator invocation (in SPL code) that is referencing the file. Prior to Version 4.0 these files were referenced using the SPL compiler intrinsic function `getThisFileDir()` and using that as the root of a path to the desired file. While this mechanism will still work, toolkits intended for use with Version 4.0 and above should specify the root directory relative to which such files are to be found, and then use the `getToolkitDirectory()` API to build an absolute path to the file.