Default values for operator parameters

Operators that accept optional parameters at their invocation site must provide default values or require the parameter to be provided. Choose default values that are safe, that do not adversely affect performance, and that follow the principle of least surprise.

The primary concern for defaults is that they are safe. In this context, safety means that the operator performs all appropriate error checking and reports all errors. For example, the FileSource operator in the SPL Standard Toolkit has a parsing parameter. The default mode is strict, which performs all error checking on input files and generates a runtime exception if it encounters an error. The strict mode is the safest because if there are errors in the input, they are discovered because the application terminates. Developers must explicitly request the less safe options that either try to recover from the error (permissive mode) or avoid error checking completely (fast mode). Typically, operators follow this pattern: default values are safe, and unsafe options must be explicitly requested.

The only exception to this rule is when the safest option has unacceptable performance. For example, by default the FileSink operator in the SPL Standard Toolkit does not explicitly flush its output. Instead, it relies on buffered I/O as provided by C++. Flushing each tuple to disk as the operator receives them is the safest behavior, as it avoids losing data if the operator crashes. However, the FileSink operator would then run at the speed of the disk, which is unacceptable performance. If developers want FileSink to explicitly flush tuples, then they specify it with the flush parameter. Do not use the safest default value if it results in unacceptable performance. However, when in doubt, err on the side of caution.

When safety and performance concerns do not affect the feature, then follow the principle of least surprise for default parameters. For example, the filter parameter in the Functor operator from the SPL Standard Toolkit is optional; it defaults to the value true. This value is the least surprising because many invocations of the Functor operator do not need a filter expression. If the default was false, then for each of those invocations, the programmer must explicitly set the parameter to true to ensure no tuple filtering.

Another example is the order parameter for the Sort operator in the SPL Standard Toolkit. The default value is ascending, which sorts items in increasing order. When people want to sort a sequence, they typically want it sorted from least to greatest. Defaulting to sorting in descending order (from greatest to least) is more likely to surprise users. When you choose among possible default values, choose the value that results in the least surprising behavior.

Related to the principle of least surprise is choosing common formats. If the parameter specifies a particular format, and one format is more common than the others, make that format the default. An example of this practice is the format parameter on both the FileSource and FileSink operators in the SPL Standard Toolkit. Their default file formats are comma-separated values (csv). The csv format is the most common way of representing structured data in text form. Also, note that in this instance, it would not make sense for the FileSource and FileSink operators to use different default values. If an operator is in a family of operators, or if it has a logical counterpart (such as sources and sinks), make the defaults the same.

Finally, if none of these criteria apply and all of the available options are equally valid, then do not have a default value for the parameter. Not having a default value implies that users of the operator must choose a value to give the parameter at its invocation site. An example is the position parameter in the Punctor operator from the SPL Standard Toolkit. The options are to generate punctuation before or after each incoming tuple. Neither option is safer, faster, more common, or less surprising than the other. Hence, users must specify which behavior they want.