Checkpointing and Cleanup

The ITE application implements a checkpointing mechanism to allow recovery after failures.

Checkpointing

Some components of the ITE application are stateful. They hold in-memory data needed to provide their functionality. These components are the filename deduplication, the record deduplication and potentially the custom correlation functions implemented by the user in the CustomContext component.

The data hold in memory includes
  • The list of already processed filenames, for the filename deduplication
  • The data used by the Bloom filter for record deduplication
  • Data structures used in the custom context to implement correlation functions, for example tables or lists for aggregations

Without additional protection this data would be lost after a host or application failure. Although the application could be restarted to continue processing files, the results would be incorrect. For example the record deduplication would not be able to detect records processed before the failure. To solve this problem, the ITE application writes checkpoint files, after each processed input file. In case of a failure, the ITE application automatically recovers the internal state from the checkpoint files after the restart. The recovery process may take some time. When you initiate a graceful shutdown of the ITE application, using the provided command line tool, some optimizations will be used to reduce the time needed to restore the state after a restart.

Cleanup

The keeping the state in memory for the deduplication components is another aspect. The file name deduplication holds a list of already processed file names. This list could potentially become huge, so it is necessary to periodically remove old entries from the list. A similar effect occurs in the record data deduplication. If old entries are never removed, the error rate of the filter increases, up to a point where the filter becomes useless.

The ITE application periodically performs a Cleanup process to remove old data from the deduplication components and it solve all these issues. The cleanup process is invoked at a configurable time and interval. Per default, it runs every day at midnight. You can also configure how long entries in the deduplication components are retained, for example to keep the file name history for 10 days.

If you want to optimize the Cleanup process, then you can configure partitions in the Bloom Filter and you have to customize you transformer logic for the detection of the partition identifier.

Checkpointing for custom correlation functions

If you implement stateful functionality in the Group component, you can decide if your function shall participate in the checkpointing and recovery mechanism. Your composite operator will receive commands from the ITE Application control, when to read, write or clear the internal state. If your use case does not require checkpointing and recovery in this component, you can simply ignore the related commands.