The context checkpointing files are missing after housekeeping with partitioned Bloom Filter

In general, the partitioned Bloom Filter operator evicts the too old data in the partitions automatically. The scheduled housekeeping for the checkpoint files isn’t synchronized with the internal eviction process of the partitioned Bloom Filter operator. You must decide how to configure the housekeeping scheduler according to the business requirements. You must define useful values for the ite.ingest.deduplication.timeToKeep and ite.businessLogic.group.deduplication.timeToKeep parameters. If you enable the ite.businessLogic.group.deduplication.checkpointing and ite.ingest.deduplication parameters and the clean-up process starts, then the framework removes too old checkpoint files.

If you define the too short timeToKeep vales for the ite.cleanup.schedule.minute|hour|dayOfMonth|dayOfWeek parameters then the framework doesn’t recover the data in your Bloom Filter operator. In this way, you could configure your application with disabled partitioning for the data deduplication as well as with partitioned Bloom Filter operator.

If you enable the deduplication partitioning in your applications, then you could observe other behavior after the housekeeping process. The succeeding scenario shows it to you.

Scenario

  • The clean-up scheduler is configured, the timeToKeep parameters are set and the ite.businessLogic.group.deduplication.checkpointing parameter is switched on.
  • You start up the application and process the data file in the ITE application.
  • The clean-up scheduler trigger starts the housekeeping process. It removes all checkpoint files generated by the file ingestion as well as by the context deduplication.
  • You reprocess the same files with the ITE application.

Results

  • The framework doesn't re-create all checkpoint files for the record deduplication.

Reason

  • The input files aren’t detected as duplicates because of the ite.ingest.deduplication.timeToKeep configuration parameter setting that allows to remove the old checkpoint files.
  • The Bloom Filter operator still keeps the record hashcodes in memory. The clean-up process isn’t processed for the partitioned Bloom Filter operator and the partitions are still present in memory.
  • If the configured partitions are huge enough so that the Bloom Filter operator doesn’t remove any partition then the operator keeps all record data, for example: If the partition is designed for whole day and the timeToKeep parameters define shorter timeslots for housekeeping, like every 6 hours.

You could configure the presented scenario and it works as designed. Finally, you can investigate in the analysis of the rejected files and the statistic files that give you the hints why the context checkpoint files aren’t created. The DedupCore composite writes only the unique hashcodes as well as the corresponding partition identifier to the checkpoint file. If the ITE application processes the same input file without the detection of file duplicates after the clean-up process, then the Bloom Filter operator detects duplicates only.

Statistic hints

  • The rejection output file provides the reason ID 2 for all duplicated records.
  • If you look to the statistic output file then you can count the rejectedInvalids, recordDuplicates and outdatedRecords statistic values. If the sum is equal to the value of sentRecords statistic attribute, then the application doesn’t write any checkpoint file for the input file.