Preparing of enhanced housekeeping for deduplication

You can enhance the housekeeping of the data deduplication in the application framework. The BloomFilter operator provides an internal clean-up processing that requires the definition of partitions. The framework provides these configuration parameters:
  • ite.businessLogic.group.deduplication.partitioning (on/off)
  • ite.businessLogic.group.deduplication.partitioning.count (integer)
  • ite.businessLogic.group.deduplication.partitioning.searchAllPartitions (on/off)
The configuration procedure is described in chapter Grouping with custom correlation and tuple deduplication.
The last two configuration parameters are mapped to parameters of the BloomFilter operator:
  • partitionCount
  • searchAllPartitions
The application framework defines the partitionId stream attribute that the BloomFilter operator uses for the partition identification. You must define the type and the value of this attribute according to the business process requirements.
The following scenario describes an example that your could follow:
  • The parsed records provide the timestamp information in the CdrCreationDate attribute in the rstring format YYYY_MM_DD_hh_mm_ss.
  • The business case describes performance requirements that the application must process 10,000,000 records per day.
  • The propability of the duplicate detection equals to 0.001.
  • The application must detect duplicates of the last 3 days also in case of error in records.

Read the BloomFilter description for more details.

Procedure

  1. Open the <PathToYourApplication>/config/config.cfg file and define the values of required parameters. Remember that the ite.businessLogic.group.deduplication.probability parameter value and the expected number of bloom entries that you can find by default in <PathToYourApplication>/config/groups.cfg file is valid and constant for and per each partition. You must keep in mind the memory space that the BloomFilter operator allocates. The ite.businessLogic.group.deduplication.partitioning.count parameter multiplies the calculated memory consumption per partition. Finally, save the file.
    • Example:
      • enable data deduplication
      ite.businessLogic.group.deduplication=on
      • enable checkpointing for restarts and data recovery after error
      ite.businessLogic.group.deduplication.checkpointing=on
      • enable bloom filter partitioning
      ite.businessLogic.group.deduplication.partitioning=on
      • set 3 partitions for 3 days
      ite.businessLogic.group.deduplication.partitioning.count=3
      • enable search for all 3 days
      ite.businessLogic.group.deduplication.partitioning.searchAllPartitions=on
  2. Open the <namespace>.streams.custom::TypesCustom composite operator and modify the value of the PartitionIdType composite attribute that reperesents the type of the partitionId streams attribute. Save the file.
    • Example: static PartitionIdType = uint32;
  3. Customize the value of the partitionId attribute. The <namespace>.chainprocessor.transfomer.custom::DataProcessor composite operator next to the calculation of the hashcode attribute is the recommended location for the customizing. It depends on your business logic how to determine the partition identifier. Finally, save the file. In the example, you have to convert the date format to number of days since 1970. Here, you can use the functions, that the framework provides in the <namespace>.functions namespace, e.g. :
    • Example: IN.partitionId=daysSince1970(convertFromFileDateToTimestamp(CdrCreationDate,"YYYY_MM_DD_hh_mm_ss"));
  4. Re-compile your application.