Preparing of enhanced housekeeping for deduplication
You can enhance the housekeeping of the data deduplication in the application framework. The BloomFilter operator provides an internal clean-up processing that requires the definition of partitions. The framework provides these configuration parameters:
- ite.businessLogic.group.deduplication.partitioning (on/off)
- ite.businessLogic.group.deduplication.partitioning.count (integer)
- ite.businessLogic.group.deduplication.partitioning.searchAllPartitions (on/off)
The last two configuration parameters are mapped to parameters of the BloomFilter operator:
- partitionCount
- searchAllPartitions
The following scenario describes an example that your could follow:
- The parsed records provide the timestamp information in the CdrCreationDate attribute in the rstring format YYYY_MM_DD_hh_mm_ss.
- The business case describes performance requirements that the application must process 10,000,000 records per day.
- The propability of the duplicate detection equals to 0.001.
- The application must detect duplicates of the last 3 days also in case of error in records.
Read the BloomFilter description for more details.
Procedure
- Open the <PathToYourApplication>/config/config.cfg file and define the values of required parameters. Remember that the ite.businessLogic.group.deduplication.probability parameter value and the expected number of bloom entries that you can find by default in <PathToYourApplication>/config/groups.cfg file is valid and constant for and per each partition. You must keep in mind the memory space that the BloomFilter operator allocates. The ite.businessLogic.group.deduplication.partitioning.count parameter multiplies the calculated memory consumption per partition. Finally, save the file.
- Example:
- enable data deduplication
- enable checkpointing for restarts and data recovery after error
- enable bloom filter partitioning
- set 3 partitions for 3 days
- enable search for all 3 days
- Example:
-
Open the <namespace>.streams.custom::TypesCustom composite operator and modify the value of the PartitionIdType composite attribute that reperesents the type of the partitionId streams attribute. Save the file.
- Example: static PartitionIdType = uint32;
-
Customize the value of the partitionId attribute. The <namespace>.chainprocessor.transfomer.custom::DataProcessor composite operator next to the calculation of the hashcode attribute is the recommended location for the customizing. It depends on your business logic how to determine the partition identifier. Finally, save the file. In the example, you have to convert the date format to number of days since 1970. Here, you can use the functions, that the framework provides in the <namespace>.functions namespace, e.g. :
- Example: IN.partitionId=daysSince1970(convertFromFileDateToTimestamp(CdrCreationDate,"YYYY_MM_DD_hh_mm_ss"));
-
Re-compile your application.