Grouping with custom correlation and tuple deduplication

If you want to group your tuples, implement your own business logic to correlate the data tuples, and get rid of duplicates in your data, you can configure your application easily to do so.

About this task

Enable grouping by file name or tuple attributes with custom correlation and tuple deduplication.

Procedure

In the file <PathToYourApplication>/config/config.cfg, find the ite.businessLogic.group parameter description
To enable the grouping function, set the parameter to on as follows: ite.businessLogic.group=on
In the file <PathToYourApplication>/config/config.cfg, find the ite.businessLogic.transformation.tupleGroupSplit and the ite.ingest.fileGroupSplit parameter descriptions
For grouping based on file names set the parameters as follows: ite.ingest.fileGroupSplit=on and ite.businessLogic.transformation.tupleGroupSplit=off, or for grouping based on tuple attributes set the parameters as follows: ite.ingest.fileGroupSplit=off and ite.businessLogic.transformation.tupleGroupSplit=on
In the file <PathToYourApplication>/config/config.cfg, find the ite.businessLogic.group.deduplication parameter description
To turn on the duplicate detection, set the parameter to on: ite.businessLogic.group.deduplication=on
In the file <PathToYourApplication>/config/config.cfg, find the ite.businessLogic.group.deduplication.timeToKeep parameter description
To review the tuple data of three days when detecting duplicates, set the parameter as follows: ite.businessLogic.group.deduplication.timeToKeep=3d
In the file <PathToYourApplication>/config/config.cfg, find the ite.businessLogic.group.deduplication.probability parameter description
Set the acceptable false positives rate to the wanted value, for example one in a million records:

ite.businessLogic.group.deduplication.probability=0.000001

In the file <PathToYourApplication>/config/config.cfg, find the ite.businessLogic.group.custom parameter description
If you want to use partitioned bloom filter for enhanced housekeeping, then set the parameter to on as follows: ite.businessLogic.group.deduplication.partitioning=on and define the number of partitions, e.g. one partition per day means 3 partitions: ite.businessLogic.group.deduplication.partitioning.count=3
If your data deduplication uses partitions and you must look for duplicates in all partitions, e.g. for last 3 days then set the parameter to on as follows: ite.businessLogic.group.deduplication.partitioning.searchAllPartitions=on
To enable the custom correlation, set the parameter to on as follows: ite.businessLogic.group.custom=on

Remember that the BloomFilter used to detect duplicates in your data records needs to know the expected number of records per period of time that you have to keep. This information is configured in the group configuration file, which by default is in <PathToYourApplication>/config/groups.cfg.

If you use the enhanced housekeeping for the detection of the data deduplication then you must define how to determine the partition identifier, follow the requirements and you must know the restrictions of the BloomFilter operator.

For grouping based on tuple attributes, you implement your custom business logic in the <namespace>.chainprocessor.transfomer.custom::DataProcessor composite. The logic must produce a destination group ID in the groupID SPL output attribute. The groupID is a 2 digits rstring attribute that supports the range 00 - 99. The default groupID value is 00. If you set the groupID to an unknown value, then the Processing Element is going to shut down.