Grouping with tuple deduplication

If you want to group your tuples and get rid of duplicates in your data, you can configure your application easily to do so.

About this task

Enable file name or tuple attributes grouping and tuple deduplication.

Procedure

  1. In the file <PathToYourApplication>/config/config.cfg, find the ite.businessLogic.group parameter description
  2. To enable the grouping function, set the parameter to on as follows: ite.businessLogic.group=on

  3. In the file <PathToYourApplication>/config/config.cfg, find the ite.businessLogic.transformation.tupleGroupSplit and the ite.ingest.fileGroupSplit parameter descriptions

  4. For grouping based on file names set the parameters as follows: ite.ingest.fileGroupSplit=on and ite.businessLogic.transformation.tupleGroupSplit=off, or for grouping based on tuple attributes set the parameters as follows: ite.ingest.fileGroupSplit=off and ite.businessLogic.transformation.tupleGroupSplit=on

  5. In the file <PathToYourApplication>/config/config.cfg, find the ite.businessLogic.group.deduplication parameter description

  6. To turn on the duplicate detection, set the parameter to on: ite.businessLogic.group.deduplication=on

  7. In the file <PathToYourApplication>/config/config.cfg, find the ite.businessLogic.group.deduplication.timeToKeep parameter description

  8. To review the tuple data of three days when detecting duplicates, set the parameter as follows: ite.businessLogic.group.deduplication.timeToKeep=3d

  9. In the file <PathToYourApplication>/config/config.cfg, find ite.businessLogic.group.deduplication.probability parameter description

  10. Set the acceptable false positives rate to the wanted value, for example one in a million records: ite.businessLogic.group.deduplication.probability=0.000001

Remember that the BloomFilter used to detect duplicates in your data records needs to know the expected number of records per day. This information is configured in the group configuration file, which by default is in <PathToYourApplication>/config/groups.cfg.

For grouping based on tuple attributes, you implement your custom business logic in the <namespace>.chainprocessor.transfomer.custom::DataProcessor composite. The logic must produce a destination group ID in the groupID SPL output attribute. The groupID is a 2 digits rstring attribute that supports the range 00 - 99. The default groupID value is 00. If you set the groupID to an unknown value, then the Processing Element is going to shut down.