Distributing files to processing chains defined on job submission

To increase the throughput of your application, you distribute the detected input files to many processing chains which all work in parallel on the data. When your business logic does not use User Defined Parallelism (UDP) itself, you can enable UDP for the framework and provide the wanted number of chains on job submission.

The Round Robin distribution that is used here is simple and incurs no extra effort. The chains might not be loaded equally if sizes of input files differ significantly. Chains processing small files can become underutilized and the overall throughput degrades. File name queues are kept in the ChainProcessors and are not limited in size, to cope with bursts of incoming data files. Use this mechanism if you expect input files of roughly the same size or files continuously landing in input directories.


Round Robin distribution

About this task

Distribute input files in round-robin manner to the parallel processing chains whose number is passed to the application on job submission.

Procedure

  1. In the file <PathToYourApplication>/config/config.cfg, find the ite.ingest.loadDistribution parameter description
  2. To select the round-robin file distribution, set the parameter value as follows: ite.ingest.loadDistribution=roundRobin

  3. In the file <PathToYourApplication>/config/config.cfg, find ite.ingest.loadDistribution.udp parameter description

  4. Enable user-defined parallelism by setting the parameter to on as follows: ite.ingest.loadDistribution.udp=on

CAUTION: Remember to provide the needed number of chains when submitting the job by using the streamtool command-line option ‘—P’ and the parameter ite.ingest.loadDistribution.groupConfigFile.chains.