The ITE application doesn't archive a reprocessed file in the archive folder
If you try to process a file again or a file that has the same content as one of the processed files, then the ITE application doesn’t detects duplicates. It doesn't archive the file as a regular processed file, and it doesn't create output files as expected.
You must remove the checkpoint information of the adequate input file and restart the ITE application to solve this problem. In case, the housekeeping is switched on, you might wait and process this file after clean-up that removes the relevant checkpoint information after specified time interval.
Keep in mind, you can't determine the single checkpoint entry for single data set in the checkpoint file. You must remove the whole checkpoint file.
Symptoms
- The input file moves to duplicate folder.
- The input file moves to failed folder.
- The input file moves to archive folder, but the application doesn’t create any output file in the load folder. The ITE writes a duplication record of each data tuple to the corresponding rejection file (<input-file-name>.rej.csv) in the rejected output folder.
Causes
- If the file ingestion detects a duplicated name of the input file, then the application moves it to duplicate folder.
- If the configuration specifies the ite.ingest.deduplication.reprocessFilePattern parameter and the application provides faulty code customizing for the reprocessing, then the application moves the input file to the failed folder.
- If you have a copy of already processed input file but with another file name, then the application does not create the output files in the load folder.
Resolving the problem
If your application moves the reprocessed input file to the failed folder and the reprocessing of files is regular procedure, then you must correct the customizing code.
The following procedure, you can use in any case.
You can follow the procedure better when you see equivalent commands that are valid for the demoapp sample application. The file that needs the reprocessing is CDR_RGN0_20140201083000.bin.
- Check whether the input directory of the ITE application is empty. This folder is specified in the config.cfg file as value of ite.ingest.directory.input parameter. Follow the example for Demo Application.
ls teda.demoapp/data/in archive
-
Check whether chain process status files are empty. The naming schema is status_<ite-namespace>_<group_id>_<chain_id>.txt, for example, for demoapp sample: status_demoapp_00_0.txt. You find these files in the control directory specified by global.applicationControlDirectory parameter.
find teda.lookupmgr/data/control -name "status_demoapp_*" -print -exec cat {} \; teda.lookupmgr/data/control/status_demoapp_02_0.txt teda.lookupmgr/data/control/status_demoapp_01_0.txt teda.lookupmgr/data/control/status_demoapp_00_0.txt
-
Cancel the ITE application, for example: ITE teda.demoapp with JobID 0
streamtool canceljob 0
-
Check for later verification how often the ITE application processed the data file. Count the entries in the statistics file <date>_<application-namespace>_Statistics.txt that you find in statistics subfolder of the output directory as you specified per ite.storage.directory.outputs parameter, for example:
grep CDR_RGN0_20140201083000.bin teda.demoapp/data/out/statistics/*_demoapp_Statistics.txt | wc -l 1
-
Remove the line with your input file name from ingestion checkpoint file. The checkpoint file is located in folder specified by ite.checkpointing.directory parameter, for example:
grep CDR_RGN0_20140201083000.bin teda.demoapp/data/checkpoint/fileDedupcheck* teda.demoapp/data/checkpoint/fileDedupcheck_00_0.chk:{n="CDR_RGN0_20140201083000.bin",t=1461136080} sed -i '/CDR_RGN0_20140201083000\.bin/d' teda.demoapp/data/checkpoint/fileDedupcheck_00_0.chk
-
Remove the checkpoint files of the group deduplication that are located in the subfolder <groupId>/committed/ of the ITE application's checkpointing folder. The name of the checkpoint file is <input-file-name>.chk, for example:
rm -f teda.demoapp/data/checkpoint/00/committed/CDR_RGN0_20140201083000.bin.chk
-
Remove the optional custom checkpoint files of the group deduplication that are located in the subfolder custom/<groupId>/committed/ of the ITE application's checkpointing folder. The name of the checkpoint file is <input-file-name>.bin, for example, the teda.demoapp:
rm -f teda.demoapp/data/checkpoint/custom/00/committed/CDR_RGN0_20140201083000.bin.bin
-
Submit the ITE application, for example:
streamtool submitjob teda.demoapp/output/ITEMain/demoapp.ITEMain.sab
-
Restart the Lookup Manager job by using the list of controlled applications as specified in the LookupMgrCustomizing.xml file or by the lm.controlledApplications submission parameter. This restart command is required because the application status between the restarted ITE applications cannot synchronize with application status of the Lookup Manager application. You must create an appl.ctl.cmd command file with content restart,<comma-separated-ite-application-list> in the control folder specified by the global.applicationControlDirectory parameter. The teda.demoapp sample:
echo "restart,demoapp" >teda.lookupmgr/data/control/appl.ctl.cmd
-
Reprocess the data file in your ITE application, for example:
mv teda.demoapp/data/in/archive/CDR_RGN0_20140201083000.bin teda.demoapp/data/in/
-
Verify whether the file processed in statistics as described in 4. The count must increase by 1, for example:
grep CDR_RGN0_20140201083000.bin teda.demoapp/data/out/statistics/*_demoapp_Statistics.txt | wc -l 2