Recovering from PE failures

Teracloud® Streams can recover automatically from failures in application processing elements (PEs). In situations where automatic recovery is not possible, this procedure provides information about how to recover from PE failures.

Procedure

  1. Determine which PEs are failing by viewing their status and health information.
    Notes:
    • PEs that must be restarted have a state of Stopped and a reason code of C (crash) or F (failure).

    • If PEs have a state of Unknown, the domain controller service cannot be contacted. Before you can restart the PE, you must recover from the resource failure.

    • If PEs have a state of Restarting for an extended time, you might have to cancel and resubmit the job to fix the problem. If an instance is configured to allocate application resources by using the --numresources option and all the application resources in the instance fail in a short time, Teracloud® Streams might not be able to automatically restart the PEs. This issue can occur even if the application resource recovers successfully. To avoid this issue, use multiple application resources to reduce the possibility that all application resources fail at the same time.
  2. Restart the PEs that are stopped.

    The recovery of a PE depends on whether the PE is restartable or relocatable.

    • If the PE is restartable and relocatable, the PE is automatically restarted on an available resource that is chosen by the application manager service.

      If the automatic retry count for the PE is exceeded, the PE cannot be restarted automatically. You must manually restart the PE by using the Streams Console or the streamtool restartpe command.

      Important: Exercise caution with restarting the PE on a specific resource. This action overrides the original resource constraints that are defined in the application, and bypasses the application manager service. In the following example, the processing element 1 is restarted on the host1 resource and is constrained to always run on host1:
      streamtool restartpe -i instance-id --resource host1 1
    • If the PE is restartable and not relocatable, the PE is automatically restarted on the same resource.

      If the automatic retry count for the PE is exceeded, the PE cannot be restarted automatically. You must manually restart the PE by using the Streams Console or the streamtool restartpe command.

    • If the PE is not restartable, the PE is not automatically restarted. You can restart the PE by canceling its job and submitting it again.

    When you restart a PE manually, the automatic retry count for that PE is reset to 0.

What to do next

View the status information for the processing elements again to verify that they are working correctly.