Recovering from resource and service failures

You can recover from resource and service failures by moving or restarting the services.

About this task

If you configure high availability for the management services in your domain and instance, Teracloud® Streams can switch to using standby services when software, hardware, or network interruptions occur. After this failover occurs, Teracloud® Streams tries to restart the failing management services; if they are successfully restarted, they become standby services and wait in an idle state.

If you did not configure high availability for your domain or instance management services, or they cannot be restarted automatically, you can try to restart the services manually. If the resource failed, for example because of hardware problems, you must address the resource problems before you can restart services on that resource.

Tip: To ensure that Teracloud® Streams can automatically restart restartable and relocatable processing elements on a different resource after a resource failure, do not specify host names in the application code.

To change the period that Teracloud® Streams waits before it restarts processing elements when a resource fails, update the instance.restartPesOnResourceFailureWaitTime instance property value. For more information about this property, see streamtool man properties.

Procedure

To manually recover from resource and service failures:
  1. Determine which resources and services are failing.
    For example, look for resources and services that do not have a status of RUNNING or WAITING. When an application resource is no longer available for scheduling application jobs, it is no longer schedulable.
  2. Fix any resource problems or move the services to a new resource.
    To move services, change the tags that are associated with the resources.
  3. Restart any failing services.
    If you moved the services to a new resource, this step is only necessary if some of the services failed to restart on the new resource. You can restart services in the Streams Console or by using the streamtool restartdomainservice and streamtool restartservice commands.

What to do next

View the status information for the resources and services again to verify that they are working correctly. Determine the state of the processing elements and recover from any failures. For more information, see Recovering from PE failures.