Techniques for troubleshooting problems

Troubleshooting is a systematic approach to solving a problem. The goal of troubleshooting is to determine why something does not work as you expect it to work, and then decide how to resolve the problem.

Certain common techniques can help with the task of troubleshooting. The first step in the troubleshooting process is to describe the problem completely. Problem descriptions help you and the Teracloud® Support representative know where to look for the cause of the problem. Ask yourself the following questions:

What are the symptoms of the problem?
Where does the problem occur?
When does the problem occur?
Under which conditions does the problem occur?
Can the problem be reproduced?

The answers to these questions typically establish a good description of the problem, which can then lead you to a resolution.

What are the symptoms of the problem?

When you begin to describe a problem, the most obvious question is What is the problem? This question might seem straightforward. However, you can break this question into several focused questions that create a more descriptive picture of the problem. These questions can include:

Who, or what, is reporting the problem?
What are the error codes and messages?
How does the system fail? For example, is the failure manifested as a loop, hang, crash, performance degradation, or an incorrect result?

Where does the problem occur?

One of the most important steps in resolving a problem is to determine where the problem originates. Multiple layers of technology can exist between the reporting and failing components. Networks, disks, and drivers are only a few of the components to consider when you investigate problems.

The following questions help you to focus on the layer of technology in which the problem occurs:

Is the problem specific to one platform or operating system, or is it common across multiple platforms or operating systems?
Is the current environment and configuration supported by the product?
Do all the users encounter the problem?
For a multiple-site installation, do all the sites encounter the problem?

When one layer of technology reports the problem, the problem does not necessarily originate in that layer. By understanding the environment in which a problem exists, you can more easily identify where that problem originates. Take the time to completely describe the environment in which the problem exists, including the operating system and version, all the corresponding software and versions, and the hardware that is used in the environment. Verify that you run in an environment that is a supported by Teracloud® Streams. Many problems can be traced to incompatible software levels that are not intended to run together or have not been thoroughly tested together.

When does the problem occur?

Develop a detailed timeline of the events that lead to the failure, especially for the problems that occur only once. By working backwards, you can easily develop a timeline of events. Begin at the time when an error was reported, and be as precise as possible by including even the millisecond. Work backwards through the available logs and information. Typically, you need to look only as far as the first suspicious event that you find in a diagnostic log.

To develop a detailed timeline of events, and construct a frame of reference in which to investigate the problem, answer these questions:

Does the problem occur only at a specific time of day or night?
How frequently does the problem occur?
What sequence of events leads to the time when the problem is reported?
Does the problem occur after an environment change such as an upgrade, or after installing software or hardware?

Under which conditions does the problem occur?

For troubleshooting purposes, it is important to know which systems and applications are running when a problem occurs. The following questions about your environment can help you to identify the root cause of the problem:

Does the problem always occur during one particular task or operation?
Does the problem occur only after a specific sequence of events?
Do other applications fail at the same time?

By answering questions such as these, you can you explain the environment in which the problem occurs and correlate any dependencies. However, when multiple problems occur at about the same time, the problems are not necessarily related.

Can the problem be reproduced?

Under ideal troubleshooting conditions, the problem can be reproduced. When a problem can be reproduced, you typically have a larger set of tools or procedures that can help you to investigate the problem. The problems that you can reproduce are often easier to debug and resolve.

However, the problems that you can reproduce can have a disadvantage. If the problem is of significant business impact, you do not want the problem to occur again. In this case and if it is possible, reproduce the problem in a test environment or a development environment, which typically offers you more flexibility and control during your investigation.

Can the problem be reproduced on a test system?
Do multiple users or applications encounter the same type of problem?
Can the problem be reproduced by running a single command, a set of commands, or a particular application?