Metrics

Teracloud® Streams provides metrics to help evaluate the health of Teracloud® Streams services, to aid in diagnosing performance issues, and to analyze throughput of requests. You can use the streamtool checkinstancemetrics, streamtool checkdomainmetrics, streamtool getdomainmetrics, and streamtool checkresourcemetrics commands to view the metrics data.

Metrics are only in memory and are collected when services are started. Metrics that are kept in the ZooKeeper server are also collected in memory. You can clear the metrics from memory on that server by running the resetzkstat command.

Metrics that are displayed by default are defined in the default template file install-dir/version/etc/cfg/checkMetricsTemplate.json. If you want to make changes to this template file, copy it to a new location. You can affect which metrics are displayed by setting the include element for the metric. You can modify the attributes of the metric to indicate which metrics to check and provide threshold values to check against. After you update the file, specify the file in the new location on the metrics streamtool command by using the --file parameter.

Threshold values can be set for the metrics to give a warning if the value is above the limit that is specified in the template. The values are flagged with ** if they do not meet the thresholds that are established in the template file. Some metrics are specific for the instance name. If you want to show these metrics, update the @@instance-name@@ value to the name of the instance.

The data that is time related is shown in milliseconds.

Metric types

Teracloud® Streams monitors performance and issues by using the following metric types:
  • counter
  • gauge
  • histogram
  • meter
  • timer

A counter is a simple incrementing and decrementing 64-bit integer. The attribute count is the counter's current value.

A gauge is the simplest metric type, and only returns a value. It is an instantaneous measurement of the attribute value, which returns the metric's current value. For example, a gauge might measure the number of pending jobs in a queue.

A histogram measures the distribution of values in a stream of data, for example, the number of results that are returned by a search. In addition to minimum, maximum, mean, and so forth, it also measures median, 75th, 90th, 95th, 98th, 99th, and 99.9th percentiles.

Traditionally, the way the median or any other quantile is calculated is by sorting the entire data set and determining the value in the middle (or 1% from the end, for the 99th percentile). This works for small data sets, or batch processing systems, but not for high-throughput, low-latency services. The solution for the high-throughput services is to sample the data as it passes through. By maintaining a small, manageable reservoir that is statistically representative of the data stream as a whole, the system can quickly and easily calculate quantiles that are valid approximations of the actual quantiles.

This technique is called reservoir sampling. Teracloud® Streams uses an exponentially decaying reservoir. A histogram with an exponentially decaying reservoir produces quantiles that are representative of approximately the last five minutes of data. It does so by using a forward-decaying priority reservoir with an exponential weighting towards newer data. Unlike the uniform reservoir, an exponentially decaying reservoir represents recent data, enabling you to know very quickly if the distribution of the data has changed. Timers use histograms with exponentially decaying reservoirs by default.

The following attributes apply for histograms:
Table 1. Histogram attributes
Attribute Description
count The number of values recorded.
min The lowest value in the snapshot.
max The highest value in the snapshot.
mean The arithmetic mean of the values in the snapshot.
median The median value in the distribution.
stdDev The standard deviation of the values in the snapshot.
95 pct The value at the 95th percentile in the distribution.
98 pct The value at the 98th percentile in the distribution.
99 pct The value at the 99th percentile in the distribution.
999 pct The value at the 999th percentile in the distribution.

A meter measures the rate at which a set of events occur. Meters measure the rate of the events in a few different ways. The mean rate is the average rate of events. Although useful, the mean rate represents the total rate for your application’s entire lifetime, so it does not offer a sense of recency. For example, the mean rate might measure the total number of requests handled, divided by the number of seconds the process has been running. However, meters also record three different exponentially-weighted moving average rates: the 1-, 5-, and 15-minute moving averages.

The following attributes apply for meters:
Table 2. Meter attributes
Attribute Description
count The number of events that are marked.
15 min rate The fifteen-minute exponentially-weighted moving average rate at which events have occurred since the meter was created.
5 min rate The five-minute exponentially-weighted moving average rate at which events have occurred since the meter was created.
1 min rate The one-minute exponentially-weighted moving average rate at which events have occurred since the meter was created.
mean rate The mean rate at which events have occurred since the meter was created.

A timer is a histogram of the duration of a type of event and a meter of the rate of its occurrence. A timer measures both the rate that a particular piece of code is called and the distribution of its duration. Elapsed times for events are measured internally in nanoseconds, using Java’s high-precision System.nanoTime() method. Its precision and accuracy vary depending on operating system and hardware.

The following attributes apply for timers:
Table 3. Timer attributes
Attribute Description
count The number of events that have been marked.
15 min rate The fifteen-minute exponentially-weighted moving average rate at which events have occurred since the timer was created.
5 min rate The five-minute exponentially-weighted moving average rate at which events have occurred since the timer was created.
1 min rate The one-minute exponentially-weighted moving average rate at which events have occurred since the timer was created.
mean rate The mean rate at which events have occurred since the timer was created.
min The lowest value in the snapshot.
max The highest value in the snapshot.
mean The arithmetic mean of the values in the snapshot.
median The median value in the distribution.
stdDev The standard deviation of the values in the snapshot.
95 pct The value at the 95th percentile in the distribution.
98 pct The value at the 98th percentile in the distribution.
99 pct The value at the 99th percentile in the distribution.
999 pct The value at the 999th percentile in the distribution.

Teracloud® Streams Metrics

The following table shows the metrics that are included in the JSON template for metrics reporting.
Table 4. Teracloud® Streams metrics
Metric Name Type Display by Default Description
domain.aas.counter counter Y The number of times the aas service is started on this node. This counts anytime the service is restarted intentionally or not.

Limit Value: counter = 5

domain.auditlog.counter counter Y The number of times auditlog service was started on this node. This counts anytime the service is restarted intentionally or not.

Limit Value: counter = 5

domain.jmx.counter counter Y The number of times jmx service was started on this node. This counts anytime the service is restarted intentionally or not.

Limit Value: counter = 5

domain.sws.counter counter Y The number of times sws service was started on this node. This counts anytime the service is restarted intentionally or not.

Limit value: counter = 5

controller.ping.timer timer Y Measures the time for leader to non-leader resource ping requests.

Limit Values:

  • min=500
  • max=1000
  • mean=500
  • median=500
zk.children.timer timer Y Measures the time to perform ZooKeeper get children operations

Limit Values:  

  • min=5
  • max=250
  • mean=10
  • median=10
zk.create.timer timer Y Measures the time to perform ZooKeeper create operations.

Limit Values:

  • min=15
  • max=500
  • mean=20
  • median=20
zk.delete.timer timer Y Measures the time to perform ZooKeeper delete operations.

Limit Values:

  • min=10
  • max=500
  • mean=25
  • median=25
zk.exist.timer timer Y Measures the time to perform ZooKeeper exists operations.

Limit Values:  

  • min=5
  • max=250
  • mean=10
  • median=10
zk.read.timer timer Y Measures the time to perform ZooKeeper read operations.

Limit Values:

  • min=5
  • max=250
  • mean=10
  • median=10
zk.transaction.timer timer Y Measures the time to perfrom ZooKeeper transaction operations.

Limit Values = no warnings

zk.write.timer timer Y Measures the time to perform ZooKeeper write operations.

Limit Values:

  • min=15
  • max=500
  • mean=20
  • median=20
zk.disconnect.timer timer Y Measures the time for ZooKeeper disconnect occurrences.

Limit Values:

  • 1 min rate=1
  • 5 min rate=1
  • 15 min rate=1
sam.job.counter counter Y The number of jobs submitted to SAM since instance start.
sam.pe.counter counter Y The number of PEs created by SAM since instance start.
sam.inputport.counter counter Y The number of PE Input Ports created by SAM since instance start.
sam.outputport.counter counter Y The number of PE Output Ports created by SAM since instance start.
sam.connection.counter counter Y The number of PE Connections created by SAM since instance start.
*.app.counter 3 counter N The number of times app service was started on this node. This counts anytime the service is restarted intentionally or not.
*.sam.counter3 counter N The number of times sam service was started on this node. This counts anytime the service is restarted intentionally or not.
*.srm.counter3 counter N The number of times srm service was started on this node. This counts anytime the service is restarted intentionally or not.
*.view.counter3 counter N The number of times view service was started on this node. This counts anytime the service is restarted intentionally or not.
view.get.data.timer timer N The amount of time it takes the view server to retrieve the data from it's internal buffer and return to the caller.
view.active.view.counter counter N The number of views that are actively buffering data.
rest.request.timer timer N The amount of time it took to service a rest request (this is the published REST API).
sws.get.file.timer timer N The amount of time SWS took to service file downloads (this is done for html, scripts, css, etc that the console needs).
aas.session.counter The number of AAS sessions since the domain started.

Limit Values: counter = 3500

aas.login.timer counter Y Measures time for authentication with user credential.
aas.ldap.*.invalidauthreqs.meter1 meter N Measures invalid LDAP authentication.
aas.ldap.*.errorauthreqs.meter1 meter N Measures LDAP authentication errors.
aas.ldap.*.erroranonauthreqs.meter1 meter N Measures anonymous LDAP authentication errors.
aas.ldap.*.invalidsulauthreqs.meter1 meter N Measures invalid secondary user lookup LDAP authentication.
aas.ldap.*.errorsulreqs.meter1 meter N Measures secondary user lookup LDAP errors.
aas.pam.*.invalidauthreqs.meter2 meter N Measures invalid PAM authentication.
aas.pam.*.errorauthreqs.meter2 meter N Measures PAM authentication errors.
aas.ldap.*.commerrors.meter1 meter N Measures LDAP communication errors.
Table notes:
  1. The asterisk (*) displays the metric for each configured LDAP server. When displayed, the asterisk (*) is replaced by a value that is composed of the LDAP hostname and port. The periods within the hostname and the colon are replaced by underscore characters. For example, if the LDAP server URL is ldap://xyz.ibm.com:389, then the value displayed in place of the asterisk (*) is xyz_ibm_com_389.
  2. The asterisk (*) displays the metric for each configured PAM server. When displayed, the asterisk (*) is replaced with the PAM service name.
  3. The asterisk (*) displays the metric for each instance. When displayed, the asterisk (*) will be replaced by the instance name.
  4. You can modify the template to replace an asterisk (*) with a specific instance or server. If you do not modify the file to hard code the specific name, the metrics for all instances and servers are displayed.