Metrics
Teracloud® Streams provides metrics to help evaluate the health of Teracloud® Streams services, to aid in diagnosing performance issues, and to analyze throughput of requests. You can use the streamtool checkinstancemetrics, streamtool checkdomainmetrics, streamtool getdomainmetrics, and streamtool checkresourcemetrics commands to view the metrics data.
Metrics are only in memory and are collected when services are started. Metrics that are kept in the ZooKeeper server are also collected in memory. You can clear the metrics from memory on that server by running the resetzkstat command.
Metrics that are displayed by default are defined in the default template file install-dir/version/etc/cfg/checkMetricsTemplate.json. If you want to make changes to this template file, copy it to a new location. You can affect which metrics are displayed by setting the include element for the metric. You can modify the attributes of the metric to indicate which metrics to check and provide threshold values to check against. After you update the file, specify the file in the new location on the metrics streamtool command by using the --file parameter.
Threshold values can be set for the metrics to give a warning if the value is above the limit
that is specified in the template. The values are flagged with ** if they do not meet the thresholds
that are established in the template file. Some metrics are specific for the instance name. If you
want to show these metrics, update the @@instance-name@@
value to the name of the
instance.
The data that is time related is shown in milliseconds.
Metric types
- counter
- gauge
- histogram
- meter
- timer
A counter is a simple incrementing and decrementing 64-bit integer. The attribute count is the counter's current value.
A gauge is the simplest metric type, and only returns a value. It is an instantaneous measurement of the attribute value, which returns the metric's current value. For example, a gauge might measure the number of pending jobs in a queue.
A histogram measures the distribution of values in a stream of data, for example, the number of results that are returned by a search. In addition to minimum, maximum, mean, and so forth, it also measures median, 75th, 90th, 95th, 98th, 99th, and 99.9th percentiles.
Traditionally, the way the median or any other quantile is calculated is by sorting the entire data set and determining the value in the middle (or 1% from the end, for the 99th percentile). This works for small data sets, or batch processing systems, but not for high-throughput, low-latency services. The solution for the high-throughput services is to sample the data as it passes through. By maintaining a small, manageable reservoir that is statistically representative of the data stream as a whole, the system can quickly and easily calculate quantiles that are valid approximations of the actual quantiles.
This technique is called reservoir sampling. Teracloud® Streams uses an exponentially decaying reservoir. A histogram with an exponentially decaying reservoir produces quantiles that are representative of approximately the last five minutes of data. It does so by using a forward-decaying priority reservoir with an exponential weighting towards newer data. Unlike the uniform reservoir, an exponentially decaying reservoir represents recent data, enabling you to know very quickly if the distribution of the data has changed. Timers use histograms with exponentially decaying reservoirs by default.
Attribute | Description |
---|---|
count | The number of values recorded. |
min | The lowest value in the snapshot. |
max | The highest value in the snapshot. |
mean | The arithmetic mean of the values in the snapshot. |
median | The median value in the distribution. |
stdDev | The standard deviation of the values in the snapshot. |
95 pct | The value at the 95th percentile in the distribution. |
98 pct | The value at the 98th percentile in the distribution. |
99 pct | The value at the 99th percentile in the distribution. |
999 pct | The value at the 999th percentile in the distribution. |
A meter measures the rate at which a set of events occur. Meters measure the rate of the events in a few different ways. The mean rate is the average rate of events. Although useful, the mean rate represents the total rate for your application’s entire lifetime, so it does not offer a sense of recency. For example, the mean rate might measure the total number of requests handled, divided by the number of seconds the process has been running. However, meters also record three different exponentially-weighted moving average rates: the 1-, 5-, and 15-minute moving averages.
Attribute | Description |
---|---|
count | The number of events that are marked. |
15 min rate | The fifteen-minute exponentially-weighted moving average rate at which events have occurred since the meter was created. |
5 min rate | The five-minute exponentially-weighted moving average rate at which events have occurred since the meter was created. |
1 min rate | The one-minute exponentially-weighted moving average rate at which events have occurred since the meter was created. |
mean rate | The mean rate at which events have occurred since the meter was created. |
A timer is a histogram of the duration of a type of event and a meter of the rate of its occurrence. A timer measures both the rate that a particular piece of code is called and the distribution of its duration. Elapsed times for events are measured internally in nanoseconds, using Java’s high-precision System.nanoTime() method. Its precision and accuracy vary depending on operating system and hardware.
Attribute | Description |
---|---|
count | The number of events that have been marked. |
15 min rate | The fifteen-minute exponentially-weighted moving average rate at which events have occurred since the timer was created. |
5 min rate | The five-minute exponentially-weighted moving average rate at which events have occurred since the timer was created. |
1 min rate | The one-minute exponentially-weighted moving average rate at which events have occurred since the timer was created. |
mean rate | The mean rate at which events have occurred since the timer was created. |
min | The lowest value in the snapshot. |
max | The highest value in the snapshot. |
mean | The arithmetic mean of the values in the snapshot. |
median | The median value in the distribution. |
stdDev | The standard deviation of the values in the snapshot. |
95 pct | The value at the 95th percentile in the distribution. |
98 pct | The value at the 98th percentile in the distribution. |
99 pct | The value at the 99th percentile in the distribution. |
999 pct | The value at the 999th percentile in the distribution. |
Teracloud® Streams Metrics
Metric Name | Type | Display by Default | Description |
---|---|---|---|
domain.aas.counter | counter | Y | The number of times the aas service is started on this node. This counts
anytime the service is restarted intentionally or
not. Limit Value: counter = 5 |
domain.auditlog.counter | counter | Y | The number of times auditlog service was started on this node. This counts
anytime the service is restarted intentionally or
not. Limit Value: counter = 5 |
domain.jmx.counter | counter | Y | The number of times jmx service was started on this node. This counts anytime
the service is restarted intentionally or not. Limit Value: counter = 5 |
domain.sws.counter | counter | Y | The number of times sws service was started on this node. This counts anytime
the service is restarted intentionally or not. Limit value: counter = 5 |
controller.ping.timer | timer | Y | Measures the time for leader to non-leader resource ping requests. Limit Values:
|
zk.children.timer | timer | Y | Measures the time to perform ZooKeeper get children operations Limit Values:
|
zk.create.timer | timer | Y | Measures the time to perform ZooKeeper create operations. Limit Values:
|
zk.delete.timer | timer | Y | Measures the time to perform ZooKeeper delete operations. Limit Values:
|
zk.exist.timer | timer | Y | Measures the time to perform ZooKeeper exists operations. Limit Values:
|
zk.read.timer | timer | Y | Measures the time to perform ZooKeeper read operations. Limit Values:
|
zk.transaction.timer | timer | Y | Measures the time to perfrom ZooKeeper transaction
operations. Limit Values = no warnings |
zk.write.timer | timer | Y | Measures the time to perform ZooKeeper write operations. Limit Values:
|
zk.disconnect.timer | timer | Y | Measures the time for ZooKeeper disconnect occurrences. Limit Values:
|
sam.job.counter | counter | Y | The number of jobs submitted to SAM since instance start. |
sam.pe.counter | counter | Y | The number of PEs created by SAM since instance start. |
sam.inputport.counter | counter | Y | The number of PE Input Ports created by SAM since instance start. |
sam.outputport.counter | counter | Y | The number of PE Output Ports created by SAM since instance start. |
sam.connection.counter | counter | Y | The number of PE Connections created by SAM since instance start. |
*.app.counter 3 | counter | N | The number of times app service was started on this node. This counts anytime the service is restarted intentionally or not. |
*.sam.counter3 | counter | N | The number of times sam service was started on this node. This counts anytime the service is restarted intentionally or not. |
*.srm.counter3 | counter | N | The number of times srm service was started on this node. This counts anytime the service is restarted intentionally or not. |
*.view.counter3 | counter | N | The number of times view service was started on this node. This counts anytime the service is restarted intentionally or not. |
view.get.data.timer | timer | N | The amount of time it takes the view server to retrieve the data from it's internal buffer and return to the caller. |
view.active.view.counter | counter | N | The number of views that are actively buffering data. |
rest.request.timer | timer | N | The amount of time it took to service a rest request (this is the published REST API). |
sws.get.file.timer | timer | N | The amount of time SWS took to service file downloads (this is done for html, scripts, css, etc that the console needs). |
aas.session.counter | The number of AAS sessions since the domain
started. Limit Values: counter = 3500 |
||
aas.login.timer | counter | Y | Measures time for authentication with user credential. |
aas.ldap.*.invalidauthreqs.meter1 | meter | N | Measures invalid LDAP authentication. |
aas.ldap.*.errorauthreqs.meter1 | meter | N | Measures LDAP authentication errors. |
aas.ldap.*.erroranonauthreqs.meter1 | meter | N | Measures anonymous LDAP authentication errors. |
aas.ldap.*.invalidsulauthreqs.meter1 | meter | N | Measures invalid secondary user lookup LDAP authentication. |
aas.ldap.*.errorsulreqs.meter1 | meter | N | Measures secondary user lookup LDAP errors. |
aas.pam.*.invalidauthreqs.meter2 | meter | N | Measures invalid PAM authentication. |
aas.pam.*.errorauthreqs.meter2 | meter | N | Measures PAM authentication errors. |
aas.ldap.*.commerrors.meter1 | meter | N | Measures LDAP communication errors. |
- The asterisk (*) displays the metric for each configured LDAP server. When displayed, the asterisk (*) is replaced by a value that is composed of the LDAP hostname and port. The periods within the hostname and the colon are replaced by underscore characters. For example, if the LDAP server URL is ldap://xyz.ibm.com:389, then the value displayed in place of the asterisk (*) is xyz_ibm_com_389.
- The asterisk (*) displays the metric for each configured PAM server. When displayed, the asterisk (*) is replaced with the PAM service name.
- The asterisk (*) displays the metric for each instance. When displayed, the asterisk (*) will be replaced by the instance name.
- You can modify the template to replace an asterisk (*) with a specific instance or server. If you do not modify the file to hard code the specific name, the metrics for all instances and servers are displayed.