PingDirectory

Working with alarms, alerts, and gauges

Alarms, alerts, and gauges alert administrators to changes in server conditions that might require attention.

Alarms

An alarm represents a stateful condition of the server or a resource that might indicate a problem, such as low disk space or external server unavailability.

Alarms have severity, name, and message. Alarms will always have a Condition property, and can have a Specific Problem or Resource property. If surfaced through SNMP, a Probable Cause property and Alarm Type property are also listed. You can configure alarms to generate alerts when the alarm’s severity changes.

You can configure the Alarm Manager, which governs the actions performed when an alarm state is entered, through the dsconfig tool and the administrative console. A complete list of system alerts, alarms, and their severity is available in <server-root>/docs/admin-alerts-list.csv.

The server complies with the International Telecommunication Union CCITT Recommendation X.733 (1992) standard for generating and clearing alarms. If configured, entering or exiting an alarm state might result in one or more alerts.

An alarm state is exited when the condition no longer applies. An alarm_cleared alert type is generated by the system when an alarm’s severity changes from a non-normal severity to any other severity. An alarm_cleared alert will correlate to a previous alarm when the Condition and Resource properties are the same. The Condition corresponds to the Summary column in the admin-alerts-list.csv file.

Like the Alerts Backend, which stores information in cn=alerts, the Alarm Backend stores information within the cn=alarms backend. Unlike alerts, alarm thresholds have a state over time that can change in severity and be cleared when a monitored value returns to normal. You can view alarms with the status tool. As with other alert types, you can configure alert handlers can to manage the alerts generated by alarms.

Alerts

There are two alert types supported by the server: standard and alarm-specific.

The server constantly monitors for conditions that might need administrator attention, such as low disk space. For this condition, the standard alert is low-disk-space-warning, and the alarm-specific alert is alarm-warning.

You can configure the server to generate alarm-specific alerts as well as standard alerts. By default, standard alerts are generated for conditions internally monitored by the server. However, gauges can only generate alarm-alerts.

Gauges

A gauge defines a set of threshold values with a specified severity that, when crossed, cause the server to enter or exit an alarm state.

Numeric gauges monitor continuous values like CPU load or free disk space. Indicator gauges monitor enumerated set of values such as 'server unavailable' or ‘server unavailable’. Gauges generate alarms when the gauge’s severity changes due to changes in the monitored value.

The server installs a set of gauges that are specific to the product and that can be cloned or configured through the dsconfig tool. You can tailor existing gauges to fit each environment by adjusting the update interval and threshold values. Configuration of system gauges determines the criteria by which alarms are triggered.

Use the Stats Logger to view historical information about the value and severity of all system gauges. For more information, see Profiling server performance using the Stats Logger.

Testing alarms and alerts

Steps

  1. Use dsconfig to configure a gauge and set the override-severity property to critical.

    The following example configures the CPU Usage (Percent) gauge.

    Example:

    $ dsconfig set-gauge-prop \
      --gauge-name "CPU Usage (Percent)" \
      --set override-severity:critical
  2. Run the status tool to verify that an alarm was generated with corresponding alerts.

    The status tool provides a summary of the server’s current state with key metrics and a list of recent alerts and alarms.

    Example:

    The sample output has been shortened to show just the alarms and alerts information.

    $ bin/status
    
                            --- Administrative Alerts ---
     Severity : Time            : Message
     ---------:-----------------:------------------------------------------------------
     Info     : 11/Aug/2014     : A configuration change has been made in the
              : 15:48:46 -0500  : Directory Server:
              :                 : [11/Aug/2014:15:48:46.054 -0500]
              :                 : conn=17 op=73 dn='cn=Directory Manager,cn=Root
              :                 : DNs,cn=config' authtype=[Simple] from=127.0.0.1
              :                 : to=127.0.0.1 command='dsconfig set-gauge-prop
              :                 :  --gauge-name 'Cleaner Backlog (Number Of Files)'
              :                 : --set warning-value:-1'
     Info     : 11/Aug/2014     : A configuration change has been made in the
              :  15:47:32 -0500 : Directory Server: [11/Aug/2014:15:47:32.547 -0500]
              :                 : conn=4 op=196 dn='cn=Directory Manager,cn=Root
              :                 : DNs,cn=config' authtype=[Simple] from=127.0.0.1
              :                 : to=127.0.0.1 command='dsconfig set-gauge-prop
              :                 : --gauge-name 'Cleaner Backlog (Number Of Files)'
              :                 :  --set warning-value:0'
     Error    : 11/Aug/2014     : Alarm [CPU Usage (Percent). Gauge CPU Usage (Percent)
              :  15:41:00 -0500 : for Host System has
              :                 : a current value of '18.583333333333332'.
              :                 : The severity is currently OVERRIDDEN in the
              :                 : Gauge's configuration to 'CRITICAL'.
              :                 : The actual severity is: The severity is
              :                 : currently 'NORMAL', having assumed this severity
              :                 : Mon Aug 11 15:41:00 CDT 2014. If CPU use is high,
              :                 : check the server's current workload and make any
              :                 : needed adjustments. Reducing the load on the system
              :                 : will lead to better response times.
              :                 : Resource='Host System']
              :                 : raised with critical severity
    Shown are alerts of severity [Info,Warning,Error,Fatal] from the past 48 hours
     Use the --maxAlerts and/or --alertSeverity options to filter this list
                             --- Alarms ---
     Severity : Severity Start : Condition : Resource    : Details
              : Time           :           :             :
     ---------:----------------:-----------:-------------:------------------------------
     Critical : 11/Aug/2014    : CPU Usage : Host System : Gauge CPU Usage (Percent) for
              : 15:41:00 -0500 : (Percent) :             : Host System
              :                :           :             : has a current value of
              :                :           :             : '18.785714285714285'.
              :                :           :             : The severity is currently
              :                :           :             : 'CRITICAL', having assumed
              :                :           :             : this severity Mon Aug 11
              :                :           :             : 15:49:00 CDT 2014. If CPU use
              :                :           :             : is high, check the server's
              :                :           :             : current workload and make any
              :                :           :             : needed adjustments. Reducing
              :                :           :             : the load on the system will
              :                :           :             : lead to better response times
     Warning  : 11/Aug/2014    : Work Queue: Work Queue  : Gauge Work Queue Size (Number
              : 15:39:40 -0500 : Size      :             : of Requests) for Work Queue
              :                : (Number of:             : has a current value of '27'.
              :                : Requests) :             : The severity is currently
              :                :           :             : 'WARNING' having assumed this
              :                :           :             : severity Mon Aug 11 15:48:50
              :                :           :             : CDT 2014. If all worker
              :                :           :             : threads are busy processing
              :                :           :             : other client requests, then
              :                :           :             : new requests that arrive will
              :                :           :             : be forced to wait in the work
              :                :           :             : queue until a worker thread
              :                :           :             : becomes available
    Shown are alarms of severity [Warning,Minor,Major,Critical]
    Use the --alarmSeverity option to filter this list

Indeterminate alarms

The server raises indeterminate alarms for a server condition for which a severity cannot be determined.

In most cases these alarms are benign and do not issue alerts, nor do they appear in the output of the status tool or Administrative Console by default.

These alarms are usually caused by an enabled gauge that is intended to measure an aspect of the server that is not currently enabled. For example, gauges intended to monitor metrics related to replication might produce indeterminate alarms if a server is not currently replicating data. The gauge can be disabled if needed.

For more information about indeterminate alarms, view the gauge’s associated monitor entry. There might be messages that can help determine the issue.

The following is sample output from the status tool run with the —alarmSeverity=indeterminate option.

                        --- Alarms ---
Severity     : Severity Start : Condition      : Resource   : Details
             : Time           :                :            :
-------------:----------------:----------------:------------:------------------------
Normal       : 26/Aug/2014    : Startup Begun  : cn=config  : The Directory Server
             : 14:16:29 -0500 :                :            : is starting.
             :                :                :            :
Indeterminate: 26/Aug/2014    : Replication    : not        : The value of gauge
             : 14:16:40 -0500 : Latency        : available  : Replication Latency
             :                : (Milliseconds) :            : (Milliseconds) could not
             :                :                :            : be determined. The
             :                :                :            : severity is INDETERMINATE,
             :                :                :            : having assumed this
             :                :                :            : severity Tue Aug 26
             :                :                :            : 14:17:10 CDT 2014.

The following is an indeterminate alarm for the Replication Latency (Milliseconds) gauge. A search of the monitor backend for this gauge’s entry results in an error message that might explain the indeterminate severity.

# ldapsearch -w password --baseDN "cn=monitor"  \
-D"cn=directory manager" gauge-name="Replication Latency (Milliseconds)"

dn: cn=Gauge Replication Latency (Milliseconds),cn=monitor
objectClass: top
objectClass: ds-monitor-entry
objectClass: ds-numeric-gauge-monitor-entry
objectClass: ds-gauge-monitor-entry
objectClass: extensibleObject
cn:          Gauge Replication Latency (Milliseconds)
gauge-name:  Replication Latency (Milliseconds)
resource:
severity:    indeterminate
summary:     The value of gauge Replication Latency (Milliseconds) could not
             be determined. The severity is INDETERMINATE, having assumed
             this severity Tue Aug 26 15:42:40 CDT 2014
error-message: No entries were found under cn=monitor having object
               class ds-replica-monitor-entry
              …