Monitor services
The following sections describe how to set up and maintain monitoring in your deployment, to ensure appropriate performance and service availability.
Monitor the Prometheus endpoint
Java Agent automatically exposes a monitoring endpoint where Prometheus can scrape metrics, in a standard Prometheus format.
For information about installing and running Prometheus, refer to the Prometheus documentation.
By default, no special setup or configuration is required to access metrics at this endpoint. The following example queries the Prometheus endpoint for a route.
Tools such as Grafana are available to create customized charts and graphs based on the information collected by Prometheus. For more information on installing and running Grafana, refer to the Grafana website.
Prometheus performance metrics are provided by an endpoint configured in the
protected web application’s web.xml
file. The endpoint must be accessible
to the Prometheus server that uses the performance data.
When the example in Expose endpoints for Prometheus and Common REST is configured, the
Prometheus endpoint is available at
https://mydomain.example.com/myapp/metrics/prometheus
.
Monitor the Common REST monitoring endpoint
Common REST performance metrics are provided by an endpoint configured in the
protected web application’s web.xml
file. The endpoint must be accessible
to the REST client that uses the performance data.
When the example in Expose endpoints for Prometheus and Common REST is configured, the
Common REST performance monitoring endpoint is available at
https://mydomain.example.com/myapp/metrics/crest
.
Expose endpoints for Prometheus and Common REST
Use the following procedure to expose endpoints for Prometheus or Common REST.
-
For each protected web application that is to expose metrics, edit the web application’s
web.xml
file.The following Tomcat example exposes a base endpoint named
/metrics
:<servlet> <servlet-name>AgentMonitoring</servlet-name> <servlet-class>org.forgerock.http.servlet.HttpFrameworkServlet</servlet-class> <init-param> <param-name>application-loader</param-name> <param-value>guice</param-value> </init-param> </servlet> <servlet-mapping> <servlet-name>AgentMonitoring</servlet-name> <url-pattern>/metrics/*</url-pattern> </servlet-mapping>
Choose a name for the exposed base endpoint that does not conflict with any of the built-in agent endpoints, for example
/sunwCDSSORedirectURI
. -
Allow access to the monitoring endpoint that is protected by the agent, in one of the following ways:
-
Configure Not-Enforced URIs to create a not-enforced URI rule for the base endpoint.
The following example rule allows access to the metrics base endpoint:
*/metrics/*
-
Configure Not-Enforced URIs, Not-Enforced Client IP List, and Not-Enforced Compound Rule Separator to create a compound not-enforced rule for the base endpoint.
The rule allows access from only the IP addresses of the REST clients or Prometheus server.
The following example rule allows access to the
/metrics
endpoint for HTTP requests that come from the IP address range from 192.168.1.1 to 192.168.1.3:192.168.1.1-192.168.1.3 | */metrics/*
HTTP requests from other IP addresses are not able to access the metrics base endpoint.
-
Create an authorization policy in AM to restrict access to the metrics base endpoint.
Note that the metric base endpoint does not require login credentials. You can use a policy to ensure that requests to the endpoints are authenticated against the AM instance.
For more information, refer to Policies in AM’s Authorization guide.
-
-
If the monitoring endpoint is protected by AM policies, include the required credentials.
Write metrics to CSV files
Configure Export Monitoring Metrics to CSV to write metric information to CSV files.
Summary of metric types
Timer fields
Common REST fields
Field | Description |
---|---|
|
Metric ID. |
|
Metric type. |
|
Number of events recorded for this metric. |
|
Sum of the durations recorded for this metric. |
|
Minimum duration recorded for this metric. |
|
Maximum duration recorded for this metric. |
|
Average duration recorded for this metric. |
|
Standard deviation of durations recorded for this metric. |
|
Units used for measuring the durations in the metric. |
|
50% of the durations recorded are at or below this value. |
|
75% of the durations recorded are at or below this value. |
|
95% of the durations recorded are at or below this value. |
|
98% of the durations recorded are at or below this value. |
|
99% of the durations recorded are at or below this value. |
|
99.9% of the durations recorded are at or below this value. |
|
One-minute average rate. |
|
Five-minute average rate. |
|
Fifteen-minute average rate. |
|
Average rate. |
|
Units used for measuring the rate of the metric. |
Duration-based values, such as min , max , and p50 , are weighted
towards newer data. By representing approximately the last five minutes of data,
the timers make it easier to see recent changes in behavior, rather than a
uniform average of recordings since the server was started.
|
The following is an example of the requests.granted.not-enforced
metric from
the Common REST endpoint:
{
"_id" : "requests.granted.not-enforced",
"_type" : "timer",
"count" : 486,
"total" : 80.0,
"min" : 0.0,
"max" : 1.0,
"mean" : 0.1905615495053855,
"stddev" : 0.39274399467782056,
"duration_units" : "milliseconds",
"p50" : 0.0,
"p75" : 0.0,
"p95" : 1.0,
"p98" : 1.0,
"p99" : 1.0,
"p999" : 1.0,
"m1_rate" : 0.1819109974890356,
"m5_rate" : 0.05433445522996721,
"m15_rate" : 0.03155662103953588,
"mean_rate" : 0.020858521722211427,
"rate_units" : "calls/second"
}
Prometheus fields
The Prometheus endpoint does not provide rate-based statistics, as rates can be calculated from the time-series data.
Field | Description |
---|---|
|
Metric ID, and type. Note that the |
|
Number of events recorded. |
|
Sum of the durations recorded. |
|
50% of the durations are at or below this value. |
|
75% of the durations are at or below this value. |
|
95% of the durations are at or below this value. |
|
98% of the durations are at or below this value. |
|
99% of the durations are at or below this value. |
|
99.9% of the durations are at or below this value. |
Duration-based quantile values are weighted towards newer data. By representing approximately the last five minutes of data, the timers make it easier to see recent changes in behavior, rather than a uniform average of recordings since the server was started. |
The following is an example of the
ja_requests{access=granted,decision=allowed-by-policy}
metric from the
Prometheus endpoint:
ja_requests_seconds{access="granted",decision="allowed-by-policy",quantile="0.5",} 0.013000000000000001
ja_requests_seconds{access="granted",decision="allowed-by-policy",quantile="0.75",} 0.022000000000000002
ja_requests_seconds{access="granted",decision="allowed-by-policy",quantile="0.95",} 0.022000000000000002
ja_requests_seconds{access="granted",decision="allowed-by-policy",quantile="0.98",} 0.022000000000000002
ja_requests_seconds{access="granted",decision="allowed-by-policy",quantile="0.99",} 0.022000000000000002
ja_requests_seconds{access="granted",decision="allowed-by-policy",quantile="0.999",} 1.1380000000000001
ja_requests_count{access="granted",decision="allowed-by-policy",} 7.0
ja_requests_seconds_total{access="granted",decision="allowed-by-policy",} 1.21
Gauge fields
Common REST fields
Metric for a numerical value that can increase or decrease. The value for a gauge is calculated when requested, and represents the state of Metric at that specific time.
Field | Description |
---|---|
|
Metric ID. |
|
Metric type. |
|
Current value of the metric. |
The following is an example of the jvm.used-memory
metric from the Common
REST endpoint:
{
"_id" : "jvm.used-memory",
"_type" : "gauge",
"value" : 2.13385216E9
}
Prometheus fields
Field | Description |
---|---|
|
Metric ID, and type. Formatted as a comment. |
|
Current value. Large values may be represented in scientific E-notation. |
The following is an example of the ja_jvm_used_memory_bytes
metric from the
Prometheus endpoint:
# TYPE ja_jvm_used_memory_bytes gauge
ja_jvm_used_memory_bytes 1.418723328E9
Distinct counter
Metric providing an estimate of the number of unique values recorded.
For example, this could be used to estimate the number of unique users who have authenticated, or unique client IP addresses.
The DistinctCounter metric is calculated per instance of AM, and
cannot be aggregated across multiple instances to get a site-wide view.
|
Common REST fields
Field | Description |
---|---|
|
Metric ID. |
|
Metric type. Note that the |
|
Calculated estimate of the number of unique values recorded in the metric. |
The following is an example of the authentication.unique-uuid.success
metric
from the Common REST endpoint:
{
"_id" : "authentication.unique-uuid.success",
"_type" : "gauge",
"value" : 3.0
}
Prometheus fields
Field | Description |
---|---|
|
Metric ID, and type. Note that the |
|
Calculated estimate of the number of unique values recorded in the metric. |
The following is an example of the ja_notenforced_ip_unmatched_cache_size
metric from the Prometheus endpoint:
# TYPE ja_notenforced_ip_unmatched_cache_size gauge
ja_notenforced_ip_unmatched_cache_size 3.0
Summary of exposed metrics
Java Agent exposes the monitoring metrics described in this section.
Audit handler metrics
Metric | Prometheus name | Description |
---|---|---|
|
|
Time taken to generate an audit object. (Timer) |
|
|
Time taken to audit outcomes, both locally to the agent and remotely in AM. (Timer) |
Labels:
<handler-type>
-
am-delegate
. Remote auditing performed by AM. (Prometheus:am_delegate
)json
. Local audit logging using JSON. <outcome>
-
success
failure
Endpoint and REST SDK metrics
Metric | Prometheus name | Description |
---|---|---|
|
|
Time taken to retrieve user session information from AM. (Timer) |
|
|
Time taken to retrieve the user profile information from AM. (Timer) |
|
|
Time taken to retrieve policy decisions from AM. (Timer) |
JSON Web Token (JWT) metrics
Metric | Prometheus name | Description |
---|---|---|
|
|
Size of the JWT cache. (Gauge) |
|
|
The eviction count for the JWT cache. (Gauge) |
|
|
The load count for the JWT cache. (Gauge) |
|
|
The load time for the JWT cache, in milliseconds. (Gauge) |
|
|
The hit count for the JWT cache. (Gauge) |
|
|
The miss count for the JWT cache. (Gauge) |
JVM metrics
To get Metric name used by Prometheus, prepend |
Name | Description |
---|---|
|
Number of processors available to the Java virtual machine. (Gauge) |
|
Number of classes loaded since the Java virtual machine started. (Gauge) |
|
Number of classes unloaded since the Java virtual machine started. (Gauge) |
|
Number of collections performed by the "parallel scavenge mark sweep" garbage collection algorithm. (Gauge) |
|
Approximate accumulated time taken by the "parallel scavenge mark sweep" garbage collection algorithm. (Gauge) |
|
Number of collections performed by the "parallel scavenge" garbage collection algorithm. (Gauge) |
|
Approximate accumulated time taken by the "parallel scavenge" garbage collection algorithm. (Gauge) |
|
Amount of heap memory that the Java virtual machine initially requested from the operating system. (Gauge) |
|
Maximum amount of heap memory that the Java virtual machine will attempt to use. (Gauge) |
|
Amount of heap memory that is committed for the Java virtual machine to use. (Gauge) |
|
Amount of heap memory used by the Java virtual machine. (Gauge) |
|
Amount of memory that the Java virtual machine initially requested from the operating system. (Gauge) |
|
Maximum amount of memory that the Java virtual machine will attempt to use. (Gauge) |
|
Amount of non-heap memory that the Java virtual machine initially requested from the operating system. (Gauge) |
|
Maximum amount of non-heap memory that the Java virtual machine will attempt to use. (Gauge) |
|
Amount of non-heap memory that is committed for the Java virtual machine to use. (Gauge) |
|
Amount of non-heap memory used by the Java virtual machine. (Gauge) |
|
Amount of "code cache" memory that the Java virtual machine initially requested from the operating system. (Gauge) |
|
Maximum amount of "code cache" memory that the Java virtual machine will attempt to use. (Gauge) |
|
Amount of "code cache" memory that is committed for the Java virtual machine to use. (Gauge) |
|
Amount of "code cache" memory used by the Java virtual machine. (Gauge) |
|
Amount of "compressed class space" memory that the Java virtual machine initially requested from the operating system. (Gauge) |
|
Maximum amount of "compressed class space" memory that the Java virtual machine will attempt to use. (Gauge) |
|
Amount of "compressed class space" memory that is committed for the Java virtual machine to use. (Gauge) |
|
Amount of "compressed class space" memory used by the Java virtual machine. (Gauge) |
|
Amount of "metaspace" memory that the Java virtual machine initially requested from the operating system. (Gauge) |
|
Maximum amount of "metaspace" memory that the Java virtual machine will attempt to use. (Gauge) |
|
Amount of "metaspace" memory that is committed for the Java virtual machine to use. (Gauge) |
|
Amount of "metaspace" memory used by the Java virtual machine. (Gauge) |
|
Amount of "parallel scavenge eden space" memory that the Java virtual machine initially requested from the operating system. (Gauge) |
|
Maximum amount of "parallel scavenge eden space" memory that the Java virtual machine will attempt to use. (Gauge) |
|
Amount of "parallel scavenge eden space" memory that is committed for the Java virtual machine to use. (Gauge) |
|
Amount of "parallel scavenge eden space" memory after the last time garbage collection recycled unused objects in this memory pool. (Gauge) |
|
Amount of "parallel scavenge eden space" memory used by the Java virtual machine. (Gauge) |
|
Amount of "parallel scavenge old generation" memory that the Java virtual machine initially requested from the operating system. (Gauge) |
|
Maximum amount of "parallel scavenge old generation" memory that the Java virtual machine will attempt to use. (Gauge) |
|
Amount of "parallel scavenge old generation" memory that is committed for the Java virtual machine to use. (Gauge) |
|
Amount of "parallel scavenge old generation" memory after the last time garbage collection recycled unused objects in this memory pool. (Gauge) |
|
Amount of "parallel scavenge old generation" memory used by the Java virtual machine. (Gauge) |
|
Amount of "parallel scavenge survivor space" memory that the Java virtual machine initially requested from the operating system. (Gauge) |
|
Maximum amount of "parallel scavenge survivor space" memory that the Java virtual machine will attempt to use. (Gauge) |
|
Amount of "parallel scavenge survivor space" memory that is committed for the Java virtual machine to use. (Gauge) |
|
Amount of "parallel scavenge survivor space" memory after the last time garbage collection recycled unused objects in this memory pool. (Gauge) |
|
Amount of "parallel scavenge survivor space" memory used by the Java virtual machine. (Gauge) |
|
Amount of memory that is committed for the Java virtual machine to use. (Gauge) |
|
Amount of memory used by the Java virtual machine. (Gauge) |
|
Number of threads in the BLOCKED state. (Gauge) |
|
Number of live threads including both daemon and non-daemon threads. (Gauge) |
|
Number of live daemon threads. (Gauge) |
|
Number of threads in the NEW state. (Gauge) |
|
Number of threads in the RUNNABLE state. (Gauge) |
|
Number of threads in the TERMINATED state. (Gauge) |
|
Number of threads in the TIMED_WAITING state. (Gauge) |
|
Number of threads in the WAITING state. (Gauge) |
Not-enforced rule metrics
Metric | Prometheus name | Description |
---|---|---|
|
|
Size of the not-enforced URI matched cache. (Gauge) |
|
|
Eviction count for the not-enforced URI matched cache. (Gauge) |
|
|
Load count for the not-enforced URI matched cache. (Gauge) |
|
|
Load time for the not-enforced URI matched cache, in milliseconds. (Gauge) |
|
|
Hit count for the not-enforced URI matched cache. (Gauge) |
|
|
Miss count for the not-enforced URI matched cache. (Gauge) |
|
|
Size of the not-enforced URI unmatched cache. (Gauge) |
|
|
Eviction count for the not-enforced URI unmatched cache. (Gauge) |
|
|
Load count for the not-enforced URI unmatched cache. (Gauge) |
|
|
Load time for the not-enforced URI unmatched cache, in milliseconds. (Gauge) |
|
|
Hit count for the not-enforced URI unmatched cache. (Gauge) |
|
|
Miss count for the not-enforced URI unmatched cache. (Gauge) |
|
|
Size of the not-enforced IP matched cache. (Gauge) |
|
|
Eviction count for the not-enforced IP matched cache. (Gauge) |
|
|
Load count for the not-enforced IP matched cache. (Gauge) |
|
|
Load time for the not-enforced IP matched cache, in milliseconds. (Gauge) |
|
|
Hit count for the not-enforced IP matched cache. (Gauge) |
|
|
Miss count for the not-enforced IP matched cache. (Gauge) |
|
|
Size of the not-enforced IP unmatched cache. (Gauge) |
|
|
Eviction count for the not-enforced IP unmatched cache. (Gauge) |
|
|
Load count for the not-enforced IP unmatched cache. (Gauge) |
|
|
Load time for the not-enforced IP unmatched cache, in milliseconds. (Gauge) |
|
|
Hit count for the not-enforced IP unmatched cache. (Gauge) |
|
|
Miss count for the not-enforced IP unmatched cache. (Gauge) |
|
|
Size of the not-enforced compound matched cache. (Gauge) |
|
|
Eviction count for the not-enforced compound matched cache. (Gauge) |
|
|
Load count for the not-enforced compound matched cache. (Gauge) |
|
|
Load time for the not-enforced compound matched cache, in milliseconds. (Gauge) |
|
|
Hit count for the not-enforced compound matched cache. (Gauge) |
|
|
Miss count for the not-enforced compound matched cache. (Gauge) |
|
|
Size of the not-enforced compound unmatched cache. (Gauge) |
|
|
Eviction count for the not-enforced compound unmatched cache. (Gauge) |
|
|
Load count for the not-enforced compound unmatched cache. (Gauge) |
|
|
Load time for the not-enforced compound unmatched cache, in milliseconds. (Gauge) |
|
|
Hit count for the not-enforced compound unmatched cache. (Gauge) |
|
|
Miss count for the not-enforced compound unmatched cache. (Gauge) |
Policy decision metrics
Metric | Prometheus name | Description |
---|---|---|
|
|
Size of the policy decision cache. (Gauge) |
|
|
Eviction count for the policy decision cache. (Gauge) |
|
|
Load count for the policy decision cache. (Gauge) |
|
|
Load time for the policy decision cache, in milliseconds. (Gauge) |
|
|
Hit count for the policy decision cache. (Gauge) |
|
|
Miss count for the policy decision cache. (Gauge) |
POST data preservation metrics
Metric | Prometheus name | Description |
---|---|---|
|
|
Size of the POST data preservation cache. (Gauge) |
|
|
Eviction count for the POST data preservation cache. (Gauge) |
|
|
Load count for the POST data preservation cache. (Gauge) |
|
|
Load time for the POST data preservation cache, in milliseconds. (Gauge) |
|
|
Hit count for the POST data preservation cache. (Gauge) |
|
|
Miss count for the POST data preservation cache. (Gauge) |
Request metrics
Metric | Prometheus name | Description |
---|---|---|
|
|
Rate of granted/denied requests and their decision. (Timer) |
Labels:
<access>
-
granted
denied
<decision>
-
not-enforced
: Request matched a not-enforced rule.no-valid-token
: Request did not have a valid SSO token or an OpenID Connect JWT.allowed-by-policy
: Request matched a policy, which allowed access.denied-by-policy
: Request matched a policy, which denied access.am-unavailable
: The AM instance was not reachable.agent-exception
: An internal error (exception) occurred within the agent.
Session information metrics
Metric | Prometheus name | Description |
---|---|---|
|
|
Size of the session information cache. (Gauge) |
|
|
Eviction count for the session information cache. (Gauge) |
|
|
Load count for the session information cache. (Gauge) |
|
|
Load time for the session information cache, in milliseconds. (Gauge) |
|
|
Hit count for the session information cache. (Gauge) |
|
|
Miss count for the session information cache. (Gauge) |
SSO token to JWT exchange metrics
Metric | Prometheus name | Description |
---|---|---|
|
|
Size of the SSO token exchange cache. (Gauge) |
|
|
Eviction count for the SSO token exchange cache. (Gauge) |
|
|
Load count for the SSO token exchange cache. (Gauge) |
|
|
Load time for the SSO token exchange, in milliseconds. (Gauge) |
|
|
Hit count for the SSO token exchange cache. (Gauge) |
|
|
Miss count for the SSO token exchange cache. (Gauge) |
Websocket metrics
Metric | Prometheus name | Description |
---|---|---|
|
|
Number of milliseconds since anything was received over the websocket, for example a ping or a notification. (Gauge) |
|
|
Number of milliseconds since anything was sent over the websocket. (Gauge) |
|
|
Number of configuration change notifications received. Note that some may be ignored if the realm or agent name are not applicable. (DistinctCounter) |
|
|
Number of configuration change notifications processed, that were not ignored. (DistinctCounter) |
|
|
Number of policy change notifications received. Note that some may be ignored if the realm or agent name are not applicable. (DistinctCounter) |
|
|
Number of policy change notifications processed, that were not ignored. (DistinctCounter) |
|
|
Number of session logout notifications received. Note that some may be ignored if the realm or agent name are not applicable. (DistinctCounter) |
|
|
Number of session logout notifications processed, that were not ignored. (DistinctCounter) |
|
|
Ping/pong round trip time. (Timer) |