Built-in Events Reference

The Events page displays a list of all the currently available events; out of the box built-in events and any user-defined custom events. To view the Events page, click Settings -> Events.

The list can be filtered by:

  • Type: built-in event or custom event.
  • Incidents and severity: incidents, warning, or critical.
  • Full text search.

Important: Built-in events can't be modified. You can create custom events based on the same entities and metrics used for built-in events. Custom events trigger issues or incidents based on the thresholds of an individual metric of any given entity.

.NET App

Event Description Metric
Garbage collection activity high. Monitors the garbage collection (GC) time spent by the CLR runtime platform and checks it against the maximum allowed percentage value. GC time (mem.time_in_gc).

For more information about this sensor, see the .NET documentation.

ActiveMQ

Event Description Metric
Dead-letter queue size is growing. Dead-letter queue size is increasing. Messages sent are not routed to their correct destination. ActiveMQ queue size.
Memory usage is close to the limit. Memory usage is close to 100% of the memory limit. Memory Usage (memoryPercentage).
Store usage is close to the limit. Store usage is close to 100% of the store limit. Store Usage (storePercentage).

For more information about this sensor, see the ActiveMQ documentation.

ActiveMQ Artemis

Event Description Metric
ActiveMQ Artemis has no connections. There are no connections in the last 5 seconds. The current number of connections is equal to the configured NoConnections count. Total Connections (totalConnectionCount).
ActiveMQ Artemis has no consumers. There are no consumers in last 5 seconds. Current number of consumers is equal to the configured NoConsumers count. Total Consumers (totalConsumerCount).
Addresses memory usage is close to the limit. Memory usage of all addresses is close to 100% of its memory limit. Address Memory Usage (addressMemoryPercentage).

For more information about this sensor, see the ActiveMQ Artemis documentation.

Apache HTTPd

Event Description Metric
Apache child processes are stuck performing DNS lookups. Detects high usage of server workers by DNS lookup. Dns (worker.dns).
Logging is slowing down Apache HTTPd performance. Detects high usage of server workers for logging purposes. Logging (worker.logging).
Number of busy workers is approaching max workers. Detect high percentage of busy workers. Busy workers (busy_workers).

For more information about this sensor, see the Apache HTTPd documentation.

Application

Event Description Metric
Complete drop in calls Detects a rapid drop to zero (essentially the service is not being called anymore) in the values of the calls relative to the values in the last 30 minutes. The magnitude of the drop in calls should also exceed the listed relative and absolute threshold parameters. Calls/s (count)
Error rate too high Detects a consistently high error rate when the average errors KPI within the last four minutes is above the given threshold value. Error Rate (error_rate).
Increasing trend in error rate This rule checks the presence of an increasing trend in a given metric. The rule is tuned to detect weakly monotonous increases in the given metric. The detector is however, not strict and tolerates a certain amount of decreases in the metric value inside the trend candidate. Error Rate (error_rate).
Sudden drop in calls Detects a rapid drop in the values of the calls KPI metric relative to the values in the last 30 minutes. The magnitude of the drop in calls should also exceed the listed relative and absolute threshold parameters. Calls/s (count).
Sudden increase in error rate Detects a rapid increase in the values of the errors KPI relative to the KPIs values in the last 10 minutes. The magnitude of the increase in errors should also exceed the listed relative and absolute threshold parameters. Error Rate (error_rate).
Sudden increase in latency Detects a rapid increase in the given latency KPI percentile relative to the KPIs values in the last 30 minutes. The magnitude of the increase in errors should also exceed the listed relative and absolute threshold parameters. Latency 50th (duration.50th).
Sudden increase in latency for a fraction of requests Detects a rapid increase in the given latency KPI percentile relative to the KPIs values in the last 30 minutes. The magnitude of the increase in errors should also exceed the listed relative and absolute threshold parameters. Latency 99th (duration.99th).

AWS DynamoDB

Event Description Metric
Ratio of consumed and provisioned reads is critical. Detects high ratio of consumed and provisioned reads. Consumed read capacity (consumed_read).
Ratio of consumed and provisioned writes is critical. Detects high ratio of consumed and provisioned writes. Consumed write capacity (consumed_write) and provisioned write capacity (provisioned_write).

For more information about this sensor, see the AWS DynamoDB documentation.

AWS RDS

Event Description Metric
CPU credit balance reaching zero. Checks if the CPU credit balance is getting closer to zero. CPU Credit Balance (cpu_credit_balance).
Number of CPU credits consumed is high. Checks if the percentage of CPU credits consumed by an instance is reaching max capacity. CPU Credit Usage (cpu_credit_usage) and CPU Credit Balance (cpu_credit_balance).

For more information about this sensor, see the AWS RDS documentation.

Azure API Management Service

The Azure API Management sensor will conduct custom health checks and execute them every minute. If the checks fail for at least one minute, an issue will be raised to inform the user.

Event Description Metric
Azure Api Management capacity is getting closer to the max capacity limit. Checks whether Azure API Management is using more than 90% of the available capacity. Capacity (metrics.Capacity).

For more information about this sensor, see the Azure Api Management documentation.

Azure CosmosDB

Event Description Metric
Azure CosmosDb storage capacity is getting closer to the max capacity limit. Detects whether the Azure CosmosDb storage capacity is reaching the max capacity limit. CosmosDb storage capacity.

For more information about this sensor, see the Azure CosmosDB documentation.

Azure Redis

The Azure Redis Cache sensor will conduct custom health checks and execute them every minute. If the checks fail for at least one minute, an issue will be raised to inform the user.

Event Description Metric
Azure Redis Cache client connections are getting closer to max connections limit. Azure Redis Cache is using more than 90% of available client connections. Connected Clients (connectedclients).
Azure Redis Cache memory usage is getting closer to max memory limit. Azure Redis Cache is using more than 90% of available memory. Percentage of Memory Used (usedmemorypercentage).

For more information about this sensor, see the Azure Redis documentation.

Azure SQL Database

The Azure SQL Database sensor will conduct custom health checks and execute them every minute. If the checks fail for at least one minute, an issue will be raised to inform the user.

Event Description Metric
Database is running out of space. Checks if Azure SQL Database is running out of space. Warning limit is at 80% and the critical limit is at 90% of the used size. metrics.storage_percent.
Database status. Unhealthy state is caused by the database being unavailable. A database can be unavailable if one of the following conditions is true:
  • The database has been set offline by the user
  • The database is being restored from backup
  • The database is being recovered
  • The database has been corrupted
  • The database has been set to the Emergency state by the administrator
  • The database is in the process of being created by copying another database
metrics.statusCode.
The total DTU utilization is getting closer to max DTU limit. Checks if the Azure SQL Database DTU utilization is reaching max DTU limit. Warning limit is at 75% and the critical limit is at 85% of the DTU utilization. metrics.dtu_consumption_percent.

Azure SQL Elastic Pool

The Azure SQL Elastic Pool sensor will conduct custom health checks and execute them every minute. If the checks fail for at least one minute, an issue will be raised to inform the user.

Event Description Metric
The total eDTU utilization is getting closer to max eDTU limit. Checks if Azure SQL Elastic Pool eDTU is reaching maximum eDTU limit. metrics.dtu_consumption_percent.

Cassandra

Cassandra Cluster

Event Description Metric
Unreachable Cassandra nodes. One or more nodes are down. Number of unreachable nodes (unreachableNodes).

Cassandra Node

Event Description Metric
Blocked threadpools. Checks whether there are stages with the blocked threads. Blocked threads metric for a stage.
Dropped messages. Checks whether there are thread pools dropping messages. Dropped messages metric for a stage.
Pending compactions. Checks whether pending compactions are increasing. Write (compaction.pending).
Pending mutations. Checks whether there are pending mutations. Counter Mutation (stage.mutation.pending).
Pending reads. Pending reads. Read Repair (stage.read.pending).
Pending request responses. Pending request responses. Write (Mutation) (stage.requestresponse.pending).
Sudden drop in write requests. Checks for a sudden drop in the number of Cassandra write requests. Read (clientrequests.write.count).

For more information about this sensor, see the Cassandra documentation.

Ceph

Event Description Metric
Ceph cluster status. Ceph cluster is reporting a problem; HEALTH_WARN or HEALTH_ERR. Status of the Ceph Cluster (overall_status).
Monitor quorum is not reached. The number of healthy monitors is less than 50% of all monitors. Number of monitors (num_mons) and number of active monitors (num_active_mons).
Osd(s) full capacity state. Some of OSDs are reporting full state. Number of active+clean pgs (num_full_osds).
Osd(s) near full capacity state. Some of OSDs are reporting near full state. Number of near full osds (num_near_full_osds).

For more information about this sensor, see the Ceph documentation.

Consul (HashiCorp)

Event Description Metric
Consul cluster health. Detects the overall health of the cluster and if any of the nodes are considered unhealthy by Autopilot. Consul autopilot health status (consul.autopilot.healthy).

CRI-O

Event Description Metric
Memory exhausted. Detects when the container memory usage exceeds specified limits. RSS (memory.total_rss).

Docker

Event Description Metric
Memory exhausted. When the container memory usage exceeds specified limits, a memory warning threshold or a memory critical threshold alert is displayed. RSS (memory.total_rss).

For more information about this sensor, see the Docker documentation.

Elasticsearch

Elasticsearch Cluster

Event Description Metric
Cluster status. Monitors the status of Elasticsearch cluster. Number of Elasticsearch nodes (node_count) and the status of Elasticsearch cluster (cluster_status).
Elasticsearch is in split-brain situation. Checks whether an Elasticsearch cluster has more than 1 master node. Split Brain is triggered for environments with two Elastic clusters with the same name. Master nodes count in elasticsearch cluster.

Elasticsearch Node

Event Description Metric
Capacity limit while rebalancing. Characterizes the node at being at the capacity limit by checking whether it's relocating shards at the time of being at the capacity limit. Results of the capacity limit evaluation and shard relocation.
Heap overallocation. Evaluates whether the heap size setting of the Elasticsearch is too big. Maximum heap size of the underlying JVM and the total memory on the underlying host.
High heap usage. Checks the heap usage of the node along with the recent workload characteristics to detect the heap usage to be too high. Heap usage by the underlying JVM and workload characterization.
Node at capacity limits. Checks for the node being at the capacity limit which is determined by the presence of the following issues: high load and CPU usage on the host, high heap usage and high GC time in the Elasticsearch JVM. High load and high CPU time on the host, high heap usage by the Elasticsearch, as well as high GC time on the underlying JVM
Node status. Checks the cluster status provided by the Elasticsearch. High load and high CPU time on the host, high heap usage by the Elasticsearch, as well as high GC time on the underlying JVM.
Rejected actions. Checks for the number of rejected threads being too high. Index (threads.index_rejected), search (threads.search_rejected), bulk (threads.bulk_rejected), and get (threads.get_rejected).

For more information about this sensor, see the Elasticsearch documentation.

Endpoint

Event Description Metric
Complete drop in calls. Detects a rapid drop to zero (essentially the service is not being called anymore) in the values of the calls KPI metric relative to the values in the last 30 minutes. The magnitude of the drop in calls should also exceed the relative and absolute threshold parameters below. Calls/s (count).
Error rate too high. Detects a consistently high error rate when the average errors KPI within the last four minutes is above the given threshold value. Error Rate (error_rate).
Error rate too high for a synthetic endpoint. Detects a consistently high error rate of a synthetic endpoint when the average errors KPI within the last four minutes is above the given threshold value. Synthetic error rate (synthetic_error_rate).
Increasing trend in error rate. Checks a presence of an increasing trend in a given metric. The rule is tuned to detect weakly monotonous increases in the given metric. The detector is however, not strict and tolerates a certain amount of decreases in the metric value inside the trend candidate. Error Rate (error_rate).
Sudden drop in calls. Detects a rapid drop in the values of the calls KPI metric relative to the values in the last 30 minutes. The magnitude of the drop in calls should also exceed the relative and absolute threshold parameters below. Calls/s (count).
Sudden drop in synthetic calls. Detects a rapid drop in the values of the calls KPI metric relative to the values in the last 30 minutes. The magnitude of the drop in calls should also exceed the relative and absolute threshold parameters below. Synthetic calls/s (synthetic_count).
Sudden increase in error rate. Detects a rapid increase in the values of the errors KPI relative to the KPIs values in the last 10 minutes. The magnitude of the increase in errors should also exceed the relative and absolute threshold parameters below. Error Rate (error_rate).
Sudden increase in latency. Detects a rapid increase in the given latency KPI percentile relative to the KPIs values in the last 30 minutes. The magnitude of the increase in errors should also exceed the relative and absolute threshold parameters below. Latency 50th (duration.50th).
Sudden increase in latency for a fraction of requests. Detects a rapid increase in the given latency KPI percentile relative to the KPIs values in the last 30 minutes. The magnitude of the increase in errors should also exceed the relative and absolute threshold parameters below. Latency 99th (duration.99th).

etcd

Event Description Metric
Abnormally high disk backend commit duration. Detects high disc backend commit duration. Disk backend commit duration (health.disk_backend_commit_duration).
Abnormally high disk wal fsync duration. Detects high disc was fsync duration. Disk fsync duration (health.disk_wal_fsync_duration).
Abnormally high snapshot duration. Detects high duration of saving a snapshot. Snap save total duration (health.debugging_snap_save_total_duration).
Frequent leader changes seen in last minute. Detects a high number of leader changes in the last minute. Server leader changes (health.server_leader_changes).
Member doesn't have leader. Detects a member who does not have a leader (unavailable). Server has leader (health.server_has_leader).
Proposal ratio analysis. Detects unusual fall of applied proposals and an unusual rise of pending and failed proposals. Number of proposals commited (health.server_proposals_committed), number of proposals applied (health.server_proposals_applied), number of proposals pending (health.server_proposals_pending), and number of proposals failed (health.server_proposals_failed).
Usage of open file descriptors is critical. Detects a high usage of open file descriptors. Number of open file descriptors (health.process_open_fds) and the maximum number of file descriptors (health.process_max_fds).

For more information about this sensor, see the etcd documentation.

Garden Container

Event Description Metric
Memory exhausted. Container memory usage is getting close to its memory limit. Usage (memory.usage).

For more information about this sensor, see the Garden documentation.

Glassfish

Event Description Metric
Glassfish file cache hit rate is below 70%. A processing pipeline checks the file cache hit rate and validates whether it's lower than the given threshold value. Hit rate (file_cache_rate).
Maximum number of JDBC connections reached. A processing pipeline checks the total number of JDBC connections. It validates whether it's reaching the maximum limit for the server configuration. Used (jdbc_connection_used).

For more information about this sensor, see the Glassfish documentation.

Google Cloud Storage

Event Description Metric
Sudden increase in size of all objects Checks for a sudden increase in size of all objects in 24h for non empty buckets Total size of all objects in the bucket.

For more information about this sensor, see the Google Cloud Storage documentation.

Hadoop YARN

Event Description Metric
Resource manager is reporting lost node. Detects if the resource manager is reporting lost nodes. Lost Nodes (lostNodes).
Resource manager is reporting unhealthy node. Detects if the resource manager is reporting unhealthy nodes. Unhealthy Nodes (unhealthyNodes).
Submitted app has failed. Detects if submitted app has failed. Apps Failed (appsFailed).

For more information about this sensor, see the Hadoop YARN documentation.

HAProxy

Event Description Metric
HAProxy backend average queue size is high. HAProxy backend average queue size is large. Backend Queue Size.
HAProxy frontend session usage is high. HAProxy frontend session usage is high. Frontend Session Utilization.
Sudden increase in average response time. Checks for a sudden increase in the average response time of a single backend. Average response time metrics.

For more information about this sensor, see the HAProxy documentation.

Hazelcast

Starting with Hazelcast 3.3 the public methods HazelcastInstance::getPartitionService()::isLocalMemberSafe() is used. For older Hazelcast versions the health status is derived from an internal "has ongoing migrations" status on each local node.

The Hazelcast cluster health status is aggregated from each Hazelcast node. This is exactly what HazelcastInstance::getPartitionService()::isClusterSafe() does internally, but without creating additional overhead of calling this method.

Hazelcast Cluster

Event Description Metric
Cluster status. Checks the cluster status of Hazelcast. Hazelcast 3.3 or above. Hazelcast cluster status flag.

Hazelcast Node

Event Description Metric
Node status. Checks the status of the local member. Hazelcast 3.3 or above. Hazelcast node status flag.

For more information about this sensor, see the Hazelcast IMDG documentation.

HBase

Event Description Metric
Difference between number of stores and number of store files is significant. Detects unusually low or unusually high number of stores. Stores count (rs_store_count) and stores files count (rs_store_file_count).
Region server block cache hit ratio is low. Detects low cache hit ratio. Block cache hit rate (rs_blk_cache_hit_rate) and block cache hit count (rs_blk_cache_hit_count).
Significant increase in compaction queue length. Checks for a sudden increase in the length of the compaction queue. This rule indicates that all regions are growing at a similar rate and need to split/compact at around the same time. This can be addressed by pre-splitting or turning off auto-compactions. Compaction queue length (rs_comp_queue_length).
Significant increase in flush queue length. Checks for a sudden increase in the length of the flush queue. When triggered, this can be an indication of a lack of RAM or that flushes are faster than what disks can handle. Flush queue length (rs_flush_queue_length).

For more information about this sensor, see the Apache HBase documentation.

Host

Event Description Metric
CPU spends significant time waiting for input/output. Checks whether the system spends significant time waiting for input/output (sampling in a sliding window of 60 seconds). Wait (cpu.wait).
CPU Steal Time exceeded. Checks on a secondly moving window, whether there is too much CPU stolen between running processes or by the hypervisor / host OS (sampling in a sliding window of 60 seconds). Steal (cpu.steal).
Device has low capacity left or is full. Detects disk low capacity problems to give an early prediction for a possible capacity breach up to 15 minutes in advance. The detector is not firing when the remaining disk space is more than 1GB or 1% of the total capacity. However, it will fire if either the remaining disk space is empty (<1MB), or the disk space would fill up within the next 15 minutes based on the current trend. The disks free storage capacity.
Disk fills up faster than it is being purged. Detects long-term disk capacity problems and fires when when the disk is likely to run out of capacity within the next 48 hours. The detector is not firing when the remaining disk space is more than 20% of the total capacity. However, it will fire when the disk space would fill up within the next 48 hours based on the current trend. This trend is computed based local minima collected over time. When these local minima define a timeframe of at least 4 hours, a linear regression model is fitted on these data points to finally do the long-term forecast. The disks free storage capacity.
Frequent TCP errors. Checks whether the host has an unusually high number of TCP errors (sampling in a sliding window of 60 seconds). In Segments/s (tcp.inSegs) and error (tcp.errors).
Frequent TCP fails. Checks whether the host has an unusually high number of TCP fails (sampling in a sliding window of 60 seconds). Fail (tcp.fails) and open/s (tcp.opens).
Permanent TCP retransmissions. Checks whether the host has an unusual high number of TCP retransmission (sampling in a sliding window of 60 seconds). Retransmission (tcp.retrans) and out Segments/s (tcp.outSegs).
System load too high. Checks whether the system load is too high, by comparing the load against 2 times the CPU cores of the machine (sampling in a sliding window of 120 seconds). Load (load.1min).
System memory exhausted. Checks whether the system memory is close to being exhausted (triggered instantly). Free (memory.free) and used (memory.used).
Too many open files. Processes are opening files faster than they close them (current vs max ratio exceeds threshold). Used (openFiles.used).
Too many used inodes. Low level of free inodes on filesystem triggers this health rule (current vs max ratio exceeds threshold). inode usage.
Too much CPU usage by user processes. Checks whether CPU usage of user processes is too high (sampling in a sliding window of 180 seconds). User (cpu.user) and topPID.
You will run out of disk space soon. Detects short-term capacity problems of a disk and fires when when the disk is likely to run out of capacity within the next hour. The detector is not firing when the disk freed up a considerable amount of space (>=100MB) in the recent past, or when the remaining disk space is more than 20% of the total capacity. However, it will fire when the disk space would fill up within the next hour based on the current trend. This trend is computed based on a linear regression model fitted on the data points of the current sliding window. The disks free storage capacity.

For more information about this sensor, see the Host documentation.

IIS Internet Information Server

Event Description Metric
Sudden drop in requests to IIS-site. Checks for a sudden drop in the requests for an IIS-site. Total request metric of an IIS-sites.

For more information about this sensor, see the Microsoft IIS documentation.

JBoss

Event Description Metric
Average errors on connector too high. A processing pipeline detects the number of errors that occurred on connectors in the given time window and also checks whether the number of errors is greater than the threshold value. Jboss connector errors.
ConnectionPool is running out of connections. A processing pipeline detects the used connections ratio and checks if it is about to reach the threshold value. JBoss connection pool connections used ratio.
Connections on datasources run out. A processing pipeline detects the number of available connections on data sources in the given time window and checks if the total number of connections is about to reach the threshold value. Jboss datasources connections used, datasources connections available.
ThreadPool is running out of threads. A processing pipeline detects the number of max threads and checks if the current thread count is about to reach the threshold value. JBoss thread pool current thread count, thread pool max threads.

For more information about this sensor, see the JBoss AS documentation.

JBoss Data Grid

Event Description Metric
Caches not in the running state. Checks the ratios of number of caches created against the number of caches running in Jboss Data Grid. If the ratio is below a certain value, then it is considered a violation. Running and created caches of cache managers.

For more information about this sensor, see the JBoss Data Grid documentation.

JVM

Event Description Metric
Garbage collection activity high. A processing pipeline monitors the Garbage Collection time spent by the JVM Runtime Platform and validates it against a threshold. JVM Garbage Collection.
JVM code cache is full. A processing pipeline monitors the maximum Code Cache usage of the JVM Runtime Platform. JVM maximum Code Cache usage.
Perm Gen is full (CMS). A processing pipeline detects the maximum Perm Gen CMS Pools utilized. pools.CMS Perm Gen
Perm Gen is full (G1). A processing pipeline detects the maximum Perm Gen G1 Pools utilized. pools.G1 Perm Gen
Perm Gen is full (PS). A processing pipeline detects the maximum Perm Gen PS Pools utilized. pools.PS Perm Gen
Threads are deadlocked. A detector monitors the JVM Runtime Platform and detects if there are any Deadlocked threads. Number of threads deadlocked (threads.deadlocked).

For more information about this sensor, see the JVM documentation.

Kafka

Kafka Cluster

Event Description Metric
Number of active controllers. Checks for an unusual number of active controllers in the Kafka cluster. Broker active controller count (broker.activeControllerCount).

Kafka Node

Event Description Metric
Kafka network thread is under high load. Checks whether the Kafka network thread is under high load. Network Processor (broker.networkProcessorIdle).
Kafka request handler thread is under high load. Checks whether the Kafka request handler is under high load. Request Handler (broker.requestHandlerIdle).
Leader elections are too often. Checks whether there are too many leader elections within a given timeframe. Leader Elections (broker.leaderElections).
Potential data loss due to unclean leader election. Checks for potential data loss due to unclean leader elections. Unclean Leader Elections (broker.uncleanLeaderElections).
Producers and consumer are blocked. Checks whether producers and consumer are blocked due to partitions being offline. Offline Partitions (broker.offlinePartitionsCount).
The number of in-sync replicas has shrunk. Checks whether the number of in-sync replicas has shrunk and did not recover back within the given interval. ISR shrinks (broker.isrShrinks) and ISR expansions (broker.isrExpansions).
Under-replicated partitions. Checks whether the number of under-replicated partitions exceeds the expected number. Under-replicated partitions (broker.underReplicatedPartitions).

For more information about this sensor, see the Kafka documentation.

Kubernetes

Kubernetes Cluster

Event Description Metric
Kubernetes Cluster component status. Kubernetes reports that a Master-Component (API-server, scheduler, controller manager) is unhealthy. Due to a bug in Kubernetes, the health is not always reliably reported. We try to filter these out and not cause an alert by only showing up on the Cluster detail page. Instana low level events.

Kubernetes DaemonSet

Event Description Metric
Available replicas is less than desired replicas. Checks whether the total number of available replicas is less than the number of desired replicas. This indicates that the Kubernetes DaemonSet is missing replica pods. Desired (desiredReplicas) and available (availableReplicas).

Kubernetes Deployment

Event Description Metric
Available replicas is less than desired replicas. Checks whether the total number of available replicas is less than the number of desired replicas. This indicates that the Kubernetes Deployment is missing replica pods. Desired (desiredReplicas) and available (availableReplicas).

Kubernetes Namespace

Event Description Metric
Allocatable cpu requests too low. Requested CPU is approaching max capacity (requested CPU / CPU capacity ratio is greater than 80%). CPU Requests Allocation (required_cpu_percentage).
Allocatable memory requests too low. Requested Memory is approaching max capacity (requested memory/memory capacity ratio is greater than 80%) Memory Requests Allocation (required_mem_percentage).
Allocatable pod count too low. Allocated pods are approaching maximum capacity (allocated pods/pods capacity ratio is greater than 80%). For a namespace, pods in the phases Pending, Running, and Unknown are counted as allocated. The namespace capacity values are based on ResourceQuotas, which can be set per Namespace. For more information, see the Kubernetes documentation. Pods Allocation (used_pods_percentage).

Kubernetes Node

Event Description Metric
Allocatable CPU too low. Requested CPU is approaching max capacity (requested CPU / CPU capacity ratio is greater than 80%). CPU Requests Allocation (required_cpu_percentage).
Allocatable memory too low. Requested Memory is approaching max capacity (requested memory/memory capacity ratio is higher than 80%). Memory Requests Allocation (required_mem_percentage).
Allocatable pod count too low. Allocated pods are approaching maximum capacity (allocated pods/pods capacity ratio is greater than 80%). For a node, pods in the phases Running and Unknown are counted as allocated. For more information, see the Kubernetes documentation. Pods Allocation (alloc_pods_percentage).
Kubernetes Node condition status. The node reports a condition which is not ready for more than one minute. For a node that’s all conditions besides the Ready condition. For more information, see the Kubernetes documentation. Instana low level events.

Kubernetes Pod

Event Description Metric
Kubernetes Pod condition status. A pod is not ready for more than one minute, and the reason is not that it’s completed. (PodCondition=Ready, Status=False, Reason != PodCompleted). For more information, see the Kubernetes documentation. Instana low level events.

For more information about this sensor, see the Kubernetes documentation.

Memcached Nodes

Event Description Metric
Flush all command executed. Detects high number of the flush_all command. Flush (cmd_flush).
High key eviction. Detects high number of key evictions. Evictions (evictions).
Number of queued connections increases. Detects high number of queued connections. Queued (conn_queued).
Number of yielded connections increases. Detects high number of yielded connections. Yields (conn_yields).
Used bytes by Memcached reached maxbytes limit. Used bytes by Memcached reached max bytes limit. Used bytes.

For more information about this sensor, see the Memcached documentation.

MongoDB Node

Event Description Metric
Continuously increasing background flushing latency. Database reports increasing background flushing latency (sampling in a sliding window of 150 seconds). Last background flushing latency (backgroundFlushingLast).
Continuously increasing lock queue length. Monitors the MongoDb Lock Queue metric and validates if the lock queue size is increasing too fast. Lock Queue Length (lockQueue).
Increasing page faults. Increasing page faults (sampling in a sliding window of 150 seconds). Number of Page Faults (pageFaults).
Journal commits in write lock growing Journal commits in write lock growing (sampling in a sliding window of 150 seconds). Journal Write Lock (journalWriteLock).
Too high ratio of non-mapped virtual memory Too high ratio of non-mapped virtual memory (triggered instantly and reported by the Instana Host sensor). Virtual and mapped.

MongoDB Replica Set

Event Description Metric
ReplicaSet has member(s) down. The member, as seen from another member of the set, is unreachable. unreachableNodeCount.
ReplicaSet monitoring status. Monitors the health of all the members of MongoDB replica set. Slave Delays Count (slaveDelaysCount), optimes count (optimesCount), and monitored members count (monitoredMembersCount).
Replication lag is growing. Replication lag is growing (sampling in a sliding window of 150 seconds). Slave Delays (slaveDelays) and Optimes (optimes).

For more information about this sensor, see the MongoDB documentation.

MySQL DB

Event Description Metric
Available server connections are at limit. Ratio between the used and connections limit is greater than the configured ratio threshold. Connections (status.THREADS_CONNECTED).

For more information about this sensor, see the MySQL documentation.

Nginx Server

Event Description Metric
Nginx has a problem with offline peers. Inactive Peer (available only for NGINX Plus). Upstreams failed (nginx_plus.http.upstreams.peers.failed).
Nginx is dropping connections. Dropped connections. Dropped connections (connections.dropped).
Nginx is failing with SSL handshakes. Failed SSL handshakes (available only for NGINX Plus). Failed hanshakes (nginx_plus.ssl.handshakes_failed).
Number of active connections is close to the max. Used connections ratio exceeds the configured ratio threshold for used connections. Active connections (connections.active).

For more information about this sensor, see the NGINX documentation.

Node.js App

Event Description Metric
Garbage collection activity high. Checks whether the time spent in GC in the given window is above the given threshold. GC pause metrics.
Health checks are failing. Checks whether there are any failing healthchecks. For more information, see Health check support. Health check result (healthcheckResult).

For more information about this sensor, see the Node.js documentation.

OpenShift Deployment Config

Event Description Metric
Available replicas is less than desired replicas. Checks whether the total number of available replicas is less than the number of desired replicas. This indicates that the OpenShift DeploymentConfig is missing replica pods. Desired (desiredReplicas) and available (availableReplicas).

For more information about this sensor, see the Openshift documentation.

OracleDB

Event Description Metric
Ratio between DB CPU Time and DB Time is low. Ratio between DB CPU Time and DB Time is below the configured threshold. DB CPU Time/DB Time Ratio (stats.cpuTimeDbTimeRatio).
Tablespace space usage is high. Tablespace used space is more significant than the configured amount of maximum space. Tablespace used space percentage.
Total amount of sessions at maximum. Used sessions ratio exceeds the configured used sessions ratio threshold. Sessions/Session Limit (stats.usedSessionsRatio).

For more information about this sensor, see the OracleDB documentation.

OS process

Event Description Metric
CPU Usage Process is causing high CPU usage on host. The result of a high CPU usage rule evaluation on the underlying host and the CPU user time of the given process.
Open Files Usage. Process is opening files faster than it closes them (current vs max ratio exceeds threshold) Used (openFiles.used).
Abnormal termination. Process terminated as a result of an uncaught signal.
Abnormal termination. Process terminated with a non-zero exit code.

For more information about this sensor, see the OS process documentation.

PHP-FPM Runtime

Event Description Metric
Frequent restarts of PHP-FPM worker pool. Checks for frequent restarts of a PHP-FPM worker pool by evaluating the number of its restarts in a given time window against a given threshold. Start times for a worker pool.
Listen Backlog configured over capacity. Checks whether the listen backlog of a worker pool is over the configured capacity. Worker pool queue length.
Too many connections reset. Checks the number of connection resets to be above the given threshold in the given time window. Connection resets metric for worker pool.
Too many requests piling up in Listen Backlog. Checks the size for various PHP-FPM worker queues and validates it against the threshold value. Listen queue size metrics for various PHP-FPM worker queues.
Too many slow requests. Checks the ratio of slow requests on all monitored PHP-FPM worker pools. Slow requests and accepted connection metric for a worker pool of a PHP-FPM instance.

For more information about this sensor, see the PHP documentation.

Synthetic Check

Event Description Metric
Remote target is not reachable. Checks whether the percentage of failed communication attempts in the given sliding window is above the given threshold. Status of Ping (status). A http status code between 200-206 and 300-307 results in healthy status, for icmp the exit value 0 is seen as healthy, in addition a maximum execution time of 2 seconds is set

For more information about this sensor, see the Synthetic Check documentation.

PostgreSQL DB

Event Description Metric
Active connection usage. Number of active connections is more than 90% of the maximum connections. Connection Usage (max_conn_pct).

For more information about this sensor, see the PostgreSQL documentation.

Process

Event Description Metric
High CPU usage. Evaluates whether the given process is causing high CPU usage on a host. Results of high CPU usage rule evaluation on the underlying host and CPU user time of the given process.
Too many open files. Open files percentage is higher than the configured threshold. Used (openFiles.used).

RabbitMQ

RabbitMQ Cluster

Event Description Metric
RabbitMQ network partition detected Detects if network partition occurs inside the RabbitMQ cluster (triggered every 5 seconds). Total number of Network partitions (net_partitions_count).

RabbitMQ Server

Event Description Metric
Queues are filling up with messages Over a period of 10 minutes, queues are filling up with messages that are not delivered. Messages ready (overview.messages_ready) and messages acknowledged (overview.ack).
RabbitMq has no consumers In the last 5 seconds, RabbitMQ has had no consumers. Consumers (overview.consumers).
RabbitMq has no connections In the last 5 seconds, RabbitMQ has had no connections. Connections (overview.connections).

RabbitMQ Nodes

Event Description Metric
RabbitMQ File Descriptors Usage is critical. File descriptors usage rate is critical on a specific node (Warning: > 90%, Critical: > 98%). This is triggered every 5 seconds. RabbitMQ file descriptors used rate (fd_used_rate).
RabbitMQ Memory Usage is critical on node. Memory usage rate is critical on a specific node (Warning: > 90%, Critical: > 98%). This is triggered every 5 seconds. RabbitMQ memory used rate (mem_used_rate).
RabbitMQ Erlang Processes count is critical. Erlang Processes count is critical on a specific node (Warning: > 90%, Critical: > 98%). This is triggered every 5 seconds. RabbitMQ processes rate.

RabbitMQ Queues

Event Description Metric
More messages are being produced than consumed. More messages are being published to a queue than the consumers can process from a queue. RabbitMQ unacknowledged messages in a queue.

For more information about this sensor, see the RabbitMQ documentation.

Redis

Redis Cluster

Event Description Metric
Redis cluster state isn't ok. Cluster is in an inappropriate state. cluster_state.

Redis Node

Event Description Metric
Memory allocation analysis. Redis server is causing external memory fragmentation. Used memory (used_memory) and memory fragmentation ratio (mem_fragmentation_ratio).
Redis hit rate is low. Redis hit rate is below the configured threshold. Cache hit rate (hit_rate), keyspace hits (keyspace_hits), keyspace misses (keyspace_misses), and Redis evicted keys (evicted_keys).
Redis memory usage is getting closer to max memory limit. Redis memory usage is getting closer to max memory limit. Used memory (used_memory).
Redis rejecting connections. Redis is rejecting connections. Number of rejected connections (rejected_connections).
Redis slave node can't connect to master node. Redis slave node can't connect to the master node. master_downtime_seconds.

For more information about this sensor, see the Redis documentation.

Service

Event Description Metric
Complete drop in calls. Detects a rapid drop to zero (essentially the service is not being called anymore) in the values of the calls KPI metric relative to the values in the last 30 minutes. The magnitude of the drop in calls should also exceed the relative and absolute threshold parameters below. Calls/s (count).
Error rate too high. Detects a consistently high error rate when the average errors KPI within the last four minutes is above the given threshold value. Error rate (error_rate).
Increasing trend in error rate. Checks a presence of an increasing trend in a given metric. The rule is tuned to detect weakly monotonous increases in the given metric. The detector is however, not strict and tolerates certain amount of decreases in the metric value inside the trend candidate. Error rate (error_rate).
Sudden drop in calls. Detects a rapid drop in the values of the calls KPI metric relative to the values in the last 30 minutes. The magnitude of the drop in calls should also exceed the relative and absolute threshold parameters below. Calls/s (count).
Sudden increase in error rate. Detects a rapid increase in the values of the errors KPI relative to the KPIs values in the last 10 minutes. The magnitude of the increase in errors should also exceed the relative and absolute threshold parameters below. Error Rate (error_rate).
Sudden increase in latency. Detects a rapid increase in the given latency KPI percentile relative to the KPIs values in the last 30 minutes. The magnitude of the increase in errors should also exceed the relative and absolute threshold parameters below. Latency 50th (duration.50th).
Sudden increase in latency for a fraction of requests. Detects a rapid increase in the given latency KPI percentile relative to the KPIs values in the last 30 minutes. The magnitude of the increase in errors should also exceed the relative and absolute threshold parameters below. Latency 99th (duration.99th).

Solr

Solr Cloud Cluster

Event Description Metric
Unreachable Solr nodes. One or more nodes are down. unreachableNodes.

Solr Node

Event Description Metric
Solr cache hit rate is low. Solr cache hit rate is below 80% over the last minute, possibly due to high evictions or clients are querying the wrong data. Solr Hit Ratio (hitratio) and Solr evictions.

For more information about this sensor, see the Apache Solr documentation.

Spark

Spark Application

Event Description Metric
Failed tasks on executor. Number of failed tasks on an executor exceeds the configured threshold. Spark Application failed tasks.
Scheduling delay is high. Scheduling delay is increasing too fast or is too high. Scheduling Delay (schedulingDelay).

Spark Standalone

Event Description Metric
Driver has failed. Number of failed drivers exceeds the configured threshold. Number of failed drivers (drivers.failed).
Spark standalone master is reporting dead worker(s). Number of dead workers exceeds the configured threshold. Dead workers (workers.deadWorkers).
Spark standalone master is reporting worker(s) in unknown state. Number of workers in an unknown state exceeds the configured threshold.
Submitted app has failed. Number of failed applications exceeds the configured threshold. Workers in unknown state (workers.workersInUnknownState).

For more information about this sensor, see the Apache Spark documentation.

Spring Boot App

Event Description Metric
Number of active sessions reached maximum number. A processing pipeline detects the number of active connections of the SpringBoot application in the given time window. It validates whether the number of active sessions is greater than the threshold value. Active sessions (metrics.httpsessions.active).
Spring Boot Application down. Monitors the status of the SpringBoot Application. Status of SpringBoot Application (metrics.status).

For more information about this sensor, see the Spring Boot documentation.

Sybase Server

Event Description Metric
Available server connections are at limit. Number of connections is close to 100% of connections limit per server. Connections (stats.connCount).
The maximum number of databases is at limit. Number of databases is close to 100% of databases limit per server. databasesCount.

For more information about this sensor, see the Sybase SQL Anywhere documentation.

Tibco EMS

Event Description Metric
Connections exceeds max available connections. The max number of connections is almost used up. Connections Count (connectionCount).
Messages memory usage exceeds the limit. The maximum message memory is almost used up. Messages Memory (messagesMemory).
Queues pending messages exceeds the limit. The max number of pending messages for queue is almost used up. Queue pending messages usage.
Topics pending messages exceeds the limit. The max number of pending messages for topic is almost used up. Topic pending messages usage.

For more information about this sensor, see the Tibco EMS documentation.

Tomcat

Event Description Metric
Active connections reached maximum. Detects if the number of connections of specific connector is reaching its maximum configured value. Number of connector connection count.
Sudden drop in the number of session. Checks for a significant drop in the number of sessions. Total session count (totalSessionCount).
Sudden increase in the number of session. Checks for a significant increase in the number of sessions. Total session count (totalSessionCount).
Threads number reached maximum. Detects if the number of busy threads of specific connector is reaching its maximum configured value. Number of connector busy threads.

For more information about this sensor, see the Tomcat documentation.

Varnish Node

Event Description Metric
Sudden drop in the number of requests. Checks for a sudden drop in the number of client requests. Received client requests (client_req).
Sudden increase in evected objects. Checks for a sudden increase in the number of evicted objects. Nuked Objects (n_lru_nuked).
Thread creation is failing. Too many thread creations failed. Failed (threads_failed) and limited (threads_limited).
Varnish backend is marked unhealthy. Varnish backend server is unhealthy or is not available. Unhealthy (backend_unhealthy).
Varnish hit rate is low. Varnish hit rate is very low. Cache Hit Rate (cache_hit_rate).
Varnish is out of worker threads. Varnish is out of worker threads. Connections dropped due to a full queue (sess_dropped).

For more information about this sensor, see the Varnish documentation.

Vault

Event Description Metric
Vault is sealed. Detects if the sealed status is set to true. Sealed (sealed).
Sudden increase in secret reads Checks for a sudden increase (increase by 60% based on the average of the last 5 minutes) in the number of secrets read. Secrets read count (secret.read.count).

For more information about this sensor, see the Vault documentation.

WebLogic Server

Event Description Metric
Datasource error state. A processing pipeline monitors status codes of the WebLogicApplications data sources, and checks if any data source is unhealthy. WebLogic datasource status.
Health state Detects overall system degradation based on reported health state. Health State status.

For more information about this sensor, see the WebLogic documentation.

WebSphere

Event Description Metric
WebContainer thread pool active threads reached maximum. A processing pipeline validates that the number of active threads in the WebContainer thread pool is reaching the maximum limit. Active threads (threadPools.webContainer.activeThreads).

For more information about this sensor, see the Websphere AS documentation.

ZooKeeper

Event Description Metric
Maximum request latency is high. A processing pipeline checks if the maximum request latency is reaching the threshold value. Max request latency (max_request_latency).
Number of queued requests is high. A processing pipeline detects the number of queued request and validates whether the number is reaching the threshold value. Outstanding request count (outstanding_requests).

For more information about this sensor, see the ZooKeeper documentation.