Monitoring Elasticsearch

Configuration

Instana automatically monitors up to 1000 indices and collects 5 most important metrics per index. To enable in-depth index monitoring (~20 metrics/index) for up to 200 indices, you need to specify indicesRegex in the agent configuration file <agent_install_dir>/etc/instana/configuration.yaml:

com.instana.plugin.elasticsearch:
  enabled: true # true
  indicesRegex: '.*'

Metrics collection

Node-Level

Configuration data

  • Version
  • Cluster
  • Health Status
  • Node Name
  • Node Type
  • Node is Master
  • Node is Master Eligible
  • Transport
  • Log Directory
  • Shards
  • Indices

Performance metrics

Data point Description Granularity
Query Latency Query latency is collected from NodeIndicesStats#SearchStats. 1 second
Number of Queries Query count per second is collected from NodeIndicesStats#SearchStats. 1 second
Overall Documents Total Documents is collected from DocsStats#count. 1 second
Added Documents The total number of indexing operations is collected from IndexingStats#indexCount. 1 second
Removed Documents The number of delete operation executed is collected from IndexingStats#deleteCount. 1 second
Active Shards The number of active shards is collected from IndexRoutingTable#ShardRouting. 1 second
Active Primary Shards The number of active primary shards is collected from IndexRoutingTable#ShardRouting. 1 second
Refresh Count The number of refresh executed per second is collected from NodeIndicesStats#RefreshStats. 1 second
Refresh Time The total time merges have been executed is collected from NodeIndicesStats#RefreshStats. 1 second
Flush Count The total number of flush executed per second is collected from NodeIndicesStats#FlushStats. 1 second
Flush Time The total time merges have been executed is collected from NodeIndicesStats#FlushStats. 1 second
Indices metrics Documents count, Deleted count and Size per index is collected from IndexStats#DocsStats. 1 second
Lucene Segments The number of segments is collected from NodeIndicesStats#SegmentsStats#count. 1 second
Active Threads Search, Index, Bulk, Merge, Flush, Get, Management, Refresh are collected from ThreadPoolStats.Stats#active. 1 second
Queued Threads Search, Index, Bulk, Merge, Flush, Get, Management, Refresh are collected from ThreadPoolStats.Stats#queue. 1 second
Rejected Threads Search, Index, Bulk, Get are collected from ThreadPoolStats.Stats#rejected. 1 second
Sent Data Size of TX packets sent by the node during internal cluster communication is collected from TransportStats#tx_size 1 second
Received Data Size of RX packets received by the node during internal cluster communication is collected from TransportStats#rx_size 1 second

Index metrics

Data point Description Granularity
Total Queries The total number of query operations is collected from SearchStats.Stats#queryTotal 1 second
Queries Current The number of query operations currently running is collected from SearchStats.Stats#queryCurrent 1 second
Fetches Total The total number of fetch operations is collected from SearchStats.Stats#fetchCount 1 second
Fetches Current The number of fetch operations currently running is collected from SearchStats.Stats#fetchCurrent 1 second
Query Time Time in milliseconds spent performing query operations is collected from SearchStats.Stats#queryTimeInMillis 1 second
Fetch Time Time in milliseconds spent performing fetch operations is collected from SearchStats.Stats#fetchTimeInMillis 1 second
Query Cache Memory Total amount of memory used for the query cache is collected from QueryCacheStats#ramBytesUsed 1 second
Query Cache Evictions The number of query cache evictions is collected from QueryCacheStats#evictions 1 second
Request Cache Memory The number of request cache evictions is collected from RequestCacheStats#ramBytesUsed 1 second
Request Cache Evictions The number of request cache evictions is collected from RequestCacheStats#evictions 1 second
Get Requests The total number of Get request is collected from GetStats#count 1 second
Get Requests Time Time in milliseconds spent on Get requests is collected from GetStats#timeInMillis 1 second
Get Requests Failed The number of failed Get requests is collected from GetStats#missingCount 1 second
Get Requests Failed Time Time in milliseconds spent on failed Get requests is collected from GetStats#missingTimeInMillis 1 second
Indexing Operations Failed The number of failing indexing operations is collected from IndexingStats#indexFailedCount 1 second
Active Merges Count The current number of merges executing is collected from MergeStats#current 1 second
Total Merges Size The total size of merges executed is collected from MergeStats#totalSizeInBytes 1 second
Total Merges Time The total time merges have been executed is collected from MergeStats#totalTimeInMillis 1 second

Index metrics mentioned above are going to be enabled for indices configured via regular expression indicesRegex in the agent configuration.

Health Signatures

For each sensor, there is a curated knowledgebase of health signatures that are evaluated continuously against the incoming metrics and are used to raise issues or incidents depending on user impact.

Built-in events trigger issues or incidents based on failing health signatures on entities, and custom events trigger issues or incidents based on the thresholds of an individual metric of any given entity.

For information about built-events for the Elasticsearch Node, see the Built-in events reference.

Cluster-Level

Configuration data

  • Name
  • Health Status
  • Nodes, Masters

Performance metrics

Data point Description Granularity
Query Latency Query latency is calculated as max query latency of all nodes. 1 second
Number of Queries Query count is calculated as query count sum for all nodes. 1 second
Overall Documents Total Documents is calculated as sum of overall documents for all nodes. 1 second
Added Documents Added Documents is calculated as sum of added documents for all nodes. 1 second
Removed Documents Removed Documents is calculated as sum of removed documents for all nodes. 1 second
Indices Number of indices 1 second
Shards Active, Active Primary, Initializing, Relocating, Unassigned is collected from ClusterHealth. 1 second
Cluster State size Size of the ClusterState. 1 second

Health Signatures

For information about built-events for the Elasticsearch Cluster, see the Built-in events reference.