Monitoring Kafka

Supported Versions

All Kafka metrics Instana collects are available for every version of Apache Kafka, Cloudera Kafka and Confluent Kafka, apart from the Consumer group lag and the Consumer/Producer Byte Rate/Throttling metrics.

Consumer group lag metrics are available for:

  • Apache Kafka versions from 0.11.x.x to 2.x.x
  • Cloudera Kafka version from 3.x.x to 4.1.x
  • Confluent Kafka versions from 3.3.x to 6.x.x.

Consumer/Producer Byte Rate/Throttling metrics are available for Java Kafka clients only and:

  • Apache Kafka versions from 1.1.x to 2.x.x
  • Cloudera Kafka versions from 4.0.x to 4.1.x
  • Confluent Kafka versions from 4.1.x to 6.x.x.

Configuration

The Instana agent automatically detects the running Kafka agent, therefore no configuration is required.

Instana collects the first 400 topics sorted by topic name.

If there is a requirement to filter topics, you can configure it in the agent configuration file <agent_install_dir>/etc/instana/configuration.yaml:

com.instana.plugin.kafka:
  topicsRegex: '<OPTIONAL_REGEX_HERE>'
  brokerPropertiesFilePath: '/path/to/server.properties'
  collectLagData: '' # true or false. The default value is true
  • topicsRegex: Optional regular expression to select up to 400 topics by name. If the value is empty or does not exist, Instana collects the first 400 topics sorted by name.
  • brokerPropertiesFilePath: The path to the broker server.properties file which is used by the agent to get information about the broker network and security protocol settings.
  • collectLagData: Flag which is being used to explicitly enable/disable lag data collection (enabled by default).

If the path to the broker properties is not specified, the agent will try to find server.properties in the following places:

  • Kafka broker process arguments
  • KAFKA_SERVER_PROPERTIES environment variable
  • Using the predefined paths: /path_to_kafka_home/config/server.properties or /path_to_kafka_home/etc/kafka/server.properties for Confluent Kafka.

The Agent uses /opt/kafka/config/server.properties as a default path in case the path to server.properties could not be found in any of the aforementioned ways.

SSL TLS support

If your Kafka broker instance requires SSL client connections, you need to configure the Instana agent via <agent_install_dir>/etc/instana/configuration.yaml to enable collecting Consumer lag metrics:

com.instana.plugin.kafka:
  ...
  sslTrustStore: '/path/to/truststore.jks'
  sslTrustStorePassword: 'kafkaTsPassword'
  sslKeyStore: '/path/to/sslKeyStoreFile.jks'
  sslKeyStorePassword: 'kafkaKsPassword'

Keys need to be in the Java Keystore format (JKS). The keytool can be used to create these.

Note: This will enable the Instana agent to connect to Kafka broker via SSL and collect Consumer group lag metrics.

Kafka Node - Metrics collection

Configuration data

  • Version
  • Zookeeper Connect
  • Process ID
  • Node ID
  • Topics/Partitions

Performance metrics

  • Produce Latency
  • Fetch Consumer Latency
  • Fetch Follower Latency

Broker Traffic

Metric Description Granularity
In Aggregate incoming byte rate and is collected from kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec. 1 second
Out Aggregate outgoing byte rate and is collected from kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec. 1 second
Rejected Aggregate rejected byte rate and is collected from kafka.server:type=BrokerTopicMetrics,name=BytesRejectedPerSec. 1 second

Broker Messages In

Metric Description Granularity
Count Aggregate incoming message rate and is collected from kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec. 1 second

Produce Requests

Metric Description Granularity
Count Request rate and is collected from kafka.network:type=RequestMetrics,name=RequestsPerSec,request=Produce. 1 second
Mean Latency Average latency calculated as quotient of Count (mentioned above) and of total time in ms to serve the specified request collected from kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce. 1 second

Fetch Consumer Requests

Metric Description Granularity
Count Request rate and is collected from kafka.network:type=RequestMetrics,name=RequestsPerSec,request=FetchConsumer. 1 second
Mean Latency Average latency calculated as quotient of Count (mentioned above) and of total time in ms to serve the specified request collected from kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchConsumer. 1 second

Fetch Follower Requests

Metric Description Granularity
Count Request rate and is collected from kafka.network:type=RequestMetrics,name=RequestsPerSec,request=FetchFollower. 1 second
Mean Latency Average latency calculated as quotient of Count (mentioned above) and of total time in ms to serve the specified request collected from kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchFollower. 1 second

Average Idle Time

Metric Description Granularity
Network Processor Average fraction of time the network processor threads are idle. Values are between 0% (all resources are used) and 100% (all resources are available) and is collected from kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent. 1 second
Request Handler Average fraction of time the request handler threads are idle. Values are between 0% (all resources are used) and 100% (all resources are available) and is collected from kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent. 1 second

Broker Failures

Metric Description Granularity
Fetch Fetch request rate for requests that failed and is collected from kafka.server:type=BrokerTopicMetrics,name=FailedFetchRequestsPerSec. 1 second
Produce Produce request rate for requests that failed and is collected from kafka.server:type=BrokerTopicMetrics,name=FailedProduceRequestsPerSec. 1 second

Broker State Metrics

Metric Description Granularity
Under-replicated Partitions Number of under-replicated partitions (ISR < all replicas) and is collected from kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions. 1 second
Offline Partitions Number of partitions that don’t have an active leader and are hence not writable or readable and is collected from kafka.controller:type=KafkaController,name=OfflinePartitionsCount. 1 second
Leader Elections Leader election rate and latency and is collected from kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs. 1 second
Unclean Leader Elections Unclean leader election rate and is collected from kafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec. 1 second
ISR Shrinks If a broker goes down, ISR for some of the partitions will shrink. When that broker is up again, ISR will be expanded once the replicas are fully caught up. Other than that, the expected value for both ISR shrink rate and expansion rate is 0. Collected from kafka.server:type=ReplicaManager,name=IsrShrinksPerSec. 1 second
ISR Expansions When a broker is brought up after a failure, it starts catching up by reading from the leader. Once it is caught up, it gets added back to the ISR. Collected from kafka.server:type=ReplicaManager,name=IsrExpandsPerSec. 1 second
Active controller count Number of active controllers in the cluster and is collected from kafka.controller:type=KafkaController,name=ActiveControllerCount. 1 second

Partitions

Metric Description Granularity
Count Number of partitions on this broker. This should be mostly even across all brokers and is collected from kafka.server:type=ReplicaManager,name=PartitionCount. 1 second

Log Flushing

Metric Description Granularity
Mean Log flush rate and is collected from kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs. 1 second
Flushes Log flush count and is collected from kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs. 1 second

Topics

Metric Description Granularity
Name Aggregate incoming message rate and is collected from kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec. 1 second
Partitions Aggregate incoming message rate and is collected from kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec. 1 second
Bytes In Aggregate incoming byte rate for the topic and is collected from kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec. 1 second
Bytes Out Aggregate outgoing byte rate for the topic and is collected from kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec. 1 second
Bytes Rejected Aggregate rejected byte rate for the topic and is collected from kafka.server:type=BrokerTopicMetrics,name=BytesRejectedPerSec. 1 second
Messages In Aggregate incoming message rate for the topic and is collected from kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec. 1 second
In-Sync Replicas In-sync replicas count and is collected from kafka.cluster:type=Partition,name=InSyncReplicasCount. 1 second

Kafka Cluster - Metrics collection

Configuration data

  • Cluster Name
  • Zookeeper
  • Nodes (Name, Version)
  • Topics/Partitions

Performance metrics

  • All Brokers Messages In
  • Rejected Traffic
  • Fetch Consumer Latency
  • Fetch Follower Latency

Average Request Latency vs Throughput

Metric Description Granularity
Produce Throughput Sum of the Produce Requests Count from all nodes. 1 second
Fetch Consumer Throughput Sum of the Fetch Consumer Requests Count from all nodes. 1 second
Fetch Follower Throughput Sum of the Fetch Follower Requests Count from all nodes. 1 second
Produce Latency Sum of the Produce Requests Latency from all nodes. 1 second
Fetch Consumer Latency Sum of the Fetch Consumer Requests Latency from all nodes. 1 second
Fetch Follower Latency Sum of the Fetch Follower Requests Latency from all nodes. 1 second

All Brokers Traffic

Metric Description Granularity
In Sum of the Broker Traffic In from all nodes. 1 second
Out Sum of the Broker Traffic Out from all nodes. 1 second
Rejected Sum of the Broker Traffic Rejected from all nodes. 1 second

All Brokers Failures

Metric Description Granularity
Fetch Sum of the Broker Failures Fetch from all nodes. 1 second
Produce Sum of the Broker Failures Produce from all nodes. 1 second

All Brokers State Metrics

Metric Description Granularity
Under-replicated Partitions Sum of the Broker State Metrics Under-replicated Partitions from all nodes. 1 second
Offline Partitions Sum of the Broker State Metrics Offline Partitions from all nodes. 1 second
Leader Elections Sum of the Broker State Metrics Leader Elections from all nodes. 1 second
Unclean Leader Elections Sum of the Broker State Metrics Unclean Leader Elections from all nodes. 1 second
ISR Shrinks Sum of the Broker State Metrics ISR Shrinks from all nodes. 1 second
ISR Expansions Sum of the Broker State Metrics ISR Expansions from all nodes. 1 second
Active controller count Sum of the Broker State Metrics Active controller count from all nodes. 1 second

Average Idle Time Percentage

Metric Description Granularity
Network Processor Average of the Average Idle Time Network Processor from all nodes. 1 second
Request Handler Average of the Average Idle Time Request Handler from all nodes. 1 second

Log Flushing

Metric Description Granularity
Mean Sum of the Log Flushing Mean from all nodes. 1 second
Flushes Sum of the Log Flushing Flushes from all nodes. 1 second

Cluster Nodes

Metric Description Granularity
Controller Is the node controller? Yes/No. 1 second
Messages In Chart with the count of the Broker Messages In. 1 second
Bytes In Chart with the count of the Broker Bytes In. 1 second
Bytes Out Chart with the count of the Broker Bytes Out. 1 second
Average Response Time Chart with the count of the Broker Average Response Time. 1 second
Health The node health indicator. 1 second

Cluster Topics

Metric Description Granularity
Partitions Number of partitions. 10 minutes
Bytes In Chart with the count of the Topic Bytes In. 1 second
Bytes Out Chart with the count of the Topic Bytes Out. 1 second
Bytes Rejected Chart with the count of the Topic Bytes Rejected. 1 second
Messages In Chart with the count of the Topic Messages In. 1 second

Consumer group lag

Metric Description Granularity
Lag Consumer group lag per topic. 60 seconds

Consumers

Metric Description Granularity
Byte Rate The number of bytes consumed sent per second. 1 second
Throttling Average throttle time. 1 second
Latency Average fetch latency. 1 second

Producers

Metric Description Granularity
Byte Rate The number of outgoing bytes sent per second. 1 second
Throttling Average throttle time. 1 second
Latency Average request latency. 1 second

Note: In order to enable the Instana agent client to query the Kafka broker for lag-related data, add the PLAINTEXT security protocol for localhost socket connections within the Kafka broker configuration file.

Health Signatures

For each sensor, there is a curated knowledgebase of health signatures that are evaluated continuously against the incoming metrics and are used to raise issues or incidents depending on user impact.

Built-in events trigger issues or incidents based on failing health signatures on entities, and custom events trigger issues or incidents based on the thresholds of an individual metric of any given entity.

For information about built-events for Kafka Node and Cluster, see the Built-in events reference.

Troubleshooting

SSL not configured

Monitoring issue type: kafka_ssl_not_configured

To resolve this issue please refer to the steps as described in SSL/TLS Support for how to configure Kafka SSL truststore location and password.

SSL client authentication not configured

Monitoring issue type: kafka_ssl_client_not_configured

To resolve this issue please refer to the steps as described in SSL/TLS Support for how to configure Kafka SSL client authentication (keystore location and password).