Monitoring Host

Supported OS

  • Linux
  • Windows
  • Mac OS/OS X
  • Solaris on Sparc
  • AIX

Configuration

For detailed information, see our agent configuration documentation.

Metrics collection

Configuration data

  • Operating System name and version
  • CPU model and count
  • GPU model and count
  • Memory
  • Max Open Files
  • Hostname
  • Fully Qualified Domain Name
  • Machine ID
  • Boot ID
  • Startup time
  • Installed packages

Performance metrics

CPU usage

Overall CPU usage as a percentage.

Datapoint: Filesystem

Granularity: 1 second

Memory usage

Total memory used as a percentage.

Datapoint: Filesystem

Granularity: 1 second

CPU load

The average number of processes being or waiting to be executed over past selected period of time.

Datapoint: Filesystem

Granularity: 5 seconds

CPU usage

CPU usage values as a percentage; user, system, wait, nice, and steal. The values are displayed on a graph over a selected time period.

Datapoint: Filesystem

Granularity: 1 second

Context switches

The total number of context switches. This is supported only on Linux hosts. The value is displayed on a graph over a selected time period.

Datapoint: Filesystem

Granularity: 1 second

CPU load

CPU load. The value is displayed on a graph over a selected time period.

Datapoint: Filesystem

Granularity: 1 second

Individual CPU usage

Individual CPU usage values as a percentage; user, system, wait, nice, and steal. The values are displayed on a graph over a selected time period.

Datapoint: Filesystem

Granularity: 1 second

Individual GPU usage

Individual GPU usage values.

Datapoint Collected from Granularity Unit
Gpu Usage nvidia-smi 1 second %
Temperature nvidia-smi 1 second °C
Encoder nvidia-smi 1 second %
Decoder nvidia-smi 1 second %
Memory Used nvidia-smi 1 second %
Memory Total nvidia-smi 1 second bytes
Transmitted throughput nvidia-smi 1 second bytes/s
Received throughput nvidia-smi 1 second bytes/s

Supported Nvidia Graphic Cards:

Brand Model
Tesla S1070, S2050, C1060, C2050/70, M2050/70/90, X2070/90, K10, K20, K20X, K40, K80, M40, P40, P100, V100
Quadro 4000, 5000, 6000, 7000, M2070-Q, K-series, M-series, P-series, RTX-series
GeForce varying levels of support, with fewer metrics available than on the Tesla and Quadro products

Supported OS: Linux

Prerequisites: Installed latest official Nvidia drivers.

Starting Instana Agent Docker container with GPU support is documented here: Enable GPU monitoring through Instana Agent container.

Note:

Data collection of GPU metrics is carefully designed for minimal impact by splitting polling and querying into two processes using nvidia-smi command-line utility. The background process is started in a loop mode and is kept in memory. It significantly improves the metrics collection performance and prevents any potential overhead. The sensor queries GPU metrics based on the configured poll rate (every second by default). The solution enables the sensor to gather accurate and up to date metrics every second for multiple GPUs without the overhead.

Note:

Data collection of GPU metrics is carefully designed for minimal impact by splitting polling and querying into two processes using nvidia-smi command-line utility. The background process is started in a loop mode and is kept in memory. It significantly improves the metrics collection performance and prevents any potential overhead. The sensor queries GPU metrics based on the configured poll rate (every second by default). The solution enables the sensor to gather accurate and up to date metrics every second for multiple GPUs without the overhead.

GPU Memory/Process

The following list of processes utilizes GPU.

Datapoint Collected from Granularity
Process Name nvidia-smi 1 second
PID nvidia-smi 1 second
GPU nvidia-smi 1 second
Memory nvidia-smi 1 second

Supported Nvidia Graphic Cards:

Brand Model
Tesla S1070, S2050, C1060, C2050/70, M2050/70/90, X2070/90, K10, K20, K20X, K40, K80, M40, P40, P100, V100
Quadro 4000, 5000, 6000, 7000, M2070-Q, K-series, M-series, P-series, RTX-series
GeForce varying levels of support, with fewer metrics available than on the Tesla and Quadro products

Supported OS: Linux

Prerequisites: Installed latest official Nvidia drivers.

Starting Instana Agent Docker container with GPU support is documented here: Enable GPU monitoring through Instana Agent container.

Note:

Data collection of GPU metrics is carefully designed for minimal impact by splitting polling and querying into two processes using nvidia-smi command-line utility. The background process is started in a loop mode and is kept in memory. It significantly improves the metrics collection performance and prevents any potential overhead. The sensor queries GPU metrics based on the configured poll rate (every second by default). The solution enables the sensor to gather accurate and up to date metrics every second for multiple GPUs without the overhead.

Note:

Data collection of GPU metrics is carefully designed for minimal impact by splitting polling and querying into two processes using nvidia-smi command-line utility. The background process is started in a loop mode and is kept in memory. It significantly improves the metrics collection performance and prevents any potential overhead. The sensor queries GPU metrics based on the configured poll rate (every second by default). The solution enables the sensor to gather accurate and up to date metrics every second for multiple GPUs without the overhead.

Memory

Memory used value as a percentage. Memory values as a byte; swap total, swap free, buffers, cached, and available. The values are displayed on a graph over a selected time period.

Datapoint: Filesystem

Granularity: 1 second

Open files

Open files usage when available on the operating system; current vs max. The values are displayed on a graph over a selected time period.

Datapoint: Filesystem

Granularity: 1 second

Filesystems

Filesystems per device.

Datapoint Collected from Granularity
Device Filesystem 60 seconds
Mount Filesystem 60 seconds
Options Filesystem 60 seconds
Type Filesystem 60 seconds
Capacity Filesystem 60 seconds
Used Filesystem 1 second
Leaked* Filesystem 1 second
Inode usage Filesystem 1 second
Reads/s, Bytes Read/s** Filesystem 1 second
Writes/s, Bytes Writes/s** Filesystem 1 second

* Leaked (refers to deleted files that are in use and equates to capacity - used - free. On Linux you can find these files with lsof | grep deleted).

** Reads/Writes are not supported for NFS (Network File System).

Instana will by default only monitor local filesystems. It is possible to explicitly list the filesystems that shall be monitored or excluded in the configuration.yaml file. The name for the config setting is the device name, which can be obtained from the first column of mtab file or df command output.

Temporary filesystems need to be specified in the following format: tmpfs:/mount/point. For example, list of filesystems to be monitored:

com.instana.plugin.host:
  filesystems:
    - '/dev/sda1'
    - 'tmpfs:/sys/fs/cgroup'
    - 'server:/usr/local/pub'

or to be included / excluded:

com.instana.plugin.host:
  filesystems:
    include:
      - '/dev/xvdd'
      - 'tmpfs:/tmp'
      - 'server:/usr/local/pub'
    exclude:
      - '/dev/xvda2'

Network interfaces

Network traffic and errors per an interface.

Datapoint Collected from Granularity
Interface Filesystem 60 seconds
Mac Filesystem 60 seconds
IPs Filesystem 60 seconds
RX Bytes Filesystem 1 second
RX Errors Filesystem 1 second
TX Bytes Filesystem 1 second
TX Errors Filesystem 1 second

TCP activity

TCP activity values are displayed on a graph over a selected time period.

Datapoint Collected from Granularity
Establised Filesystem 1 second
Open/s Filesystem 1 second
In Segments/s Filesystem 1 second
Out Segments/s Filesystem 1 second
Established Resets Filesystem 1 second
Out Resets Filesystem 1 second
Fail Filesystem 1 second
Error Filesystem 1 second
Retransmission Filesystem 1 second

Process top list

The process toplist is refreshed approximately every 10 seconds and contains only the processes that have significant system usage - more than 10% CPU usage over the previous 10 seconds, and more than 512 MB memory usage (RSS).

Linux top semantics are used; 100% CPU refers to a single CPU core, and you can search a history of snapshots for the previous month.

Datapoint Collected from Granularity
PID Filesystem 30 seconds
Process Name Filesystem 30 seconds
CPU Filesystem 30 seconds
Memory Filesystem 30 seconds

Installed Packages List

When the collectInstalledSoftware is set to true in the configuration.yaml file, installed packages on an operating system can be extracted once a day.

The following Linux distributions are currently supported:

  • Debian-based (dpkg)
  • RedHat-based (rpm and yum)
com.instana.plugin.host:
  collectInstalledSoftware: true # [true, false]

Health signatures

For each sensor, there is a curated knowledgebase of health signatures that are evaluated continuously against the incoming metrics and are used to raise issues or incidents depending on user impact.

Built-in events trigger issues or incidents based on failing health signatures on entities, and custom events trigger issues or incidents based on the thresholds of an individual metric of any given entity.

For information about the built-in events for the Host sensor, see the Built-in events reference.

Troubleshooting

eBPF Not Supported

Monitoring issue type: ebpf_not_supported

The Process Abnormal Termination functionality detects when processes running on a Linux-based Operating System terminate unexpectedly due to crashes or getting killed by outside signals.

This functionality is built on top of the extended Berkley Packet Filter, which seems to be unavailable on this host.

To take advantage of Instana's eBPF-based features you need a 4.7+ Linux kernel with debugfs mounted. Refer to the Process Abnormal Termination documentation for more information on the supported Operating Systems.

SELinux policy blocking eBPF

If you have SELinux installed on your host, you usually need to create a policy to allow the agent to leverage eBPF. SELinux may prevent unconfined services like the host agent from issuing the bpf_* syscall that the eBPF sensor uses to instrument the Linux kernel. To verify that this is happening, one must look in the log entries of the Audit system, which is stored by default in the /var/log/audit/audit.log.

The following is an example is from a Red Hat Linux machine:

$ cat /var/log/audit/audit.log | grep ebpf
type=AVC msg=audit(1598891569.452:193): avc:  denied  { map_create } for  pid=1612 comm="ebpf-preflight-" 
scontext=system_u:system_r:unconfined_service_t:s0 tcontext=system_u:system_r:unconfined_service_t:s0 
tclass=bpf permissive=0
type=SYSCALL msg=audit(1598891569.452:193): arch=c000003e syscall=321 success=no exit=-13 
a0=0 a1=7ffc0e1f5020 a2=78 a3=fefefefefefefeff items=0 ppid=1502 pid=1612 auid=4294967295 
uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="ebpf-preflight-" 
exe="/opt/instana/agent/data/repo/com/instana/ebpf-preflight/0.1.6/ebpf-preflight-0.1.6.bin" 
subj=system_u:system_r:unconfined_service_t:s0 key=(null)
type=PROCTITLE msg=audit(1598891569.452:193):
proctitle="/opt/instana/agent/data/repo/com/instana/ebpf-preflight/0.1.6/ebpf-preflight-0.1.6.bin"

Note that audit log files are usually rotated, so we have to run this command not long after starting the host agent.

In the log file, we see that the map_create syscall was denied. To allow the eBPF sensor to make this syscall we need to create an SELinux policy. For this we need the program audit2allow. On Red Hat systems this can be installed as follows:

$ yum install policycoreutils-python

With audit2allow, we can then create raw policy files based on the log entries:

$ grep ebpf /var/log/audit/audit.log | audit2allow -M instana_ebpf

The command above will create the following files:

$ ls -Al | grep instana_ebpf
-rw-r--r--. 1 root                    root                      886 31. Aug 18:31 instana_ebpf.pp
-rw-r--r--. 1 root                    root                      239 31. Aug 18:31 instana_ebpf.te

The raw policy file, called instana_ebpf.te, now contains an instruction to allow the denied syscall:

$ cat instana_ebpf.temodule instana_ebpf 1.0;require {
	type unconfined_service_t;
	class bpf map_create;
}#============= unconfined_service_t ==============#!!!! This avc is allowed in the current policy
allow unconfined_service_t self:bpf map_create;

This policy will allow any app of type unconfined (very generic) to make the map_create syscall.

Additionally, the eBPF sensor needs a few more syscalls. We have to edit the instana_ebpf.te file so it looks like this:

$ cat instana_ebpf.te module instana_ebpf 1.0;require {
	type unconfined_service_t;
	class bpf { map_create map_read map_write prog_load prog_run };
}#============= unconfined_service_t ==============#!!!! This avc is allowed in the current policy
allow unconfined_service_t self:bpf { map_create map_read map_write prog_load prog_run };

This file then must be re-written to a binary format as the instana_ebpf.mod file:

$ checkmodule -M -m -o instana_ebpf.mod instana_ebpf.te
checkmodule:  loading policy configuration from instana_ebpf.te
checkmodule:  policy configuration loaded
checkmodule:  writing binary representation (version 19) to instana_ebpf.mod

The instana_ebpf.mod file must be repackaged as a loadable module:

$ semodule_package -o instana_ebpf.pp -m instana_ebpf.mod

And finally we can apply the policy package:

$ semodule -i instana_ebpf.pp

Any unconfined process, such as the host agent, can now make those syscalls.