Monitoring Apache Spark
TABLE OF CONTENTS
Currently supported Spark versions are from 1.4.x to 2.4.x.
The two main components of a spark application are driver process and executor processes. Executor processes contain data only relevant to the task execution. The Driver is the main process and is responsible for coordinating the execution of a Spark application. It therefore contains all data about the performance and execution of the Spark application. This also includes data about each executor of the Spark application.
Instana collects all spark application data (including executor data) from the driver JVM. To monitor spark applications the Instana agent needs to be installed on the host on which the Spark driver JVM is running.
Please note that there are two ways of submitting spark applications to the cluster manager. Depending how this option is set the location where the driver JVM is running can change.
- Deploy mode cluster:
When submitting with the option
--deploy-mode cluster, e.g.
./spark-submit --class org.apache.spark.examples.JavaWordCount --master yarn --deploy-mode cluster /path/to/app.jar, the spark driver JVM will be running on one of the worker nodes of your cluster manager. If the Instana agent is installed on worker nodes, the Spark application (driver) is discovered automatically
- Deploy mode client:
When submitting with the option
--deploy-mode client, or without the option
--deploy-mode(default value is
./spark-submit --class org.apache.spark.examples.JavaWordCount --master yarn --deploy-mode client /path/to/app.jaror
./spark-submit --class org.apache.spark.examples.JavaWordCount --master yarn /path/to/app.jar, the Spark driver JVM will be running on the host on which this command is executed. For Instana to be able to monitor this spark application, the Instana agent must be installed on the host where the Spark submit is executed.
Depending on the type of the Spark application Instana monitors different data is collected:
- Longest completed stages
- Scheduling delay
- Total delay
- Processing time
- Output operations
- Input records
Instana detects and monitors spark applications through the spark driver, therefore to get visibility of the spark applications, install the agent on EC2 instances in your EMR cluster. When deploying spark apps from the master node and with the deploy mode
client, it's sufficient to install the agent only on the master node of EMR cluster.
If you don't want to copy the spark app jar to the master node, and you want to deploy your spark app with
cluster mode from somewhere else, e.g. from an S3 bucket, you must install the agent on all the nodes in the EMR cluster - this is because the driver is scheduled on the worker node.
The best method to do this is to create the EMR cluster, and in the advance configuration, select the custom AMI image that already has the Instana agent installed. For more information on how to start the EMR cluster with the custom AMI, see the AWS documentation. To build the AMI image with the Instana agent installed, see the AWS documentation, and when prompted to SSH into the EC2 instance to install the software, use the one-liner located in your Instana Management Portal as described here. This way you gain insights into all of your EMR cluster nodes, you can monitor spark applications regardless of the deployment mode, and you gain insights into all the underlying components of EMR, such as Hadoop YARN. If you want to measure only the Hadoop YARN metrics, refer to the documentation.
In addition to running on the Mesos or YARN cluster managers, Spark also provides a simple standalone deploy mode. Spark standalone is a cluster manager and is made of master and worker nodes. Instana monitors whole spark standalone cluster through master node of a cluster. It collects cluster wide data and data for each worker node of a cluster.
- Rest Uri
- Alive Workers
- Dead Workers
- Decommissioned Workers
- Workers In Unknown State
- Used Memory
- Total Memory
- Used Cores
- Total Cores
- Data and metrics per worker
- Most recent apps
- Most recent drivers