What is EMR?
According to Amazon, “Amazon EMR is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).”
Amazon EMR is used by many customers across several verticals to handle big data use cases. These use cases include; machine learning, data transformations, financial and scientific simulation, bioinformatics, log analysis, and deep learning. EMR enables customers the ability to run their specific use cases on single-purpose short lived clusters that scale to meet demand.
Monitoring Amazon EMR
Amazon EMR provides several tools to gather information about the cluster including from the console, the CLI, and the Hadoop web interfaces and log files are available on the master node. Additionally, Amazon’s CloudWatch can be used.
With CloudWatch, metrics are gathered every five minutes for each EMR cluster. The metrics allow users to track the progress of a cluster, detect clusters, detect when a node runs out of storage, AppsCompleted, AppsFailed, AppsRunning, and more. While this can help provide information about how the environment is trending over time, it leaves a lot to be desired for customers running production applications and needing real-time visibility.
How to Monitor Amazon EMR with Instana
To effectively monitor Amazon EMR requires visibility at the cluster, node, and Hadoop application layers. Instana provides the most efficient way to discover and monitor Amazon EMR clusters, nodes, and Hadoop applications. To begin, install the Instana Agent on AWS and the Agent automatically discovers all EMR components running in the environment. Once discovered, the agent deploys all appropriate monitoring sensors and begins tracing and analyzing every request. Using a combination of machine learning and preset health rules, Instana automatically determines the health of the Hadoop applications and EMR components.
Metrics – The Instana agent automatically identifies EMR running in AWS and, with no manual effort required, deploys and configures Instana’s EMR monitoring sensor. Instana references its curated knowledge base to understand what performance metrics are relevant to collect as well as what parameters must be configured. Instana enables users to determine the granularity of metics being pulled. Specifically, Instana’s automatic configuration for EMR is set to track things like Cluster Details (Id, Name, Creation time, version, etc.), Cluster metrics (Apps Running, Apps Pending, Memory Allocated, Memory Available, etc.), and Node metrics (Active Nodes, Lost Nodes, Unhealthy Nodes, etc.).
Health – In addition to automatically collecting performance metrics, the Instana EMR monitoring sensor will also automatically collect KPIs on the monitored environment’s jobs to determine its health. These health signatures are used to raise Issues or Incidents depending on user impact.
With Instana, you’ll have a full analysis of every user impact, performed automatically, that correlates all of the data from the traces with the underlying EMR metrics. By doing so, Instana provides the root cause of any issue within a few seconds. This enables you to update your services as often as you need to without worrying if there are regressions impacting your customers.
Instana’s Amazon EMR monitoring includes automatic and continuous discovery, dependency mapping, metric monitoring, distributed tracing, anomaly detection, and analytics across the complete trace data set. This means you’ll always know everything that EMR is doing and the impact to user requests at all times. To see Instana’s EMR monitoring in action sign up for a free trial of Instana today.