Performance

SparkOscope: Enabling Apache Spark Optimization through Cross-Stack Monitoring and Visualization

During the last year we have been using Apache Spark to perform analytics on large volumes of sensor data. The Spark applications we have developed evolve around cleansing, filtering, and ingesting historical data. These applications need to be executed on a daily basis, therefore, it was essential for us to understand Spark resource utilization, and thus, provide customers with optimal time-to-insight and cost control throughbetter calculated infrastructure needs.

To date, the conventional way of identifying bottlenecks in a Spark application would include the inspection of the Spark Web UI (either throughout the duration of the jobs/stage execution and/or postmortem) which is limited to job-level application metrics reported by the built-in metric system of Spark (e.g. stage completion time). The current version of the Spark metric system supports recording the values of the metrics in local CSV files and also integration with external metrics systems like Ganglia.

We found it cumbersome to manually consume and efficiently inspect these CSV files generated at the Spark worker nodes. Although using an external monitoring system like Ganglia would automate this process, we were still plagued with the inability to derive temporal associations between system-level metrics (e.g. CPU utilization) and job-level metrics (e.g. job or stage ID) as reported by Spark. For instance, we were not able to trace back the root cause of a peak in HDFS Reads or CPU usage to the code in our Spark application causing the bottleneck.

To overcome these limitations we developed SparkOscope. Taking advantage of the job-level information available through the existing Spark Web UI and to minimize source-code pollution, we use the existing Spark Web UI to monitor and visualize job-level metrics of a Spark application (e.g. completion time). More importantly, we extend the Web UI with a palette of system-level metrics of the server/VM/container that each of the Spark job’s executor ran on. Using SparkOScope, the user can navigate to any completed application and identifyapplication-logic bottlenecks by inspecting the various plots providing in-depth timeseriesforall relevant system-level metrics related to the Spark executors, while also easily associating them with stages, jobs and even source code lines incurring the bottleneck. On a related use, system-level metrics are essential for us when it comes to careful infrastructure planning for the set of Spark applications we know our customers will be running on a daily basis.

In the backend, SparkOScope leverages the open-source Hyperic Sigar library to capture OS-level metrics at the nodes where executors are launched. In the screenshot below, onecan see that the user is able to select which metric to visualize, among the new family of OS-level metrics, for all executors launched by the Spark job under scrutiny.

bf679673-1abc-490e-9217-e517ad1b9246

We used our tool to evaluate the RAM requirements on one of our most frequent workloads. Initially, we assigned each Spark executor 80GB RAM and inspected the metrics of RAM Utilization and KBytes Written on the disk.

ram kbyteswritten

It became clear to us that Spark utilizes all the available RAM given to the executors and once there is no more available it resolves to heavy disk writing of intermediate data. However, in the plots produced we could see that the nodes have more RAM available which can be assigned to the executors, thus reducing the need for heavy disk writes and finally faster completion time. With our tool, the user can potentially plot additional metrics of interest like the volume of HDFS Read bytes.

hdfs-read-bytes

The Sparkoscope project is open source.  Take a look.

This contribution is part of an ongoing effort of the High Performance Systems team, IBM Research Ireland to better realize Spark’s performance bottlenecks by utilizing visualization of the key metrics. That way we can be in a better position to extract useful insights that will drive our future work. We are exploring various possible avenues to make this available to the Spark user community.

Newsletter

You Might Also Enjoy

James Spyker
James Spyker
2 months ago

Streaming Transformations as Alternatives to ETL

The strategy of extracting, transforming and then loading data (ETL) to create a version of your data optimized for analytics has been around since the 1970s and its challenges are well understood. The time it takes to run an ETL job is dependent on the total data volume so that the time and resource costs rise as an enterprise’s data volume grows. The requirement for analytics databases to be mo... Read More

Seth Dobrin
Seth Dobrin
2 months ago

Non-Obvious Application of Spark™ as a Cloud-Sync Tool

When most people think about Apache Spark™, they think about analytics and machine learning. In my upcoming talk at Spark Summit East, I'll talk about leveraging Spark in conjunction with Kafka, in a hybrid cloud environment, to apply the batch and micro-batch analytic capabilities to transactional data in place of performing traditional ETL. This application of these two open source tools is a no... Read More