Last December, IBM open sourced a project called the Spark Kernel, an application focused on interactive usage of Apache Spark. This project addresses a problem we encountered when trying to migrate a Storm-based application to Apache Spark, “How do we enable interactive applications against Apache Spark?”
Recently, I gave a talk at the Spark Meetup in San Francisco about the Spark Kernel and how we were using it in IBM. So, today, I wanted to elaborate on a problem we were trying to solve and how we solved it using the Spark Kernel.
When looking for a method to migrate our existing application, we discovered that there were several options to communicate with a Spark cluster, but none of them provided the flexibility we needed combined with a usable API.
Spark Submit was the main advocated approach for Spark job submission; however, forking a process to execute the shell script proved to be both cumbersome and slow. Results from job execution needed to be written to an external datastore, which would then be accessed directly by our application.
JDBC provided more direct access to a cluster compared to Spark Submit, but it was limited to Spark SQL and did not have an easy way to use other Spark components such as Spark Streaming.
Spark REST Server had an advantage over Spark Submit in that it returned the results of a Spark computation as JSON; however, it only supported submissions through jars and was lagging behind Apache Spark in terms of version support.
Spark Shell offered the most flexibility through its support of executing code snippets, thereby allowing us to accurately and dynamically control the tasks submitted to a Spark cluster. Unfortunately, the shell was not a consumable service that we could use with our applications.
Since none of the available options to communicate with Apache Spark suited our needs for an interactive application, we decided to roll out our own tool: the Spark Kernel. The kernel serves as the middleman between our application and a Spark cluster.
Because our application was focused on interactivity, the tool needed to be able to serve content in a very chatty manner. This meant that potential forms of communication like a RESTful implementation were not suitable. Furthermore, we wanted to avoid implementing our own protocol, which would be a requirement if we decided to use websockets. Finally, we were striving for the flexibility and control that is provided by the Spark Shell, a Scala REPL that connects to an Apache Spark cluster, meaning that we needed to be able to have programmatic access to Spark on the level of line-by-line code snippets.
Figure 1: Spark Kernel Overview
With these restrictions in mind, we turned our attention to the IPython message protocol, used as the backbone of interactive notebooks. Having used IPython for experiments in the past, we knew that the project had recently updated their protocol to be language agnostic due to the growing popularity of alternative languages like Julia and Haskell. The result of our exploration was the Spark Kernel, a remote IPython kernel written in Scala that interfaces with Apache Spark. The kernel enables applications to send code snippets that are evaluated against a Spark cluster. The diagram gives a high-level overview of the Spark Kernel, illustrating its position between applications and a Spark cluster.
What features does the Spark Kernel offer?
Define and execute raw Scala source code
Execute Spark tasks initiated through code snippets or jars
Collect results directly from a Spark cluster
Communicate using dynamically-defined messages as an alternative to jar and source code execution
What are the benefits of using the Spark Kernel over other options?
Avoids the friction of repackaging and shipping jars such as with Spark Submit and current RESTful services
Removes the requirement to store results into an external datastore
Acts as a proxy between applications and a Spark cluster, removing the requirement that applications must have access to the master and all worker nodes
Enables Spark clusters behind firewalls to expose only the ports of the kernel, allowing applications to communicate with clusters through the kernel
Using the kernel as the backbone of communication, we have enabled several higher-level applications to interact with Apache Spark:
Livesheets, a line of business tool for data exploration
A RESTful query engine running on top of Spark SQL
A demonstration of a PHP application utilizing Apache Spark at ZendCon 2014
IPython notebook running the Spark Kernel underneath
The following posts in this Spark Kernel series will cover the Spark Kernel’s architecture in detail and provide an overview of the available client library that allows Scala applications to more easily interface with the kernel.
You can find the Spark Kernel project on Github here: ibm-et/spark-kernel