How to enable interactive applications against Apache Spark™

Last December, IBM open sourced a project called the Spark Kernel, an application focused on interactive usage of Apache Spark. This project addresses a problem we encountered when trying to migrate a Storm-based application to Apache Spark, “How do we enable interactive applications against Apache Spark?”

Recently, I gave a talk at the Spark Meetup in San Francisco about the Spark Kernel and how we were using it in IBM. So, today, I wanted to elaborate on a problem we were trying to solve and how we solved it using the Spark Kernel.


When looking for a method to migrate our existing application, we discovered that there were several options to communicate with a Spark cluster, but none of them provided the flexibility we needed combined with a usable API.

  • Spark Submit was the main advocated approach for Spark job submission; however, forking a process to execute the shell script proved to be both cumbersome and slow. Results from job execution needed to be written to an external datastore, which would then be accessed directly by our application.

  • JDBC provided more direct access to a cluster compared to Spark Submit, but it was limited to Spark SQL and did not have an easy way to use other Spark components such as Spark Streaming.

  • Spark REST Server had an advantage over Spark Submit in that it returned the results of a Spark computation as JSON; however, it only supported submissions through jars and was lagging behind Apache Spark in terms of version support.

  • Spark Shell offered the most flexibility through its support of executing code snippets, thereby allowing us to accurately and dynamically control the tasks submitted to a Spark cluster. Unfortunately, the shell was not a consumable service that we could use with our applications.


Since none of the available options to communicate with Apache Spark suited our needs for an interactive application, we decided to roll out our own tool: the Spark Kernel. The kernel serves as the middleman between our application and a Spark cluster.

Because our application was focused on interactivity, the tool needed to be able to serve content in a very chatty manner. This meant that potential forms of communication like a RESTful implementation were not suitable. Furthermore, we wanted to avoid implementing our own protocol, which would be a requirement if we decided to use websockets. Finally, we were striving for the flexibility and control that is provided by the Spark Shell, a Scala REPL that connects to an Apache Spark cluster, meaning that we needed to be able to have programmatic access to Spark on the level of line-by-line code snippets.

Diagram of Spark Kernel relationship with applications and Spark cluster Figure 1: Spark Kernel Overview

With these restrictions in mind, we turned our attention to the IPython message protocol, used as the backbone of interactive notebooks. Having used IPython for experiments in the past, we knew that the project had recently updated their protocol to be language agnostic due to the growing popularity of alternative languages like Julia and Haskell. The result of our exploration was the Spark Kernel, a remote IPython kernel written in Scala that interfaces with Apache Spark. The kernel enables applications to send code snippets that are evaluated against a Spark cluster. The diagram gives a high-level overview of the Spark Kernel, illustrating its position between applications and a Spark cluster.

What features does the Spark Kernel offer?

  • Define and execute raw Scala source code

  • Execute Spark tasks initiated through code snippets or jars

  • Collect results directly from a Spark cluster

  • Communicate using dynamically-defined messages as an alternative to jar and source code execution

What are the benefits of using the Spark Kernel over other options?

  • Avoids the friction of repackaging and shipping jars such as with Spark Submit and current RESTful services

  • Removes the requirement to store results into an external datastore

  • Acts as a proxy between applications and a Spark cluster, removing the requirement that applications must have access to the master and all worker nodes

  • Enables Spark clusters behind firewalls to expose only the ports of the kernel, allowing applications to communicate with clusters through the kernel


Using the kernel as the backbone of communication, we have enabled several higher-level applications to interact with Apache Spark:

Coming up

The following posts in this Spark Kernel series will cover the Spark Kernel’s architecture in detail and provide an overview of the available client library that allows Scala applications to more easily interface with the kernel.

You can find the Spark Kernel project on Github here: ibm-et/spark-kernel


You Might Also Enjoy

Kevin Bates
Kevin Bates
2 months ago

Limit Notebook Resource Consumption by Culling Kernels

There’s no denying that data analytics is the next frontier on the computational landscape. Companies are scrambling to establish teams of data scientists to better understand their clientele and how best to evolve product solutions to the ebb and flow of today’s business ecosystem. With Apache Hadoop and Apache Spark entrenched as the analytic engine and coupled with a trial-and-error model to... Read More

Gidon Gershinsky
Gidon Gershinsky
4 months ago

How Alluxio is Accelerating Apache Spark Workloads

Alluxio is fast virtual storage for Big Data. Formerly known as Tachyon, it’s an open-source memory-centric virtual distributed storage system (yes, all that!), offering data access at memory speed and persistence to a reliable storage. This technology accelerates analytic workloads in certain scenarios, but doesn’t offer any performance benefits in other scenarios. The purpose of this blog is to... Read More