Spark users need a way to interface with object stores – something fast, direct, and optimized for use with stored objects.
Now you have it.
Apache Spark can work with multiple data sources that include object stores like Amazon S3, OpenStack Swift (such as on IBM SoftLayer), Azure Blob Storage, and more. To access an object store, Spark uses Hadoop modules that contain drivers to the various object stores. In most of the flows, Spark sends a request with the data source URI to the Hadoop Mapreduce Client Core. In the response, Spark receives the File System driver that is capable of handling the given URI. In some specific flows, Spark may bypass the Hadoop Map Reduce Client and obtain the File System driver directly, via the Hadoop Path resolver.
Spark needs only a small set of object store functions. More specifically, Spark requires object listing, object creation, object reading, and getting data partitions. Hadoop drivers, however, must be compliant with the Hadoop ecosystem. This means they support many more operations, such as shell operations on directories, including move, copy, and rename, which are not native object store operations. Moreover, Hadoop Mapreduce Client Core uses temp files and folders for every write operation. Those temp files and folders are renamed, copied, and deleted. This leads to dozens of useless requests when targeted at an object store. It’s clear that Hadoop is designed to work with file systems and not object stores.
Stocator – The New Object Store Cloud Connector
We took a new approach and developed Stocator, a unique object store connector for Apache Spark. The good news is that we just contributed Stocator to the open source community, under Apache License 2.0. It comes packaged with the OpenStack Swift driver, but contains a pluggable mechanism capable of supporting additional object store drivers.
Stocator is written in Java, and implements an HDFS interface to simplify its integration with Spark. There is no need to modify Spark’s code to use the new connector: it can be compiled with Spark via a Maven build, or provided as an input package without the need to re-compile Spark.
Stocator works smoothly with the Hadoop Mapreduce Client Core module, but uses an object store approach. Apache Spark continues working with Hadoop Mapreduce Client Core and Spark can use various existing Hadoop output committers. Stocator intercepts internal requests and adapts them to the object store before they access the object store.
For example, when Stocator receives a request to create a temporary object or directory, it creates the final object instead of a temporary one. Stocator also uses streaming to upload objects. Uploading an object via streaming, without knowing its content length in advance, eliminates the need to create the object locally, calculate its length, and only then upload it to the object store. Swift is unique in its ability to support streaming, as compared to other object store solutions available on the market.
Because Stocator is explicitly designed for object stores, it has very a different architecture than the existing Hadoop driver. It doesn’t depend on robust Hadoop modules, but rather interacts directly with object stores via the JOSS package. Below is a diagram of the flow.
val data = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
val distData = sc.parallelize(data)
The following table summarizes the overall number of HTTP requests to Swift, and compares the number of tasks involved for the existing Hadoop driver and Stocator.
How can you use Stocator with Object Store?
To use Stocator, you need access to an object store. An object store allows you to store any type of object (text data, binary data, Parquet objects, etc.). After that, it’s just a matter of using the powerful Spark to analyze the content of the objects.
For example, OpenStack Swift can be obtained via the community OpenStack code, e.g., with Swift-All-in-One, or via an object store service such as IBM SoftLayer. The detailed instructions for how to set up Spark with Swift can be found here.
We believe strongly in the power of the Spark-Object Store integration and will continue to support the new driver in the open source community. Our future plans include exploring more improvements to the way Spark interacts with object stores—and keeping up the open source spirit.
It’s all open source. Don’t waste time – go out and try it!