Secure Resilient Distributed Datasets in Apache Spark™ (S-RDDs)

Syed Yousaf Shah and Petros Zerfos

Apache Spark enables fast computations and greatly accelerates analytics applications by keeping large amounts of data in main memory, in data structures known as RDDs (Resilient Distributed Datasets). However, the data represented in the RDD data structure remains un-encrypted and is prone to attacks such as RAM scrapper malware, making it unsuitable for use with sensitive information that should be secured at all times. Even worse, RDDs may also be persisted (unencrypted) in disk storage under various circumstances, such as when there is lack of available RAM for further caching, RDD check pointing by the Spark framework or the application itself, and data spill during the data shuffling operation, etc. Spark provides a serialization mechanism for RDD objects when persisted and during data shuffling, but the serialized data is still easy to read/retrieve. If data to be analyzed using Spark is sensitive, then the current Spark framework may allow leakage of sensitive data.

We investigate various security approaches for Spark, through which RDDs are encrypted and cryptographically sliced using Perfect Secret Sharing[i]/Information Dispersal Algorithm[ii] (PSS/IDA) such that at least M-out-of-N slices of a RDD are needed for reconstruction and decryption. Whenever these RDDs are needed for computation, then and only then are reconstructed from slices and decrypted. Therefore, we reduce the window of exposure of unencrypted data for the duration of data residing in main memory, which might be long-lived. Moreover, whenever these RDDs are flushed to persistent storage (either by the user or by the Spark framework), they will be stored in a secure form and only be decrypted/reconstructed on-demand, at the time when it is requested for the analytics computation. Last, by performing cryptographic splitting of RDDs (IDA & Perfect Secret Sharing), we allow for high-availability of persisted RDD objects.

We explore three different approaches to implement the above security mechanism in Spark.

  1. Security through Secure Object Serialization: In this approach, we enforce the security mechanism through a secure reserialization mechanism. Spark uses serialization of RDDs to reduce their memory footprint as well as to checkpoint data to the disk. We achieve RDD-level security by implementing a secure (Java) object serializer, which transparently encrypts every object (RDD) and cryptographically splits it whenever serialization is requested by Spark. The reverse deserialization mechanism is likewise transparent and it automatically builds the RDDs and decrypts them for use in computation. This ensures that all RDDs in Spark are secured in memory, as well as when they are persisted on disk either accidently or on-purpose through the check pointing mechanism of Spark. Moreover, the data remains secured during any internal operations performed by Spark that involves object serialization, e.g., shuffling data among worker nodes. We are also exploring policy enforcement on the use of secure serialization such that secure serializer is used under user-specified conditions to lower the overhead of frequently invoking the secure serializer.

The major benefit of this approach is that the security mechanism is transparent to the applications and to the Spark framework as well, in the sense that it can be enabled/disabled through a standard Spark configuration parameter (spark.serializer). This approach does not require recompilation of the Spark Framework. We implemented the secure object serializer by extending the Kryo serialization framework. We use Kryo for its speed and efficiency; however, our approach is not limited to any specific type of object serialization framework and can be implemented with other Java serializers.

SecureSerializerSecure Serializer for Spark

#### How to use Secure Serializer

The Spark serialization can be changed to SecureSerializer via setting the property spark.serializer to com.ibm.watson.spark.KryoSecureSerializer in conf/spark-defaults.conf as

spark.serializer       com.ibm.watson.spark.KryoSecureSerializer

Before running Spark Shell, the serializer can be changed to secure serializer as follows,

 export SPARK_JAVA_OPTS="-Dspark.serializer=org.apache.spark.serializer.KryoSecureSerializer"

In scala code, the serializer can be changed to secure serializer by change property in configuration as,

val conf = new SparkConf().setAppName("Secure Serializer Test").set("spark.serializer", "com.ibm.watson.spark.KryoSecureSerializer")
  1. Security through Customized SecureRDD: In this approach, we implement a custom RDD that we call SecureRDD, SecureRDD extends the RDD class and implements the PSS/IDA based security mechanism when the data is stored in the SecureRDD. The SecureRDD encrypts and cryptographically splits the data that is stored in it. When computation is performed on the SecureRDD the data is decrypted and reconstructed at run time in the “compute” method. Since, RDDs are immutable, the encryption and data splitting happens only once at the RDD creation. However, the reconstruction and decryption of data occurs each time when an operation is performed on the RDD. The SecureRDD is provided as a library to the user that she can use as needed in her application. The advantage of this approach is that it gives the user control over the amount of data that needs to be secured. Hence, this approach is faster and more efficient. Like approach #1 above, this is also transparent to the Spark Framework and does not require its recompilation. However, it is not transparent to user applications since they need to be explicitly invoke the use SecureRDD./p<>

SecureRDDSecureRDD for Spark

#### How to use Secure RDD

In Scala, an RDD containing data from a text file can created as,

val plainRDD = sc.textFile(“/tmp/data.txt”)

The secured version of such an RDD can be created as follows,

val secureRDD = SecureRDD.getSecureRDDFromFile(“/tmp/data.txt”), sc )

The ‘sc’ in above example is the sparkContext.

  1. Security through Natively Securing RDDs: In this approach, the native creation process of the RDDs is modified such that data stored in the RDDs is always secure. In this approach, whenever a RDD is created, the instantiation process encrypts and splits every partition of the RDD and is only decrypted when computation is performed on the RDD. This ensures that all the RDDs created are secured and information is not revealed in cases where the RDDs are persisted onto disk. This approach is transparent to the user applications but needs modifications and recompilation to the Spark framework itself. The advantage of this approach is that the user applications do not need to change and data always remains secured unless explicitly leaked by the application. The downside of this approach is that every single RDD is encrypted thus the performance drawback might be high, therefore, this approach needs to be accompanied with a policy enforcement mechanism which will control the level of security, e.g., secure only those RDDs flagged by user or secure all user/application specific RDDs.

How to use Natively Secure RDDs

This approach is transparent to applications so RDDs can be created without any changes to the RDD creation code in the application like,

val textFileRDD = sc.textFile("MyData.txt")
//optionally call secure method if by default all RDDs are not secured.

We are in the process of implementing and evaluating the performance of these three approaches in Spark, with the goal of providing usage guidelines based on the various application scenarios.

[i] Shamir, Adi. “How to share a secret.” Communications of the ACM 22, no. 11 (1979): 612-613.
[ii] Rabin, Michael O. “Efficient dispersal of information for security, load balancing, and fault tolerance.” Journal of the ACM (JACM) 36, no. 2 (1989): 335-348.


You Might Also Enjoy

Gidon Gershinsky
Gidon Gershinsky
18 days ago

How Alluxio is Accelerating Apache Spark Workloads

Alluxio is fast virtual storage for Big Data. Formerly known as Tachyon, it’s an open-source memory-centric virtual distributed storage system (yes, all that!), offering data access at memory speed and persistence to a reliable storage. This technology accelerates analytic workloads in certain scenarios, but doesn’t offer any performance benefits in other scenarios. The purpose of this blog is to... Read More

James Spyker
James Spyker
3 months ago

Streaming Transformations as Alternatives to ETL

The strategy of extracting, transforming and then loading data (ETL) to create a version of your data optimized for analytics has been around since the 1970s and its challenges are well understood. The time it takes to run an ETL job is dependent on the total data volume so that the time and resource costs rise as an enterprise’s data volume grows. The requirement for analytics databases to be mo... Read More