Revving Up Performance for the Tachyon File System

Tachyon is an open source, memory-centric distributed storage system. Since being released to open source in April 2013, it has flourished into a fast growing project with more than 150 contributors from about 50 organizations.

Being a fault tolerance storage system, Tachyon knows how to reliably preserve data. In contrast to existing solutions, Tachyon doesn’t use replication to achieve fault tolerance. Instead, it relies purely on computations. While data replication remains a common approach today, the drawback is that replications are generally limited by their networks or disks. Tachyon eliminates the need for replication by using data lineage, a well-known technique that tracks the lineage of data operations into the storage layer.

When it comes to Apache Spark™, Tachyon offers many advantages. For example, Tachyon can keep in-memory data safe, even when the Spark job crashes. It also allows data to be shared at memory speed between different Spark jobs. Without Tachyon, each job would need to load data from disk to the main memory, greatly slowing down performance. It also goes beyond Spark and can be used with Hadoop MapReduce, Apache HBase, Apache Flink, and others.

Basically, Tachyon consists of two major layers: the lineage layer and persistent layer. In this blog, I focus on the persistent layer, which is responsible for preserving Tachyon’s checkpoint data to the underlying storage (which may be Amazon S3, HDFS, or GlusterFS).

Tachyon internally implements HDFS interfaces to interact with the underlying storage. As a result, any storage system that exposes the HDFS interface can easily be plugged into Tachyon. At IBM Research, we recently extended Tachyon’s persistent layer to work with OpenStack Swift and the SoftLayer public object store. We based this integration on the Swift driver from the Hadoop OpenStack module.

Screen Shot 2015-10-28 at 11.30.02 AM

While testing the Tachyon-Swift integration, it became clear that using Hadoop modules for Tachyon is far from optimal. This occurs primarily because the default Hadoop code is not optimized to work with object stores. To explain this problem, let’s look at the example of FileOutputCommiter, which comes with Hadoop and is built to work in a file system as opposed to an object store. This means, we must maintain the file system structure of directories and sub-directories any time we want to work with a file. For example, if we want to work with the file container/a/b/c/data.txt, the Swift driver will have to create empty objects for container, a, a/b, and a/b/c, in order to maintain the nested structure demanded by Hadoop. In contrast, working with an architecture that is optimized for object storage would allow Swift to simply create a container with an object called a/b/c/data.txt . Because Swift supports object names with delimiters and supports listing based on prefix—it wouldn’t have to generate all the structures in Swift.  In short, by having Tachyon work directly with Swift, and using a different architecture, we can make things work much more efficiently.

To overcome these drawbacks, we developed a new architecture that doesn’t depend on the existing Hadoop Swift driver—and helps Tachyon work more efficiently with Swift. Our approach uses direct access to Swift via the JOSS library. In fact, our early tests show a significant improvement in the performance and user experience as compared to the previous solution. We are about to contribute our new architecture to the Tachyon community.

IBM Research sees Tachyon as a promising new technology and we will continue to evaluate the new architecture and advance integration between Spark and Tachyon, while looking into innovative combinations for Spark, Tachyon, and Swift. Stay tuned—Spark + Tachyon evaluation post coming soon.


You Might Also Enjoy

Kevin Bates
Kevin Bates
9 months ago

Limit Notebook Resource Consumption by Culling Kernels

There’s no denying that data analytics is the next frontier on the computational landscape. Companies are scrambling to establish teams of data scientists to better understand their clientele and how best to evolve product solutions to the ebb and flow of today’s business ecosystem. With Apache Hadoop and Apache Spark entrenched as the analytic engine and coupled with a trial-and-error model to... Read More

Gidon Gershinsky
Gidon Gershinsky
10 months ago

How Alluxio is Accelerating Apache Spark Workloads

Alluxio is fast virtual storage for Big Data. Formerly known as Tachyon, it’s an open-source memory-centric virtual distributed storage system (yes, all that!), offering data access at memory speed and persistence to a reliable storage. This technology accelerates analytic workloads in certain scenarios, but doesn’t offer any performance benefits in other scenarios. The purpose of this blog is to... Read More