open source

Introducing EclairJS

Introducing EclairJS  

In this post, we describe the motivation behind the EclairJS project and provide a glimpse into its capabilities. Node.js is fast becoming one of the more popular frameworks for quickly developing front-end applications to the enterprise. The simplicity of Javascript programming combined with the network scalability of Node.js enables developers to quickly build new applications that can handle very large numbers of concurrent requests. Despite Node.js’s network scaling capabilities, it is not a good platform for large-scale data processing.

Node.js’s scaling is achieved through its so-called non-blocking event loop, which favors the creation of network connections over the completion of back-end processing tasks, and so a typical Node.js application will offload any significant data processing work to back-end engines. A Node.js server operates on a system of asynchronous call-backs that are essentially contracts between the server and the back-end engines to deliver results sometime in the future. While any contract is in place, the server is free to accept more requests, and it does not respond to the contract’s request until its results are returned from the back-end engine.

Another trend in enterprises is to use more data to better understand their customers, provide better customer service, and improve internal business processes. The data available to enterprises for these purposes is rapidly growing along all three of the “v” dimensions, i.e. velocity, volume and variety. So for example, business is looking for streaming data to provide more up-to-date information so they can make business decisions more quickly. Such data may be structured or unstructured, as in the case of social media, and it may be best represented in other than the familiar tables of relational databases, such as in the graph structures between social and organizational entities. Furthermore, the range of analytics functions available to the enterprise has grown considerably, ranging from simple descriptive statistics such as counts and averages, to neural networks that can accurately categorize images and model patterns of human behavior.<

In recent years, a number of technologies have aimed at providing enterprises with platforms for analysing data with high values of the three “v”‘s. The current favorite is Apache Spark which provides a general-purpose, scalable engine for data processing. Within a single platform, Spark provides capabilities for processing batch and streaming data, data represented as SQL tables and as graphs, and it has a growing library of machine learning algorithms for analysing all of this data. The Spark platform is highly scalable with scale being achieved by adding additional compute nodes to a Spark cluster. It is also fast due to its in-memory processing model, and it provides for both developers and data scientists by providing Spark’s APIs in several languages, namely Scala and Java, Python and R.

The purpose of the EclairJS project is to bridge the gap between Node.js applications and Apache Spark by providing the Spark API in JavaScript: a language that is not otherwise supported in Spark. To illustrate EclairJS’s ability to bridge this gap, we’ll describe a Node.js application written in JavaScript that uses one of Spark’s machine learning algorithms called k-means (https://en.wikipedia.org/wiki/K-means_clustering) which is used for clustering observations that have similar characteristics. Imagine we are building a real-estate application and we want to segment (cluster) the properties in a regional housing market by price, square footage, number of bedrooms, etc so we can help sellers determine which segment they should sell into. Using EclairJS we can write some JavaScript that will first create a k-means model describing the various segments, and then predict for any new property which segment it should belong to:

 1 var spark = require("eclairjs")
 2 
 3 var sc = new spark.SparkContext("local[*]", "K Means Example");
 4
 5 var rawTrainData = sc.textFile("trainData.txt");
 6 var trainData = rawTrainData.map(function (line) {
 7     var tokens = line.split(" ");
 8     var point = [];
 9     tokens.forEach(function (t) {
10         point.push(parseFloat(t));
11     });
12     return Vectors.dense(point);
13 });
14 
15 var model = spark.mllib.clustering.KMeans.train(trainData, nClusters, nIterations);
16 
17 model.clusterCenters().then(function(results) {
18     console.log('WSSE: ', results);
19 });
20 
21 model.computeCost(points).then(function(results) {
22     console.log('Cost: ', results);
23 });

We start by loading the EclairJS module using Node.js’s function, and then create a SparkContext which is the context for all the Spark operations and variables (lines 1 & 3). In this example, we specify that the context should run on our local machine, but this is where we could specify a remote Spark cluster. After creating the context, we read and prepare the data which will be used to train our model and represent it as an array of Vectors (lines 5-13). One point to notice here is that the inline function definitions, i.e. the parameters of <.map> and <.forEach>, are written in JavaScript. This is significant because under the covers these functions are executed on the Spark cluster’s distributed worker nodes. Once the training data has been prepared, we use it to train the model (line 15).

Computing the model and applying it to make predictions from new data may take some time. However, in an interactive user application, we will often want to continue executing statements rather than stopping to wait for the results from such long-running processes, and in these cases we need a mechanism to handle the results when they are finally returned. You can see how EclairJS accomplishes this with the <.then> functions (lines 17 & 21) which take as arguments call-back functions to be executed when the model has been computed. So when our applications runs, it will only print the parameters of the model (line 18) and the results of applying the model (line 22) when those results are available, and until that time it may execute statements appearing later in the application, i.e. after line 23, in our example.

By way of this example, we hope to have shown you that EclairJS enables Node.js developers to write applications that can take advantage of the power provided by Apache Spark. The basic application shown here will run unchanged on a local machine with a couple of megabytes of data or on a large cluster with terabytes of data. Furthermore, the application is entirely written in JavaScript and uses constructs, such as call-back functions, that should be familiar to Node.js and JavaScript developers. Clearly there are Spark specific semantics that must be accommodated in EclairJS applications, like creating a SparkContext, however these seem to be reasonable accommodations to make in order to take advantage of Apache Spark.

EclairJS exists as an Apache-licensed project on Github, see http://github.com/EclairJS/eclairjs-node. In addition to the code, the project provides examples, build instructions, and other resources.

In the second part of our blog, we will delve farther into the technical detail behind EclairJS, and provide more examples of its capabilities.

Newsletter

You Might Also Enjoy

Gidon Gershinsky
Gidon Gershinsky
21 days ago

How Alluxio is Accelerating Apache Spark Workloads

Alluxio is fast virtual storage for Big Data. Formerly known as Tachyon, it’s an open-source memory-centric virtual distributed storage system (yes, all that!), offering data access at memory speed and persistence to a reliable storage. This technology accelerates analytic workloads in certain scenarios, but doesn’t offer any performance benefits in other scenarios. The purpose of this blog is to... Read More

James Spyker
James Spyker
3 months ago

Streaming Transformations as Alternatives to ETL

The strategy of extracting, transforming and then loading data (ETL) to create a version of your data optimized for analytics has been around since the 1970s and its challenges are well understood. The time it takes to run an ETL job is dependent on the total data volume so that the time and resource costs rise as an enterprise’s data volume grows. The requirement for analytics databases to be mo... Read More