open source

STC Contributions to Apache Spark™

In a relatively short period of time, the IBM Spark Technology Center (STC) has made notable contributions to the greater Apache Spark eco-system, and is very energetic and passionate in moving Spark forward.

Since June 2015, when IBM announced the Spark Technology Center (STC), engineers at the STC have actively contributed to Spark releases: v1.4.x, v1.5.x, v1.6.0, v1.6.1, as well as to release v.2.0 (in progress.)

As of the writing of this blog, IBM STC has contributed to over 279 JIRAs and counting. About 50% are answers to major JIRAs reported in Apache Spark.

Let’s take a quick peek into these 279 contributions…..

  • 127out of 279 (46%) are deliverables in Spark SQL area
  • 71 of them (25%) are in MLlib module
  • 46 (16%) are in PySpark module

These top 3 areas of focus from IBM STC made up 87% of the total contributions as of today. The rest are in the documentation, Spark Core and Streaming modules, and so on.

You can track the STC progress on this live dashboard on github, which shows the cumulative progress:

Specific to Spark 1.6.x, IBM team members have over 95 commits, primarily from the STC. A total of 29 team members contributed to the release (26 of them from the STC), and each contributing engineer is a credited contributor in the release notes of Spark 1.6.x


For SparkSQL, we contributed enhancements and fixes in the new DataSet API, DataFrame API, Data type, UDF and SQL standard compliance, such as adding EXPLAIN and PrintSchema capability, and support for coalesce and re-partition.

In addition, STC engineers:

  • Added support for column datatype of CHAR.
  • Fixed the type extractor failures for complex data types.
  • Fixed DataFrames bugs in saving long column partitioned parquet file, and handling of various nullability bugs and optimization issues.
  • Fixed the limitation in Order by clause to comply with standard.
  • Fixed a number of UDF code issues in completion of Stddev support.

Machine Learning

In the area of machine learning, the STC team met with key influencers and stake holders in the Spark community to jointly work on items on the roadmap for Machine Learning in Spark. Most of the roadmap items discussed went into 1.6, except for implementation of LU Decomposition algorithm which is slated for the upcoming release.

In addition to helping implement the roadmap, here are some notable contributions:

  • Improved the Pyspark distributed matrix algebra by enriching the matrix operations and fixing bugs.
  • Enhanced the Word2Vec algorithm.
  • Added optimized 1st- through 4th-order summary statistics for DataFrames (technically in SparkSQL, but related to machine learning).
  • Greatly enhanced Pyspark API by adding interfaces to Scala Machine learning tools.
  • Made a performance enhancement to the Linear Data Generator, which is critical for unit testing in Spark ML.

And more…

The team also addressed major regressions on DataFrame API, enhanced support for Scala 2.11, made enhancements to the Spark History Server, and added JDBC Dialect for Apache Derby.

In addition to the JIRA activities, the IBM STC added the JDBC dialect support for DB2 and made Spark Connector for Netezza v0.1.1 available to public through Spark Packages and a developer blog on IBM external site. Check it out here:

We’d love for you to take a look at the work we’re doing and help us with this ongoing effort to contribute in a big way to the open-source community.


You Might Also Enjoy

Kevin Bates
Kevin Bates
9 months ago

Limit Notebook Resource Consumption by Culling Kernels

There’s no denying that data analytics is the next frontier on the computational landscape. Companies are scrambling to establish teams of data scientists to better understand their clientele and how best to evolve product solutions to the ebb and flow of today’s business ecosystem. With Apache Hadoop and Apache Spark entrenched as the analytic engine and coupled with a trial-and-error model to... Read More

Gidon Gershinsky
Gidon Gershinsky
10 months ago

How Alluxio is Accelerating Apache Spark Workloads

Alluxio is fast virtual storage for Big Data. Formerly known as Tachyon, it’s an open-source memory-centric virtual distributed storage system (yes, all that!), offering data access at memory speed and persistence to a reliable storage. This technology accelerates analytic workloads in certain scenarios, but doesn’t offer any performance benefits in other scenarios. The purpose of this blog is to... Read More