Q & A with Shiv Vaithyanathan, Creator of SystemML and IBM Fellow

Eight years ago, Shiv Vaithyanathan and his colleagues started work on ideas that would evolve into SystemML, the powerful machine learning platform for distributed computing frameworks like MapReduce and Apache Spark. By separating algorithm input format from optimization, Shiv and his team gave data scientists a way to to be “super-creative”: to express algorithms in simpler, more flexible terms — without sacrificing speed or scalability. Following on from last month’s conversation about SystemML with Fred Reiss, we’ll hear about the origins and evolution of SystemML, as well as Shiv’s own vision for the future of machine learning in light of IBM’s landmark decision to donate SystemML to the open-source community.

"You need to have passion. Everything else can be learned."“You need to have passion. Everything else can be learned.”

You are the original creator of SystemML. How did it start? What problem were you trying to solve?

In 2007, my group was already working closely with Hadoop. Most of the processing was ETL and it was major work to parse documents and extract meta-data from them. Even then parsing was only the first step; our goal was to do more advanced processing with the parsed documents. As a first step, because the parsed documents were already on HDFS, I started to experiment with writing the first machine learning algorithm on Hadoop. To have experience with more ML algorithms, I hired a summer intern — a really bright graduate student from Wisconsin — and she and I planned to work on and have multiple algorithms running on Hadoop over the summer of 2008. Long story short, at the end of the summer not a single one of our algorithms was working efficiently.

When did you know you had a solution — or something new and meaningful? What did that feel like?

By the end of that summer, my intern and I decided the current direction of writing every algorithm against lower-level API (for example, map and reduce) wasn’t tenable. Not only did we need to re-think the algorithm and re-write it in the context of a distributed system’s choices and idiosyncrasies, but such open-source systems were themselves a moving target. From release to release of the systems, we would have had to put in effort to keep the algorithms running efficiently. There simply had to be an easier way!

The answer wasn’t actually far away. Almaden, where I work, is the birthplace of SQL — the original declarative system for data-management. We needed to think along the same lines as SQL, by separating the “what” from the “how”. In other words, we had to separate the specification of the algorithm – and make the specification easy – from the way we optimized the algorithm.

Once that was evident, I just needed to define our initial scope. I remember that once I defined that scope I dropped an email to a faculty friend to get his reaction. I sent the email late in the evening and got a strongly positive response within a couple of hours. That was enough for me to get management approval to put together the larger project — and SystemML was born. Of course, we’ve worked out many details since then, but that was the start.

So how does this change the life of a data scientist?

This was the major motivation for SystemML. In 2007 / 2008 my intern and I played the role of data scientists. However, we also had the luxury of world-class systems folks in Almaden to help whenever we ran into an issue. Not the case with every data-scientist! Separating the “how” from the “what” enables two different personas to bring their best to the problem. First, the data scientist can be a data scientist and not a systems hacker or an open-source developer. It lets the data scientist be super-creative – not encumbered by how to optimize the algorithms and consequently make compromises to accommodate the constraints of the system. Second, separating the how enables the larger systems community to get involved in a fundamental way. The creative data scientist continues to push the boundaries by writing new algorithms — and the systems programmer continually gets new workloads to keep moving the system forward. That’s a marriage made in heaven and tailor-made for open-source!

I’ve heard questions about merging SystemML with the Spark code database— that the SystemML DSL optimizer is different. Do you think this’ll affect integration with Spark?

I’d like to bring some SystemML optimization capabilities to Catalyst. Operationally, we can minimize changes to Spark since Catalyst is already available. Further, we can optimize on end-to-end workloads ranging from data prep to the core Machine Learning algorithm. There’s potential not only to bring innovation to customers but to bring value to the entire analytics pipe.

I hear people wondering whether SystemnML being written in Java is going to impact adoption by Spark users. Do you think that matters?

For the SystemML users, I don’t think there will be a difference. We’ll have a Scala API for users of algorithms written in SystemML. And we’re working on a flavor of Scala DSL (to augment the R and Python DSLs in SystemML). That’s still on the drawing board and not in plan yet.

We expect open source to drive the direction of SystemML and IBM is being very aggressive about support for it. IBM will do what it takes to make SystemML successful.

Now that it’s open, are you and the other original creators of SystemML still maintaining it in the open-source world, or are you just putting out a snapshot and going back to other research?

Absolutely! We’re not only maintaining it but growing the team both in Research and in the STC. This is a major commitment for us and we’ll be working with it in open-source very aggressively.

Any thoughts of open-sourcing SystemT in a similar way?

At this point, we haven’t started that discussion, but I do encourage folks interested in SystemT to have a look at the tutorials available here:

What was it like to do machine learning in 2010 versus today?

Until a few years ago, data was only available to a small set of companies who provided insights from data as their core business, but in the last few years virtually all enterprises have realized that they’ve been collecting a lot of valuable data as part of their regular business operations. I have the luxury of working with many customers, and the question enterprises are asking now is: “How do I monetize my data?”

If they’re already in the business of monetizing their data, then they want to know: “How else can I monetize my data?” Or often: “What other data sources can I acquire or partner with to increase the value of my data?”

The first step in monetizing data is to derive insights — and to do that you need machine learning. There’s no paucity of data and the main concern for the data scientist is how to run machine learning algorithms on the largest possible data sets. Data scientists live and die by intuition gained from running their algorithms on data, and in today’s world they need to run their algorithms on larger and larger data sets to truly validate their intuitions with insights from the data.

What’s your vision for where machine learning will go from here? Any clues from IBM Research or elsewhere?

ML is going in multiple directions from here. For one, we’ll see concentrated attempts to make ML easier for business analysts to use. This is a no-brainer and very good work is already happening — and there will be significant advancements here soon. At IBM Research, we’re also exploring ideas at the other end of the spectrum. For mathematicians, who until now have not even bothered to test their algorithms, can we help them test those algorithm against data without forcing them to learn anything else? We’re also working on multiple applications of large-scale Machine Learning, Deep Learning, pushing ML algorithms down to hardware, and so on.

What are you doing with Watson? How is it different than what you were doing before?

I’m responsible for the core Watson Content Services. I’ve got the opportunity to work with the varied Watson applications and build a back-end Content Analytics / Services system that can support both the semantic and systemic requirements of those applications.

I read you started your career as a tech reporter in Mumbai, when tech was first booming there. How did that experience compare to the tech boom in Silicon Valley?

Very different experiences. The tech reporter experience was very early in my life and a fascinating experience. In some ways, it’s what motivated me to move toward a career in tech.


You Might Also Enjoy

Kevin Bates
Kevin Bates
9 months ago

Limit Notebook Resource Consumption by Culling Kernels

There’s no denying that data analytics is the next frontier on the computational landscape. Companies are scrambling to establish teams of data scientists to better understand their clientele and how best to evolve product solutions to the ebb and flow of today’s business ecosystem. With Apache Hadoop and Apache Spark entrenched as the analytic engine and coupled with a trial-and-error model to... Read More

Gidon Gershinsky
Gidon Gershinsky
10 months ago

How Alluxio is Accelerating Apache Spark Workloads

Alluxio is fast virtual storage for Big Data. Formerly known as Tachyon, it’s an open-source memory-centric virtual distributed storage system (yes, all that!), offering data access at memory speed and persistence to a reliable storage. This technology accelerates analytic workloads in certain scenarios, but doesn’t offer any performance benefits in other scenarios. The purpose of this blog is to... Read More