Just days away from August 15th, the official date IBM contributes SystemML—IBM’s declarative large-scale machine research project—to open source, engineers from San Francisco to Singapore are compulsively checking and rechecking github in anticipation. STC researcher Fred Reiss, the author of the first prototype of SystemML’s backend for Apache Spark, took time out from getting SystemML ready for its debut to talk about resource optimization for machine learning (no more picking a huge cluster size on the better safe than sorry theory), overcoming skepticism about Spark, and how the Spark Technology Center is positioned to look out for the interests of mainstream users.
How’d you first get involved with open source and why do you think it’s important?
Fred Reiss: I got involved with open source in grad school. The research group I was a member of produced a streaming query engine called TelegraphCQ, which was based on PostegreSQL. Having a full-fledged database to use as a starting point was a huge help in our project. When we were done writing papers and dissertations, we took the whole system and turned it into a new open source project, which a number of researchers have since used as a basis or a comparison point for their own work. One of my lab-mates took the same open-source codebase and built a startup around it. Because of open source, we were able to build a much more complete system to test our ideas; we were able to share those ideas with future researchers in a very concrete way; and we were able to have our work stay “alive” much longer than it would have otherwise. My experiences at Berkeley left me with a deep appreciation for the benefits of having a code base that exists out in the open and can be freely shared.
Did you work with the AMPLab people when you were at Berkeley?
FR: The AMPLab was just starting as I was leaving Berkeley. I would definitely have been a part of it if that resource had been available. My thesis work required me to consult with our networking and OS faculty, and having an AMPLab would have smoothed the way for that collaboration. I can really appreciate the additional opportunity for interdisciplinary work that my immediate successors at Berkeley had as a result of the AMPLab initiative.
*IBM is about to contribute SystemML to open source. I’ve met some very, very excited people in the open source community who want to know more. Why does SystemML have the business and the engineering worlds so excited? *
*FR: *Over the last few years, the research group behind SystemML has worked with a dozen different customers on big data problems involving machine learning. Across all of these engagements, there’s been one consistent factor: Every one of these customers could plausibly argue that their use case was so different that a custom solution was necessary. Machine learning is a really multidimensional discipline. There are so many different factors that define a given real-world problem—things like number and types of feature, type of regression algorithm, need for regularization, number of observations—that the only way to attack an evolving business problem is with software that can also evolve.
So practitioners want a system that makes it easy to change all aspects of the machine learning pipeline, from the initial data transformation all the way down to tweaking the way that the learning algorithm works. This requirement translates into having a high-level language that exposes the aspects of machine learning that change frequently, while hiding the more low-level, system-specific aspects from the user. For problems that fit on a single machine, there are number of systems that give this kind of control, including open-source systems like R and Jupyter, as well as commercial products like SPSS and SAS. In the big data space, there really hasn’t been any high-level language, just a collection of point solutions and some frameworks for tying them together. SystemML represents a new design point: a flexible, high-level language coupled with an optimizer and runtime that can handle big data problems.
How’s the plan to open source it coming along? Do you have any concerns about releasing it into the wild?
*FR: *IBM has a long history of working with open source, and the company has done open-source releases many times in the past. We have a very well-thought-out process for releasing open-source code, and the Spark Technology Center has been working steadily through the steps of that process. At this point, the main thing we’re focused on is on polishing up the system and clearing out as many of our open work items as possible before the release. It’s important to make a good first impression, and we want our first open release of SystemML to be as well-received as possible.
Tell us about the latest research from the SystemML group at IBM Research.
FR: *We just presented a paper at the SIGMOD 2015 conference *on resource optimization for machine learning. In today’s cloud-based compute environments, it’s important to allocate the right amount of compute and memory for each task. The current best practice in machine learning is to pick a cluster size that is known to be too large for the problem at hand, on the grounds that doing so is a lot safer than using too small a cluster and running out of memory. On public clouds like the Softlayer cloud, that approach leaves a lot of money on the table. In private clouds, the first data scientist who gets onto the cluster grabs resources away from everyone else.
The problem we solved in our paper was that of automatically finding the smallest amount of driver and executor memory that gives optimal throughput for a given algorithm and data set. SystemML, with its high-level language and optimizer, is in a unique position to solve this kind of problem. Our system can look at the data and the algorithm and accurately predict its memory requirements. The paper shows how we designed an optimization strategy for resource optimization, implemented that strategy in SystemML, and delivered a 5x improvement in training throughput for multitenant clusters.
We also have a new journal article with in-depth picture of the internals of SystemML’s optimizer. There are quite a few other papers in the pipeline, but obviously I can’t talk about those yet.
What are you doing with SystemML on Spark, and what’s your vision for the potential there — either for end-use business applications or for development of the technology itself?
FR: I wrote the first prototype of SystemML’s backend for Spark. At the time, there was quite a bit of skepticism within the group about the utility of Spark. But the results we have been able to get using the system—both in terms of performance and in terms of how easy it is to code against—converted the whole group. Spark has a lot of hype behind it right now, but the system has a habit of really selling itself to systems programmers. I think that the positive first time experience of Spark will give the system a source of grass-roots momentum that will make it hard to unseat as the leader in distributed analytics.
Why the Spark Technology Center? Besides working in downtown San Francisco, what’s the difference between the STC and the work you were doing within IBM proper?
FR: I came to the STC from IBM Research. The environment at the STC is actually not that different from a research lab, but there are a few differences in degree. We have a lot more manpower at the STC to turn ideas into reality. We target those ideas at a much more broad-based audience, compared with research projects that aim to impress only ivory-tower academics. And we have a more direct connection at the STC with IBM’s strategy and our product line.
*What’s your vision for the future of Spark? What swamps of stagnation are you trying to avoid? What great things can you imagine can happen with it? *
FR: One of the things that Spark has done really well has been to target the needs of mainstream users. Spark really went against the grain by targeting smaller clusters and smaller data sizes, as well as by emphasizing ease of use ahead of scalability. There is a danger right now that Spark could end up being redirected by a few large users to serve those users’ needs ahead of those of the majority. That danger became a reality with Hadoop, where a handful of companies with really big clusters directed nearly all of the project’s resources towards working well at massive scales, while ignoring the needs of the vast majority of users.
The Spark Technology Center has an opportunity here to serve as a neutral party in that regard. IBM’s customers represent the mainstream of analytics, with relatively small clusters, relatively small data, and a need to build solutions quickly and simply. We need to make sure that the needs of those users are represented in a field increasingly dominated by large companies with unique needs and small startups trying to create walled gardens.
Correction: SystemML was officially released to open source on Monday, August 31. Find it here. – Ed