apache spark

Why you should be using Apache SystemML

What is SystemML? Why is it relevant to you?

Here's the deal, you've probably never heard of SystemML, but you definitely need to know what it is. Why? Not only will SystemML make you look awesome because machine learning is the hot topic right now, but it will also save you a lot of time and trouble. As a new data scientist I am constantly having to spend my time learning new technologies--most of which don't work very well. Here's the thing: SystemML actually does work very well. Because it only recently became open source, it's difficult to find material on how to get started, but that's quickly changing. For now, let me break down what SystemML is, why you want to use it, and how it will get even easier.

SystemML's official definition is this:

"SystemML provides declarative large-scale machine learning (ML) that aims at flexible specification of ML algorithms and automatic generation of hybrid runtime plans ranging from single-node, in-memory computations, to distributed computations on Apache Hadoop and Apache Spark."

What does that mean in normal people language? Let's break it down.

SystemML is an Apache incubator project that is being focused on by IBM's Spark Technology Center. At a very high level, it is a language, and hopeful platform, that allows you to basically log into the Spark shell or a notebook like Jupyter or Zeppelin, and use python or R to do all of your awesome machine learning stuff. Specifically, this platform is catered to linear algebra, matrices, statistical functions etc. that help you with your machine learning project. Previously, data scientists would have to hand off their project for someone else to do it, due to the complicated code needed and the size of the big data. That issue is no longer relevant. Now you just load SystemML, load your data and do all of your computations, just like that, even with incredibly large data. Very clear tables and statistics come out of it, and I just wish I knew about it in my previous classes at UC Berkeley!

But guess what, SystemML is even more awesome than just that. The other compelling aspect of SystemML is how it functions. SystemML decides, line by line, whether to run on the driver or the Spark cluster, meaning that it scales automatically. This saves valuable time, and does it automatically so you don't have to. Because it decides what is optimal for any given line of code, it saves you a great deal of hassle, where you can just focus on what you're trying to do and your very own algorithms.

And speaking of code, SystemML is also incredibly concise. One of my mentors, Deron Eriksson, mentioned to me that it was probably 10x to 100x less code than Scala, which is absolutely incredible. After trying it out for myself, I found this to be surprisingly true.

So how can you use SystemML now?

You can run SystemML locally on your computer. I personally find it easiest to do this on the Jupyter Notebook or the Spark Shell with the new API. (This is being made public next week! I will write a tutorial, I promise!)

Why don't you know about this already and how is that changing?

SystemML has been an Apache Incubator project for less than a year, but it's actually been worked on by a talented team of researchers at IBM Research at Almaden since 2010. A lot of work has gone into making it awesome where the research team has focused on a lot of customer use cases, but because it is only now becoming open source the public materials haven't quite caught up just yet. Not to worry, that's why I'm here. Although the website is a bit tricky at the moment, it will be changing and a lot of tutorials will soon be available. The best way to get started on it this very second is to follow this tutorial for Jupyter notebook. Next week, I'll show you how to run it on the Spark shell which is my favorite.

Now that you've heard about how awesome SystemML is, I encourage you to go experiment with it! We want you to enjoy it as much as we do! Make sure to give us your feedback and let us know what projects your working on with SystemML! If it's awesome, I'll blog about you!

Stay tuned for tutorials!

By Madison J. Myers


You Might Also Enjoy

Kevin Bates
Kevin Bates
9 months ago

Limit Notebook Resource Consumption by Culling Kernels

There’s no denying that data analytics is the next frontier on the computational landscape. Companies are scrambling to establish teams of data scientists to better understand their clientele and how best to evolve product solutions to the ebb and flow of today’s business ecosystem. With Apache Hadoop and Apache Spark entrenched as the analytic engine and coupled with a trial-and-error model to... Read More

Gidon Gershinsky
Gidon Gershinsky
10 months ago

How Alluxio is Accelerating Apache Spark Workloads

Alluxio is fast virtual storage for Big Data. Formerly known as Tachyon, it’s an open-source memory-centric virtual distributed storage system (yes, all that!), offering data access at memory speed and persistence to a reliable storage. This technology accelerates analytic workloads in certain scenarios, but doesn’t offer any performance benefits in other scenarios. The purpose of this blog is to... Read More