Moon Soo Lee, Creator of Apache Zeppelin, on its History and Future

Moon Soo Lee

Moon Soo Lee is co-founder and CTO of NFLabs and the creator for Apache Zeppelin (incubating), a web-based notebook for building interactive data analytics using SQL, Scala and more.

Moon sat down with STC designer Jeremy Anderson for a conversation about Zeppelin’s history and its future.

How did Apache Zeppelin get started?

We (NFLabs) created a commercial data analytics product, called Peloton, back in 2012 on top of AmpLab Spark/Shark. We wanted to build a really helpful tool for data analytics, especially on top distributed computing systems — and then decided to open source an interactive analytics feature from Peloton, in 2013, which we named Zeppelin.

How did Zeppelin become an Apache Software Foundation incubating project?

As Zeppelin gained more adoption in 2014, we realized to really take Zeppelin to a global level, it had to be part of the Apache software foundation.

Have you been involved in other open source projects, prior to Zeppelin? What was it about Zeppelin that drew you in?

I was using many open-source projects and sometimes submitted patches, but haven’t been involved in any open-source project constantly. I always dreamed of contributing back to the open-source world and I thought Zeppelin would be the best way to do that.

How has the response from the community been?

In the early days, there were single users in the community, who gave a lot of valuable feedback. In fact, one person became one of our initial Zeppelin committers. Since then, the community has embraced Zeppelin and the Zeppelin project is being helped by the community. Now, the Zeppelin project has numerous users, more than 90 contributors worldwide, and lots of adoption in commercial products. Going through all this community growth is a wonderful experience and really motivating.

In your words, what sets Zeppelin apart from other notebooks and data science tools? Why should data scientists use Zeppelin?

To me, the purpose of Zeppelin is not to be the best notebook, although it’s one of the best notebooks out there. I’d like to see Zeppelin as a platform for analytics applications. A notebook is just one possible application. Helium (ZEPPELIN-533) is an effort to try to bring Zeppelin into the application platform by making Zeppelin truly pluggable. I believe the pluggability will give Zeppelin a lot of applications (extensions) for data scientists as well as business users.

The field of data science is rapidly growing and evolving. How do you see it changing over the next year?

Data science is becoming more and more important in any business, especially with the evolution of machine learning. As the field of data science grows, communication between data scientists and business users will become a big challenge.

How will this change influence the future of the Zeppelin notebook?

Zeppelin will be used not only by data scientists but also by business users, in their own communication. That’ll require Zeppelin to have good support for business users, including communication.

What’s next on the road map for Apache Zeppelin?

We’ll focus on making Zeppelin enterprise-ready by enhancing stability, security, and multi-user support. We’re also trying to become the best platform for analytics applications — both for data scientists and business users. And it’s an open-source project. The community is open. Please add a user road map!


You Might Also Enjoy

Kevin Bates
Kevin Bates
9 months ago

Limit Notebook Resource Consumption by Culling Kernels

There’s no denying that data analytics is the next frontier on the computational landscape. Companies are scrambling to establish teams of data scientists to better understand their clientele and how best to evolve product solutions to the ebb and flow of today’s business ecosystem. With Apache Hadoop and Apache Spark entrenched as the analytic engine and coupled with a trial-and-error model to... Read More

Gidon Gershinsky
Gidon Gershinsky
10 months ago

How Alluxio is Accelerating Apache Spark Workloads

Alluxio is fast virtual storage for Big Data. Formerly known as Tachyon, it’s an open-source memory-centric virtual distributed storage system (yes, all that!), offering data access at memory speed and persistence to a reliable storage. This technology accelerates analytic workloads in certain scenarios, but doesn’t offer any performance benefits in other scenarios. The purpose of this blog is to... Read More