Developing Data Products with Jupyter Notebooks and Apache Spark

Note: This post draws materials from a talk titled From Doodles to Dashboards: Notebooks as a Cloud Platform presented at Defrag 2015 on November 11, 2015. All of the notebooks shown in this post are included in the associated GitHub repo.

One of the major value propositions of Apache Spark is its ability to span the many phases of the data mining process. From exploration to deployment, Spark offers a consistent API, multi-language support, and scalability with both the volume and velocity of data. With Spark, data science teams can focus on the development of their data products, and avoid the accidental complexity of switching data processing technologies along the way.

To truly capitalize on this potential, data practitioners also need a user experience that caters to their many activities and their many uses of Spark (e.g., data munging, machine learning, stream processing). We see one answer to this need in interactive notebooks—living documents of text, code, visualizations, and widgets, backed by cloud compute and data. Combining the open ecosystem of Project Jupyter notebooks with Spark, for example, can help data science teams grow their data products from back-of-the-napkin ideas to reproducible interactive reports to just-good-enough dynamic dashboards.

As a demonstration of how Spark and Jupyter Notebooks complement each other to ease the creation of data products, consider the following scenario:

My team wants to help increase attendance at IBM meetups. We know from prior research that meetup attendees are more likely to subscribe to IBM cloud services, and so we think the effort is justified. But we don’t yet know how to empower our evangelists to attract new attendees.

While the problem above is hypothetical, the approach summarized below mirrors our path in a number of customer engagements.

Doodling to understand

Starting out, my teammates and I need to learn about the relevant data available to us. Notebooks provide a natural place for us to dabble with data sources and take notes along the way. We might, for instance, poke at the API in notebooks to fetch meetup lists by topic and to learn how to process the real-time RSVP stream with Spark. We might also try joining other public data sources with the meetup data to see if we can get richer detail about users and venues.


Documenting to collaborate

After exploring for a while, we start to hit upon viable models for identifying and visualizing meetup candidates. Notebooks provide a nice canvas for comparing and contrasting these approaches with reproducible results. For instance, we might evaluate the performance of our various candidate models in a notebook that we can easily extend and re-run when need be. We might also start to look at different ways of visualizing candidates from our Spark stream and reaching out to them.


Deploying to take action

Ultimately, we want to put our work into the hands of our evangelists so they can start to take action. Here, too, notebooks shine thanks to their ability to include front- and backend code in an open document format that we can transform. We might, for instance, roll-up our work into a notebook that uses declarative widgets to show meetups and candidates in real-time, to provide a one-click way to contact a Meetup user, and to track how many candidates RSVP after we reach out. We might then lay out this notebook as a dashboard and deploy it as a web frontend to be used by our evangelists.


Discussing new insights

After deploying the app, we start to collect both objective and subjective feedback from our evangelist users. Together, we think of improvements to not only our first-cut dashboard UI, but also to how we identify and rank candidates. These new ideas are ripe for experimentation in notebooks and redeployment as improved dashboards. We can continue to iterate in this fashion until our data product is just-good-enough for our evangelists, warrants implementation as production-level Spark data processing pipeline and web application, deserves no further investment, and so on.

And more than likely along the way, we’ll discover new ways to bring insights to our team. For instance, we might bridge our work in notebooks to Slack to make information about meetups readily available in our on-going team conversation.


Bottom line

Notebooks help data science teams realize the value of Spark throughout the evolution of their data products. Together, they form a powerful platform for performing analytics at scale and developing data products with speed.


You Might Also Enjoy

Kevin Bates
Kevin Bates
9 months ago

Limit Notebook Resource Consumption by Culling Kernels

There’s no denying that data analytics is the next frontier on the computational landscape. Companies are scrambling to establish teams of data scientists to better understand their clientele and how best to evolve product solutions to the ebb and flow of today’s business ecosystem. With Apache Hadoop and Apache Spark entrenched as the analytic engine and coupled with a trial-and-error model to... Read More

Gidon Gershinsky
Gidon Gershinsky
10 months ago

How Alluxio is Accelerating Apache Spark Workloads

Alluxio is fast virtual storage for Big Data. Formerly known as Tachyon, it’s an open-source memory-centric virtual distributed storage system (yes, all that!), offering data access at memory speed and persistence to a reliable storage. This technology accelerates analytic workloads in certain scenarios, but doesn’t offer any performance benefits in other scenarios. The purpose of this blog is to... Read More