Developing Data Products with Jupyter Notebooks and Apache Spark

Note: This post draws materials from a talk titled From Doodles to Dashboards: Notebooks as a Cloud Platform presented at Defrag 2015 on November 11, 2015. All of the notebooks shown in this post are included in the associated GitHub repo.

One of the major value propositions of Apache Spark is its ability to span the many phases of the data mining process. From exploration to deployment, Spark offers a consistent API, multi-language support, and scalability with both the volume and velocity of data. With Spark, data science teams can focus on the development of their data products, and avoid the accidental complexity of switching data processing technologies along the way.

To truly capitalize on this potential, data practitioners also need a user experience that caters to their many activities and their many uses of Spark (e.g., data munging, machine learning, stream processing). We see one answer to this need in interactive notebooks—living documents of text, code, visualizations, and widgets, backed by cloud compute and data. Combining the open ecosystem of Project Jupyter notebooks with Spark, for example, can help data science teams grow their data products from back-of-the-napkin ideas to reproducible interactive reports to just-good-enough dynamic dashboards.

As a demonstration of how Spark and Jupyter Notebooks complement each other to ease the creation of data products, consider the following scenario:

My team wants to help increase attendance at IBM meetups. We know from prior research that meetup attendees are more likely to subscribe to IBM cloud services, and so we think the effort is justified. But we don’t yet know how to empower our evangelists to attract new attendees.

While the problem above is hypothetical, the approach summarized below mirrors our path in a number of customer engagements.

Doodling to understand

Starting out, my teammates and I need to learn about the relevant data available to us. Notebooks provide a natural place for us to dabble with data sources and take notes along the way. We might, for instance, poke at the API in notebooks to fetch meetup lists by topic and to learn how to process the real-time RSVP stream with Spark. We might also try joining other public data sources with the meetup data to see if we can get richer detail about users and venues.


Documenting to collaborate

After exploring for a while, we start to hit upon viable models for identifying and visualizing meetup candidates. Notebooks provide a nice canvas for comparing and contrasting these approaches with reproducible results. For instance, we might evaluate the performance of our various candidate models in a notebook that we can easily extend and re-run when need be. We might also start to look at different ways of visualizing candidates from our Spark stream and reaching out to them.


Deploying to take action

Ultimately, we want to put our work into the hands of our evangelists so they can start to take action. Here, too, notebooks shine thanks to their ability to include front- and backend code in an open document format that we can transform. We might, for instance, roll-up our work into a notebook that uses declarative widgets to show meetups and candidates in real-time, to provide a one-click way to contact a Meetup user, and to track how many candidates RSVP after we reach out. We might then lay out this notebook as a dashboard and deploy it as a web frontend to be used by our evangelists.


Discussing new insights

After deploying the app, we start to collect both objective and subjective feedback from our evangelist users. Together, we think of improvements to not only our first-cut dashboard UI, but also to how we identify and rank candidates. These new ideas are ripe for experimentation in notebooks and redeployment as improved dashboards. We can continue to iterate in this fashion until our data product is just-good-enough for our evangelists, warrants implementation as production-level Spark data processing pipeline and web application, deserves no further investment, and so on.

And more than likely along the way, we’ll discover new ways to bring insights to our team. For instance, we might bridge our work in notebooks to Slack to make information about meetups readily available in our on-going team conversation.


Bottom line

Notebooks help data science teams realize the value of Spark throughout the evolution of their data products. Together, they form a powerful platform for performing analytics at scale and developing data products with speed.


You Might Also Enjoy

James Spyker
James Spyker
2 months ago

Streaming Transformations as Alternatives to ETL

The strategy of extracting, transforming and then loading data (ETL) to create a version of your data optimized for analytics has been around since the 1970s and its challenges are well understood. The time it takes to run an ETL job is dependent on the total data volume so that the time and resource costs rise as an enterprise’s data volume grows. The requirement for analytics databases to be mo... Read More

Seth Dobrin
Seth Dobrin
2 months ago

Non-Obvious Application of Spark™ as a Cloud-Sync Tool

When most people think about Apache Spark™, they think about analytics and machine learning. In my upcoming talk at Spark Summit East, I'll talk about leveraging Spark in conjunction with Kafka, in a hybrid cloud environment, to apply the batch and micro-batch analytic capabilities to transactional data in place of performing traditional ETL. This application of these two open source tools is a no... Read More