data pros

Apache Spark + Watson + Twitter

Unlock the power of Spark with IBM Watson and Twitter.

How’s your relationship with your customers? What do they feel about you, your products, or your company?

Answering these questions usually requires building and running complex analytics over a large set of data. This takes time, infrastructure, and the right skills, none of which come cheap.

One solution is to leverage Apache Spark™, the open-source, in-memory computing framework for distributed data processing. One of the nicest things about Spark is that it features a simple programming model that hides the complexity inherent to distributed computing. As an added bonus, the APIs come in multiple flavors: Scala, Java, Python, and R.

IBM Analytics for Apache Spark lets you combine the power of Spark with the rich set of data-centric services available on Bluemix in new and innovative ways. For example, you can leverage the growing set of Watson Cognitive Services to enrich your data with new insights and build more powerful analytics.

To show what’s possible, I created a simple open-source application that uses Spark Streaming to create a feed of live tweets and enrich the data with emotion/tone scores from the Watson Tone Analyzer service. (You can find my project on on Github here.)

The following diagram provides a high-level architecture of the application:

Twitter + Watson high level architectureThe results were really impressive. Once the tweets were collected and enriched with Watson Tone Analyzer, I could build a set of fully functional analyses using the Jupyter notebooks that come with the Spark service:

Compute the distribution of tweets by sentiment scores greater than a specified threshold:

Distribution of tweets by sentiment scores greater than 60% 

Compute the top 10 hastags contained in the tweets set:

top 10 hashtags pie chart 

Visualize a breakdown of the top 5 hashtags by sentiment scores:

breakdown by tone scores 

These are just 3 simple examples that show how easy it is to get value from your data using Spark. The possibilities are endless. So, I encourage you to download the code on GitHub, follow the tutorial, and start building your own data solutions. Here are a few starter ideas:

  1. Figure out where people are feeling the most angry. Use the tweets’ geo location (available in the data model) to build maps that correlate to emotions.
  2. Use another Watson service, called Personality Insight, to identify psychological traits.
  3. Enhance the resiliency and scalability of the architecture with the MessageHub Service, which provides Kafka-based events queues.
  4. Connect to the IBM Insight for Twitter service to work with past tweets.

Spark is the start of something big, growing in popularity every day. It’s a game-changing technology for big data, and IBM is investing heavily in it, along with other open-source solutions and a host of resources to help developers get, build, and analyze data in the cloud.

Whether you’re a data scientist, a developer, or a business analyst, Spark enables you to do things that weren’t possible before, at web scale. You can work with huge amounts of data and make sense of it with easy-to-integrate analytics and visualizations. Start with my tutorial and let me know what you create! I’ll be presenting at DataPalooza in San Francisco on Nov 10-12; if you happen to be there please come find me to discuss your project or just to say hello.

This post was originally published in kdnuggets. – Ed.


You Might Also Enjoy

James Spyker
James Spyker
2 months ago

Streaming Transformations as Alternatives to ETL

The strategy of extracting, transforming and then loading data (ETL) to create a version of your data optimized for analytics has been around since the 1970s and its challenges are well understood. The time it takes to run an ETL job is dependent on the total data volume so that the time and resource costs rise as an enterprise’s data volume grows. The requirement for analytics databases to be mo... Read More

Seth Dobrin
Seth Dobrin
2 months ago

Non-Obvious Application of Spark™ as a Cloud-Sync Tool

When most people think about Apache Spark™, they think about analytics and machine learning. In my upcoming talk at Spark Summit East, I'll talk about leveraging Spark in conjunction with Kafka, in a hybrid cloud environment, to apply the batch and micro-batch analytic capabilities to transactional data in place of performing traditional ETL. This application of these two open source tools is a no... Read More