data pros

Apache Spark + Watson + Twitter

Unlock the power of Spark with IBM Watson and Twitter.

How’s your relationship with your customers? What do they feel about you, your products, or your company?

Answering these questions usually requires building and running complex analytics over a large set of data. This takes time, infrastructure, and the right skills, none of which come cheap.

One solution is to leverage Apache Spark™, the open-source, in-memory computing framework for distributed data processing. One of the nicest things about Spark is that it features a simple programming model that hides the complexity inherent to distributed computing. As an added bonus, the APIs come in multiple flavors: Scala, Java, Python, and R.

IBM Analytics for Apache Spark lets you combine the power of Spark with the rich set of data-centric services available on Bluemix in new and innovative ways. For example, you can leverage the growing set of Watson Cognitive Services to enrich your data with new insights and build more powerful analytics.

To show what’s possible, I created a simple open-source application that uses Spark Streaming to create a feed of live tweets and enrich the data with emotion/tone scores from the Watson Tone Analyzer service. (You can find my project on on Github here.)

The following diagram provides a high-level architecture of the application:

Twitter + Watson high level architectureThe results were really impressive. Once the tweets were collected and enriched with Watson Tone Analyzer, I could build a set of fully functional analyses using the Jupyter notebooks that come with the Spark service:

Compute the distribution of tweets by sentiment scores greater than a specified threshold:

Distribution of tweets by sentiment scores greater than 60% 

Compute the top 10 hastags contained in the tweets set:

top 10 hashtags pie chart 

Visualize a breakdown of the top 5 hashtags by sentiment scores:

breakdown by tone scores 

These are just 3 simple examples that show how easy it is to get value from your data using Spark. The possibilities are endless. So, I encourage you to download the code on GitHub, follow the tutorial, and start building your own data solutions. Here are a few starter ideas:

  1. Figure out where people are feeling the most angry. Use the tweets’ geo location (available in the data model) to build maps that correlate to emotions.
  2. Use another Watson service, called Personality Insight, to identify psychological traits.
  3. Enhance the resiliency and scalability of the architecture with the MessageHub Service, which provides Kafka-based events queues.
  4. Connect to the IBM Insight for Twitter service to work with past tweets.

Spark is the start of something big, growing in popularity every day. It’s a game-changing technology for big data, and IBM is investing heavily in it, along with other open-source solutions and a host of resources to help developers get, build, and analyze data in the cloud.

Whether you’re a data scientist, a developer, or a business analyst, Spark enables you to do things that weren’t possible before, at web scale. You can work with huge amounts of data and make sense of it with easy-to-integrate analytics and visualizations. Start with my tutorial and let me know what you create! I’ll be presenting at DataPalooza in San Francisco on Nov 10-12; if you happen to be there please come find me to discuss your project or just to say hello.

This post was originally published in kdnuggets. – Ed.


You Might Also Enjoy

Kevin Bates
Kevin Bates
10 months ago

Limit Notebook Resource Consumption by Culling Kernels

There’s no denying that data analytics is the next frontier on the computational landscape. Companies are scrambling to establish teams of data scientists to better understand their clientele and how best to evolve product solutions to the ebb and flow of today’s business ecosystem. With Apache Hadoop and Apache Spark entrenched as the analytic engine and coupled with a trial-and-error model to... Read More

Gidon Gershinsky
Gidon Gershinsky
a year ago

How Alluxio is Accelerating Apache Spark Workloads

Alluxio is fast virtual storage for Big Data. Formerly known as Tachyon, it’s an open-source memory-centric virtual distributed storage system (yes, all that!), offering data access at memory speed and persistence to a reliable storage. This technology accelerates analytic workloads in certain scenarios, but doesn’t offer any performance benefits in other scenarios. The purpose of this blog is to... Read More