RedRock is Open Source


Putting big data analysis in your hands

RedRock on GitHub

We are excited to announce that the RedRock backend is now open source! That’s cool, but what is RedRock you say?

RedRock is an example application to demonstrate the power of Spark integrated with ElasticSearch and processing Twitter data.

RedRock puts the power of big data analysis in the hands of everyday users. They only need to provide a word or hashtag and RedRock will analyze billions of historical tweets and provide analysis. RedRock allows the user to search for anything they want, as well as tune the search results by including and excluding terms.

We have paired the RedRock backend with an iPad front end that communicates with it via REST calls. That means the backend can be hooked up to any frontend you like. To get you excited about the endless possibilities that we see here, we are providing some examples of how we chose to display the data returned from RedRock.

In this scenario, the user searched for "ibm":


On the left is a list of the top tweets for the search, and some statistics about the data you're working with. This list of tweets can be dismissed and shown at any time by tapping the “Feed” button. On the right, we have five different visualizations, each providing more information about your search. The user can switch between the visualizations by clicking on the icons at the bottom, or by swiping left and right on the visualization.

Let’s go through some of the key features of the feed and the visualizations by describing what you're seeing in the screenshots...

  • Twitter Feed: Top 100 tweets related to your search sorted by most influential user (followers count)

  • Found Tweets: Total number of tweets that matched your search
  • Found Users: Total number of unique users within tweets that matched your search

  • Total Tweets: Total number of tweets in the database

  • Cluster: Shows relationships between the top 20 most closely related terms to your search. Each colored group represents a different topic, and the size of each circle represents the volume of occurrences. 


Network Graph

Shows the top 20 most closely related terms to the search term. The size of the circle represents its frequency, and the length of the line represents how closely it relates to the search term.


Sentiment Bar Chart

Shows the positive and negative sentiment related to the search term over time.


Profession Treemap

Shows the careers of users tweeting about the given topic.


World Map

Shows the volume of tweets from each country over time. 


When we started this project, we set it up as if we were using Spark from a notebook. This meant that every time a user searched for a term, a Spark job would run. This worked fine as long as there was only one user. Because this would not scale the way we wanted, we decided to use Spark Streaming to process and annotate the tweets as they came in. This meant that Spark would extract the information that we needed from each tweet, do some analysis, and then save the annotated tweet in ElasticSearch. ElasticSearch is very good at text searches so that is just what we used it for. By combining these two technologies we were able to get better analysis and faster response times than if we'd used either technology individually.

RedRock was initially put together in a very short amount of time for a demo. Using Spark allowed us to get the project up and running quickly by making it easy to set up, develop algorithms, and scale. We know it isn’t perfect, but we figured we would share it with you all in hopes that it will help you as you use Spark to tackle any big data blocking your plans of world domination.

Enjoy RedRock — and we look forward to your feedback.


You Might Also Enjoy

Kevin Bates
Kevin Bates
9 months ago

Limit Notebook Resource Consumption by Culling Kernels

There’s no denying that data analytics is the next frontier on the computational landscape. Companies are scrambling to establish teams of data scientists to better understand their clientele and how best to evolve product solutions to the ebb and flow of today’s business ecosystem. With Apache Hadoop and Apache Spark entrenched as the analytic engine and coupled with a trial-and-error model to... Read More

Gidon Gershinsky
Gidon Gershinsky
10 months ago

How Alluxio is Accelerating Apache Spark Workloads

Alluxio is fast virtual storage for Big Data. Formerly known as Tachyon, it’s an open-source memory-centric virtual distributed storage system (yes, all that!), offering data access at memory speed and persistence to a reliable storage. This technology accelerates analytic workloads in certain scenarios, but doesn’t offer any performance benefits in other scenarios. The purpose of this blog is to... Read More