SETI Data Publicly Available on Spark

Last year, the IBM jStart group began a collaboration with the SETI Institute of Mountain View to use IBM’s data storage services and the power of IBM’s Spark service to analyze the many TBs of radio-telescope data they’ve acquired over the past few years.

Today, we are proud to announce the launch of the public-facing component of this collaboration, which we call SETI@IBMCloud.

Our team has constructed a system that delivers the raw SETI data to the general public, along with the basic signal processing tools to consume that data and connect it to machine-learning tools.

SETI Institute satellite dishes Image credit: SETI Institute

SETI, IBM, NASA collaboration

The jStart team’s mission is to exercise emerging IBM technologies with real data and analytics problems in order to find their strengths and weaknesses. They chose SETI because of the size and complexity of their data set, which resonates with many of today’s business needs.

Computational power

The SETI Institute believes that by using IBM’s Cloud Data Services they can improve their analysis through increased computation, faster analysis algorithm development, and better data management. It has been technically difficult for the relatively small SETI team to handle increases in data production and to perform multiple analysis iterations.

Additionally, moving data to IBM Object Storage and Spark allows them to use big-data machine-learning tools, which are quickly being adopted throughout the scientific community, complementing many of the standard data analysis methods.

The SETI Institute, with collaborators from NASA, IBM Research, IBM Cloud Data Services and Swinburne University have thus far moved over 16 TBs of recently taken data onto IBM Object Storage and IBM dashDB and have analyzed that data using the IBM Spark framework. This collaboration has been so successful that the SETI Institute and IBM are currently collaborating on new observations of planetary systems and new acquisition schemes (which produce significantly more data).

You can observe what’s going on at the Allen Telescope Array each day, where SETI makes their observations, by visiting and checking out the Twitter feed from Jon Richards (@jrseti), a leading engineer at SETI.

Join us!

Despite the amount of data collected thus far, because the SETI Institute has a finite number of scientists, most of their data have not yet been analyzed in novel ways. They are looking for YOU to help develop those new ways to search for signs of extraterrestrial intelligent life.

Are there patterns in SETI’s data that have been overlooked thus far? We’ll show you how to read the data and extract features. Can your signal processing algorithm detect signals that SETI cannot? If so, your code could end up in SETI’s data acquisition system. Or you might even be the first to discover intelligent alien life!

To get started, you will need an IBM Bluemix or Data Science Experience account. An IBM Bluemix account comes with a free 30-day trial to use most services, including the Spark and Object Storage services that will facilitate your work with SETI data.

You will learn, hands-on, how to use IBM’s data and analytics platform tools while simultaneously analyzing interesting scientific data from one of the world’s most recognizable scientific institutes.


The technical landing page for this project is on GitHub:

Through example Jupyter notebooks and other documentation, it will guide you through the handful of steps you need to get the raw SETI data and to start your analysis, extract features, and share results. You do not need to be an expert in signal processing, astronomy, or astrophysics to get involved!

Look forward to future blog posts from me and others about technical details of this project, new data analysis, interesting results from citizen scientists, and new data and updates from the SETI Institute.

An example Python notebook from the seti_at_ibm repo.

Example Python notebook from the /ibm-cds-labs/seti_at_ibm/ GitHub repo

Final thoughts

Personally, I hope this project will create an active, long-lasting open-science collaboration between citizen scientists from around the world and the scientists from this collaboration — the SETI Institute, NASA, IBM, Swinburne University of Technology — from Stanford University and from other institutions.

I hope you are as excited about this project as I am! I can’t wait to see what we build together.

There are a number of individuals at SETI, NASA and IBM that directly helped or supported my work to make this happen. I’d like to thank Graham Mackintosh, Jill Tarter, Bill Diamond, Jon Richards, Jeff Scargle, Gerry Harp, Chris Henze, Francois Luus, Niru Anisetti, Ted Morris, Sven Hafeneger, Steve Moore, Randy Horman, Mark Watson, Brad Noble, Derek Schoettle and Rob Thomas.

View on Github.


You Might Also Enjoy

James Spyker
James Spyker
2 months ago

Streaming Transformations as Alternatives to ETL

The strategy of extracting, transforming and then loading data (ETL) to create a version of your data optimized for analytics has been around since the 1970s and its challenges are well understood. The time it takes to run an ETL job is dependent on the total data volume so that the time and resource costs rise as an enterprise’s data volume grows. The requirement for analytics databases to be mo... Read More

Seth Dobrin
Seth Dobrin
2 months ago

Non-Obvious Application of Spark™ as a Cloud-Sync Tool

When most people think about Apache Spark™, they think about analytics and machine learning. In my upcoming talk at Spark Summit East, I'll talk about leveraging Spark in conjunction with Kafka, in a hybrid cloud environment, to apply the batch and micro-batch analytic capabilities to transactional data in place of performing traditional ETL. This application of these two open source tools is a no... Read More