data pros

Rocking Data Science for Sustainable Innovation: A Datapalooza Dispatch

Datapalooza San Francisco was a great success. The event (the first of an ongoing, international series) fulfills an important role: educating the next-generation of data scientists on how to apply their minds, creativity, and tools in the creation of innovative data products..

Datapalooza has also been a lot of fun, starting with Jeff Jonas’ hilarious (and of course, thought provoking) keynote on the morning of the first day; progressing through the offsite receptions and a concert on the evening of the second day; and culminating in the still-vibrant buzz of data scientists pooling their efforts to build innovative apps by the end of the third day..

Here are the high-level takeaways from the first Datapalooza:.

• SPONSORSHIP: The Spark Technology Center (STC) is the sponsor of Datapalooza San Francisco. Orange-shirted mentors have been available on-site during all 3 days at Galvanize to help participants in their data science exploration, ideation, and development activities. And if any participants wished to check out the STC itself, it’s just a few blocks away at 425 Market Street, in IBM’s office on the 20th floor..

• VENUE: Datapalooza took place at the downtown San Francisco campus of Galvanize, a university that’s training the next generation of data scientists. Hundreds of data science professionals participated in the 3-day community event. Datapalooza will come to cities throughout the world in 2016 (specific dates, sites, and sponsors are TBD): Tokyo, Berlin, Hong Kong, Tel Aviv, London, Sao Paulo, Sydney, Moscow, New York, Seattle, Denver, and Austin.

• CURRICULUM: Datapalooza San Francisco featured over 30 different sessions. Those include “101”-level courses in Apache Spark and Scala, as well as sessions that featured data products, including apps for Apple Watch, Facebook, Watson, Spark, and many others. The Datapalooza curriculum covered the principal subject matter that data scientists need to master to be effective at building innovative data products. Courses were delivered by STC-based data scientists and by data-science professionals from Galvanize, Silicon Valley Data Science, Typesafe, Nitro, Cake Solutions, Facebook,, zData, and other organizations. Rolled up by the principal categories, the courses were as follows: CEO & Founder Brandon Schatz at the Design Thinking station

o Data Engineering: Quick Intro to Scala, Spark Search, Scraping Reddit with Akka Streams, Building a Word2Vec Model with Twitter Data, Large Scale Topic Modeling with Scala and Spark, InfoSphere Streams Audio Data, NLP Enriched Product Review, Data pipelines with Watson, Part 1: Functional Programming for Data Pipelines and Machine Learning, Part 2: Functional Programming for Data Pipelines and Machine Learning
o Data Science: Spark 101, General Question and Answering System, Making Big Data Small: Visualizing Petabytes of Data with Spark and D3.js, RedRock, Better Search at Scale: Leveraging Spark for Contextual NLP, Customer Relationship Prediction with Machine Learning Ensembles, Real Time Vehicle Telematics, Deploying Machine Learning Models in Production, Search by Selfie: a Spark Facial Recognition Algorithm, Network Experimentation
o Data App Development: Caltrain Rider: A Complete Data Product, Who’s in the News: Build a Quality web app with minimal code, Spark Streaming application with Twitter and Watson, Constructing a Fast Data Application with muvr

Data Scientist Jorge Castañón shows how to Build a Word2Vec MOdel with Twitter data Data Scientist Jorge Castañón shows how to Build a Word2Vec Model with Twitter data

• DIVERSITY: Datapalooza had a diverse group of attendees and presenters. Around 40 percent of Datapalooza attendees were women from science, technology, engineering, and mathematics (STEM) backgrounds. In addition, under the Women in Data Science and Engineering program, IBM announced at Datapalooza that it has committed to provide support to female participants in Galvanize University programs. These are full scholarship covering everything from tuition assistance to mentorship, internships, and employment opportunities. As part of the IBM-Galvanize partnership, $150,000 is being made available in tuition assistance for female data scientists and data engineers. In conjunction with Galvanize, IBM announced the award of full scholarships to the following individuals: Pooja Ramesh (Denver campus), Yihua Leng (Seattle campus), Emily Spahn (Seattle campus), Samaneh Sadighi (San Francisco campus), Prathishta Rebala (San Francisco campus), and Susie Sun (San Francisco campus). To be considered for future scholarships, individuals should submit their applications here.

Big Data at the After PartyBig Data at the after party

• PROJECTS: At Datapalooza, IBM’s Leon Katsnelson discussed Big Data University’s support for the United Nations Sustainable Development Goals, with specific emphasis on data science to promote education of girls and address other global concerns. In addition, DataKind and IBM announced a partnership with MicroCred Group, an organization that provides financial services to entrepreneurs and individuals who can’t access them through traditional means so they in turn can strengthen the economic development of their countries. The partnership was announced at Datapalooza by Julia Rhodes Davis, DataKind’s managing director, capital and growth. Under the partnership, IBM is sponsoring a long-term DataCorps project with MicroCred that will use predictive modeling to generate customer scores more efficiently. This will enable financial services organizations to better target their services more inclusively to meet the needs of individuals, small businesses, and midmarket firms around the world. Specifically, Datakind & IBM will work with MicroCred to develop an in-house credit data product, which will allow the most accurate definition of the good repayer profile and reduce the cost of loan evaluation. This will be designed to enable MicroCred to lower minimum loan amounts, thereby lowering barriers to entry in the financial sector. Further details on the program are in this this DataKind blog.

Get involved. Click here to become part of the STC community and contribute projects, design, and code to Apache Spark..

Also, Datapalooza may soon be coming to a city near you. Stay tuned here for updates. We hope to engage the world’s brightest data scientists wherever and whenever makes sense for you.


You Might Also Enjoy

Kevin Bates
Kevin Bates
9 months ago

Limit Notebook Resource Consumption by Culling Kernels

There’s no denying that data analytics is the next frontier on the computational landscape. Companies are scrambling to establish teams of data scientists to better understand their clientele and how best to evolve product solutions to the ebb and flow of today’s business ecosystem. With Apache Hadoop and Apache Spark entrenched as the analytic engine and coupled with a trial-and-error model to... Read More

Gidon Gershinsky
Gidon Gershinsky
10 months ago

How Alluxio is Accelerating Apache Spark Workloads

Alluxio is fast virtual storage for Big Data. Formerly known as Tachyon, it’s an open-source memory-centric virtual distributed storage system (yes, all that!), offering data access at memory speed and persistence to a reliable storage. This technology accelerates analytic workloads in certain scenarios, but doesn’t offer any performance benefits in other scenarios. The purpose of this blog is to... Read More