spark

Changing the World with Watson Health: An Interview with Ethan Xu

Saving the world is easier said than done.

In my professional opinion, the process of making the world a better place must be broken down into smaller sub-projects. Each project can identify a unique problem and work towards a solution using creativity and expertise. That is exactly what IBM Watson Health data scientists and engineers (part of the Explorys acquisition), are working on right now. Specifically, the team is using SystemML to train a model and predict emergency room visits. Given the amount of impact this could potentially have, I sat down with one of the team members, Ethan Xu, to learn more.

(Don't know what I'm doing? Check this out.)

What brought you to IBM and what are your interests? Have you always been in this field? If not, what brought the change?

I learned about IBM Explorys from a seminar presentation given by our senior director Jason Gilder at Case Western Reserve University. In the presentation he talked about challenges in the healthcare industry and how the team utilized Big Data to derive insights. That captured my attention immediately, because at that time I was working as a Research Associate focusing on dimension reduction methodologies and modeling of healthcare data. As a Statistician I knew that many insights can be learned with the right techniques from even small data, so I was very excited about the endless possibilities from the rich EHR data from IBM Explorys.

My PhD was in Statistics and Probability, so I have been in the field of Statistics/Machine Learning, but my training and previous work were more theoretically focused. While that was very interesting, I like challenges that rise from real- world scenarios, which motivated the move from a pure academic research role to a data scientist. I also hold an adjunct faculty position in CWRU so there’s still a tie to the academic world.

What do you like most about the work that you do?

What I like the most is working in a team of like-minded people to tackle problems using massive (and messy) healthcare data. There are a lot of unknown territories to explore and it’s fun.

You are currently working on a project that could save lives- what has been your process? In broad terms, what implications might this project have on the real-world?

We are trying to build a model to predict 30-day emergency room (ER) visits and investigate important risk factors. According to a CDC report, there were 445 ER visits per 1000 people in the US, and a large portion of them were non-emergencies, which led to billions in wasted healthcare cost. Identifying the at-risk population helps us better understand the pattern, helps us to reduce healthcare costs and better utilizes healthcare resources.

How are you using SystemML? Why do you think SystemML is important?

I’ve been using SystemML to train a model to predict probability of emergency room visits. What I like about SystemML is that it separates the development of machine learning algorithms and the implementation of the algorithms on different platforms including distributed frameworks like Hadoop MR/Spark. This allows data analysts to focus on algorithms without worrying too much about the platform, and ultimately accelerates the growth of the Machine Learning community. It also helps that the DML can be coded in R-like and Python-like syntax.

How are you using Spark? Why do you think Spark is important?

I’m new to Spark. I tested the performances of the same SystemML scripts on the same data with Hadoop MR and Spark, and the improvement in speed was quite significant (from 8.8 hrs on 249 nodes to 1.3 hrs on 6 nodes). I have some experience in writing Hadoop MR jobs in Java, which seems very cumbersome comparing to the Spark API (Scala and PySpark), and Spark in-memory caching makes certain operations much faster, so I’m looking forward to writing Spark jobs for data manipulation.

There you have it. Someone who has merged academia, healthcare and data to help make an impact.

by Madison J. Myers

Newsletter

You Might Also Enjoy

Kevin Bates
Kevin Bates
4 months ago

Limit Notebook Resource Consumption by Culling Kernels

There’s no denying that data analytics is the next frontier on the computational landscape. Companies are scrambling to establish teams of data scientists to better understand their clientele and how best to evolve product solutions to the ebb and flow of today’s business ecosystem. With Apache Hadoop and Apache Spark entrenched as the analytic engine and coupled with a trial-and-error model to... Read More

Gidon Gershinsky
Gidon Gershinsky
5 months ago

How Alluxio is Accelerating Apache Spark Workloads

Alluxio is fast virtual storage for Big Data. Formerly known as Tachyon, it’s an open-source memory-centric virtual distributed storage system (yes, all that!), offering data access at memory speed and persistence to a reliable storage. This technology accelerates analytic workloads in certain scenarios, but doesn’t offer any performance benefits in other scenarios. The purpose of this blog is to... Read More