This is part 2 of a practical guide to scoring health data with Apache Spark™. We'll post the final part of the guide in the coming weeks. (See part 1 here.)
As a reminder, these posts take inspiration from an R tutorial here.
The guide is divided into four parts:
- Objective description of the study and data
- Data preparation and initial analysis
- Construction and validation of the score
- Interpretation of results
For the code I reference, visit the full github repository here.
Let's continue with step 3: Searching for meaningful explanatory values...
3. Search for meaningful explanatory variables
Let's try to identify possible correlations between descriptors, starting with a look at our age variable:
This is a bit hard to read. So now we can use Zeppelin plotting functionalities to help us take a better look at our variable, using the histogram view.
This is still meaningless. We can open the visualization settings to choose what Keys, Groups, and Values we would want to display. For the age variable, we will choose the age as a key, chd (target variable) for groups, and count(sum) for values.
Now let's do the same for sbp, alcohol, tobacco, ldl, and obesity:
Take care when when conducting graph analysis for the purpose of detecting possible co-linearities. The variables to consumption of alcohol and the quantity of tobacco seem to be distributed in the same way, as are the distributions for cholesterol and obesity.
Another analysis tool is to perform point cloud for all variables. One can possibly color the points according to the target variable. I use seaborn with some helper visualization scripts to support such plotting. The scripts are available on the project github page under notebooks.
4. Outliers and missing values
Outliers depend on the distribution and on the object of the study. In the literature, the treatment of missing values or outliers is sometimes subject to endless discussions which practitioners should consider. It is easier, in general, to decide what is an outlier with some domain-knowledge. Let's look again basic statistics:
chd are both categorical variables so we will drop them from the statistical description:
The distribution of tobacco consumption is very spread out - as it is for alcohol. Other distributions seem rather consistent. So, for now, we do nothing on those values, given the distribution.
This data set doesn't contain missing values nor visible outliers.
5. Discretize or not?
This is a common issue in the exploratory analysis. Discretizing continuous variables.
In the data set, the follow variables could potentially be discretized:
Should we discretize continuous variables? Yes, mostly. But how? In line with the target variable? For business knowledge? Distribution based on quantiles?
There's no definitive answer. From a general point of view, which method you choose will generally depend on the problem — and how much time you want to spend. Always remember: Consider your actual results and don't hesitate to reconsider (and reverse) any cutting you do.
Let's consider the variable
age, which is the simplest to discretize. Heart issues are not uniformly common across age groups, as shown here. Let's discretize the variables
tobacco, distinguishing between light, medium, and heavy smokers.
With Spark we can do both actions separately or using Pipeline from Spark ML (later on that in the following segment).
For now, we will just use the QuantileDiscretizer for that purpose, as follows:
import org.apache.spark.sql.types.DoubleType val ageDiscretizer = new QuantileDiscretizer() .setInputCol("age") .setOutputCol("age_discret") .setNumBuckets(4) val result = ageDiscretizer.fit(encoded.withColumn("age",$"age".cast(DoubleType))).transform(encoded.withColumn("age",$"age".cast(DoubleType))) z.show(result.orderBy($"age".asc))
As you've noticed we converted our age variable into a double precision floating point format. Otherwise, we'd get the following error:
scala.MatchError:  (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
(Which is ambiguous if you ask me.)
We will perform the same action on the tobacco variable (no need to cast here).
val tobaccoDiscretizer = new QuantileDiscretizer() .setInputCol("tobacco") .setOutputCol("tobacco_discret") .setNumBuckets(3) val result = tobaccoDiscretizer.fit(encoded).transform(encoded) z.show(result.orderBy($"tobacco_discret".asc))
The category of people under 15 years old is not at all representative in the sample, nor is the category of those under 30.
Half of people over 45 are suffering from a heart problem.
One could consider that heart problems are simply hereditary. But of course we find that smoking has a real influence on heart problems since it is significantly higher among smokers, even regardless of the amount of tobacco they consumed.
This initial analysis indicate that:
- It is not very useful to keep the sample in individuals under 15 years, because the model we developed is not calibrated to predict the likelihood of developing heart disease if age plays a role.
- Therefore, we can see some results here descriptively, but we have to be able to confirm them in a modeling phase. Our baseline will be:
val baseline = step2.filter($"age">15).drop("age").drop("tobacco").drop("chd").drop("famhist") z.show(baseline)
We'll pause there. In the final part of this post, we'll talk about sampling, modelization, model validation, and more.