Apache Spark™ 2.0 was released in July with an impressive list of new features, improvements and bug fixes. If you are using Apache Spark™ you probably studied the release notes and you may have looked at the long list of more than 2,500 issues that were addressed in this latest major release — the highest number for any Apache Spark™ release yet — resulting in more than 200,000 lines of new code. And of course no less impressive is the list of over 300 contributors that made this release possible.
Rightfully, there has been a lot of buzz about all the major accomplishments in this release and Google will gladly serve up many blog posts and detailed presentations. However, if you like spreadsheets, data tables, comparing numbers, and are curious about how this latest Spark release compares to previous releases in terms of issue numbers and lines of code then this blog post is for you.
Since Apache Spark™ is an open source project all of its source code as well as its issue tracker is available online on GitHub and Apache JIRA respectively. GitHub makes some statistics readily available like commit frequency or lines of code added in a certain period and JIRA provides its own query language to search and filter issues the results of which can be presented on dashboards. If you want to know how many developers contributed to a certain component or how long it took on average to fix an issue and how many people are involved in the average code review, then you need to dig deeper into what the GitHub Developer API and JIRA Rest API have to offer.
Let's start with a look at how the 2,522 issues break down by component and issue type to underscore the focus points of this release. Since there are about 27 components, let's limit ourselves to Spark Core, GraphX, ML/MLlib, PySpark, SparkR, Spark SQL and Spark Streaming. Overall, we find that more than 1,600 of the issues resolved in Apache Spark™ 2.0 were new Features and Improvements, broken down into Tasks and Sub-Tasks, and about 900 were bug fixes. Looking at the component break-down we see that about half of the resolved issues are in the Spark SQL component with 1,254, followed by Machine Learning with 407 and Spark Core with 258.
Each issue is also categorized by severity with the highest severity being Blocker and the lowest severity being Trivial. When breaking down all of the issues resolved in Apache Spark 2.0 by component and severity, one can see that more than 60% of all resolved issues were categorized as Major, Critical, or Blocker. The majority of issues addressed in the PySpark and Machine Learning components were categorized as Minor.
When we take a look at how this release compares to previous releases we can see that Spark 2.0 clearly stands out with 2,522 issues addressed over 1,195 in Spark 1.6 and 1,563 in Spark 1.5. Another observation is that for the past several releases the strongest focus of the Spark community has been on Spark SQL and Machine Learning followed by Spark Core and PySpark.
Thanks to the wealth of attributes that can be recorded with each issue in the JIRA issue tracking system, and thanks to the Apache Spark community and its committers who diligently fill them in, there are a lot more metrics that we can drill into. The table below contains a few of those metrics along with some derived and aggregated metrics to draw out the relationship between the number of reporters, watchers and code contributors, the average number of days it takes to resolve an issue that was reported by someone other than the implementer of the code change ("turn-around time"). Since each code contribution (via "Pull Request") is thoroughly reviewed by other members of the Spark community we added a column for the average number of code reviewers per issue and the duration of the code review process.
|Spark Core||258||90||101||2.9||69%||94%||4.1||113.6||2.7||6.7||Spark Core|
There are a couple of interesting things to notice in the table above. Overall 2,522 issues that got raised by 409 issue reporters have been addressed by 301 code contributors in Apache Spark™ 2.0. This translates to a workload of about 8.4 issues per code contributor. Two thirds of the issues actually got reported by the same developer that also contributed the code change. Overall 89 percent of all 2.0 issues were raised by a Spark developer, only 11 percent by Spark users who did not also contribute code. This 8-1 ratio could be an indication that the Spark developers find and fix issues before they arise in the field or that issues get raised via other channels like mailing lists or discussions on online forums like StackOverflow which get monitored by Spark developers who then enter those issues into JIRA. On average each issue is followed by 3.9 community members and it takes about 66 days from the time an issue is raised to the time it gets resolved (excluding self-reported issues where all too often the reporting developer has a code change ready at the time the issue is created). Each proposed code change ("Pull Request") is reviewed by 2 to 3 community members and the code review process takes between 9 to 10 days on average.
In the component breakdown Spark SQL stands out with the most productive developers, each of whom addressed almost 11 issues. Spark SQL also shows the quickest turn-around time of less than 45 days to resolve issues and Spark SQL has the swiftest code reviews taking only 6 days on average.
A somewhat biased take on the GraphX component would be that it is the least widely used since 19 of the 20 addressed issues were raised by Spark developers themselves. This apparent lack of usage and interest seems to be reflected in the highest number of days (almost 130) that it took to resolve GraphX issues. However, with an average 4.3 watchers, the issues in the GraphX component also garner the highest interest by other members of the community.
Coming soon: A follow-up post that details the lines of code added per component with Spark 2.0.