Apache Spark SQL Version 1.6 runs queries faster!
That’s the good news from Berni Schiefer, an IBM Fellow, an STC engineer with deep expertise in rigorous performance testing, and a native of the great Canadian prairie Province of Saskatchewan. “Our results are encouraging for anyone moving to Spark from a SQL database background,” is how Berni sums up the results. “Developers essentially need to know three things about moving their queries to run in Spark SQL: Can Spark SQL parse the query, can it get the right answer, and once it can do that, how long does it take to run?”
“In earlier releases, developers would sometimes get slowed down at step one: they’d expect the syntax to work unmodified—and find they could not run a query because it did not work as is. The developer would need to make major or minor edits, or even have to give up.”
Spark SQL is a Spark module for structured data processing; see Spark SQL: Relational Data Processing in Spark by Matei Zaharia et al for background.
In version 1.6, Spark SQL has a better and broader command of the SQL language, especially in the nuanced cases where the meaning is the same but the syntax is slightly different.
“Imagine you had to translate everything from British English to American English—you’d have to remember to change “chemists” to “drug store” and “flats” to “apartments” every time. In version 1.6, this annoyance is mitigated to a significant degree. This reduces barriers to entry for Spark. If you’re hiring, for example, you can now get a more skilled labor force more easily for fewer dollars because you can hire from the large pool of SQL developers, rather than from the smaller pool of developers with more specialized Spark-specific SparkSQL skills.”
And people building Spark apps with regular SQL can have greater confidence that it’s going to work.
To measure the performance of Spark SQL, we ran a subset of the TPC-DS* workload This workload is derived from TPC-DS, and uses dsdgen and dsqgen to generate the data and the queries from the TPC-DS query templates. The results reported are not intended to represent an official TPC benchmark result. No changes to the schema or data generator were made, but only 32 of the 99 queries were executed.
*The TPC-DS workload consists of 99 queries – but Spark SQL can’t run all 99 yet. We tracked the performance of the queries that do run across multiple Spark versions. In 1.6, the number of queries that run have also gone up. *
Overall we observed a 9.3% reduction in elapsed time between Spark 1.4.1 and Spark 1.6 for the 31 TPC-DS queries that we ran using a 1 Terabyte TPC-DS data set running on a 5 node cluster. The maximum gain was 25% while the maximum degradation was 31%. However, 15 of the queries had substantial elapsed time improvements while only 2 queries visibly degraded.
We ran all workloads using spark-submit in yarn-client mode. We found that Spark 1.6 requires more driver memory than previous versions. For example, the TPC-DS SQL workload used 8GB driver memory in 1.5.1 but needed 12GB for Spark 1.6 in order to run without failed stages. So the memory footprint of Spark 1.6 increased, and depending on your application, you may also need to increase the executor’s memory.
The TPC-DS SQL workload improved by almost 20% in performance compared to 1.5.1. Almost all queries in the workload ran faster, making Spark 1.6 a really solid release for Spark SQL.
*TPC is a trademark of the Transaction Processing Performance Council. The results reported are not intended to represent an official TPC benchmark. The workload is derived from TPC-DS, and uses dsdgen and dsqgen to generate the data and the TPC DS SQL from the TPC-DS templates.
Berni noted the temporary elapsed-time increase in between version 1.4.1 and 1.5.1, as well as the 2 queries that remained regressed going from Spark 1.4 to Spark 1.6 were not unusual or exceptional in the world of query processing. “Improving software as complex as Spark is like Whac-A-Mole. When you make one thing better with focused attention it’s common for other things to get worse for a little while. But we see how quickly this was addressed in Spark 1.6. We continue to work closely with the Spark community to investigate the origin of the 2 remaining regressions and find ways to improve those.”
The testing in this article was done over a 5 month period after the July 15th, 2015, release of Spark version 1.4.1. Apache Spark has a 7 year history, beginning in 2009 when Romanian-Canadian Matei Zaharia started the project at UC Berkeley’s AMPLab in 2009 as a PhD student.
The STC’s benchmarking team continues to test Spark SQL and other functions of Apache Spark. In the next couple of months, we’ll publish results from:
More tests with a larger subset of the 99 queries, possibly on bigger data sets.
Additional testing using the query optimized Parquet format. We are confident that exploiting the latest Spark enhancements in Spark 1.6 for parquet formats will yield substantial benefits.
Testing use the way production Spark users would consume SparkSQL: measuring throughput on multi user runs with varying workload profile.
For more Spark 1.6 and Spark SQL benchmarking:
Recent performance improvements in Apache Spark: SQL, Python, DataFrames, and More by Reynold Xin, Databricks
Spark 1.6.0 Performance Sneak Peek by Jesse F. Chen, IBM
Evaluating Hive and Spark SQL with BigBench by Todor Ivanov and Max-Georg Beer, Frankfurt Big Data Lab