Tatsuhiro Chiba and Tamiya Onodera
As mentioned in many Apache Spark tuning articles, the performance of Spark applications is improved by configuring Spark parameters and Java Virtual Machine (JVM) options. Besides tuning these parameters, operating system side optimization has helped to accelerate Spark performance more on recent hardware. Nowadays, the multi-core system architecture is common; a processor has many cores, and multiple processors on the system are connected over the non-unified memory access (NUMA) architecture. Moreover, recent processors deliver many hardware threads in each core with simultaneous multi-threading (SMT) technology. For example, POWER8 supports using up to eight hardware threads per core and Haswell supports up to two hardware threads. As a result, the operating system manages many logical cores in one system. The Linux Completely Fair Scheduler (CFS) handles this load balancing across cores and tries to schedule tasks while taking NUMA nodes into consideration because the cost of memory access on a remote NUMA node is several times higher than on a local NUMA node.
Let’s get back to the Spark story. Spark launches executor JVMs with many task worker threads in a system. Then, the operating system tries to schedule these worker threads to multiple cores. However, the scheduler does not always bind worker threads to the same NUMA node, so several worker threads are often scheduled to distant cores on remote NUMA nodes. Once a thread moves to another core over NUMA, the thread has to incur an overhead to access data on remote memory. Since Spark’s compute intensive workloads such as machine learning continue to compute many times for the same RDD dataset in memory, the remote memory access overhead is not negligible in total.
We applied NUMA aware process affinity to the executor JVMs on the same system and compared the computation performance with multiple Spark workloads; one was Kmeans with MLlib, and the other was the TPC-H benchmark with Spark SQL. We observed a 3 – 13% improvement in both experiments by binding process affinity to executor JVMs.
We performed both experiments on a single machine that had two 3.3 GHz POWER8 processors, and each POWER8 had 12 cores, so 24 physical cores were available, and also, 192 logical cores were active with SMT8 mode. The POWER8 processor is also divided into two modules internally, so we could utilize four different NUMA nodes. We delivered 4 executor JVMs, and each had 12 worker threads and a 48GB Java heap. We used Spark 1.4.1 and IBM J9 JVM (1.8.0 SR1FP10).
We prepared two types of Spark workloads, Kmeans and TPC-H. For Kmeans, we synthetically generated a dataset of about 6 GB that included 68M points, and Kmeans loaded this input data in memory as RDD and continued to iterate over the data ten times until convergence. For the TPC-H benchmark, we prepared a scale factor data set of about 100 GB and executed Hive queries through a Thrift server to submit to the Spark SQL. Kmeans is a compute intensive application, whereas the TPC-H workload is I/O bound.
Figure 1 shows the result of the performance with and without NUMA aware process affinity for executor JVMs. Each value represents the average performance of a five-time test. We show the performance for several selected queries for TPC-H, but we observed a 3 – 4% improvement for all queries on average. The query characteristics were a little different between Q1 and the others. Speaking concretely, Q1 had less data shuffling, and Q5 and Q9 had a larger amount of data shuffling during computation. For Kmeans, the NUMA aware setting also achieved an improvement of over 13%. Consequently, this setting was more helpful for compute intensive workloads.
Next, we evaluated worker thread scheduling and memory access hardware performance counter events to estimate NUMA efficiency. Figure 2 shows the logical core id mapping where Spark worker threads run in one JVM without NUMA aware process affinity. We periodically captured where worker threads were running every five seconds and plotted it to the y-axis. Each line corresponded to 12 running worker threads. As shown in Figure 2, the worker threads were scheduled over NUMA at first and then almost all threads were gathered into the same NUMA node. As mentioned above, this machine had four NUMA nodes with 24 cores, so each NUMA node had six cores. However, several threads often exceeded the NUMA node domain. In contrast, Figure 3 shows the result after applying process affinity and focuses on the cores in a NUMA node. We can confirm that all of the 12 worker threads kept running within their NUMA node domain.
In addition, we captured hardware events by using the Linux perf command. In Spark, memory is heavily consumed while RDD generates many immutable objects, so the TLB miss and page fault rates are high. Considering NUMA locality helps reduce these rates and remote memory access. As a result, the penalty cycles also decreased, and the performance was accelerated.
Setting NUMA aware locality for executor JVMs achieves better performance in many Spark applications, so let’s try to enable core bindings while launching executor JVMs.