In this blog, we describe a test of how well Apache Mesos handles 30 continuous streams of Apache Spark jobs, running simultaneously in a shared, 10-node, multi-user cluster. It’s not as easy as you think. Spark’s integration with Mesos has two scheduling modes, each with its own tradeoffs. We tried both and came up with some areas that need to be improved, which we’re working on with the Mesos community.
For the last year or so, we at IBM have been taking a serious interest in the work that the Apache Mesos community has been doing. Mesos provides the base cluster management, resource allocation and task execution capabilities that enable frameworks to co-exist on the same infrastructure. We feel that Mesos provides a promising open platform that can be a foundation developing distributed systems. Mesos provides a clear separation of resource allocation handled by the Mesos master and the scheduling and management of workloads handled at the framework level. The Mesos ecosystem has a number of frameworks like Marathon, Aurora, Chronos, Singularity that integrate with Mesos.
One important framework today is Spark, which looks like it will be next generation big data analytics platform. Spark’s intuitive API and ability to combine in-memory task processing, along with support for streaming, machine learning, and graph processing make it a popular tool for data scientists. However, in enterprise environments one of the keys to running Spark in production is to be able to support multiple concurrent users, tenants, or applications that must run a variety of interactive or batch jobs on a shared infrastructure.
We believe that a scenario where multiple analysts come in at different times and concurrently submit short-duration jobs is an important use case for which Spark is eminently suited because of its speed and support for interactive queries, among other features. Such exploratory analysis of data is of growing importance in many organizations. We therefore chose such a scenario for our experiment, which simulated up to 30 users each coming onto the system 60 seconds apart and submitting a job. Each job has 3 stages with 40 tasks for a total of 120 tasks per job with 2GB of data. We ran this on 10 Intel x86 nodes each with 16 cores with 2 hyperthreads each and 96GB of memory. The workload consists of TeraSort jobs and the scripts can be found here. This figure gives a view of the workload:
We tested several different configurations of Spark on Mesos using Mesos 0.26 and Spark 1.5.1. We experimented with both coarse- grained and fine- grained modes of running Spark on Mesos. In coarse-grained mode Spark uses the offers from Mesos to launch Spark executors and re-uses them for all the tasks of a job. In fine-grained mode, each Spark task is mapped to a Mesos task, which comes with additional overhead in launching each task. This mode may be inappropriate for low-latency workloads. Our tests with fine-grained mode found that for the TeraSort jobs we got a very high variance in response time and the system didn’t recover when the workload was stepped down. Since the default for Spark on Mesos and the community best practice is to use coarse-grained mode, we focused on that.
The graphic below shows job duration (job response-time) distribution produced by the experiment. Each dot on the graph represents a single, complete TeraSort job execution, including all the Spark and Mesos tasks that comprise it.
We can see that in the beginning and very end of the benchmark run, when the number of jobs executing concurrently is small, the clustering of the data points is quite tight. Towards the middle of the benchmark run, where the number of jobs executing concurrently is large, data points begin to spread out – reflecting increasing differences in job duration among jobs.To understand why this happens let’s take a look at the graphic below. This graphic shows how the offer batch size changes as the workload progresses. In coarse-grained mode Mesos always offers all resources available on a given host to Spark in a single offer. When only a single job runs on the cluster the number of offers in the offer batch is equal to the number of hosts in the cluster. As the number of jobs running concurrently increases, the number of offers in the batch begins to vary. Thus, the total number of CPU and memory resources offered to jobs begins to vary as well. Towards the middle of the run, when the number of jobs running on the cluster is large, each offer batch contains only a single offer. Each job runs on a single executor, on a single host. Jobs that don’t have a host to run on wait until other jobs finish. Lucky jobs will run quickly. Unlucky ones could wait a long time.
Given that we are simulating 30 users/jobs running a cluster of 10 hosts, that translates into many jobs waiting for their turn to run on a host. We can imagine hitting this scenario in a heavily used cluster wherever there are more users than physical hosts.
But even when a job gets its turn to run, can a single executor make effective use of all CPU and memory resources on that host? The graphic below shows a typical CPU utilization trace on one of the cluster hosts during one of our benchmark runs.
In this CPU trace you can see a pattern whereby high CPU utilization (blue area) alternates with periods of low utilization. This is because a single 2 GB TeraSort job executes a total of 120 tasks in 3 stages (40 per stage), that don’t neatly divide into the 32 CPUs and 96 GB RAM available on any one host. Since the executor holds the offer for the duration of a job, there will be periods where only 8 CPUs are being used concurrently and 24 are left idle. From a system utilization point of view it is certainly possible to run multiple executors on each host, but current Mesos coarse-grained mode precludes this for our scenario.
Mesos supports dynamic allocation only with coarse-grained mode, which can resize the number of executors based on statistics of the application. In our scenario, when the cluster is busy a job only gets at most one host to run on, so resizing the *number *of executors, doesn’t help. Also, for the workload we’re running the size of the tasks in different stages does not change.
Another observation we can make is that the average job is only a few seconds, which was meant to simulate interactive analytics, as explained above. Hence, the small data set and fast jobs. Since a new Spark context is created for each job, this causes executors to be restarted. This points to a need for an effective scheduling and workload management layer between Mesos as the resource allocator and Spark to enable executors to be effectively reused by multiple jobs.
We are continuing to do our part to improve Apache Mesos by active participation in the project. We have a number of work streams that we are involved in to improve the system for Spark including:
- Task Resizing
- Fine-grained Offer
- Container Network Support
- Implement isolator for Docker volume
- Optimistic Offer
The task resizing feature could be used by the Spark scheduler to dynamically resize executors according to the number of tasks running in parallel. Fine-grained offers will help to reduce the offer granularity from all the available resources in a host to only slices of the available resources such that resources will be more fairly shared among Spark jobs. The container network and volume isolator can help to run multiple tenants’ Spark workloads in Docker environments with greater security and resource isolation.
The optimistic offers could potentially further optimize the Spark on Mesos experience as well as help other framework schedulers. The idea behind optimistic offers is to give the resource offer to multiple frameworks and resolve conflicts at task execution time. This will allow multiple Spark schedulers to see the same set of resources and may improve the overall resource utilization.
So these are our initial experiments and observations in running Spark on Mesos in a multi-user context. We’ve made some attempt to explain the behavior and explored ways that the system might be improved. We would welcome other opinions and suggestions!