Improving BLAS library performance for MLlib

If you’re making use of Apache Spark’s MLlib component, you may have seen the following warnings in your application’s logfile:

16/03/23 10:49:18 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS

16/03/23 10:49:18 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS

This post will explain what these warnings mean, and what your options are to eliminate them.

Apache Spark makes use of a component called netlib-java, which provides a Java API for Linear Algebra routines, such as BLAS, LAPACK, etc. The netlib-java package doesn’t implement these directly, but rather delegates incoming calls to one of three implementations, in the following order:

Spark Charts

1) A system-specific library, such as OpenBLAS, Intel MKL, or Nvidia’s cuBLAS, among others

2) A built-in native reference implementation written in Fortran, provided by

3) A pure-java implementation, provided by F2J.

The two warnings above mean that the first two implementations were not usable, and MLlib is using the Java implementations under the covers. If your program has large data sizes and spends a lot of time in BLAS functions, you may get a significant performance boost by enabling native BLAS.

Ok, so we’ll start with the second warning first: NativeRefBLAS. This message indicates that the built-in reference implementation didn’t load on your system, probably because you’ve downloaded the pre-built version of Spark, and the reference code isn’t a part of that build. To fix this, you need to download the source code, and build it yourself passing the “-Pnetlib-lgpl” option to maven. See this note on MLlib dependencies for more details.

It’s also possible that the pre-built binary implementations are not compatible with your system. You can see a list of netlib-java’s pre-built platforms here. Power CPUs are not present, and older versions of Linux, such as RedHat 6.x, are not compatible with the pre-built linux binaries. You can attempt to build netlib-java yourself on such systems, but this is not straightforward and I won’t cover it here.

Hopefully, you are able to get past the NativeRefBLAS error with the simple “-Pnetlib-lgpl” build flag. Getting past this is a requirement before tackling the NativeSystemBLAS warning, because the same platform-specific native JNI code is used to call either the built-in reference or system-provided BLAS implementations.

Once you’ve eliminated the NativeRefBLAS error, you may or may not still have the NativeSystemBLAS error. Fixing this requires installing your favorite implementations of BLAS and LAPACK. On a RedHat/CentOS 7 system, you’ll use “yum install openblas lapack”, or on a Debian/Ubuntu system you would run “apt-get install libatlas3-base libopenblas-base”. This brings in the necessary binaries, but you still may need to create symlinks so that “” and “” are pointing to your implementation. For example, after installing on RedHat 7, you will need to go into the /usr/lib64/ directory and type “ln -s” and “ln -s”. At this point, you should be able to run your Spark program and not get any warnings about NativeSystemBLAS.

There are other options for a high-performing BLAS, which depending on your system and program, may get you even better performance. Eliminating the warnings means that you’re using a native system-provided BLAS implementation, but it doesn’t mean that you’re using the best implementation for your hardware, so it is worth exploring these options a bit further.

If you have a modern Nvidia GPU in your system, you might be considering using Nvidia’s cuBLAS implementation. This is possible through Nvidia’s NVBLAS library, and you can read about how to configure it in netlib-java here. However, I don’t recommend going this route for Apache Spark. If you take a look at section 3 of the NVBLAS documentation, you’ll see that only a handful of BLAS3 routines get sent to the GPU: gemm, syrk, herk, syr2k, her2k, trsm, trmm, symm, and hemm. Of these, only gemm is available in Spark’s MLlib, and only if explicitly used by the end user; MLlib doesn’t make use of any of these internally. This means that all other BLAS calls will get directed to the backup implementation that you have defined in your nvblas configuration file, and won’t run on the GPU. If you’re on a RedHat system like me, this means you’ll likely be building and using the reference CBLAS library, and not using a better-performing library such as OpenBLAS. So unless you have an application that sends a large amount of data through explicit calls to Matrix.multiply(y: DenseMatrix), you’re not gaining anything for your efforts to get cuBLAS working in Spark.

Spark Charts2

That leaves CPU-based BLAS implementations, of which there are several to choose from. On OSX, you’ll use the standard vecLib framework. On Linux, you’ll want to build OpenBLAS or ATLAS from scratch to get binaries that work best on your specific hardware, or if you have purchased an Intel MKL license, you can use that. On Power hardware, you’ll need to build netlib-java yourself to get usable JNI code, and then probably make use of IBM’s ESSL library.


You Might Also Enjoy

Kevin Bates
Kevin Bates
9 months ago

Limit Notebook Resource Consumption by Culling Kernels

There’s no denying that data analytics is the next frontier on the computational landscape. Companies are scrambling to establish teams of data scientists to better understand their clientele and how best to evolve product solutions to the ebb and flow of today’s business ecosystem. With Apache Hadoop and Apache Spark entrenched as the analytic engine and coupled with a trial-and-error model to... Read More

Gidon Gershinsky
Gidon Gershinsky
10 months ago

How Alluxio is Accelerating Apache Spark Workloads

Alluxio is fast virtual storage for Big Data. Formerly known as Tachyon, it’s an open-source memory-centric virtual distributed storage system (yes, all that!), offering data access at memory speed and persistence to a reliable storage. This technology accelerates analytic workloads in certain scenarios, but doesn’t offer any performance benefits in other scenarios. The purpose of this blog is to... Read More