research

Configuring the Apache Spark™ SQL Context

The Apache Spark website documents the properties you can configure, including settings that control the Spark application and the Spark SQL Context. Let’s look at some of the Spark SQL Context parameters, and how to enable them with a nice feature in Spark SQL 1.6.

Spark SQL Context (SQLContext) serves as the entry point for users to create dataframes, datasets, and to run SQL-related queries. SQLContext can take Spark SQL configuration properties through setConf method or you can specify the SQLContext properties in the /conf directory within the spark-defaults.conf file.  For example, you can add:

spark.sql.PARQUET_SCHEMA_MERGING_ENABLED=true

Or you can set the parameter using the SET key=value command in SQL.

The Spark SQL, DataFrames and Datasets Guide on the Apache Spark website documents most Spark SQL Context properties, which all start with the naming convention “spark.sql.”. Look for those properties within these sections of the guide:

  • Configuration
  • Caching Data In Memory
  • Other Configuration Options
  • Migration Guide

After you set the SQLContext configuration property value, the property name and the key value are stored as a string pair in the map structure, and the property key value can then be used during the SQL execution time. The value of the key can be boolean, integer, or long.

Internally, Spark converts the string value to the corresponding boolean, integer, or long when that property’s value is being used. For some of the properties, the values are memory bytes, for example, SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE, which is the targeted size of a post-shuffle partition’s input data size.

In previous Spark releases, if the user wanted to specify 1m byte size, he or she had to do the calculation (1024×1024 = 1048576), because the string ‘1m’ couldn’t be converted to long using the string’s toLong method. It would throw a NumberFormatException. However, with Spark 1.6, the user can specify string ‘1m’ , ‘1g’ etc as unit (m,k,g), and Spark will use an existing utility to try to interpret the string and convert it to corresponding byte size. That should save Spark users a lot of time.

That’s just one example of how Spark is becoming easier to use and easier to configure with each new release.

Newsletter

You Might Also Enjoy

Kevin Bates
Kevin Bates
6 months ago

Limit Notebook Resource Consumption by Culling Kernels

There’s no denying that data analytics is the next frontier on the computational landscape. Companies are scrambling to establish teams of data scientists to better understand their clientele and how best to evolve product solutions to the ebb and flow of today’s business ecosystem. With Apache Hadoop and Apache Spark entrenched as the analytic engine and coupled with a trial-and-error model to... Read More

Gidon Gershinsky
Gidon Gershinsky
8 months ago

How Alluxio is Accelerating Apache Spark Workloads

Alluxio is fast virtual storage for Big Data. Formerly known as Tachyon, it’s an open-source memory-centric virtual distributed storage system (yes, all that!), offering data access at memory speed and persistence to a reliable storage. This technology accelerates analytic workloads in certain scenarios, but doesn’t offer any performance benefits in other scenarios. The purpose of this blog is to... Read More