research

Configuring the Apache Spark™ SQL Context

The Apache Spark website documents the properties you can configure, including settings that control the Spark application and the Spark SQL Context. Let’s look at some of the Spark SQL Context parameters, and how to enable them with a nice feature in Spark SQL 1.6.

Spark SQL Context (SQLContext) serves as the entry point for users to create dataframes, datasets, and to run SQL-related queries. SQLContext can take Spark SQL configuration properties through setConf method or you can specify the SQLContext properties in the /conf directory within the spark-defaults.conf file.  For example, you can add:

spark.sql.PARQUET_SCHEMA_MERGING_ENABLED=true

Or you can set the parameter using the SET key=value command in SQL.

The Spark SQL, DataFrames and Datasets Guide on the Apache Spark website documents most Spark SQL Context properties, which all start with the naming convention “spark.sql.”. Look for those properties within these sections of the guide:

  • Configuration
  • Caching Data In Memory
  • Other Configuration Options
  • Migration Guide

After you set the SQLContext configuration property value, the property name and the key value are stored as a string pair in the map structure, and the property key value can then be used during the SQL execution time. The value of the key can be boolean, integer, or long.

Internally, Spark converts the string value to the corresponding boolean, integer, or long when that property’s value is being used. For some of the properties, the values are memory bytes, for example, SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE, which is the targeted size of a post-shuffle partition’s input data size.

In previous Spark releases, if the user wanted to specify 1m byte size, he or she had to do the calculation (1024×1024 = 1048576), because the string ‘1m’ couldn’t be converted to long using the string’s toLong method. It would throw a NumberFormatException. However, with Spark 1.6, the user can specify string ‘1m’ , ‘1g’ etc as unit (m,k,g), and Spark will use an existing utility to try to interpret the string and convert it to corresponding byte size. That should save Spark users a lot of time.

That’s just one example of how Spark is becoming easier to use and easier to configure with each new release.

Newsletter

You Might Also Enjoy

Gidon Gershinsky
Gidon Gershinsky
2 months ago

How Alluxio is Accelerating Apache Spark Workloads

Alluxio is fast virtual storage for Big Data. Formerly known as Tachyon, it’s an open-source memory-centric virtual distributed storage system (yes, all that!), offering data access at memory speed and persistence to a reliable storage. This technology accelerates analytic workloads in certain scenarios, but doesn’t offer any performance benefits in other scenarios. The purpose of this blog is to... Read More

James Spyker
James Spyker
4 months ago

Streaming Transformations as Alternatives to ETL

The strategy of extracting, transforming and then loading data (ETL) to create a version of your data optimized for analytics has been around since the 1970s and its challenges are well understood. The time it takes to run an ETL job is dependent on the total data volume so that the time and resource costs rise as an enterprise’s data volume grows. The requirement for analytics databases to be mo... Read More