research

Configuring the Apache Spark™ SQL Context

The Apache Spark website documents the properties you can configure, including settings that control the Spark application and the Spark SQL Context. Let’s look at some of the Spark SQL Context parameters, and how to enable them with a nice feature in Spark SQL 1.6.

Spark SQL Context (SQLContext) serves as the entry point for users to create dataframes, datasets, and to run SQL-related queries. SQLContext can take Spark SQL configuration properties through setConf method or you can specify the SQLContext properties in the /conf directory within the spark-defaults.conf file.  For example, you can add:

spark.sql.PARQUET_SCHEMA_MERGING_ENABLED=true

Or you can set the parameter using the SET key=value command in SQL.

The Spark SQL, DataFrames and Datasets Guide on the Apache Spark website documents most Spark SQL Context properties, which all start with the naming convention “spark.sql.”. Look for those properties within these sections of the guide:

  • Configuration
  • Caching Data In Memory
  • Other Configuration Options
  • Migration Guide

After you set the SQLContext configuration property value, the property name and the key value are stored as a string pair in the map structure, and the property key value can then be used during the SQL execution time. The value of the key can be boolean, integer, or long.

Internally, Spark converts the string value to the corresponding boolean, integer, or long when that property’s value is being used. For some of the properties, the values are memory bytes, for example, SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE, which is the targeted size of a post-shuffle partition’s input data size.

In previous Spark releases, if the user wanted to specify 1m byte size, he or she had to do the calculation (1024×1024 = 1048576), because the string ‘1m’ couldn’t be converted to long using the string’s toLong method. It would throw a NumberFormatException. However, with Spark 1.6, the user can specify string ‘1m’ , ‘1g’ etc as unit (m,k,g), and Spark will use an existing utility to try to interpret the string and convert it to corresponding byte size. That should save Spark users a lot of time.

That’s just one example of how Spark is becoming easier to use and easier to configure with each new release.

Newsletter

You Might Also Enjoy

James Spyker
James Spyker
2 months ago

Streaming Transformations as Alternatives to ETL

The strategy of extracting, transforming and then loading data (ETL) to create a version of your data optimized for analytics has been around since the 1970s and its challenges are well understood. The time it takes to run an ETL job is dependent on the total data volume so that the time and resource costs rise as an enterprise’s data volume grows. The requirement for analytics databases to be mo... Read More

Seth Dobrin
Seth Dobrin
2 months ago

Non-Obvious Application of Spark™ as a Cloud-Sync Tool

When most people think about Apache Spark™, they think about analytics and machine learning. In my upcoming talk at Spark Summit East, I'll talk about leveraging Spark in conjunction with Kafka, in a hybrid cloud environment, to apply the batch and micro-batch analytic capabilities to transactional data in place of performing traditional ETL. This application of these two open source tools is a no... Read More