Spark SQL

CACHE Table in Apache Spark™ SQL

For users wanting to improve performance by caching table data into memory, we offer some considerations…

You can either do sqlContext.cacheTable(“tableName”), dataFram.cache() in an application or “CACHE table tableName” in the Spark-SQL shell. The new query against the cached table will use InMemoryColumnerTableScan for scanning and retrieving only the required column(s).

For example:

scala> sqlContext.cacheTable("t4")

scala> val df = sqlContext.sql("select col1 from t4") df: org.apache.spark.sql.DataFrame = [col1: int]

scala> df.explain(true) == Parsed Logical Plan == 'Project ['col1] +- 'UnresolvedRelation `t4`, None

== Analyzed Logical Plan == col1: int Project [col1#103] +- MetastoreRelation default, t4, None

== Optimized Logical Plan == Project [col1#103] +- InMemoryRelation [col1#103,col2#104,col3#105], true, 10000, StorageLevel(true, true, false, true, 1), HiveTableScan [col1#73,col2#74,col3#75], MetastoreRelation default, t4, None, Some(t4)

== Physical Plan == InMemoryColumnarTableScan [col1#103], InMemoryRelation [col1#103,col2#104,col3#105], true, 10000, StorageLevel(true, true, false, true, 1), HiveTableScan [col1#73,col2#74,col3#75], MetastoreRelation default, t4, None, Some(t4)

It’s worth noting that prior to Apache Spark™ 1.5.2, caching a parquet table had an issue. Specifically, the query selecting the cached parquet table did not actually scan from the InMemoryColumnartableScan. Instead, it scanned from ParquetRelation in the physical plan — which had the potential to downgrade performance.

The problem was that the LogicalRelation that wraps the ParquetRelation has an expectedOutpuAttributes that stores a list of resolved fields with expIds, yet these expIds are not expected to be the same at different times. When caching table, the LogicalRelation that wraps the ParquetRelation becomes the key in the cache and the resulting InMemoryRelation is the value. Then, when a new query comes in, the newly resolved LogicalRelation that wraps the same ParquetRelation has expectedOutpuAttributes with different expIds than the cached key. As a result, the look up of the cached relation is not found and the plan fails to choose the physical ParquetRelation for scanning.

Instead of comparing wrapping LogicalRelations for looking up the key from the cache, the code should directly compare the underlying ParquetRelation. This issue is fixed in 1.6.0 and 1.5.2.

Bios: Xin Wu is an active contributor for Apache Spark with IBM Spark Technology Center(STC).. Xin’s main focus is on Spark SQL component. Prior to joining STC, he was a developer of Big SQL, which is a SQL-on-Hadoop engine by IBM.

Newsletter

You Might Also Enjoy

James Spyker
James Spyker
2 months ago

Streaming Transformations as Alternatives to ETL

The strategy of extracting, transforming and then loading data (ETL) to create a version of your data optimized for analytics has been around since the 1970s and its challenges are well understood. The time it takes to run an ETL job is dependent on the total data volume so that the time and resource costs rise as an enterprise’s data volume grows. The requirement for analytics databases to be mo... Read More

Seth Dobrin
Seth Dobrin
2 months ago

Non-Obvious Application of Spark™ as a Cloud-Sync Tool

When most people think about Apache Spark™, they think about analytics and machine learning. In my upcoming talk at Spark Summit East, I'll talk about leveraging Spark in conjunction with Kafka, in a hybrid cloud environment, to apply the batch and micro-batch analytic capabilities to transactional data in place of performing traditional ETL. This application of these two open source tools is a no... Read More