For users wanting to improve performance by caching table data into memory, we offer some considerations…
You can either do sqlContext.cacheTable(“tableName”), dataFram.cache() in an application or “CACHE table tableName” in the Spark-SQL shell. The new query against the cached table will use InMemoryColumnerTableScan for scanning and retrieving only the required column(s).
scala> sqlContext.cacheTable("t4") scala> val df = sqlContext.sql("select col1 from t4") df: org.apache.spark.sql.DataFrame = [col1: int] scala> df.explain(true) == Parsed Logical Plan == 'Project ['col1] +- 'UnresolvedRelation `t4`, None == Analyzed Logical Plan == col1: int Project [col1#103] +- MetastoreRelation default, t4, None == Optimized Logical Plan == Project [col1#103] +- InMemoryRelation [col1#103,col2#104,col3#105], true, 10000, StorageLevel(true, true, false, true, 1), HiveTableScan [col1#73,col2#74,col3#75], MetastoreRelation default, t4, None, Some(t4) == Physical Plan == InMemoryColumnarTableScan [col1#103], InMemoryRelation [col1#103,col2#104,col3#105], true, 10000, StorageLevel(true, true, false, true, 1), HiveTableScan [col1#73,col2#74,col3#75], MetastoreRelation default, t4, None, Some(t4)
It’s worth noting that prior to Apache Spark™ 1.5.2, caching a parquet table had an issue. Specifically, the query selecting the cached parquet table did not actually scan from the InMemoryColumnartableScan. Instead, it scanned from ParquetRelation in the physical plan — which had the potential to downgrade performance.
The problem was that the LogicalRelation that wraps the ParquetRelation has an expectedOutpuAttributes that stores a list of resolved fields with expIds, yet these expIds are not expected to be the same at different times. When caching table, the LogicalRelation that wraps the ParquetRelation becomes the key in the cache and the resulting InMemoryRelation is the value. Then, when a new query comes in, the newly resolved LogicalRelation that wraps the same ParquetRelation has expectedOutpuAttributes with different expIds than the cached key. As a result, the look up of the cached relation is not found and the plan fails to choose the physical ParquetRelation for scanning.
Instead of comparing wrapping LogicalRelations for looking up the key from the cache, the code should directly compare the underlying ParquetRelation. This issue is fixed in 1.6.0 and 1.5.2.
Bios: Xin Wu is an active contributor for Apache Spark with IBM Spark Technology Center(STC).. Xin’s main focus is on Spark SQL component. Prior to joining STC, he was a developer of Big SQL, which is a SQL-on-Hadoop engine by IBM.