Curious about enabling analytic capabilities using Apache Spark for data in Netezza? The Spark SQL team at Spark Technology Center developed a Spark Connector for Netezza, published as part of spark-packages.org. The connector can access data in Netezza inside the Spark ecosystem and move data from Netezza tables to Spark. We’ll demonstrate how to define Netezza tables as a Spark data source using the package — and how to perform analytics.
Overview of the package
Spark SQL provides a Data Sources API to plug in external data sources. The Spark Netezza package is an implementation of this API that enables users to map Netezza tables as Spark SQL data sources. The package supports all the language interfaces supported by Spark.
You can deploy the package as part of an application program or from Spark tools such as spark-shell and spark-sql. To use the package in the application, specify it in your application’s build dependency. When using from Spark tools, add the package using the –packages command-line option. Be sure to also add the Netezza JDBC driver to the application dependencies.
For example, to use the Spark Netezza package with Spark’s interactive shell:
$SPARK_HOME/bin/spark-shell –packages com.ibm.SparkTC:spark-netezza_2.10:0.1.1 –driver-class-path ~/nzjdbc.jar
Reading data from Netezza*
You can define a data frame that corresponds to a Netezza table as shown below, and use it to join with other data frames defined on other data sources:
scala> val nzoptions = Map("url" -> "jdbc:netezza://hostname:5480/gosales", "user" -> "ugosales", "password" -> "gosales", "dbtable" -> "PRODUCT", "numPartitions" -> "8") scala> val productdf = sqlContext.read.format("com.ibm.spark.netezza").options(nzoptions).load() scala> productdf.printSchema root |– PRODUCTNUMBER: integer (nullable = false) |– INTRODUCTIONDATE: timestamp (nullable = false) |– PRODUCTNAME: string (nullable = false) |– PRODUCTTYPECODE: integer (nullable = false) |– PRODUCTIONCOST: double (nullable = false) |– MARGIN: double (nullable = false) |– PICTURE: string (nullable = true) |– PICTUREURL: string (nullable = true) |– DESCRIPTION: string (nullable = true) scala> productdf.filter("PRODUCTIONCOST > 100").select("PRODUCTNAME") res15: org.apache.spark.sql.DataFrame = [PRODUCTNAME: string] [/code]
Saving data from Netezza into Spark*
If you’re using the data in the Netezza table interactively to perform advanced analytics, it’s better to cache/persist the data in the Spark cluster to avoid fetching the same data repeatedly. You can save data frames as tables in Spark SQL or as parquet files.
productdf.cache() scala> productdf.saveAsTable("product") scalla>val highcostProducts = sqlContext.sql("select * from product where productioncost > 100") scala> highcostProducts.count() res17: Long = 23
Querying Netezza data source using Spark SQL CLI*
You can also define a Netezza data source using SQL syntax with the Spark SQL CLI:
$SPARK_HOME/bin/spark-sql –packages com.ibm.SparkTC:spark-netezza_2.10:0.1.1 –driver-class-path ~/nzjdbc.jar CREATE TEMPORARY TABLE product_table USING com.ibm.spark.netezza OPTIONS ( url ‘jdbc:netezza://hostname:5480/gosales’, user ‘ugosales’, password ‘gosales’, dbtable ‘PRODUCT’, numPartitions ‘8’ ) select * from product_table where productioncost > 100;
How data is transferred to Spark?
This library uses the external table mechanism to stream the data from Netezza system to Spark nodes. Table data slices are mapped to RDD partitions to be read in parallel. You can control the number of partitions to create for a given data source. The maximum number of partitions is the number of data slices of the Netezza table. If the number of partitions you specify is less than the data slices in Netezza, each RDD partition will be mapped to more than one data slice. The diagram depicts how data moves from a Netezza appliance to a Spark SQL data source.
We demonstrated how easy it is to transfer data from a Netezza appliance to Spark and to use the data stored in Netezza for big data analytics. The Spark Netezza data source library is an open source GitHub project. We invite you to report and fix any issues here.
Suresh Thalamati is an Advisory software engineer at the Spark Technology Center at IBM. He is an active Apache Spark contributor and works in the open source community. He is a Apache Derby committer and a PMC member. He is experienced in Relational Databases, Distributed Computing and Big Data Analytics with focus on Hadoop MapReduce technologies.