Reading Netezza data as Apache Spark SQL data source

Curious about enabling analytic capabilities using Apache Spark for data in Netezza? The Spark SQL team at Spark Technology Center developed a Spark Connector for Netezza, published as part of The connector can access data in Netezza inside the Spark ecosystem and move data from Netezza tables to Spark. We’ll demonstrate how to define Netezza tables as a Spark data source using the package — and how to perform analytics.

Overview of the package

Spark SQL provides a Data Sources API to plug in external data sources. The Spark Netezza package is an implementation of this API that enables users to map Netezza tables as Spark SQL data sources. The package supports all the language interfaces supported by Spark.

You can deploy the package as part of an application program or from Spark tools such as spark-shell and spark-sql. To use the package in the application, specify it in your application’s build dependency. When using from Spark tools, add the package using the –packages command-line option. Be sure to also add the Netezza JDBC driver to the application dependencies.

For example, to use the Spark Netezza package with Spark’s interactive shell:

$SPARK_HOME/bin/spark-shell –packages –driver-class-path ~/nzjdbc.jar

Reading data from Netezza

You can define a data frame that corresponds to a Netezza table as shown below, and use it to join with other data frames defined on other data sources:

scala> val nzoptions = Map("url" -> "jdbc:netezza://hostname:5480/gosales",
 "user" -> "ugosales",
 "password" -> "gosales",
 "dbtable" -> "PRODUCT",
 "numPartitions" -> "8")

scala> val productdf ="").options(nzoptions).load()

scala> productdf.printSchema
 |– PRODUCTNUMBER: integer (nullable = false)
 |– INTRODUCTIONDATE: timestamp (nullable = false)
 |– PRODUCTNAME: string (nullable = false)
 |– PRODUCTTYPECODE: integer (nullable = false)
 |– PRODUCTIONCOST: double (nullable = false)
 |– MARGIN: double (nullable = false)
 |– PICTURE: string (nullable = true)
 |– PICTUREURL: string (nullable = true)
 |– DESCRIPTION: string (nullable = true)

scala> productdf.filter("PRODUCTIONCOST > 100").select("PRODUCTNAME")
 res15: org.apache.spark.sql.DataFrame = [PRODUCTNAME: string]

Saving data from Netezza into Spark

If you’re using the data in the Netezza table interactively to perform advanced analytics, it’s better to cache/persist the data in the Spark cluster to avoid fetching the same data repeatedly. You can save data frames as tables in Spark SQL or as parquet files.


 scala> productdf.saveAsTable("product")
 scalla>val highcostProducts = sqlContext.sql("select * from product where productioncost > 100")
 scala> highcostProducts.count()
 res17: Long = 23

Querying Netezza data source using Spark SQL CLI

You can also define a Netezza data source using SQL syntax with the Spark SQL CLI:

$SPARK_HOME/bin/spark-sql –packages –driver-class-path ~/nzjdbc.jar

 url ‘jdbc:netezza://hostname:5480/gosales’,
 user ‘ugosales’,
 password ‘gosales’,
 dbtable ‘PRODUCT’,
 numPartitions ‘8’

select * from product_table where productioncost > 100;

How data is transferred to Spark?

This library uses the external table mechanism to stream the data from Netezza system to Spark nodes. Table data slices are mapped to RDD partitions to be read in parallel. You can control the number of partitions to create for a given data source. The maximum number of partitions is the number of data slices of the Netezza table. If the number of partitions you specify is less than the data slices in Netezza, each RDD partition will be mapped to more than one data slice. The diagram depicts how data moves from a Netezza appliance to a Spark SQL data source.


We demonstrated how easy it is to transfer data from a Netezza appliance to Spark and to use the data stored in Netezza for big data analytics. The Spark Netezza data source library is an open source GitHub project. We invite you to report and fix any issues here.

About Author

Suresh Thalamati is an Advisory software engineer at the Spark Technology Center at IBM. He is an active Apache Spark contributor and works in the open source community. He is a Apache Derby committer and a PMC member. He is experienced in Relational Databases, Distributed Computing and Big Data Analytics with focus on Hadoop MapReduce technologies.


You Might Also Enjoy

Kevin Bates
Kevin Bates
2 months ago

Limit Notebook Resource Consumption by Culling Kernels

There’s no denying that data analytics is the next frontier on the computational landscape. Companies are scrambling to establish teams of data scientists to better understand their clientele and how best to evolve product solutions to the ebb and flow of today’s business ecosystem. With Apache Hadoop and Apache Spark entrenched as the analytic engine and coupled with a trial-and-error model to... Read More

Gidon Gershinsky
Gidon Gershinsky
4 months ago

How Alluxio is Accelerating Apache Spark Workloads

Alluxio is fast virtual storage for Big Data. Formerly known as Tachyon, it’s an open-source memory-centric virtual distributed storage system (yes, all that!), offering data access at memory speed and persistence to a reliable storage. This technology accelerates analytic workloads in certain scenarios, but doesn’t offer any performance benefits in other scenarios. The purpose of this blog is to... Read More