0 to Life-Changing App: Scala First Steps and an Interview with Jakob Odersky

Scala! The language that evokes extreme differences in opinion.

Being new to Silicon Valley, I have only recently come across the very strong opinions of developers. Whether it be spaces versus tabs or Scala versus Python, people definitely feel strongly one way or the other.

So whether you love Scala for its brevity and concise nature or whether you hate it for being different, the fact is, Scala is very important for Spark, and after all, this is the Spark Technology Center. That is why this week I am giving you some context around Scala and a means to get you started, you know, before we move forward with our life-changing app and generally saving the world.

Not sure what I'm talking about or what I'm doing? Look here and here and here.

For all of those people out there who are new to Spark or Scala, what you might not know is that although Spark has a shell available in Scala and Python and supports Scala, Java, Python, Clojure, and R, Scala has an advantage. Spark is written in the Scala Programming Language and runs on the Java Virtual Machine (JVM). This means that Scala has more capabilities on Spark than the PySpark alternative. (Depending on who you ask, this difference is varying--again, lot's of opinions!) Not only this, but Scala inherently allows you to have more succinct code, which is great for working with big data.

To understand Scala even better, I sat down with Jakob Odersky, a real-life, bonafide Scala expert, to ask him a few seriously Scala questions.

Why is scala important for Spark?

Spark's core APIs are implemented in Scala; it is the lingua franca of the engine. I would also suggest that Scala's features, specifically its conciseness combined with typesafety, make it ideal for implementing any kind of collection framework, which, if you think about it, Spark really is at its highest level of abstraction.

Does it perform differently than python?

Python is generally an interpreted language and therefore runs slower than Scala, which is compiled to java bytecode and can run on the heavily optimized Java Virtual Machine. In the original Spark APIs, where you write a sequence of operations on RDDs, a difference in performance is quite noticable. However, The newer Spark APIs (Datasets, Dataframes, etc) are opaque, in that they hide operation details and let you specify "what" you want rather than "how" you want it. This enables them to apply further optimization and expose a uniform entry-point to all languages, thus making performance differences negligible (if you require only the functionality provided by the newer APIs).

What do you like most about Scala?

There a couple of things I like about the language. Its type system is incredibly complete, yet it doesn't get in your way of writing elegant and concise code. I would say that my favorite feature is its simplicity compared to expressivity: the language itself offers few, yet extremely powerful constructs, allowing you to build libraries that feel "native" or "built-in", yet are just implemented with regular features offered by Scala to anyone.

Why is it relevant to big data and SystemML?

Making so called "big data" accessible from easy-to-use abstractions is essential for fast and productive analysis. Scala makes it very simple to write domain specific languages that can leverage analytics engines such as SystemML but offer a low-barrier entry point to anyone. Furthermore, it is also possible to use Scala in an interpreter, making it a natural choice to integrate into data science notebooks [like Jupyter and Zeppelin]. This in turn makes it possible to rapidly explore data, and with all the benefits of the language's safety and expressivity, also make it a fun experience!

Do you have any resources you would recommend for new developers and data scientists?

My recommendation would be to check out the first weeks of some online courses, just to get a basic understanding of the language. As a beginner you are extremely susceptible to either like or hate a topic, depending on the way you learn it, therefore a good source is essential. There is no need to follow the whole program however, just a few hours should give you a solid foundation to continue on your own. If you have already have some knowledge in Java, I would also recommend reading Cay Horstmann's book "Scala for the Impatient".

Now that you have the context, below is a basic tutorial on how to get going with Scala.

Quick Note: going beyond this cheat sheet is essential. I definitely recommend reading the book 'Atomic Scala' by Bruce Eckel and Dianne Marsh to understand the basics of Scala syntax once you have your shell or REPL up and running.

Assuming you followed my first blog, you should have already downloaded Spark and set Spark Home in your bash profile. If you haven't, then do this before you try to enter the spark shell in the step below. Make sure to also set your path! My Scala and Spark are together in the following example.

First, make sure Java is installed.

//In your terminal type:
java -version  
//Update if needed
//Or install if needed
brew tap caskroom/cask  
brew install Caskroom/cask/java  

Update or install Scala.

//check what version of scala you have installed
brew which scala  
//If you want to switch versions type this:
brew switch scala 2.9.2  
brew switch scala 2.10.0  
//If you need to install scala
brew install scala  

Set Scala Home and put Scala in your path.

//Pay attention to where you saved Scala!
//Go to your bash profile.
vi ~/.bash_profile  
//Type i for insert.
//Now set Scala Home and put it in your path.
export SCALA_HOME=/Users/stc/scala  
/*Notice my Scala Home and Spark Home are on the same line of code for my path.*/
//Now write and quit the changes

Load the changes you made in your bash profile.


Now you can load the REPL (Read-Evaluate-Print-Loop) or the Spark Shell to work in Scala.

//To load the REPL just type while in your terminal:
/*If you saved Scala Home and put it in your path it should work */
//For the spark-shell, type:
//The scala> prompt should now be showing.
//If it's not, double check your .bash_profile

You're ready to start experimenting!

//Try setting some variables and running simple math.
scala> val a = 15  
scala> val b = 15.15  
scala> a * b  
//should return:
res0: Double = 227.25  
//Double means a fractional number. 
//An Int means a whole number.
//Knowing this, you could rewrite the above code as:
scala> val a:Int = 15  
scala> val b:Double = 15.5  
/*Just remember that val is immutable and var is mutable. Immutable means that if you change the value, you create a new value. Mutable means you can change the value at the source. Be careful using mutable values if you're working with others. This can make it very difficult for everyone to be on the same page at the same time.*/
//You can also print your first line.
scala> println("What up Scala coder?")  
//If you're ready to exit, type:

Now you are ready to use Scala in the Spark shell! Before we move forward with our life-changing app, I'd recommend viewing some tutorials or reading one of the recommended books. Knowledge of Scala will be super helpful as we move forward with saving the world!

Stay tuned for our next step!

By Madison J. Myers


You Might Also Enjoy

Kevin Bates
Kevin Bates
9 months ago

Limit Notebook Resource Consumption by Culling Kernels

There’s no denying that data analytics is the next frontier on the computational landscape. Companies are scrambling to establish teams of data scientists to better understand their clientele and how best to evolve product solutions to the ebb and flow of today’s business ecosystem. With Apache Hadoop and Apache Spark entrenched as the analytic engine and coupled with a trial-and-error model to... Read More

Gidon Gershinsky
Gidon Gershinsky
10 months ago

How Alluxio is Accelerating Apache Spark Workloads

Alluxio is fast virtual storage for Big Data. Formerly known as Tachyon, it’s an open-source memory-centric virtual distributed storage system (yes, all that!), offering data access at memory speed and persistence to a reliable storage. This technology accelerates analytic workloads in certain scenarios, but doesn’t offer any performance benefits in other scenarios. The purpose of this blog is to... Read More