Scala for Data Engineering: Harnessing the Power of Functional Programming

You want to write a program that handles data. Which language should you choose?

Introduction

In the dynamic world of data engineering, where processing and managing vast amounts of data have become paramount, programming languages that offer flexibility, efficiency, and scalability are highly sought after. Enter Scala - a powerful and versatile language that has been gaining traction in the data science community for its exceptional capabilities.

When choosing a programming language to use in writing your program that handles data, there are different options you can choose from. You might choose a dynamic language such as
Python or R or a more traditional object-oriented language such as Java.

In this post, we will explore how Scala differs from these languages and when it might make sense to use it.

Why Scala?

Scala, is a statically typed programming language, has been steadily gaining popularity in the data engineering community due to its unique combination of features that make it well-suited for handling data-intensive tasks.

In the next sections, we examine how Scala compares to the programming languages in the field of data science.

Static typing and type inference

Scala's static typing system offers remarkable versatility, as it allows a significant amount of information about the program's behavior to be encoded in types. This ensures a certain level of correctness, making it especially beneficial for rarely used code paths. In contrast, dynamic languages can only identify errors during specific execution branches, potentially leading to persistent bugs.

One common criticism of statically typed object-oriented languages, like Java, is their verbosity. For instance, when initializing an instance of the Example class in Java, the class name is repeated twice, unnecessarily defining the compile-time type of the variable and constructing the instance.

Scala, being a functional language, leverages type inference, enabling the compiler to deduce variable types from assigned instances. As a result, Scala code is more concise and readable, without compromising type safety. By specifying argument and return value types of functions, the compiler infers types for all variables within the function's body. Scala's elegant approach to type inference significantly streamlines code development, making it a compelling choice for data engineers and programmers.

Scala encourages immutability

Scala promotes the adoption of immutable objects, making it effortless to define attributes as immutable.
For instance:

val amountSpent = 500

Additionally, the default collections in Scala are immutable, as demonstrated with the List:

val clientIds = List("123", "456") // List is immutable
clientIds(1) = "589" // Compile-time error

Embracing immutability eradicates a common source of bugs. By ensuring that certain objects cannot be changed once created, the number of potential bug locations is reduced. Instead of considering the object's lifetime, the focus narrows down to the constructor, leading to more robust and predictable code.

Scala and functional programs

Scala strongly encourages functional programming, which involves using higher-order functions to transform collections. As a programmer, you don't have to worry about the intricate details of iterating over the collection.
Let's take a look at an example of an "occurrencesOf" function in Scala:

def occurrencesOf[A](elem: A, collection: List[A]): List[Int] = {
  for {
    (currentElem, index) <- collection.zipWithIndex
    if (currentElem == elem)
  } yield index
}

In this Scala code, we declare a new list, "collection.zipWithIndex," which contains pairs of elements and their respective indexes from the original collection. Then, by using a filter and a for-comprehension, we iterate over this collection, binding the "currentElem" variable to the current element and "index" to the index. We filter and return the indexes where "currentElem" is equal to "elem."

The equivalent Java code for the same functionality looks like this:

static <T> List<Integer> occurrencesOf(T elem, List<T> collection) {
  List<Integer> occurrences = new ArrayList<>();
  for (int i = 0; i < collection.size(); i++) {
    if (collection.get(i).equals(elem)) {
      occurrences.add(i);
    }
  }
  return occurrences;
}

In Java, we start by defining a mutable list to store occurrences as we find them. We then iterate over the collection using a counter and check each element to see if it matches "elem." If it does, we add its index to the list of occurrences. This Java code requires managing more moving parts, and the logic of the function is somewhat obscured by the iteration mechanism.

It's important to note that this comparison is not meant to criticize Java; in fact, Java 8 introduced functional constructs like lambda expressions and stream processing. However, it highlights the benefits of functional approaches in Scala, which minimize the potential for errors and improve code clarity, making it easier to work with collections and focus on the core logic of the function.

Null pointer uncertainty

In many scenarios, representing the possible absence of a value becomes necessary. For example, consider a case where we read a list of usernames from a CSV file, and some users have opted not to provide their email addresses. In Java, this absence of email information is often denoted by setting the reference to null, while in Python, None is used.

However, this approach can be risky as it does not explicitly encode the possibility of a value's absence. Determining whether an instance attribute can be null becomes cumbersome in larger programs, leading to potential issues if not handled carefully. Scala, inspired by functional languages, addresses this concern by introducing the Option[T] type to represent attributes that might be absent.

In Scala, we can achieve this by writing:

class User {
  ...
  val email: Option[Email]
  ...
}

By utilizing Option[Email], we clearly convey to programmers using the User class that email information may be absent. The compiler also becomes aware of this possibility, prompting us to handle the situation explicitly rather than risking null pointer exceptions at runtime.

By eliminating the use of null, we can achieve a higher level of provable correctness and mitigate null-related issues. In languages without Option[T], developers often resort to writing unit tests on the client code to ensure correct behavior when dealing with null attributes.

It's worth noting that similar functionality can be achieved in Java using external libraries like Google's Guava library or the Optional class in Java 8. However, the convention of using null to indicate the absence of a value has long been ingrained in Java. In contrast, Scala embraces Option[T], offering a more natural and safer way to handle optional values.

Easier parallelism

Developing programs that leverage parallel architectures presents significant challenges, but it is an essential aspect of tackling most data science problems. Parallel programming poses difficulties because our natural inclination as programmers is to think sequentially. Reasoning about the potential order of events in concurrent programs becomes complex.

Scala addresses these challenges by providing several abstractions that facilitate the creation of parallel code. These abstractions impose constraints on the approach to parallelism. For example, parallel collections require computations to be expressed as a sequence of operations, such as map, reduce, and filter, on collections. Actor systems encourage thinking in terms of encapsulated actors that communicate through message passing.

The restriction of the programmer's freedom to write parallel code in any way they desire may seem paradoxical. However, it actually simplifies understanding the program's behavior. For instance, if an actor misbehaves, the problem is either in the actor's code or one of the messages it receives.

To illustrate the power of coherent, restrictive abstractions, let's solve a probability problem using parallel collections in Scala. We aim to calculate the probability of getting at least 60 heads out of 100 coin tosses using a Monte Carlo simulation. By running the simulation repeatedly and aggregating the results, we can achieve this estimation with parallel collections, parallelizing the computation across multiple CPUs effortlessly.

While not all problems are as straightforward to parallelize as the Monte Carlo example, Scala's rich set of intuitive abstractions makes writing parallel applications more manageable, providing an effective way to leverage parallel architectures and improve the performance of data science tasks.

Interoperability with Java

Scala is built to run on the Java Virtual Machine (JVM), and its compiler translates Scala programs into Java bytecode. This compatibility allows Scala developers to seamlessly utilize Java libraries within their Scala code. Considering the vast number of Java applications, both open-source and in legacy systems, this interoperability between Scala and Java has significantly contributed to Scala's widespread adoption.

Moreover, the interoperability between Scala and Java is not limited to one direction. Some Scala libraries, like the Play framework, have gained popularity among Java developers as well, indicating the bidirectional nature of this compatibility. This mutual interaction between the two languages fosters a thriving ecosystem and encourages developers from both communities to explore and leverage each other's tools and frameworks.

When not to use Scala

When considering whether to use Scala for your next project, there are certain factors to take into account. While Scala's strong type system, preference for immutability, functional capabilities, and parallelism abstractions make it an excellent choice for writing reliable programs and minimizing unexpected behavior, there are some reasons why you might decide against it.

One crucial consideration is familiarity. Scala introduces various concepts, such as implicits, type classes, and composition using traits, which may not be familiar to programmers coming from an object-oriented background. Mastering Scala's expressive type system and harnessing its full power may require time and adapting to a new programming paradigm. Additionally, dealing with immutable data structures can feel unfamiliar to those coming from languages like Java or Python.

However, with time and effort, these drawbacks can be overcome. Nevertheless, Scala does fall short in terms of library availability compared to other data science languages. For data exploration, the IPython Notebook coupled with matplotlib remains unparalleled. Although there are ongoing efforts to provide similar functionality in Scala (like Spark Notebooks or Apache Zeppelin), these projects may not have reached the same level of maturity.

Considering the above, in the author's biased opinion, Scala shines when used for more permanent programs. If you're writing a quick throwaway script or primarily focusing on data exploration, you might find Python better suited for the task. However, for projects that require reusability and a higher level of provable correctness, Scala proves to be an extremely powerful and beneficial choice. Ultimately, the decision on whether to use Scala will depend on your project's specific requirements and your team's familiarity with the language and its ecosystem.

Conclusion

In conclusion, Scala presents a compelling option for developers seeking a robust and functional language, offering the potential to build cutting-edge applications and tackle complex data science challenges. Whether you choose Scala for its expressiveness and parallel processing capabilities or opt for another language based on familiarity and immediate project needs, making an informed decision will ultimately lead to successful and impactful software development endeavors.