Key takeaways
|
Big Data Analytics with Spark book, authored by Mohammed Guller, provides a practical guide for learning Apache Spark framework for different types of big-data analytics projects, including batch, interactive, graph, and stream data analysis as well as machine learning. It covers Spark core and its add-on libraries, including Spark SQL, Spark Streaming, GraphX, MLlib, and Spark ML.
Readers can learn how to perform data analysis using in-memory caching and advanced execution engine components of Apache Spark framework.
Author talks about how to use Spark as a unified platform for data processing tasks like ETL pipelines, business intelligence, real-time data stream processing, graph analytics, and machine learning. He also discussed other topics like cluster managers and monitoring of Spark programs.
The book includes an introduction to other technologies and frameworks that are commonly used with Spark, such as distributed file management systems (HDFS), Avro, Parquet, distributed messaging (Kafka), NoSQL databases (Cassandra, HBase), and cluster management (Mesos).
InfoQ spoke with Mohammed Guller about the book, Spark framework and tools for developers who are working on big data applications using Spark.
InfoQ: How do you define Apache Spark framework and how it helps with big data analytics projects and initiatives?
Guller: Apache Spark is a fast easy-to-use general-purpose cluster computing framework for processing large datasets. It gives you both scale and speed. More importantly, it has made it easy to perform a variety of data processing tasks on large datasets. It provides an integrated set of libraries for batch processing, ad hoc analysis, machine learning, stream processing and graph analytics.
Data is growing almost exponentially. In addition, most of the data generated today is not structured, but either multi-structured or unstructured. Traditional tools such as relational databases cannot handle the volume, velocity or the variety of the data generated today. That is why you need frameworks such as Spark. It makes it easy to handle the three V's of big data. Another important thing to keep in mind is that organizations process or analyze data in different ways to get value out of it. Spark provides a single platform for different types of data processing and analytical tasks. You don't need to duplicate code or data, unlike special-purpose frameworks that do either only batch or stream processing.
InfoQ: Can you discuss what all development and testing tools are available for developers to use when they are working on projects using Spark?
Guller: In general, developers can use whatever tools are available for the programming languages supported by Spark. Currently, Spark supports Scala, Java, Python and R.
Let's take Scala as an example. Spark comes pre-packaged with an interactive development environment known as spark-shell. It is based on the Scala REPL (Read Evaluate Print Loop) tool. It provides a quick and easy way to get started with Spark. In addition, developers can use standard Scala IDEs such as Eclipse and IntelliJ IDEA. And if you are not a big fan of IDEs, you can write your code in your favorite text editor and compile it with SBT (Simple Build Tool).
InfoQ: What are some best practices you suggest to new developers who are just learning to use Spark framework?
Guller: The best way to learn Spark is to experiment with it and write code using the Spark API. The concepts become much clear when you write and execute code. This is generally true for learning any new language or tool.
Even though Spark is a big data processing framework, you don't necessarily need to have a big cluster or large datasets to learn Spark. You can run Spark on your laptop with a small dataset and get comfortable with the API and various libraries provided by Spark. My book has a chapter on how to get easily started with Spark.
InfoQ: How do you compare the different programming languages Spark currently supports, Scala, Java, Python, and R? What do you recommend for new developers if they have to choose a language, which one should be?
Guller: Spark itself is written in Scala. So, historically, Scala was a first class citizen and support for other languages lagged behind a little bit. However, that gap is shrinking with every new release of Spark. Similarly, Spark applications written in Scala had speed advantage over programs written in Python applications. However, with all the new optimizations provided by Spark under the hood, the speed difference has also reduced.
I personally like Scala, because it increases your productivity and enables you to write concise and better quality code. It rekindled my love for programming.
Having said that, a developer can use whatever language they are most comfortable with. So if you are a Python guru, use Python. You don't need to switch or learn a new language as long as you know one of the languages supported by Spark.
If you want to learn a new language and get optimal performance, I recommend Scala. That is the reason I included a chapter on functional programming and Scala in my book.
InfoQ: What is the best way for cluster setup of Spark on the local machine or on the cloud?
Guller: Spark provides a script, spark-ec2, for setting up a Spark cluster on Amazon AWS. This script allows you to launch, manage and shut down a Spark cluster on Amazon cloud. It installs both Spark and HDFS. It is a pretty flexible script with a number of input arguments, allowing you to create custom clusters for your specific processing needs and budget.
InfoQ: Can you talk about the real-time streaming data analytics using Spark Streaming library?
Guller: The Spark Streaming library extends Spark for stream processing. It provides operators for analysing stream data in near real-time. It uses a micro-batching architecture. Essentially, a stream of data is split into micro-batches. The batch interval is specified by a developer. Each micro-batch is represented by an RDD (Resilient Distributed Dataset), which is Spark's primary data abstraction.
The micro-batching architecture has both advantages and disadvantages. On the plus side, it provides high-throughput. So Spark Streaming is great for performing analytics on stream data. However, if your application needs to process each event in a stream individually with very low-latency (milliseconds) requirements, Spark Streaming may not be a good fit.
InfoQ: What are the considerations for performance and tuning of Spark programs?
Guller: This is a vast topic, since Spark provides many knobs for performance tuning. I will discuss some of the important things to keep in mind.
First, for most data processing applications, disk I/O is a big contributor to application execution time. Since Spark allows you to cache data in memory, take advantage of that capability whenever you can. Caching data in memory can speed up your application by up to 100 times. Obviously, this means it is better to setup your Spark cluster with machines with large amount of memory.
Second, avoid operators that require data shuffling. Shuffling data across a network is an expensive operation. Keep this in mind when you write your data processing logic. Sometimes the same logic can be implemented with a more efficient operator. For example, instead of the groupByKey operator, use the reduceByKey operator.
Third, optimize the number of partitions in your data. If your data is not correctly partitioned, you are not taking advantage of the data parallelism provided by Spark. For example, assume you have a Spark cluster with 100 cores. But if your data has only 2 partitions, you are underutilizing your compute power.
Fourth, co-locate data node with compute node for optimal performance. For example, if your data is in HDFS, install Spark on the same HDFS cluster. Spark will execute data processing code as close to data as possible. For example, it will first try to execute a task on the same machine where data is located. If cannot execute a task on that machine, it will try to find a machine on the same rack. If that is not possible, it will use any machine. Minimize both disk and network I/O.
These are some of the common performance related things to keep in mind.
InfoQ: What is the current support for securing Spark programs so only authorized users or applications will be able to execute those programs?
Guller: Spark supports two methods of authentication: shared secret and Kerberos. The shared secret authentication mechanism can be used with all the cluster managers: YARN, Mesos and Standalone. In addition, YARN allows you to use Kerberos with Spark.
Spark also supports encryption using SSL and SASL. SSL is used for securing communication protocols, while SASL is used for securing block transfer service.
InfoQ: How do you monitor Spark programs using Spark Web Console and other tools? What metrics do you typically measure in Spark programs?
Guller: Spark provides comprehensive monitoring capabilities. My book has a complete chapter on this topic. Spark not only exposes a wealth of metrics, but also provides web-based UI for monitoring both Spark cluster and applications running on it. In addition, it supports third-party monitoring tools such as Graphite, Ganglia and JMX-based monitoring applications.
I use monitoring for both performance optimization and debugging. The specific metrics that I review depends on the problem that I am trying to solve. For example, you can use the monitoring UI to check the state of your cluster and allocation of resources amongst your applications. Similarly, you can use the monitoring UI to see the amount of parallelism within the jobs submitted by your application. You can also check the amount of data processed by different tasks and the time taken. It can help you find straggler tasks. These are just a few examples.
InfoQ: What are some new features you would like to see added in the future releases of Spark framework?
Guller: The Spark developer community has done an amazing job in enhancing Spark with every new release. So my wish list is not big. Most of the new features that I would like to see are related to machine learning.
One thing that I feel is missing from Spark is a graphing or data plotting library for Scala developers. Exploratory visualization is a critical part of data analysis. R developers can use ggplot2. Python has matplotlib. It would be good to have something similar for Scala developers.
Another thing that I would like to see is Spark's statistical and machine learning library become on par with that provided by R. Finally, I would like to see better support for being able to export and import machine learning models using standards such as PMML and PFA.
InfoQ: Spark Machine Learning currently provides several different algorithms. Do you see any other ML libraries that may add value to machine learning & data science needs of the organizations?
Guller: You are right in that Spark's machine learning library comes with a rich set of algorithms. In addition, new algorithms are added in every release.
Spark can be used with external machine libraries, so whatever capabilities Spark is missing, those gaps can be filled with other libraries. For example, Stanford CoreNLP library can be used with Spark for NLP-heavy machine learning tasks. Similarly, SparkNet, CaffeOnSpark, DeepLearning4J, or TensorFlow can be used with Spark for deep learning.
Guller talked about the value Spark framework brings to the table.
Guller: Spark is a great framework for analyzing and processing big data. It is easy to use and provides a rich set of libraries for a variety of tasks. Plus, it provides scale and speed for processing really large dataset. Anyone working with Big Data or eager to get into the Big Data space should definitely learn it.
He also said he gets questions from lot of people about the relationship between Hadoop and Spark. He responded to the following two questions he gets from time to time.
InfoQ: Will Spark Replace Hadoop?
Guller: The short answer is no. Today Hadoop represents an ecosystem of products. Spark is a part of that ecosystem. Even core Hadoop consists of three components: a cluster manager, a distributed compute framework and a distributed file system. YARN is the cluster manager. MapReduce is the compute framework and HDFS is the distributed file system. Spark is a successor to the MapReduce component of Hadoop.
Many people are either replacing existing MapReduce job with Spark jobs or writing new jobs in Spark. So you can say that Spark is replacing MapReduce, but not Hadoop.
Another important thing to keep in mind is that Spark can be used with Hadoop, but it can also be used without Hadoop. For example, you can Mesos or the Standalone cluster manager instead of YARN. Similarly, you can use S3 or other data sources instead of HDFS. So you don't need to install Hadoop to use Spark.
InfoQ: Why are people replacing MapReduce with Spark?
Guller: Spark offers many advantages over MapReduce.
First, Spark is much faster than MapReduce. Depending on the application, it can be up to 100 times faster than MapReduce. One reason Spark is fast because of its advanced job execution engine. Spark jobs can have any number of stages, unlike MapReduce jobs which always have two stages. In addition, Spark allows applications to cache data in memory. Caching tremendously improves application execution time. Disk I/O is a significant contributor to application execution time for data processing application. Spark allows you to minimize Disk I/O.
Second, Spark is easy to use. Spark provides a rich expressive API with 80+ operators, whereas MapReduce provides only two operators Map and Reduce. The Spark API is available in four languages, Scala, Python, Java and R. You can write the same data processing job in Scala/Spark using 5x-10x less code than the amount of code you will have to write in MapReduce. Thus Spark also significantly improves developer productivity.
Third, Spark offers a single tool kit for a variety of data processing task. It comes prepackaged with integrated libraries for doing batch processing, interactive analytics, machine learning, stream processing and graph analytics. So you don't need to learn multiple tools. In addition, you don't need to duplicate code and data at multiple places. Operationally also, it is easier to manage one cluster instead of multiple special-purpose clusters for different types of jobs.
About the Book Author
Mohammed Guller is the principal architect at Glassbeam, where he leads the development of advanced and predictive analytics products. Over the last 20 years, Mohammed has successfully led the development of several innovative technology products from concept to release. Prior to joining Glassbeam, he was the founder of TrustRecs.com, which he started after working at IBM for five years. Before IBM, he worked in a number of hi-tech start-ups, leading new product development. Mohammed has a master's of business administration from the University of California, Berkeley, and a master's of computer applications from RCC, Gujarat University, India.