Learn how to use PySpark in under 5 minutes (Installation + Tutorial)

Apache Spark is one of the hottest and largest open source project in data processing framework with rich high-level APIs for the programming languages like Scala, Python, Java and R. It realizes the potential of bringing together both Big Data and machine learning.

comments

By Georgios Drakos, Data Scientist at TUI

I’ve found that is a little difficult to get started with Apache Spark (this will focus on PySpark) and install it on local machines for most people. With this simple tutorial you’ll get there really fast!

Apache Spark is a must for Big data’s lovers as it is a fast, easy-to-use general engine for big data processing with built-in modules for streaming, SQL, machine learning and graph processing. This technology is an in-demand skill for data engineers, but also data scientists can benefit from learning Spark when doing Exploratory Data Analysis (EDA), feature extraction and, of course, ML. But please remember that Spark is only truly realized when it is run on a cluster with a large number of nodes.

Introduction
Spark definition
Spark Application
Install PySpark on Mac
Open Jupyter Notebook with PySpark
Launching a SparkSession
Conclussion
References

Introduction

Spark is fast (up to 100x faster than traditional Hadoop MapReduce) due to in-memory operation.
It offers robust, distributed, fault-tolerant data objects (called RDDs)
It integrates beautifully with the world of machine learning and graph analytics through supplementary packages like MLlib and GraphX.

Spark is implemented on Hadoop/HDFS and written mostly in Scala, a functional programming language.However, for most beginners, Scala is not a great first language to learn when venturing into the world of data science.

Fortunately, Spark provides a wonderful Python API called PySpark. This allows Python programmers to interface with the Spark framework — letting you manipulate data at scale and work with objects over a distributed file system. So, Spark is not a new programming language that you have to learn but a framework working on top of HDFS.

This presents new concepts like nodes, lazy evaluation, and the transformation-action (or ‘map and reduce’) paradigm of programming.In fact, Spark is versatile enough to work with other file systems than Hadoop — like Amazon S3 or Databricks (DBFS).

Internet powerhouses such as Netflix, Yahoo, and eBay have deployed Spark at massive scale, collectively processing multiple petabytes of data on clusters of over 8,000 nodes.

Spark Definition

Typically when you think of a computer you think about one machine sitting on your desk at home or at work. This machine works perfectly well for applying machine learning on small dataset . However, when you have huge dataset(in tera bytes or giga bytes), there are some things that your computer is not powerful enough to perform. One particularly challenging area is data processing. Single machines do not have enough power and resources to perform computations on huge amounts of information (or you may have to wait for the computation to finish).

A cluster, or group of machines, pools the resources of many machines together allowing us to use all the cumulative resources as if they were one. Now a group of machines alone is not powerful, you need a framework to coordinate work across them. Spark is a tool for just that, managing and coordinating the execution of tasks on data across a cluster of computers.

Spark Application

A Spark Application consists of:

Driver
Executors (set of distributed worker processes)

Driver

The Driver runs the main() method of our application having the following duties:

Runs on a node in our cluster, or on a client, and schedules the job execution with a cluster manager
Responds to user’s program or input
Analyzes, schedules, and distributes work across the executors

Executors

An executor is a distributed process responsible for the execution of tasks. Each Spark Application has its own set of executors, which stay alive for the life cycle of a single Spark application.

Executors perform all data processing of a Spark job
Stores results in memory, only persisting to disk when specifically instructed by the driver program
Returns results to the driver once they have been completed
Each node can have anywhere from 1 executor per node to 1 executor per core

** Node is single entity machine or server .

Spark’s Application Workflow

When you submit a job to Spark for processing, there is a lot that goes on behind the scenes.

Our Standalone Application is kicked off, and initializes its SparkContext. Only after having a SparkContext can an app be referred to as a Driver
Our Driver program asks the Cluster Manager for resources to launch its executors
The Cluster Manager launches the executors
Our Driver runs our actual Spark code
Executors run tasks and send their results back to the driver
SparkContext is stopped and all executors are shut down, returning resources back to the cluster

Install Spark on Mac (locally)

First Step: Install Brew

You will need to install brew if you have it already skip this step:

1. open terminal on your mac. You can go to spotlight and type terminal to find it easily (alternative you can find it on /Applications/Utilities/).
2. Enter the command bellow.

$ /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

3. Hit Return and the script will run. It will output to your terminal a log of what is going to install. Hit Return to continue or any other key to abort.

4. It might ask for sudo privileges. If this happens you will have to type your admin password and hit Return again.

Notes: Command line tools (Apple's XCode) will be installed after this guide.

The installation will look like as the image below.

Installation of Homebrew through command line

When the installation finishes successfully it will look as the image below.

caption

By default Homebrew is sending anonymous data and analytics. You can find additional information here. You can choose to opt-out by running the command.

$ brew analytics off

Second Step: Install Anaconda

In the same terminal just simple type: $ brew cask install anaconda. Please see resources section in case you face any issue in that step.

Third final Step: Install PySpark

1. ona terminal type $ brew install apache-spark

2. if you see this error message, enter $ brew cask install caskroom/versions/java8to install Java8, you will not see this error if you have it already installed.

3. check if pyspark is properly install by typing on the terminal $ pyspark. If you see the below it means that it has been installed properly:

Open Jupyter Notebook with PySpark Ready

This section assumes that PySpark has been installed properly and no error appear when typing on a terminal $ pyspark. At this step, I present the steps you have to follow in order create Jupyter Notebooks automatically initialised with SparkContext.
In order to create a global profile for your terminal session, you will need to create or modify your .bash_profile or .bashrc file. Here, I will use .bash_profile as my example

1. Check if you have .bash_profile in your system $ ls -a, if you don't have one, create one using $ touch ~/.bash_profile

2. Find Spark path by running $ brew info apache-spark

3. If you already have a .bash_profile, open it by $ vim ~/.bash_profile, press I in order to insert, and paste the following codes in any location (DO NOT delete anything in your file):

export SPARK_PATH=(path found above by running brew info apache-spark)
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
#For python 3, You have to add the line below or you will get an error#
export PYSPARK_PYTHON=python3
alias snotebook='$SPARK_PATH/bin/pyspark --master local[2]'

4. Press ESC to exit insert mode, enter :wq to exit VIM. You could fine more VIM commands here.

5. Refresh terminal profile by $ source ~/.bash_profile

My favourite way to use PySpark in a Jupyter Notebook is by installing findSparkpackage which allow me to make a Spark Context available in my code.

findSpark package is not specific to Jupyter Notebook, you can use this trick in your favorite IDE too.

Install findspark by running the following command on a terminal

$ pip install findspark

Launch a regular Jupyter Notebook and run the following command:

# useful to have this code snippet to avoid getting an error in case forgeting
# to close spark

try:
    spark.stop()
except:
    pass

# Using findspark to find automatically the spark folder
import findspark
findspark.init()

# import python libraries
import random

# initialize
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
num_samples = 100000000

def inside(p):
    x, y = random.random(), random.random()
    return x*x + y*y < 1

count = spark.sparkContext.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)

The output should be:

Please note that with Spark 2.2 a lot of people recommend just to simply do pip install pyspark .I try using pip to install pyspark but I couldn’t get the pysparkcluster to get started properly. Reading several answers on Stack Overflow and the official documentation, I came across this:

The Python packaging for Spark is not intended to replace all of the other use cases. This Python packaged version of Spark is suitable for interacting with an existing cluster (be it Spark standalone, YARN, or Mesos) - but does not contain the tools required to setup your own standalone Spark cluster. You can download the full version of Spark from the Apache Spark downloads page.

Therefore, I would suggest to follow the steps that I described above.

Launching a SparkSession

Well, it’s the main entry point for Spark functionality: it represents the connection to a Spark cluster and you can use it to create RDDs and to broadcast variables on that cluster. When you’re working with Spark, everything starts and ends with this SparkSession. Note that SparkSession is a new feature of Spark 2.0 which minimize the number of concepts to remember or construct. (before Spark 2.0.0, the three main connection objects were SparkContext, SqlContext and HiveContext).

In interactive environments, a SparkSession will already be created for you in a variable named spark. For consistency, you should use this name when you create one in your own application.

You can create a new SparkSession through a Builder pattern which uses a "fluent interface" style of coding to build a new object by chaining methods together. Spark properties can be passed in, as shown in these examples:

from pyspark.sql import SparkSession
spark = SparkSession\
        .builder
        .master("local[*]")
        .config("spark.driver.cores", 1)
        .appName("understanding_sparksession")
        .getOrCreate()

At the end of your application, please remember to call spark.stop() in order to end the SparkSession. Let's understand the various settings that we define above:

master: Sets the Spark master URL to connect to, such as “local” to run locally, “local[4]” to run locally with 4 cores, or “spark://master:7077” to run on a Spark standalone cluster.
config:Sets a config option by specifying a (key, value) pair.
appName: Sets a name for the application, if no name is set, a randomly generated name will be used.
getOrCreate:Gets an existing SparkSession or, if there is no existing one, creates a new one based on the options set in this builder. In case an existing SparkSession is returned, the config options specified in this builder affecting the SQLContext configuration will applied. As SparkContext configuration cannot be modified on runtime (you have to stop existing context first) whileSQLContext configuration can be modified on runtime.

Conclusion

Spark has seen immense growth over the past several years. Hundreds of contributors working collectively have made Spark an amazing piece of technology powering the de facto standard for big data processing and data sciences across all industries. But please remember to use it for manipulations of huge dataset when facing performance issues otherwise it may have opposite effects. For small datasets (few gigabytes) it is advisable instead to use Pandas.

Thanks for reading and I am looking forward to hearing your questions :)
Stay tuned and Happy Machine Learning.

References

Bio: Georgios Drakos is a Data Scientist with a BSc and MSc in Electrical and Computer Engineering (National Technical University of Athens) as well as a MSc from Imperial College London, currently working as a Data Scientist for TUI in the travel industry. He specialised in Computational Intelligence for Decision Support, Data Engineering, Sophisticated Analytics and Technological Innovation Management. Highly interested in Cloud-based technologies, Distributed Computing, Big Data and the business applications of Data Science & Machine Learning.

Original. Reposted with permission.

Related: