Apache PySpark Tutorial

Introduction

PySpark Documentation

PySpark is an interface for Apache Spark in Python. It not only allows you to write Spark applications using Python APIs, but also provides the PySpark shell for interactively analyzing your data in a distributed environment. PySpark supports most of Spark’s features such as Spark SQL, DataFrame, Streaming, MLlib (Machine Learning) and Spark Core.

Spark SQL and DataFrame

Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrame and can also act as distributed SQL query engine.

Streaming

Running on top of Spark, the streaming feature in Apache Spark enables powerful interactive and analytical applications across both streaming and historical data, while inheriting Spark’s ease of use and fault tolerance characteristics.

MLlib

Built on top of Spark, MLlib is a scalable machine learning library that provides a uniform set of high-level APIs that help users create and tune practical machine learning pipelines.

Spark Core

Spark Core is the underlying general execution engine for the Spark platform that all other functionality is built on top of. It provides an RDD (Resilient Distributed Dataset) and in-memory computing capabilities.


Installation

Installation


Prerequisites: Java for MacOS

Install Homebrew

Homebrew

CLI
1
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Install Java through Homebrew

How to install Java JDK on macOS
2021-07-26, The valid jdk versions must be 8 or 11 from AdoptOpenJDK

CLI
1
2
3
brew tap adoptopenjdk/openjdk

brew search adoptopenjdk
CLI
1
brew install adoptopenjdk11

Review your Java version.

CLI
1
2
java -version
# java --version # not working for java8


Prerequisites: Hadoop for MacOS

Install Hadoop

CLI
1
brew install hadoop

Review your Hadoop version

CLI
1
hadoop version


Test

Code


Quickstart

Quickstart


Database Connection

MySQL

Download MySQL JDBC connector

Download

For MacOS, select Platform Independent.


Connect to MySQL Server

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from pyspark.sql import SparkSession

# Path format (MacOS): /.../connector_name.jar
# NO file:// in the beginning !!
# Path must be absolute path
connector_path = "[connector path]"
appName = "[app name]"

spark = SparkSession.builder \
.master("local") \
.appName(appName) \
.config("spark.jars", connector_path) \
.getOrCreate()
sc = spark.sparkContext

sc.setLogLevel('ERROR')
1
2
3
4
5
6
7
# MySQL Driver
driver = "com.mysql.cj.jdbc.Driver"
db = "[db name]"
url = f"jdbc:mysql://localhost:3306/{db}"
tb = "[tb name]"
user = "[user name]"
password = "[password]"
1
2
3
4
5
6
7
df = spark.read.format("jdbc") \
.option("url", url) \
.option("driver", "com.mysql.cj.jdbc.Driver") \
.option("dbtable", tb) \
.option("user", user) \
.option("password", password) \
.load()
1
df.show()

Or use options to pass the parameters:

1
2
3
4
5
6
7
df = spark.read.format("jdbc").options(
driver=driver,
url=url,
dbtable=tb,
user=user,
password=password
).load()


SQLite

Download SQLite JDBC connecter

Download


Connect to SQLite Server

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from pyspark.sql import SparkSession

# Path format (MacOS): /.../connector_name.jar
# NO file:// in the beginning !!
# Path must be absolute path
connector_path = "[connector path]"
appName = "[app name]"

spark = SparkSession.builder \
.master("local") \
.appName(appName) \
.config("spark.jars", connector_path) \
.getOrCreate()
sc = spark.sparkContext

sc.setLogLevel('ERROR')
1
2
3
4
5
6
# SQLite Driver
driver = "org.sqlite.JDBC"
# SQLite Database absolute path
db_path = "[db path]"
url = f"jdbc:sqlite:{db_path}"
tb = "[tb name]"
1
2
3
4
5
df = spark.read.format("jdbc").options(
driver=driver,
url=url,
dbtable=tb
).load()
1
df.show()