PySpark

Apache Spark is a data processing framework used in data engineering. It allows us to use distributed processing to use multiple nodes to work with large data sets. PySpark allows us to use Python to interact with Spark.

Setup:
I’m going to run Docker and use the jupyter/pyspark-notebook image. This will include everything we need to run PySpark, and allow us to use Jupyter notebooks for our code. I have a post on running Docker on Windows, if you haven’t done that before.
Once we open Docker Desktop to start the engine, we can use the command line to obtain the image.

docker pull jupyter/pyspark-notebook

Once we pull down the image jupyter/pyspark-notebook(The one I’m using is around 4.85 GB), we can start it at the command line with this command:

docker run -d -p 8888:8888 jupyter/pyspark-notebook

We’ll map port 8888 so that we can use the browser to run our PySpark code.
Open browser to localhost:8888 to open the Admin page.
We can run

docker ps

to see all of the running containers, and get the name. Running

docker logs container_name

will display the logs for the container (Substituting the actual name for container_name).
The log will list a token as part of a URL. Copy the token and enter on Admin page. Logging in takes you to the lab page.

Running PySpark:
On the lab page there is an Upload Files button (The upward arrow). This will allow us to upload a data file to give us some data to work with. I’ve put together a CSV with the results of Atlanta Falcons football games from 2023. Once the file has been uploaded, you’ll see the file listed in the left pane, under Work. You can double click on the file here to see the contents displayed.
Once we return to the lab page, we can select Notebook. A Notebook will allow us to put together repeatable code, and see results once we run the commands.
First, we’ll import the libraries we’ll need:

import pyspark
from pyspark.sql import SparkSession

Next, we create a Spark Context, which is a connection to the Spark cluster. Local connects to the local Spark instance in the container, and Lab is an optional name for the context.

c = pyspark.SparkContext('local', 'lab')

Then we create a Spark Session.

s = SparkSession.builder.getOrCreate()

With the session, we can read our data file into a data frame. A data frame will have named columns and rows of data, much like a table. Setting the header option to true will use the first line in the CSV for the attribute names.

df = s.read.option("header",True).csv("Falcons2023.csv")

The show method will display the data frame and the data.

df.show()

Now we can start querying our data. We’ll use an aggregate function to find the average score for Atlanta for the entire season.

atlAvg = df.agg({"AtlScore": 'avg'})

We’ll also get averages for all home games, as well as the away games. We’ll use the same aggregate, but first we’ll need to apply a filter to only use home or away games (For home games, the HomeAway value is H, and A for away games).

atlAvgHome = df.filter(df["HomeAway"]=='H').agg({"AtlScore": 'avg'})
atlAvgAway = df.filter(df["HomeAway"]=='A').agg({"AtlScore": 'avg'})

For the last step, we’ll display our calculated values. The aggregates return another dataframe of one row and one column, so we’ll use the head method to get the first row, then [0] for the first (and only) column. We’ll also round each of the values to one decimal place.

print("Average Atl Score: " + str(round(atlAvg.head()[0],1)))
print("Average Atl Home Score: " + str(round(atlAvgHome.head()[0],1)))
print("Average Atl Away Score: " + str(round(atlAvgAway.head()[0],1)))

In the notebook, we can run the code line by line with the play button (or Shift + Enter), or the double play button (with two arrows) will run everything at once.

Wrap-up:
This was a simple example to get familiar with using PySpark. There wasn’t anything here that we couldn’t do with SQL, but using Python gives us an alternative processing method. Plus, we’re able to use Python libraries to perform more complicated processing, like machine learning. We can also utilize multiples nodes if we’re working with a large dataset.

Links:

Github – jupyter/docker

Datacamp – Pyspark tutorial

Spark by examples – Pyspark tutorial

SQL Rob

SQL Server and other database technologies

PySpark

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply