Apache Spark is an in-memory distributed data processing framework that is part of the Apache open source foundation. Spark was created to work with large datasets utilizing multiple nodes. Developers can use Java, Python, Scala or R to interact with and process large datasets. Generally Spark is used as part of a Hadoop deployment, and can be used to query large datasets, process streaming data, or by data scientists using machine learning libraries to analyze data.
Previously Spark worked with Resilient Distributed Datasets. RDDs are read-only, in-memory datasets. The data is distributed across several nodes, and the data sets are fault tolerant.
DataSets were introduced in Spark 1.6, which build on RDDs. A DataFrame is a DataSet with named columns. Users of R are familar with a DataFrame, which are used widely in that language.
Spark Core: The central component, handling task scheduling, memory management, etc.
Spark SQL: Works with structured data, allowing a mixture of SQL queries and data manipulation in DataFrames.
Spark Streaming: Works with data in streams (as opposed to batch processing of data).
MLlib: Machine learning library.
GraphX: Graph data processing.
There are several options for deployment. We can create a standalone cluster, we can download a VM from Hortonworks to deploy in Virtual Box, VM Ware or Docker, or we can setup a cluster an Azure, among other options. I chose to create a cluster in Azure.
To create the cluster in Azure, go to New, then Data + Analytics, then HDInsight. Here you run through a regular Azure setup, User Name/Password, resource group, etc. Once we get to the Cluster Type, we can specify that we want a Spark setup. I chose the latest edition available (Spark 2.1.0) on the Standard tier.
On the next page we specify storage options. We can choose Azure Storage (BLOB) or Data Lake storage, I went with Azure storage. If you don’t already have a storage account, you can set one up here. When specifying a name for the storage account, you aren’t allowed to use upper case letters, only lower case and numbers.
Once you create the cluster, you are billed by the hour. The cluster I created came to $3.63 an hour, with 6 nodes and 40 cores. The deployment took about 15 minutes to complete.
In a future post I’ll run through a Spark demo job and look at some use cases.