Databricks

Databricks is a cloud-based data platform for analytics. It’s ranked #17 on the DB Engines popularity list.
Databricks stores data in a Data Lakehouse, which combines functionality of a data lake and a traditional data warehouse. Databricks uses Apache Spark, which is a distributed data processing framework. We can run SQL to retrieve and manipulate data, or we can write code (Like Python) to work with Spark.
We can create multiple nodes for our data and process everything from a centralized workspace.

Setup:
There are a couple of options for trying out Databricks. There’s a free 14 day trial, where you run Databricks on AWS, Azure, or Google Cloud. You’ll need to have an account with your preferred cloud service first.
I’ve opted to try the Databricks Community edition. It’s a single node, and doesn’t have all of the features of the regular edition, but we won’t need our own cloud service account.
You can sign up at this address. You’ll need to provide your email address. They ask for a company email, but I was able to use a Gmail address. The next page gives you an option for the trial of the full product, I selected ‘Get started with Community edition’.
Once you verify your email address and set a password, you’re ready to start.

Concepts and Components:
First, we’ll define a few terms and concepts within Databricks.
Cluster: A compute node.
Delta Lake: The storage layer, and used as the default table type. It stores data in Parquet files, and uses a transaction log to provide ACID transaction capability.
We don’t necessarily have to create a schema when we create a table, we’re able to specify for a data import to use attribute names from the data source.
Library: For storing code modules. Databricks supports Python, Scala, R, and Java.
DBFS: DataBricks File System, for Object storage.

Back to the UI. Once you’ve signed in, there are several option available on the left-hand menu. The top indicator lets us know which environment we are in. The default is the Data Science & Engineering dashboard. There’s also a Machine Learning option available.
The next option is ‘Create’. From here we can create a Notebook, a Table, or a Cluster.
A Notebook can present and run code, along with any comments we would like to add. We can create code using SQL, Python, Scala, or R. If you’ve worked in Data Science, you may already be familiar with Jupyter or similar notebooks.
A Cluster is a compute node. We’re limited to one in the Community edition, but normally you would probably have multiple clusters available.
Creating a Table gives us a way to upload a file into DBFS (DataBricks File System). We can also retrieve data from Amazon S3, or from other data stores.
After Create is the ‘Workspace’ option. A workspace is a deployment, which can consist of one or more clusters.
Next is ‘Recents’ for the most recently accessed objects, then ‘Search’.
‘Catalog’ allows us to access our Databases, and view or create Tables.
‘Compute’ gives us a UI to manage the clusters and jobs.
The ‘Workflows’ option is usable in the Community edition, but it is used to create jobs and data pipelines.

Running Databricks Quickstart:
From the main dashboard, there’s an option to run through a quickstart tutorial. This takes us to a notebook that will load sample data and run through some queries. The notebook is a combination of SQL and Python. There’s a Run All option, but I ran each block on its own, trying to follow what was going on.
Once we try to run, we’ll be prompted to ‘attach to a compute resource’. This will create a cluster.
The dataset is data on diamonds, with the color, clarity, price, etc. Running the notebook imports data into a table, and queries the table, finding the average price for each color of diamond. It completes these once with SQL, and then with Python and Spark.

Wrap Up:
Obviously, this barely gets into any of the functionality available in Databricks. I stayed in the Data Engineering & Data Science capabilities, so next I’d like to look into Machine Learning on this platform.

Links:

Databricks

Databricks Documentation

What Is Databricks?