Dremio is an open data lakehouse. The Dremio query engine allows users to query data from the data lake using SQL. By using Data Virtualization, Dremio presents a unified interface to interact with data, even for data in different formats and different locations.
Dremio can be deployed to the cloud (AWS, Azure, GCP), Docker, or deployed to your own server. It is built to run multiple nodes, although for testing we can run with one node.
There are built-in data connectors to interact with data from cloud storage (Like Amazon S3), or from relational databases. You can also connect an external reporting tool, like Tableau or PowerBI.
Set Up:
There’s a ‘Start For Free’ button on the home page to get the Community version of Dremio. They’ll ask for some information, including your email address, no credit card required. Once you register, you can obtain either the Docker or Kubernetes version.
Or from Docker Desktop, can just pull the dremio/dremio-oss image. I’m running Dremio in Docker.
Run this command in your terminal to start Dremio:
docker run -p 9047:9047 -p 31010:31010 -p 45678:45678 dremio/dremio-oss
Once Dremio is running, go to http://localhost:9047 in your browser, and complete the Admin account setup. The password should be at least 8 characters, with at least one number and one letter.
Query Setup:
There is a SQL Runner built into the Dremio UI, so we don’t necessarily need an external query tool.
The DBeaver query tool can connect to Dremio. Create a ‘Dremio’ connection, and on the Driver Properties settings tab you can download the necessary driver files.
Back on the Main tab, the Host and Port settings can be left as the defaults (localhost \ 31010). We don’t need to specify a database/schema, but we do need to fill in the Admin username and password. Click Finish to save.
Once we connect in DBeaver, we’ll see four databases. One will have an @ plus the Admin user name. There’s also $scratch, INFORMATION_SCHEMA, and sys. The @ database is the main DB, and is the one that the SQL Runner will connect to.
Dremio UI:
Once the admin setup is complete, you’ll come to the Dremio UI. The default view is for Datasets.
In Datasets, we can create Spaces, with are namespaces that can be used to organize your work. You don’t have to create a space, you can simply work in the default space.
Sources is where we can connect to a data source. We can connect to a storage account or to a relational database.
Going down from Datasets in the left-hand menu, the second entry is the SQL Runner. From here we can create queries and manipulate data.
The third item in the left-hand menu is Jobs. Once we run queries to select or manipulate data in the SQL Runner, we’ll see the list of jobs here that interact with the selected dataset(s).
The remaining left-hand menu items are for Settings, Help, and User Account.
For most object in the UI, we can click the ‘Open Details Panel’ icon to see additional information. For some items, like spaces and data sets, Dremio can allow you to add a Wiki page for that item, to contain notes for other users. You can also label items.
Querying:
Dremio has a sample data source. In the UI, we can click Add Source, and select Sample Source under Object Storage.
Once Samples appears under Sources, we can format the data. Click on Samples under Data Sources, then click samples.dremio.com. Click on the NYC-taxi-trips.csv entry to bring up the Format screen. This will give a preview of the data. In the data preview, I see what should be attribute names in the first row. I’ll check ‘Extract Column Names’, and then save the changes. This save will make this dataset available to query.
This should take us to the SQL Runner screen, with this query ready:
SELECT * FROM "NYC-taxi-trips.csv";
Notice the Context has been set to Samples.samples.dremio.com, the same as the path in Object Storage. Some of the other file names have dashes, so we would need to delimit them with double quotes, but we’re OK with this one.
If we go back to the Datasets page and navigate back down to the taxi file, we’ll see a different icon, denoting that it is available as a dataset, as opposed to the other CSV and JSON files. We can repeat the format process with the other files to make them available as well.
Views:
Just as relation database systems allow us to save a query as a view, we can create views in Dremio. As an example, we can copy this query into the SQL Runner:
SELECT pickup_datetime, passenger_count, trip_distance_mi, CASE WHEN tip_amount > 0.00 then TRUE ELSE FALSE end as was_tipped FROM "NYC-taxi-trips.csv";
This returns a subset of the available attributes. Plus we’ve added a true/false was_tipped attribute, based on if the rider left a tip or not.
In the top right of the page, we can see a button labeled ‘Save as View’. Once we save this view, we can access it just as we would the datasets.
Tables:
We can also create tables in Dremio and load data into them. We can create tables in the $scratch database, or we can specify storage options to save it in AWS, Azure, etc. Anything created in scratch is available to any other user, no security is enforced.
We can create a table in the scratch directory by selecting all of the data from an existing data source:
CREATE TABLE $scratch.NycTaxiTrips AS SELECT * FROM "NYC-taxi-trips.csv"; SELECT * FROM $scratch.NycTaxiTrips;
There are plenty of options for creating tables. You can specify a field to use to partition your data. You can also create a table by defining columns, as you would in a relational database. Tables are created using the Apache Iceberg format. We can’t create these custom tables in $scratch, so we would need to have another storage option available.
Links: