In reading articles on The Modern Data Stack, mostt articles mention using the dbt tool for transformations.
dbt allows you to set up a transformation using SQL. It can create a view, or to create a new table for the transformed data.
I worked through a simple demo, my files are available on GitHub. In this example. records from three tables (player, position, team) are combined into one denormalized table. The script to create the sample tables and data are under source_data. I’m using a Postgres database in the demo.
I installed the CLI-based version of dbt (dbt core), to run on my PC. There’s also a cloud-based version available. For this, you’ll need to have Python 3.7 (or later) installed. Installation is a pip command:
pip install dbt-postgres
Pulling down the Postgres specific file includes the core files for install.
To confirm installation, we can run the help command in the CLI.
We can include a specific command to get help on it:
dbt run --help
In the CLI, navigate to the directory where you want your project files, then run the init command:
dbt init nfl_roster
This will create a project named “nfl_roster”
You’ll see a line in the init output ‘Creating dbt configuration folder at…’. In that .dbt directory is a profiles.yml file (Mine was in my Users directory, in the .dbt folder). Set up your database connections in profiles. Since we downloaded the postgres driver, we’ll see two pg profiles set up, for dev and prod. We’ll rename the dev connection to test_pg, and remove the prod.
We’ll set the schema value to report, so that our transform will be build in a separate schema. The run process will create the schema, if it’s not already there.
We’ll also need to update target to test_pg, since the target value is the active profile.
Navigate into the project directory, and run
This will run some checks on the setup and make sure it can connect to our database.
Now we’re going to build a query to put together data from our three tables. We’ll put together a player.sql file in the models\player directory in our project folder. We’ll also remove the example directory created by the initialization.
We’ll need to update the dbt_project file to add player under the models list. We’ll also update the +materialized property to table (the default is view) to create and populate a table.
Running the transformation is a simple run command.
This will create and populate a report.player_list table. Each run will rebuild the table.
It’s possible to set up dbt for incremental transforms, to add new records instead of a complete rebuild of a table.
There’s also functionality for tests and to generate documentation.
dbt definitely is a quick and easy way to set up data transforms. Most data professionals already know SQL, so that knowledge gets leveraged to set up this part of the data pipeline.