Parameterize Databricks Notebooks

A databricks notebook that has datetime.now() in one of its cells, will most likely behave differently when it’s run again at a later point in time. For example: when you read in data from today’s partition (june 1st) using the datetime – but the notebook fails halfway through – you wouldn’t be able to restart the same job on june 2nd and assume that it will read from the same partition.

If we borrow the concept of purity from Functional Programming, and apply it to our notebook, we would simply pass any state to the notebook via parameters. And additionally we’d make sure that our notebook:

is deterministic
has no side effects

Parameterizing

Arguments can be accepted in databricks notebooks using widgets.

We can replace our non-deterministic datetime.now() expression with the following:

from datetime import datetime as dt

dbutils.widgets.text('process_datetime', '')

In a next cell, we can read the argument from the widget:

process_datetime = dt.strptime(
    dbutils.widgets.get('process_datetime'),
    '%Y-%m-%d')

Assuming you’ve passed the value 2020-06-01 as an argument during a notebook run, the process_datetime variable will contain a datetime.datetime value:

print(process_datetime)  # 2020-06-01 00:00:00

Orchestration

Using the databricks-cli in this example, you can pass parameters as a json string:

databricks jobs run-now \
--job-id 123 \
--notebook-params '{"process_datetime": "2020-06-01"}'

We’ve made sure that no matter when you run the notebook, you have full control over the partition (june 1st) it will read from.

Widgets

The full list of available widgets is always available by running dbutils.widgets.help() in a python cell:

Howto

Parameterize Databricks Notebooks

Parameterizing

Orchestration

Widgets