Let’s use the same basic setup as in test python code, then use our knowledge from create python packages to convert our code to a package. And finally we will install the package on our Databricks cluster.
Python packages are easy to test in isolation. But what if packaging your code is not an option, and you do want to automatically verify that your code actually works, you could run your databricks notebook from Azure DevOps directly using the databricks-cli.
A databricks notebook that has datetime.now() in one of its cells, will most likely behave differently when it’s run again at a later point in time. For example: when you read in data from today’s partition (june 1st) using the datetime – but the notebook fails halfway through – you wouldn’t be able to restart the same job on june 2nd and assume that it will read from the same partition.
With databricks-connect you can connect your favorite IDE to your Databricks cluster. This means that you can now lint, test, and package the code that you want to run on Databricks more easily:
Databricks-connect allows you to connect your favorite IDE to your Databricks cluster. Install Java on your local machine. Uninstall any pyspark versions, and install databricks-connect using the regular pip
commands, preventing any changes to be recorded to your virtual environment (prevents mutations to Pipfile
and Pipfile.lock
).
Companies hire developers to write spark applications – using expensive Databricks clusters – transforming and delivering business-critical data to the end user. It is advised to properly test your software: enhance your databricks workflow. But if there is no time to set up proper package testing, there is always the hacker way of running tests right inside of Databricks notebooks.