Companies hire developers to write spark applications – using expensive Databricks clusters – transforming and delivering business-critical data to the end user.
Update: It is advised to properly test the code you run on databricks, like this.
⟶
But if there’s no time to set up proper package testing, there’s always the hacker way of running tests right inside of Databricks notebooks.
How can you raise exceptions in databricks notebooks? And how can you test functions in databricks notebooks?
class DoctestFailureException(Exception):
"""An exception that occured during a doctest."""
pass
def doctest_exception_on_fail(function):
"""Run function doctest and see whether 70 *'s were written to stdout."""
sys.stdout.clear()
doctest.run_docstring_examples(function, globals())
output = sys.stdout.getvalue()
if output[:70] == 70 * '*':
raise DoctestFailureException('Doctest failed.')
The function does two things:
run_docstring_examples
allows you to pass functions in a given execution context, and run doctest on the docstring of that function.We can apply doctest_exception_on_fail
on a subset of functions in our notebook:
def doctests(*functions):
"""Run doctest in Databricks.
Pass functions that are to be tested, or pass nothing to find
public functions in the current global symbol table.
"""
import doctest
import sys
globals_copy = globals().copy()
if functions:
for func in functions:
doctest_exception_on_fail(func)
return
[
doctest_exception_on_fail(func)
for name, func in globals_copy.items()
if name[0] is not '_'
if hasattr(func, '__call__')
if func.__module__ == '__main__'
]
The doctests
function does the following:
globals()
when the function is run.__main__
execution context, are tested.When we test the noticeably erroneous function f
, an exception is raised:
There, we’ve done it!
Although this works, it is the right way going forward when it comes to testing code that you run on Databricks.
Can’t we just run pytest, unittest, or doctest in Databricks?
You could run this piece of code in a Databricks notebook, and it will let you know that it’s being executed from /databricks/driver
:
import pytest
pytest.main(["-x", ".", "-vv"])
Output:
============================= test session starts ==============================
platform linux -- Python 3.7.3, pytest-5.3.5, py-1.8.1, pluggy-0.13.1 -- /local_disk0/pythonVirtualEnvDirs/virtualEnv-61a202f1-14af-4ea7-8e29-ba1b137b4a5c/bin/python
cachedir: .pytest_cache
rootdir: /databricks/driver
collecting ... collected 0 items
============================ no tests ran in 0.01s =============================
Out[43]: <ExitCode.NO_TESTS_COLLECTED: 5>
In theory it would be possible to put some python test files in this directory. But that doesn’t count, as you’re not testing your Databricks notebooks. (The folder contains the following:)
%sh ls /databricks/driver -a
Output:
.
..
conf
derby.log
eventlogs
ganglia
logs
.pytest_cache
Databricks notebooks are not just regular .py
files, which pytest would be able to find on the filesystem.
But what about doctest? Execute the following code in your local terminal:
import sys
import doctest
def f(x):
"""
>>> f(1)
45
"""
return x + 1
my_module = sys.modules[__name__]
doctest.testmod(m=my_module)
Now execute the same code in a Databricks notebook.
It won’t work. The documentation of doctest.testmod
states the following:
Test examples in docstrings in functions and classes reachable from module m (or the current module if m is not supplied), starting with m.__doc__.
Apparently the module sys.modules[__name__]
is not behaving like a module on Databricks.
We tested a Databricks notebook. But is this really the way to go?
The doctests
function is executed, tests are ran (at runtime). This means that you have to run the actual code to verify it’s correctness. Tested functions and data processing cells should be logically separated to run tests without side effects. And tests can’t be run automatically, meaning that running of the tests – every time a change is pushed through – is a manual burden put on developers, instead of an automated process.
Refactoring Databricks notebooks – on which business’ success depends – into python packages, and running the tested python packages on Databricks, is most likely still the best solution (example).
Any great idea’s? Let me hear in the comments below!