How To

Use Snapstream

One of the more challenging aspects I encountered while working with stream processing was the task of performing near-merge or as-of joins on Kafka streams.

Processing messages from Kafka is relatively easy, but common errors can easily be avoided by having a better understanding of how Kafka works internally. We’ll process messages from the start of the topic, and see how Kafka handles offsets step by step.

Create Python Executables

When you want to share your software with a non-developer, or you want to run a comprehensive python project without having to bother with the setup, executables are quite useful. And it only takes about 5 minutes to get it done.

Use Key Vault Secrets In Data Factory

Sometimes it’s useful to get your key vault secret in a Data Factory pipeline (in a secure way).

Install Python Packages From Azure DevOps

Python Package Index pypi.org is not the only source for downloading python packages. It’s possible to host your own package index.

Install Python Packages on Azure Synapse

This post will be very similar to the last one. But this time, we’ll release our python package to Azure Synapse instead of Databricks. In order to unit test Synapse Notebooks, you’ll have to jump through all sorts of hoops.

Install Python Packages on Databricks

Let’s use the same basic setup as in test python code, then use our knowledge from create python packages to convert our code to a package. And finally we will install the package on our Databricks cluster.

Run GUI Apps With Docker

For different reasons it may be useful to run dockerized GUI apps. For me, the reason was: stress testing a web app’s javascript logic by keeping 200+ browser tabs open for a long period of time.

Run Databricks Notebooks from DevOps

Python packages are easy to test in isolation. But what if packaging your code is not an option, and you do want to automatically verify that your code actually works, you could run your databricks notebook from Azure DevOps directly using the databricks-cli.

Parameterize Databricks Notebooks

A databricks notebook that has datetime.now() in one of its cells, will most likely behave differently when it’s run again at a later point in time. For example: when you read in data from today’s partition (june 1st) using the datetime – but the notebook fails halfway through – you wouldn’t be able to restart the same job on june 2nd and assume that it will read from the same partition.

Use Functional Programming In Python

Although the official python documentation has dedicated an in-depth document on Functional Programming, the community does not consider FP techniques best practice at all times Popular FP related functions have been moved to functools. Still, FP concepts can be very helpful, and in this post I will demonstrate some ‘Functional Programming’-related concepts, using python.

Enhance Your Databricks Workflow

With databricks-connect you can connect your favorite IDE to your Databricks cluster. This means that you can now lint, test, and package the code that you want to run on Databricks more easily:

Create Python Packages

Why should you care about creating packages? Packages are easy to install (pip install demo). Packages simplify development (pip install -e . installs your package and keeps it up-to-date during development). Packages are easy to run and test …

Enhance Your Python-vscode Workflow

This post covers my personal workflow for python projects, using Visual Studio Code along with some other tools. A good workflow saves time and allows you to focus on the problem at hand, instead of tasks that make you feel like a robot (machines are good for that).

Install databricks-connect

Databricks-connect allows you to connect your favorite IDE to your Databricks cluster. Install Java on your local machine. Uninstall any pyspark versions, and install databricks-connect using the regular pip commands, preventing any changes to be recorded to your virtual environment (prevents mutations to Pipfile and Pipfile.lock).

Install Windows Subsystem for Linux

If you’re accustomed to using a unix shell as your command-line interface, you may end up being very unproductive using Windows 10. Microsoft (in collaboration with Canonical) has made a tremendous effort into closing the gap between Linux and Windows developers by creating a kernel compatibility layer based on Ubuntu.

Test Python Code

Writing unit tests should an integral part of delivering software for every developer. Whenever a piece of code is changed, it has the potential to break all other parts. The broken parts can even be discovered in a far later stage, having caused potential damage that is hard to restore.

Test Code in Databricks Notebooks

Companies hire developers to write spark applications – using expensive Databricks clusters – transforming and delivering business-critical data to the end user. It is advised to properly test your software: enhance your databricks workflow. But if there is no time to set up proper package testing, there is always the hacker way of running tests right inside of Databricks notebooks.

Howto