One of the more challenging aspects I encountered while working with stream processing was the task of performing near-merge or as-of joins on Kafka streams.
Processing messages from Kafka is relatively easy, but common errors can easily be avoided by having a better understanding of how Kafka works internally. We’ll process messages from the start of the topic, and see how Kafka handles offsets step by step.
When you want to share your software with a non-developer, or you want to run a comprehensive python project without having to bother with the setup, executables are quite useful. And it only takes about 5 minutes to get it done.
Sometimes it’s useful to get your key vault secret in a Data Factory pipeline (in a secure way).
Python Package Index pypi.org is not the only source for downloading python packages. It’s possible to host your own package index.
This post will be very similar to the last one. But this time, we’ll release our python package to Azure Synapse instead of Databricks. In order to unit test Synapse Notebooks, you’ll have to jump through all sorts of hoops.
Let’s use the same basic setup as in test python code, then use our knowledge from create python packages to convert our code to a package. And finally we will install the package on our Databricks cluster.
For different reasons it may be useful to run dockerized GUI apps. For me, the reason was: stress testing a web app’s javascript logic by keeping 200+ browser tabs open for a long period of time.
Python packages are easy to test in isolation. But what if packaging your code is not an option, and you do want to automatically verify that your code actually works, you could run your databricks notebook from Azure DevOps directly using the databricks-cli.
A databricks notebook that has datetime.now() in one of its cells, will most likely behave differently when it’s run again at a later point in time. For example: when you read in data from today’s partition (june 1st) using the datetime – but the notebook fails halfway through – you wouldn’t be able to restart the same job on june 2nd and assume that it will read from the same partition.
Although the official python documentation has dedicated an in-depth document on Functional Programming, the community does not consider FP techniques best practice at all times Popular FP related functions have been moved to functools
. Still, FP concepts can be very helpful, and in this post I will demonstrate some ‘Functional Programming’-related concepts, using python.
With databricks-connect you can connect your favorite IDE to your Databricks cluster. This means that you can now lint, test, and package the code that you want to run on Databricks more easily:
Why should you care about creating packages? Packages are easy to install (pip install demo
). Packages simplify development (pip install -e .
installs your package and keeps it up-to-date during development). Packages are easy to run and test …
This post covers my personal workflow for python projects, using Visual Studio Code along with some other tools. A good workflow saves time and allows you to focus on the problem at hand, instead of tasks that make you feel like a robot (machines are good for that).
Databricks-connect allows you to connect your favorite IDE to your Databricks cluster. Install Java on your local machine. Uninstall any pyspark versions, and install databricks-connect using the regular pip
commands, preventing any changes to be recorded to your virtual environment (prevents mutations to Pipfile
and Pipfile.lock
).
If you’re accustomed to using a unix shell as your command-line interface, you may end up being very unproductive using Windows 10. Microsoft (in collaboration with Canonical) has made a tremendous effort into closing the gap between Linux and Windows developers by creating a kernel compatibility layer based on Ubuntu.
Writing unit tests should an integral part of delivering software for every developer. Whenever a piece of code is changed, it has the potential to break all other parts. The broken parts can even be discovered in a far later stage, having caused potential damage that is hard to restore.
Companies hire developers to write spark applications – using expensive Databricks clusters – transforming and delivering business-critical data to the end user. It is advised to properly test your software: enhance your databricks workflow. But if there is no time to set up proper package testing, there is always the hacker way of running tests right inside of Databricks notebooks.
In this post we will take a look at using Pipenv, a dependency manager for python, to boost your python workflow. A virtual environment is an isolated environment in which dependencies for a python project are contained.
This will be a micro-post talking about managing multiple python versions on your machine, using a tool called Pyenv. Depending on your operating system, it may be quite a hassle uninstalling and installing different python versions.
In this post we’ll create a free static website with Hugo, push our project to Azure Repo’s, build and release the site to Github Pages using Azure DevOps.