Azure Databricks Notebooks Modulization and Interaction
Azure Databricks is the most common and popular platform and engine for data engineering and machine learning in Azure cloud. Notebook is the most used tool and application to do the data processing and data analysis, it not only inherits Jupyter Notebook
powerful functionalities in Python
but integrates Scala
, R
, Java
even markdown
to be able to create storyboard for data process. One common use case for a large data process development is notebooks need to call each. Main notebook calls sub-notebooks to retrieve classes, functions or properties, sub-notebooks call to parameter notebook to retrieve values for parameters. Notebook can be modularized and imported by other notebooks, this post is about the methods on notebook modulization and reference.
In one recent project, we collaborate with our partner engineering team to build a data solution based on their product in Azure through the APIs they provide, we use Databricks to call their API to fetch data from their application, then apply our business logics to performance data transformation and report generating. We create environmental configuration and utility notebooks for reuse and reference and we summarize different ways on notebook modulization by leveraging Databricks PySpark command and functionality
Situation and Background
we have three module notebooks to support the main data process notebook.
- util: generic and common functions used by main notebooks
- env_config: environment information and parameters’ value and common libraries for the main process
- access_token_output: access token generator for obtaining time-bind API access
we need use all variables, parameters and functions from util
and env_config
notebook, but token output from access_token_output
notebook so that we are going to use 2 different methods to solve this problem.
- %run command
- dbutils.notebook.run() function
Solution
Method 1: %run command
%run is spark magic command and it’s usage syntax is very simple
%run
$parameter_1= “value” $parameter_2=”value” … $parameter_n=”value”
we can call notebook from another notebook by using it, all variables defined in reference notebook become available in main notebook. One thing we need to pay attention is %run must be in a cell by itself because it runs the entire notebook inline Databricks document.
Additionally, space is the separator among parameters, path of notebook should use relative directory with current main notebook.
1 | %run ../config/env_config $configuration_file=$configuration_file $environment_params=$environment_params |
1 | %run ../config/util |
If we want to comment out the command, just put cursor in the cell then stroke ctrl + ?
key. Now let’s show the way to receive parameters’ value for reference notebook from main notebook.
1 | executionStatus = [] |
As above shown, reference notebook needs values for variables configFile
and envParams
from main process notebook to parse below properties
1 | # Read Environment Configurations |
In main process notebook, we retrieve parameters’ value from ADF (Azure Data Factory) then run %run command to bring all properties from reference notebook
1 | # Reading the parameters passed from the ADF |
1 | %run ../config/env_config $configuration_file=$configuration_file $environment_params=$environment_params |
If we want to use variables defined in reference notebook, we can directly call it from main process notebook
1 | sqlJdbcUrl = jdbcUrlPrefix + sqlServerName + ";" + "databaseName=" + sqlDatabaseName |
If we want to use functions defined in utility notebook, we also can call it directly from main process notebook.
1 | def writeSqlTable (jdbcUrl, write_df, sqlTable, writeMode, connectionProperties, execStatus): |
Method 2: dbutils.notebook.run() function
The syntax is run(path: string, timeout_seconds: int, arguments: map): string
map can be Python dictionary, for example:
dbutils.notebook.run(“sample_notebook”, 60, {“param1”: “value”, “param2”: “value”})
In this way, we can run a notebook and return its exit/return value. The method starts ephemeral job that runs immediately, which means a new instance of the executed notebook is created and the computations are done within it, in its own scope, and completely aside from the main notebook. This means that no functions and variables you define in the executed notebook can be reached from the main notebook. Now let’s see how do we use that method in our main process notebook.
Use Case
As we showed before, %run magic command must run in separated cell, it can’t be used with other statement, but sometime if we want to run the notebook in a separate environment without affecting current main notebook, that method would be a better choice. It won’t overwrite the variables and functions in our current main notebook, but only return executed notebook exit value so that it can be used with other code, e.g. when we use OAuth2 method to authenticate our SP (Service Principal) API access, we need to get the access token which is time-bind string, if our process needs to call million times API, we have to retrieve the new access token string before old one is expired, in that case, we have a notebook specifically is created for retrieving a new access token, the main process notebook has to integrate it into the logic with other codes in one cell.
1 | if datetime.now() > tokenExpTime: |
The differences between method 1 and method 2
In method 1, the main process notebook includes the entire reference notebook env_config
. The value of variables and functions are available for main process notebook, but they are also overwrite variables and functions in main process notebook.
In method 2, reference notebook access_token_output
runs main process notebook in a separated job. The value of variables and functions are not overwritten by reference notebook and keep as they were.
Summary
Here is the summary of the differences between them.
%run | deutils.notebook.run() | |
---|---|---|
Running Environment | In the same instance of the executed notebook | In a new instance of the executed notebook |
Running Method | In a separated cell | In one line of PySpark code |
Supported Parameter Types | String (only) | Map |
Execution Output | Doesn’t return exit status unless adding “dbutils.notebook.exit()” in executed notebook explicitly | Offers details of notebook execution and command outputs |