Posted 2024-04-09Updated 2024-04-119 minutes read (About 1288 words)

An Application of Microsoft Purview Authentication Workflow for API and Data Sources

Microsoft Purview is a powerful data catalog and governance solution that helps organization discover metadata and manage their data assets. In front end, it provides web portal for business user to browse metadata and manage data mapping; in the back end, it provides API to access data source for scanning and Purview instance for business metadata updating through data process pipeline in batch job so that one critical aspect of using Purview is API authentication, especially when integrating with other services and applications.

In this article, we’ll take a deep dive into how Purview uses OAuth2.0 authentication for service principals and the secure way to access and manage data source for scanning, providing a comprehensive understanding of the process.

Posted 2024-04-04Updated 2024-04-058 minutes read (About 1163 words)

Azure Data Factory and Azure Databricks Work Together to Build Robust Data Pipeline for Large Data Process

Azure Data Factory is a cloud-based data integration service that allows you to create data-driven workflows in the Azure cloud for orchestrating and automating data movement and data transformation. If you are familiar with Microsoft BIDS (Business Intelligent Development Suit) and used to use SSIS (SQL Server Integration Service) on-prem, you might see Azure Data Factory (we will call it ADF) as a counterpart of SSIS on Azure cloud. ADF is also a key component if you’re looking to data migration on cloud.

ADF is actually a data platform that allows users to create a workflow that can ingest data from both on-prem and cloud data stores, and transform or process data by using integrated computing service such as Synapse and Azure Databricks (we will call it ADB). Then, the results can be published to an on-prem or could data store e.g. SQL Server or Azure SQL Database for business intelligence (BI) applications (Tableau or PowerBI) to consume.

ADF inherit most of key components from SSIS such as Stored Procedure and Script. By leveraging powerful data processing capabilities of Stored Procedure, you can conduct almost any data manipulation and transformation; Script is fully compatible T-SQL syntax so that you can execute your business logic in a programming way not only SQL statement. Besides that, the most powerful component is ADB support. ADB is now data engineering industry standard, we talked about it many time on previous post. Now, let’s see how do we efficiently proceed out data using ADF and ADB.

Posted 2024-03-28Updated 2024-04-038 minutes read (About 1231 words)

Azure Databricks Notebooks Modulization and Interaction

Azure Databricks is the most common and popular platform and engine for data engineering and machine learning in Azure cloud. Notebook is the most used tool and application to do the data processing and data analysis, it not only inherits Jupyter Notebook powerful functionalities in Python but integrates Scala, R, Java even markdown to be able to create storyboard for data process. One common use case for a large data process development is notebooks need to call each. Main notebook calls sub-notebooks to retrieve classes, functions or properties, sub-notebooks call to parameter notebook to retrieve values for parameters. Notebook can be modularized and imported by other notebooks, this post is about the methods on notebook modulization and reference.

Posted 2024-03-12Updated 2024-03-286 minutes read (About 951 words)

Azure Databricks DBFS and Interactive Ways with Local Image and Azure ADLS

In today’s technology industry, Databricks has undoubtedly become a unicorn company in big data distributed processing, data analysis, data visualization, and machine learning. In the current era of cloud computing, the three major cloud service providers, Microsoft, Amazon, and Google, have all incorporated Databricks into their cloud computing platforms. This shows Databricks’ unique contribution to data cloud computing and its pivotal role in the development of enterprise-level data products.

With the decreasing cost of cloud storage and the improvement in network speeds, more and more enterprises are choosing to store all their data in a central repository rather than separately storing different types of data. This trend towards centralization helps companies better understand their business operations through real-time business intelligence and predictive analytics. At the same time, the explosive growth of data has made it impractical for companies to maintain multiple large data stores, leading to the merging of data lakes and data warehouses into a single platform. Based on the Lakehouse technology architecture, Databricks provides platform-level services that integrate data storage, processing, visualization, and machine learning into a unified environment. As a result, more and more enterprises are choosing Databricks as their primary cloud data service platform, and developers also prefer Databricks Notebook as a unified development and presentation environment.