Efficient Ways of Large Data Processing and Scanning in Azure Databricks

Working with data and database, writing query is daily routine, query running time might be big different for different queries but serve the same purpose, which shows query performance is not 100 percent determined by database system configuration, server hard ware, table index and statistics etc. but how well do you use SQL to construct your query.

To satisfy user strong demand, we built sandbox database on production for business partners to server their needs of BI reporting and business analytical, in the meanwhile, we are also in charge of database maintenance and monitoring so that we got chance to collect all kinds of user queries. Some of them even crash the system or drag the database performance down apparently, here I want to share my thinking of query optimization on large amount of data.

Azure Data Factory and Azure Databricks Work Together to Build Robust Data Pipeline for Large Data Process

Azure Data Factory and Azure Databricks Work Together to Build Robust Data Pipeline for Large Data Process

Azure Data Factory is a cloud-based data integration service that allows you to create data-driven workflows in the Azure cloud for orchestrating and automating data movement and data transformation. If you are familiar with Microsoft BIDS (Business Intelligent Development Suit) and used to use SSIS (SQL Server Integration Service) on-prem, you might see Azure Data Factory (we will call it ADF) as a counterpart of SSIS on Azure cloud. ADF is also a key component if you’re looking to data migration on cloud.

ADF is actually a data platform that allows users to create a workflow that can ingest data from both on-prem and cloud data stores, and transform or process data by using integrated computing service such as Synapse and Azure Databricks (we will call it ADB). Then, the results can be published to an on-prem or could data store e.g. SQL Server or Azure SQL Database for business intelligence (BI) applications (Tableau or PowerBI) to consume.

ADF inherit most of key components from SSIS such as Stored Procedure and Script. By leveraging powerful data processing capabilities of Stored Procedure, you can conduct almost any data manipulation and transformation; Script is fully compatible T-SQL syntax so that you can execute your business logic in a programming way not only SQL statement. Besides that, the most powerful component is ADB support. ADB is now data engineering industry standard, we talked about it many time on previous post. Now, let’s see how do we efficiently proceed out data using ADF and ADB.

Read more
Azure Databricks Notebooks Modulization and Interaction

Azure Databricks Notebooks Modulization and Interaction

Azure Databricks is the most common and popular platform and engine for data engineering and machine learning in Azure cloud. Notebook is the most used tool and application to do the data processing and data analysis, it not only inherits Jupyter Notebook powerful functionalities in Python but integrates Scala, R, Java even markdown to be able to create storyboard for data process. One common use case for a large data process development is notebooks need to call each. Main notebook calls sub-notebooks to retrieve classes, functions or properties, sub-notebooks call to parameter notebook to retrieve values for parameters. Notebook can be modularized and imported by other notebooks, this post is about the methods on notebook modulization and reference.

Read more
Azure Databricks DBFS and Interactive Ways with Local Image and Azure ADLS

Azure Databricks DBFS and Interactive Ways with Local Image and Azure ADLS

In today’s technology industry, Databricks has undoubtedly become a unicorn company in big data distributed processing, data analysis, data visualization, and machine learning. In the current era of cloud computing, the three major cloud service providers, Microsoft, Amazon, and Google, have all incorporated Databricks into their cloud computing platforms. This shows Databricks’ unique contribution to data cloud computing and its pivotal role in the development of enterprise-level data products.

With the decreasing cost of cloud storage and the improvement in network speeds, more and more enterprises are choosing to store all their data in a central repository rather than separately storing different types of data. This trend towards centralization helps companies better understand their business operations through real-time business intelligence and predictive analytics. At the same time, the explosive growth of data has made it impractical for companies to maintain multiple large data stores, leading to the merging of data lakes and data warehouses into a single platform. Based on the Lakehouse technology architecture, Databricks provides platform-level services that integrate data storage, processing, visualization, and machine learning into a unified environment. As a result, more and more enterprises are choosing Databricks as their primary cloud data service platform, and developers also prefer Databricks Notebook as a unified development and presentation environment.

Read more