Efficient Ways of Large Data Processing and Scanning in Azure Databricks

Working with data and database, writing query is daily routine, query running time might be big different for different queries but serve the same purpose, which shows query performance is not 100 percent determined by database system configuration, server hard ware, table index and statistics etc. but how well do you use SQL to construct your query.

To satisfy user strong demand, we built sandbox database on production for business partners to server their needs of BI reporting and business analytical, in the meanwhile, we are also in charge of database maintenance and monitoring so that we got chance to collect all kinds of user queries. Some of them even crash the system or drag the database performance down apparently, here I want to share my thinking of query optimization on large amount of data.

An Application of Microsoft Purview Authentication Workflow for API and Data Sources

An Application of Microsoft Purview Authentication Workflow for API and Data Sources

Microsoft Purview is a powerful data catalog and governance solution that helps organization discover metadata and manage their data assets. In front end, it provides web portal for business user to browse metadata and manage data mapping; in the back end, it provides API to access data source for scanning and Purview instance for business metadata updating through data process pipeline in batch job so that one critical aspect of using Purview is API authentication, especially when integrating with other services and applications.

In this article, we’ll take a deep dive into how Purview uses OAuth2.0 authentication for service principals and the secure way to access and manage data source for scanning, providing a comprehensive understanding of the process.

Read more
Azure Data Factory and Azure Databricks Work Together to Build Robust Data Pipeline for Large Data Process

Azure Data Factory and Azure Databricks Work Together to Build Robust Data Pipeline for Large Data Process

Azure Data Factory is a cloud-based data integration service that allows you to create data-driven workflows in the Azure cloud for orchestrating and automating data movement and data transformation. If you are familiar with Microsoft BIDS (Business Intelligent Development Suit) and used to use SSIS (SQL Server Integration Service) on-prem, you might see Azure Data Factory (we will call it ADF) as a counterpart of SSIS on Azure cloud. ADF is also a key component if you’re looking to data migration on cloud.

ADF is actually a data platform that allows users to create a workflow that can ingest data from both on-prem and cloud data stores, and transform or process data by using integrated computing service such as Synapse and Azure Databricks (we will call it ADB). Then, the results can be published to an on-prem or could data store e.g. SQL Server or Azure SQL Database for business intelligence (BI) applications (Tableau or PowerBI) to consume.

ADF inherit most of key components from SSIS such as Stored Procedure and Script. By leveraging powerful data processing capabilities of Stored Procedure, you can conduct almost any data manipulation and transformation; Script is fully compatible T-SQL syntax so that you can execute your business logic in a programming way not only SQL statement. Besides that, the most powerful component is ADB support. ADB is now data engineering industry standard, we talked about it many time on previous post. Now, let’s see how do we efficiently proceed out data using ADF and ADB.

Read more
Azure Databricks Notebooks Modulization and Interaction

Azure Databricks Notebooks Modulization and Interaction

Azure Databricks is the most common and popular platform and engine for data engineering and machine learning in Azure cloud. Notebook is the most used tool and application to do the data processing and data analysis, it not only inherits Jupyter Notebook powerful functionalities in Python but integrates Scala, R, Java even markdown to be able to create storyboard for data process. One common use case for a large data process development is notebooks need to call each. Main notebook calls sub-notebooks to retrieve classes, functions or properties, sub-notebooks call to parameter notebook to retrieve values for parameters. Notebook can be modularized and imported by other notebooks, this post is about the methods on notebook modulization and reference.

Read more
Azure Databricks DBFS and Interactive Ways with Local Image and Azure ADLS

Azure Databricks DBFS and Interactive Ways with Local Image and Azure ADLS

In today’s technology industry, Databricks has undoubtedly become a unicorn company in big data distributed processing, data analysis, data visualization, and machine learning. In the current era of cloud computing, the three major cloud service providers, Microsoft, Amazon, and Google, have all incorporated Databricks into their cloud computing platforms. This shows Databricks’ unique contribution to data cloud computing and its pivotal role in the development of enterprise-level data products.

With the decreasing cost of cloud storage and the improvement in network speeds, more and more enterprises are choosing to store all their data in a central repository rather than separately storing different types of data. This trend towards centralization helps companies better understand their business operations through real-time business intelligence and predictive analytics. At the same time, the explosive growth of data has made it impractical for companies to maintain multiple large data stores, leading to the merging of data lakes and data warehouses into a single platform. Based on the Lakehouse technology architecture, Databricks provides platform-level services that integrate data storage, processing, visualization, and machine learning into a unified environment. As a result, more and more enterprises are choosing Databricks as their primary cloud data service platform, and developers also prefer Databricks Notebook as a unified development and presentation environment.

Read more
Github Pages Hosts A Website with A Custom Domain

Github Pages Hosts A Website with A Custom Domain

From previous posting, “Online Document/Resume Deployment on Github Pges with Docsify”, we discussed the way to deploy static web pages and online documents onto Github Pages so that we can save our cost on web hosting, but what if we want to use custom domain like usernameresume.net or resume.username.net rather than username.github.io or username.github.io/pagename? The answer is we can because github provides that functionality to host website by using custom domain, all we need to do is simply 3 steps

  1. buy a custom domain from vendor
  2. config domain DNS setting
  3. config a custom domain on your Github Pages site

Now, let’s dive-in the details of those steps in order to host your website with your favorite domain on github.

Read more
A Little Thoughts on DP-203: MS Azure Data Engineering Associate Exam

A Little Thoughts on DP-203: MS Azure Data Engineering Associate Exam

The evolution of cloud computing over the past years has been becoming the most impacted event in both technology and business world. More and more companies from almost all industries such as financial, manufacture, education, Hi-Tech have migrated or are deploying their computing platform onto cloud, cloud provides the capabilities to be able to integrate company’s all resources into a focal place, which breaks the isolations and obstacles on all kinds of business operations and implementations, one of the most important digital assets among those resources is data. In that circumstance, data engineer with cloud computing background and experience becomes highly demanded in workplace and job market, and a shortcut to crack into this career path is getting a certification in the subject.
Each cloud provider like Amazon, Microsoft and Google offer a certification specialized to their data engineering services, a very first question is which one should we choose to start with, my answer is Microsoft Azure. Here are some reasons why I draw that conclusion

  • Microsoft almost dominates business software and application therefore it has the most potential customers to choose its services to be able to make their business process being migrated transparently and consistently.
  • Microsoft Azure is constantly growing these years, market share growth rate is even over Amazon AWS. It is acquiring 1000 customers a day. Moreover, Azure Cloud is used by 95% of Fortune 500 companies.
  • Most of the skills are easily transferable to other Cloud Providers such as AWS and Google Cloud
Read more
Online Document/Resume Deployment on Github Pges with Docsify

Online Document/Resume Deployment on Github Pges with Docsify

Pandemic changes the world dramatically, digitalization and virtualization like tide are becoming unstoppable and deeply impact the way we are living. Meta is coming up and becomes prevalent under this circumstance. We are to be pushed to think seriously how to build personal platform to present ourselves and allocate our resources efficiently, one of the most important personal documents is resume, but for people who don’t have development background that is never easy even so many technologies we can utilize, because the basic web technologies such as html, css, javascript, node.js, mysql and mongodb are still used as cornerstone and unavoidable, some web framework like react.js, angular and vue.js are even more overwhelming for non technical person. The question is now what is the most cost efficient way to deploy and manage online resources by using the least technology? Cost efficient here includes both economic and work effort, you might say there are a lot of web services like Wix.com to provide no coding and one stop web solution but it is expensive so that there is no perfect solution but least effort one, after given a quit a lot of tries, there is a way by using technologies with the shortest learning curve to be able to quickly deploy your digital resume online for free, they are

  • Markdown
  • Basic git commands
  • Free github repository
Read more
Setup SPARK Environment Locally for Big Data Development

Setup SPARK Environment Locally for Big Data Development

Spark has been reported to be one of the most valuable tech skills to learn by data professionals and demand for Spark and Big Data skill has exploded in recent years. As one of the latest technologies in the big data space, Spark is quickly becoming one of the most powerful Big Data tools for data processing and machine learning, its ability to run programs up to 100x faster than Hadoop MapReduce in memory.

Unlike single program installation, Spark environment setup is not that straight forward but needs a series of installations and configurations with sequence order requirement and dependencies from operating system to software. The main purpose of Spark is dealing with Big Data which means data is too big to allocate into a single server but multiple servers to comprise cluster. Because it’s cluster environment so that Spark is naturally installed in Linux server, therefore, from the operating system perspective, Linux is the only option, which decide the entire process is going to heavy rely on CLI instead of GUI so that the basic Linux CLI command is required before rolling up your sleeves and get your hand dirty.

But if you don’t want to setup environment by yourself, there is also a good solution provided by Databrick. Databrick is a company started by the creator of Spark that provides clusters that run on the top of AWS and adds a convience of having a notebook system already set up and the ability to quickly add files either from storage like Amazon S3 or from your local computer. I has a free community version that supports a 6 GB cluster. Because it is a web-based service so that you can easily follow the wizard on Databrick web portal to finish setup. We don’t show that here because it’s out of our scope.

Before we begin, something is very necessary to emphasis here. Spark is written by Scala, Scala is written by Java, so we have to follow the sequence to make sure Java installed then followed by Scala. Now let’s start it.

Read more
Python Development Bootstrap on Managing Environment and Packages

Python Development Bootstrap on Managing Environment and Packages

If you have basic python experience and knowledge and willing to develop a real python application or product, I think that post is right for you. Let’s recap how do we start to use Python, we either go to official site to download python, install it on system wise then jump start to write “Hello, world!” or install Anaconda to get a long list of pre-installed libraries and then do the same thing. I would say that is totally ok by using python globally or system wise at the beginning stage, but as we accumulate enough python scripts and do the real python project, we may found something really painful during our development

  • Version confusion: different python versions reside together as well as pip versions
  • Package redundancy: too many libraries in one place, some of them you use only for some practice and never use them again
  • Package version conflict: that is the most severe one and headache, imagine
    • your early development based on python2, you move to python3 afterwards, but you found your early applications break after you upgrade package version on which your python2 applications dependent, because those libraries are not backwards compatible
    • even all your development based on python3, you still stuck on the situation that one library of your new development needs high version sub-dependency package but your early development rely on the same sub-dependency package but with lower version
    • when you work with team, you need to pass your work to other teammate to test, your application breaks due to there no unique environment and packages between your machines

Therefore, for real python production development, the first step should set up a proper virtual environment to be able to compatibility and collaboration before jump to coding. Thanks to community contributors, we have multiple choices to meet our different requirements and needs, there are 3 ways to do that, they are

  1. Python early official solution: venv (virtualenv) and pip
  2. Python latest official solution: pipenv
  3. Conda environment and package management tool

now let’s see how do we do the configuration.

Read more