Efficient Ways of Large Data Processing and Scanning in Azure Databricks

Working with data and database, writing query is daily routine, query running time might be big different for different queries but serve the same purpose, which shows query performance is not 100 percent determined by database system configuration, server hard ware, table index and statistics etc. but how well do you use SQL to construct your query.

To satisfy user strong demand, we built sandbox database on production for business partners to server their needs of BI reporting and business analytical, in the meanwhile, we are also in charge of database maintenance and monitoring so that we got chance to collect all kinds of user queries. Some of them even crash the system or drag the database performance down apparently, here I want to share my thinking of query optimization on large amount of data.

Data Manipulation and ETL with Pandas

Data Manipulation and ETL with Pandas

Pandas is the most used library in Python to manipulate data and deal with data transformation, by leveraging Pandas in memory data frame and abundant build-in functions in terms of data frame, it almost can handle all kinds of ETL task. We are going to talk about a data process to read input data from Excel Spreadsheet, make some data transformations by business requirement then load reporting data into SQL Server database.

Read more
Automation Process for Email Attachment Excel in Python

Automation Process for Email Attachment Excel in Python

Working with business data, Excel spreadsheet is the most common file type you might deal with in daily basis, because Excel is a dominated application in the business world. What is the most used way to transfer those Excel files for business operation team, obviously, it’s Outlook, because email attachment is the easiest way for business team to share data and reports. Following this business common logic and convention, you may get quite a lot Excel files from email attachment when you involved into a business initiative or project to design a data solution for BI reporting and business analysis.

The pain points is too much manual work dragging down efficiency of data availability and also increasing the possibility of human error. Imagine, every day get data from email attachment, you need to check your inbox every once for a while, then download those files from attachment, open Excel to edit data or rename file to meet data process requirement such as remove the protected password, after those preparation works all done, push data to NAS drive, finally, launch the job to proceed the data. It’s not surprise that how easily you might make mistake because any single step contains error would cause the whole process failed.

It’s very necessary to automate the whole data process if business project turns to BAU (Business As Usual) program and you have to proceed data in regular ongoing basis. Python and Windows Task Scheduler provides a simple and fast way to solve this problem, now let’s take a look.

Overall speaking, this task can be broken down by a couple of steps:

  1. Access Outlook to read email and download the attachment
  2. Remove Excel file protected password (it’s common in business file to protect data privacy)
  3. Manipulate and edit Excel file
  4. Copy file to NAS drive
  5. Setup the Windows Task Scheduler to run python script automatically.

emailexcelmindset.png

Read more