Data Manipulation and ETL with Pandas

Data Manipulation and ETL with Pandas

Pandas is the most used library in Python to manipulate data and deal with data transformation, by leveraging Pandas in memory data frame and abundant build-in functions in terms of data frame, it almost can handle all kinds of ETL task. We are going to talk about a data process to read input data from Excel Spreadsheet, make some data transformations by business requirement then load reporting data into SQL Server database.

Read more
t quite a lot Excel files from email attachment when you involved into a business initiative or project to design a data solution for BI reporting and business analysis.

The pain points is too much manual work dragging down efficiency of data availability and also increasing the possibility of human error. Imagine, every day get data from email attachment, you need to check your inbox every once for a while, then download those files from attachment, open Excel to edit data or rename file to meet data process requirement such as remove the protected password, after those preparation works all done, push data to NAS drive, finally, launch the job to proceed the data. It’s not surprise that how easily you might make mistake because any single step contains error would cause the whole process failed.

It’s very necessary to automate the whole data process if business project turns to BAU (Business As Usual) program and you have to proceed data in regular ongoing basis. Python and Windows Task Scheduler provides a simple and fast way to solve this problem, now let’s take a look.

Overall speaking, this task can be broken down by a couple of steps:

  1. Access Outlook to read email and download the attachment
  2. Remove Excel file protected password (it’s common in business file to protect data privacy)
  3. Manipulate and edit Excel file
  4. Copy file to NAS drive
  5. Setup the Windows Task Scheduler to run python script automatically.

emailexcelmindset.png

Read more
A little thought on SQL query performance optimization 1

A little thought on SQL query performance optimization 1

Working with data and database, writing query is daily routine, query running time might be big different for different queries but serve the same purpose, which shows query performance is not 100 percent determined by database system configuration, server hard ware, table index and statistics etc. but how well do you use SQL to construct your query.

To satisfy user strong demand, we built sandbox database on production for business partners to server their needs of BI reporting and business analytical, in the meanwhile, we are also in charge of database maintenance and monitoring so that we got chance to collect all kinds of user queries. Some of them even crash the system or drag the database performance down apparently, here I want to share my thinking of query optimization on large amount of data.

Before writing a query

Just like a project, a query is able to achieve the things with backend business logics even if it is small, so make a plan before creation is very necessary to be able to let the query return result in flash and only consume minimum system resources.

  • What level detail of data do you want to query from? Business data is built on certain level, e.g. in common scenario, there are customer level, account level and transaction level data. It’s better clarify target data is on which level or mix different level.
  • What the time scope? The data amount volume would be big if you want to query multiple years even months, so if you evaluate the data is big, it’s better use loop or paging technic instead of returning outcome in one single page.
  • What are query tables look like? A table contains a lot of information, not only business attributes or columns but data type, keys, index, constraints, triggers. Familiar with table structure is going to give additional benefits when you write your queries against those tables.
Read more