Data Manipulation and ETL with Pandas
Pandas is the most used library in Python to manipulate data and deal with data transformation, by leveraging Pandas
in memory data frame and abundant build-in functions in terms of data frame, it almost can handle all kinds of ETL task. We are going to talk about a data process to read input data from Excel Spreadsheet
, make some data transformations by business requirement then load reporting data into SQL Server
database.
Pandas functions brief introduction
First, we use read_excel()
function to read in input Excel file, then use slicing function iloc()
or loc()
to get the data which need to be proceeded. The difference between iloc()
and loc()
is iloc()
use row and column index number to slice data but loc()
is use column name instead of index number.
If we need to rename the column, we use rename(column={},inplace=True/False)
function to do that. We also can use drop()
function to drop columns. If we want to re-order the sequence of columns, the reindex()
will help us to do that. For the purpose of merging or joining multiple tables, we can use merge()
function. we can use to_sql()
function to write data into database table after transformations. If we want to create UDF (user defined function) to apply to Pandas
data frame then we can use apply(UDF name)
function.
For SQL windows function, Pandas
also has equivalent way on those kind of function such as Top N rows per group which is equivalent to row_number()over() sql window function. Use assign(().groupby().cumcount()).query().sort_values()
function
Pandas use cases to manipulate Excel data to load into SQL Server database
1 | import numpy as np |
1 | import numpy as np |