Posted 2021-01-08Updated 2023-07-04IT / BI7 minutes read (About 1029 words)

Increase Disk Space for Linux Virtual Machine Created by VMware

Virtual machine is a major way to setup Linus development environment in Windows PC. One of the great things about newly Linux distro such as RHEL, Ubuntu, Centos etc is that most of them adopt to a LVM (Logical Volume Manager) filesystem which is a natural fit for Linux virtualized system like VMware or VitualBox.

The main characteristic for LVM is that it is able to adjust filesystem volume dynamically by integrating multiple partitions which feel like one partition on one disk after LVM adjustment, moreover, adding and remove partitions to increase and shrink disk space also become very easy than before and this feature applies virtualized hard drive and makes it very easy to grow the disk space within few steps setup. You might ask what is the point to do that, I would say if you make virtual machine as your main development environment, the disk space is going to run out quickly as time goes by when more and more tools and libraries are installed. I have my VM hard disk initial 20GB by default setting but after I setup all my environment items my root directory only has 300MB space left.

I will grow disk space with my Linux virtual machine to 40GB:

Before we make our hands dirty, it’s necessary to get some basic understanding about the LVM in terms of 3 key things, they are PV (physical volumes), VG (volume groups) and LV (logical volumes). LVM integrates multiple partitions or disks into a big independent partition (VG), then formats it to create multiple logical volumes (LV) to be able to mount filesystem.

Posted 2021-01-01Updated 2023-07-04Database / BI16 minutes read (About 2474 words)

SQL Server Table Partitioning in Large Scale Data Warehouse 3

This post is the last part on series of table partitioning, as plan this part is going to focus on some advanced topics like partition merge, split, conversion and performance optimization in terms of different business demands.

The first thing we will talk about might be interesting, we know one benefit of table partitioning is speed up loading and archiving data, we can easily feel the performance improvement on query against large data after table partitioning, but data archiving is not that apparent and it’s on the lower file system level to helps you mange data more on the backend efficiently by file group and database file. After all, we are intent to manage data through partitions and additionally, manage partitions through database files, but that is not that straight forward.

Posted 2020-12-26Updated 2023-07-04Database / BI39 minutes read (About 5843 words)

SQL Server Table Partitioning in Large Scale Data Warehouse 2

This post is part 2, we are focusing on design and create data process to extract, transform and load (ETL) big amount data into partitioned tables to be able to integrate to SSIS package then operate ETL process by scheduled job automatically.

In this case, we suppose transaction data with 100 columns and 100 million records need to be loaded into data warehouse from staging table. Technically, we can do partition switching from staging table (non partitioned table) to data warehouse table (partitioned table), but under the business reality, we can’t just simply do it like that, because

Source table and target table usually are in different database, it’s not able to switch data directly because of violating same file group condition by partition switching basic in part 1.
Staging table would be empty after partition switching, but the most of data transformations are applied to staging table then load final results into target table in data warehouse, so in business point of view, we can’t do partition switching from staging to target neither because big impact for entire daily, weekly or monthly data process.

There might be another question: why don’t we bulk load data from source to target or what benefits do we get from partition switching? Technically, yes, we can do bulk insert, however, for such big volume of data movement, the lead time is going to be hours (2-3 hours), if the process ran on business hours, it would cause big impact and it’s hard to tolerant by business users for critical data like transaction so from efficiency and performance perspective, we have to leverage partition switching to deliver transaction data in short period of time without interrupt BI reports, dashboard refresh and business analysis.

What is the main idea for production operation?

In order to promise transaction data is always available, we need to create a middle table to hold transaction data as source table for partition switching, we call that middle table as switch table. To satisfy all the requirements for partition switching, the switch table has to be created in as the same file group as target transaction table, identical columns, indexes, use the same partition column. We do the bulk insert from staging table to switch table then partition switch to target transaction table in data warehouse, as part 1 mentioned, this step will finish in flash as long as there is no blocking. At last, we drop switch table so the entire data process completes. Now let’s dig litter deeper on details for each steps.

Posted 2020-12-25Updated 2023-07-04Database / BI36 minutes read (About 5428 words)

SQL Server Table Partitioning in Large Scale Data Warehouse 1

This is a series articles to demonstrate table partitioning technology and how to apply it in a large scale data warehouse to improve database performance and even benefit for maintenance operation based on Microsoft SQL Server. This series is composed by three parts

The first one is a fundamental introduction on basis knowledge
The second part is showcase the entire production workflow to apply table partitioning to large tables
Finally, it’s going to be advanced topic like partition merge, split, conversion and performance optimization in terms of different business demands

Table Partitioning The Basics

For this part, I will reference Cathrine Wilhelmsen’s work, she is Microsoft Data Platform MVP, BimlHero Certified Expert, international speaker, author, blogger, and chronic volunteer. She loves data and coding, as well as teaching and sharing knowledge - oh, and sci-fi, chocolate, coffee, and cats :)

I learnt a lot and got many ideas from her blogs especially for table partitioning technology, below is her blog address, hope it can help you as well:blush:

Cathrine blog

Posted 2020-12-22Updated 2023-07-04Data / BI13 minutes read (About 1951 words)

Hadoop Data Side Load from SQL Server

Agile development and DevOps bring flexibilities and quick solutions to support business intelligent in timely manners. A technical platform and its associated applications and tools are able to turnaround very quick so that business analysts and data scientists would be able to leverage them to do the data modeling or machine learning, but in the other side, unlike functions buildup, data sync across different platforms is not that easy and quick especially for large organizations.

Background and the gap of data modeling for Hadoop early adopter

These years, big data and Hadoop are kind of trend for next generation data technic. Many companies adopt that as major data platform, but the most of data is still allocated in RDBMS data warehouse, business intention is to leverage high quality data in SQL database to build their analytical work in Hadoop, the data consistency is the first consideration from data perspective, but it is not a easy task because data is going to migrate to different platform with different operating system (from Windows to Linux). Technically, the best solution for the project is the build the direct connection from SQL Server and SSIS to Hive by using Apache Sqoop or utilize the JVM to build JDBC connection by JAVA, but for large organization, applying a new tool on production needs a quite lot approve work; developing JDBC connection facility also needs multiple level testing, those are taking a long time.

Therefore the solution is back to the foundation of the Hadoop - file system. Because SSIS cannot write to Hive directly using ODBC (before 2015 version). The alternative is to create a file with the appropriate file format and copy it directly to the Hadoop file system then use Hive command to write metadata to Hive metastore, the data will show up in the Hive table and also available in Cloudera Impala.

Posted 2020-12-20Updated 2023-07-04data / BI7 minutes read (About 984 words)

Data Manipulation and ETL with Pandas

Pandas is the most used library in Python to manipulate data and deal with data transformation, by leveraging Pandas in memory data frame and abundant build-in functions in terms of data frame, it almost can handle all kinds of ETL task. We are going to talk about a data process to read input data from Excel Spreadsheet, make some data transformations by business requirement then load reporting data into SQL Server database.

Posted 2020-12-17Updated 2023-07-04IT / BI11 minutes read (About 1698 words)

Blocking Process Monitoring and Auto Email Notification in SQL Server

From function perspective, to maintain a large scale business data warehouse is for making database system stable, robust and fast, which is a essential part to boost business team productivity and performance. However, for business users, the fundamental thing is data, so data can be delivered in high frequency and in time is the cornerstone for all business analysis and BI reporting. Obviously, the primary mandate for data management team is highly monitor ETL jobs to promise data process running well and smooth and never being blocked or corrupt by using process.

In this article, I am going to talk about how to build up a ETL job monitoring system to watch user query automatically in designed frequency. To accomplish this task, we need

Create a view to collect user query in real time
Create a stored procedure to detect blocking in different situations
Create another stored procedure to handle the notification email sending
Create a console script to overall control the workflow

Posted 2020-12-15Updated 2023-07-04Data / BI11 minutes read (About 1687 words)

Automation Process for Email Attachment Excel in Python

Working with business data, Excel spreadsheet is the most common file type you might deal with in daily basis, because Excel is a dominated application in the business world. What is the most used way to transfer those Excel files for business operation team, obviously, it’s Outlook, because email attachment is the easiest way for business team to share data and reports. Following this business common logic and convention, you may get quite a lot Excel files from email attachment when you involved into a business initiative or project to design a data solution for BI reporting and business analysis.

The pain points is too much manual work dragging down efficiency of data availability and also increasing the possibility of human error. Imagine, every day get data from email attachment, you need to check your inbox every once for a while, then download those files from attachment, open Excel to edit data or rename file to meet data process requirement such as remove the protected password, after those preparation works all done, push data to NAS drive, finally, launch the job to proceed the data. It’s not surprise that how easily you might make mistake because any single step contains error would cause the whole process failed.

It’s very necessary to automate the whole data process if business project turns to BAU (Business As Usual) program and you have to proceed data in regular ongoing basis. Python and Windows Task Scheduler provides a simple and fast way to solve this problem, now let’s take a look.

Overall speaking, this task can be broken down by a couple of steps:

Access Outlook to read email and download the attachment
Remove Excel file protected password (it’s common in business file to protect data privacy)
Manipulate and edit Excel file
Copy file to NAS drive
Setup the Windows Task Scheduler to run python script automatically.

Posted 2020-12-14Updated 2023-07-04IT3 minutes read (About 482 words)

Linux Network Management Tool nmcli

There are over 60 Linux networking commands you can utilize to do all the system network configurations, some of them are well known and widely used such as ifconfig, ip addr, traceroute, netstat and ping etc.., one command is very useful but relatively few being used, that is nmcli which is used for controlling your network, just like its name network manager and also can do all the thing to configure your network like displaying network device status, create, edit, activate/deactivate and delete network connection.

syntax and options

nmcli command has 2 arguments, one is option and the other one is object.

1	nmcli [options] object {command \| help}

Posted 2020-12-11Updated 2023-07-04Database / BI5 minutes read (About 696 words)

Oracle client side configuration for 12C

The client side need to do some configurations after Oracle 11g upgrade to 12C on Server in order to make database server is connectable. Before starting to configurate your clients, you have to get the below new server information from DBA

Host name
Port number
Service name
Your user name(usually it won’t be changed and replicated from old version)
Password(initial password for test connection then you need to update it)

and then you need to make a new connection strings to add it into ORA file(*.ora)