Posted 2021-01-01Updated 2023-07-04Database / BI16 minutes read (About 2474 words)

SQL Server Table Partitioning in Large Scale Data Warehouse 3

This post is the last part on series of table partitioning, as plan this part is going to focus on some advanced topics like partition merge, split, conversion and performance optimization in terms of different business demands.

The first thing we will talk about might be interesting, we know one benefit of table partitioning is speed up loading and archiving data, we can easily feel the performance improvement on query against large data after table partitioning, but data archiving is not that apparent and it’s on the lower file system level to helps you mange data more on the backend efficiently by file group and database file. After all, we are intent to manage data through partitions and additionally, manage partitions through database files, but that is not that straight forward.

Posted 2020-12-26Updated 2023-07-04Database / BI39 minutes read (About 5843 words)

SQL Server Table Partitioning in Large Scale Data Warehouse 2

This post is part 2, we are focusing on design and create data process to extract, transform and load (ETL) big amount data into partitioned tables to be able to integrate to SSIS package then operate ETL process by scheduled job automatically.

In this case, we suppose transaction data with 100 columns and 100 million records need to be loaded into data warehouse from staging table. Technically, we can do partition switching from staging table (non partitioned table) to data warehouse table (partitioned table), but under the business reality, we can’t just simply do it like that, because

Source table and target table usually are in different database, it’s not able to switch data directly because of violating same file group condition by partition switching basic in part 1.
Staging table would be empty after partition switching, but the most of data transformations are applied to staging table then load final results into target table in data warehouse, so in business point of view, we can’t do partition switching from staging to target neither because big impact for entire daily, weekly or monthly data process.

There might be another question: why don’t we bulk load data from source to target or what benefits do we get from partition switching? Technically, yes, we can do bulk insert, however, for such big volume of data movement, the lead time is going to be hours (2-3 hours), if the process ran on business hours, it would cause big impact and it’s hard to tolerant by business users for critical data like transaction so from efficiency and performance perspective, we have to leverage partition switching to deliver transaction data in short period of time without interrupt BI reports, dashboard refresh and business analysis.

What is the main idea for production operation?

In order to promise transaction data is always available, we need to create a middle table to hold transaction data as source table for partition switching, we call that middle table as switch table. To satisfy all the requirements for partition switching, the switch table has to be created in as the same file group as target transaction table, identical columns, indexes, use the same partition column. We do the bulk insert from staging table to switch table then partition switch to target transaction table in data warehouse, as part 1 mentioned, this step will finish in flash as long as there is no blocking. At last, we drop switch table so the entire data process completes. Now let’s dig litter deeper on details for each steps.

Posted 2020-12-25Updated 2023-07-04Database / BI36 minutes read (About 5428 words)

SQL Server Table Partitioning in Large Scale Data Warehouse 1

This is a series articles to demonstrate table partitioning technology and how to apply it in a large scale data warehouse to improve database performance and even benefit for maintenance operation based on Microsoft SQL Server. This series is composed by three parts

The first one is a fundamental introduction on basis knowledge
The second part is showcase the entire production workflow to apply table partitioning to large tables
Finally, it’s going to be advanced topic like partition merge, split, conversion and performance optimization in terms of different business demands

Table Partitioning The Basics

For this part, I will reference Cathrine Wilhelmsen’s work, she is Microsoft Data Platform MVP, BimlHero Certified Expert, international speaker, author, blogger, and chronic volunteer. She loves data and coding, as well as teaching and sharing knowledge - oh, and sci-fi, chocolate, coffee, and cats :)

I learnt a lot and got many ideas from her blogs especially for table partitioning technology, below is her blog address, hope it can help you as well:blush:

Cathrine blog

Posted 2020-12-22Updated 2023-07-04Data / BI13 minutes read (About 1951 words)

Hadoop Data Side Load from SQL Server

Agile development and DevOps bring flexibilities and quick solutions to support business intelligent in timely manners. A technical platform and its associated applications and tools are able to turnaround very quick so that business analysts and data scientists would be able to leverage them to do the data modeling or machine learning, but in the other side, unlike functions buildup, data sync across different platforms is not that easy and quick especially for large organizations.

Background and the gap of data modeling for Hadoop early adopter

These years, big data and Hadoop are kind of trend for next generation data technic. Many companies adopt that as major data platform, but the most of data is still allocated in RDBMS data warehouse, business intention is to leverage high quality data in SQL database to build their analytical work in Hadoop, the data consistency is the first consideration from data perspective, but it is not a easy task because data is going to migrate to different platform with different operating system (from Windows to Linux). Technically, the best solution for the project is the build the direct connection from SQL Server and SSIS to Hive by using Apache Sqoop or utilize the JVM to build JDBC connection by JAVA, but for large organization, applying a new tool on production needs a quite lot approve work; developing JDBC connection facility also needs multiple level testing, those are taking a long time.

Therefore the solution is back to the foundation of the Hadoop - file system. Because SSIS cannot write to Hive directly using ODBC (before 2015 version). The alternative is to create a file with the appropriate file format and copy it directly to the Hadoop file system then use Hive command to write metadata to Hive metastore, the data will show up in the Hive table and also available in Cloudera Impala.

Posted 2020-12-17Updated 2023-07-04IT / BI11 minutes read (About 1698 words)

Blocking Process Monitoring and Auto Email Notification in SQL Server

From function perspective, to maintain a large scale business data warehouse is for making database system stable, robust and fast, which is a essential part to boost business team productivity and performance. However, for business users, the fundamental thing is data, so data can be delivered in high frequency and in time is the cornerstone for all business analysis and BI reporting. Obviously, the primary mandate for data management team is highly monitor ETL jobs to promise data process running well and smooth and never being blocked or corrupt by using process.

In this article, I am going to talk about how to build up a ETL job monitoring system to watch user query automatically in designed frequency. To accomplish this task, we need

Create a view to collect user query in real time
Create a stored procedure to detect blocking in different situations
Create another stored procedure to handle the notification email sending
Create a console script to overall control the workflow

Posted 2020-12-02Updated 2023-07-04Database / BI17 minutes read (About 2559 words)

Sandbox solution for BI reporting and business analytics

At beginning, I’d like to share a story from my client and business partner. One day, my team worked on a big marketing project, data from all kinds of source like spreadsheet, csv, mainframe and SQL Server, we had to do cross reference all those data to generate analysis reports but the headache thing was they were isolated and no way to put them into a single unique environment to run one time query then return the results so we could only open multiple windows to compare them by eyeballs.

During the project, we often composed some complex queries then ran for a long time to return the result dataset, those datasets were quite important for future further analysis and share with the whole team, but the another panic thing was we could not save those dataset into database due to insufficient access so what we did was copy and paste everything in excel spreadsheet, after for a while, we found number of excel file explode and hard to find the report among those huge files, we feed up with the tedious work and decided created a bunch of views in database but that was also not controlled by us but infrastructure team, all we could do was submit the request then followed infrastructure team’s schedule and waited for month end deployment, no matter how urgent those reports would be.

That is the story, I think if you had ever experienced that, that solution might be right for you.

Posted 2020-11-30Updated 2023-07-04Data8 minutes read (About 1155 words)

A little thought on SQL query performance optimization 1

Working with data and database, writing query is daily routine, query running time might be big different for different queries but serve the same purpose, which shows query performance is not 100 percent determined by database system configuration, server hard ware, table index and statistics etc. but how well do you use SQL to construct your query.

To satisfy user strong demand, we built sandbox database on production for business partners to server their needs of BI reporting and business analytical, in the meanwhile, we are also in charge of database maintenance and monitoring so that we got chance to collect all kinds of user queries. Some of them even crash the system or drag the database performance down apparently, here I want to share my thinking of query optimization on large amount of data.

Before writing a query

Just like a project, a query is able to achieve the things with backend business logics even if it is small, so make a plan before creation is very necessary to be able to let the query return result in flash and only consume minimum system resources.

What level detail of data do you want to query from? Business data is built on certain level, e.g. in common scenario, there are customer level, account level and transaction level data. It’s better clarify target data is on which level or mix different level.
What the time scope? The data amount volume would be big if you want to query multiple years even months, so if you evaluate the data is big, it’s better use loop or paging technic instead of returning outcome in one single page.
What are query tables look like? A table contains a lot of information, not only business attributes or columns but data type, keys, index, constraints, triggers. Familiar with table structure is going to give additional benefits when you write your queries against those tables.