Posted 2020-12-22Updated 2023-07-04Data / BI13 minutes read (About 1951 words)

Hadoop Data Side Load from SQL Server

Agile development and DevOps bring flexibilities and quick solutions to support business intelligent in timely manners. A technical platform and its associated applications and tools are able to turnaround very quick so that business analysts and data scientists would be able to leverage them to do the data modeling or machine learning, but in the other side, unlike functions buildup, data sync across different platforms is not that easy and quick especially for large organizations.

Background and the gap of data modeling for Hadoop early adopter

These years, big data and Hadoop are kind of trend for next generation data technic. Many companies adopt that as major data platform, but the most of data is still allocated in RDBMS data warehouse, business intention is to leverage high quality data in SQL database to build their analytical work in Hadoop, the data consistency is the first consideration from data perspective, but it is not a easy task because data is going to migrate to different platform with different operating system (from Windows to Linux). Technically, the best solution for the project is the build the direct connection from SQL Server and SSIS to Hive by using Apache Sqoop or utilize the JVM to build JDBC connection by JAVA, but for large organization, applying a new tool on production needs a quite lot approve work; developing JDBC connection facility also needs multiple level testing, those are taking a long time.

Therefore the solution is back to the foundation of the Hadoop - file system. Because SSIS cannot write to Hive directly using ODBC (before 2015 version). The alternative is to create a file with the appropriate file format and copy it directly to the Hadoop file system then use Hive command to write metadata to Hive metastore, the data will show up in the Hive table and also available in Cloudera Impala.