Apache Hadoop is mainly used processing unstructured and semi-structured data, such as logs and files, there may also be a need to process structured data that is stored in relational databases.
Apache Sqoop is a tool used to transfer data between Hadoop clusters and relational databases. It can be used to import data from a relational database management system (RDBMS) such as SQL Server, MySQL, or Oracle into the Hadoop distributed file system (HDFS), transform the data in Hadoop with MapReduce or Apache Hive, and then export the data back into an RDBMS.
Things to consider
- Both Hadoop and SQL Server must be on same Virtual network.
- SQL Server must be configured to allow SQL Authentication and remote connections.
Limitations
- Bulk export - With Linux-based HDInsight, the Sqoop connector used to export data to Microsoft SQL Server or Azure SQL Database doesn't support bulk inserts.
- Batching - With Linux-based HDInsight, When using the -batch switch when performing inserts, Sqoop makes multiple inserts instead of batching the insert operations.