Spark SQL Transfer from Database to Hadoop
Hadoop can store structured and unstructured data. That’s the benefit of schemaless approach. However lots of our customer or data resides in Relation Database. We need to take this first into Hadoop so we can query and transform the data inside Hadoop cluster and optimizing the parallelism.
For transfering the data from relational database to hadoop usually you will use Apache Sqoop for this one. However there’s some limitation and weakness on the data type preserve-ration. Especially around datetime or timestamp. That’s why i suggest to use Spark SQL for this stuff. Spark can also be used as ETL Tools !!
Spark can transform relational database into parquet and avro data structure. So it will safe space and compress it with snappy. You can find the good explanation why we use avro and parquet on the net.
Please refer to the blog post below for transfering the data via Spark with Avro and Parquet as data file.
https://weltam.wordpress.com/2017/03/27/spark-sql-transfer-from-sql-server-to-hadoop-as-parquet/
https://weltam.wordpress.com/2017/03/27/extract-rdbms-as-avro/
Cheers