Spark SQL Transfer from Database to Hadoop

March 27, 2017 Leave a comment

Hadoop can store structured and unstructured data. That’s the benefit of schemaless approach. However lots of our customer or data resides in Relation Database. We need to take this first into Hadoop so we can query and transform the data inside Hadoop cluster and optimizing the parallelism.

For transfering the data from relational database to hadoop usually you will use Apache Sqoop for this one. However there’s some limitation and weakness on the data type preserve-ration. Especially around datetime or timestamp. That’s why i suggest to use Spark SQL for this stuff. Spark can also be used as ETL Tools !!

Spark can transform relational database into parquet and avro data structure. So it will safe space and compress it with snappy. You can find the good explanation why we use avro and parquet on the net.

Please refer to the blog post below for transfering the data via Spark with Avro and Parquet as data file.

https://weltam.wordpress.com/2017/03/27/spark-sql-transfer-from-sql-server-to-hadoop-as-parquet/

https://weltam.wordpress.com/2017/03/27/extract-rdbms-as-avro/

Cheers

Categories: Uncategorized

Spark SQL Transfer from SQL Server to Hadoop as Parquet

March 27, 2017 Leave a comment
Categories: Uncategorized

Spark SQL Transfer from SQL Server to Hadoop as Avro

March 27, 2017 Leave a comment
Categories: Uncategorized

Big Data, Apache Hadoop and Cloudera

March 26, 2017 Leave a comment

Big data is everywhere, people are talking about it. We need to be prepare to embrace this wave. If you try to find or search through the internet you will find that Hadoop is circling Big Data. Hadoop is an operating system for big data. So let’s get started to meet Apache Hadoop in Action.

The most convenient way to introduce to hadoop is by using Virtual Machine provided by Cloudera. Cloudera is one of the biggest vendor that bundle hadoop ecosystem in one package.  It also provide monitoring and easier manager to manage your cluster. It’s really easy also to deploy the whole cluster with this package. Let’s continue with downloading the package.

Please download Cloudera Quickstart from this link. It almost 4 Gb. If you would like to register and find the latest installer please go to this site.

image

 

Make sure that you have installed VMWare Player on your machine. Please find the installer in this site.

image 

Extract the quickstart files and open from VMWare.

image

You need to have 8GB RAM and 2 Virtual CPU. Please configure that from Virtual Machine settings.

image

image

 

Run the Virtual machine and after it finish booting then execute “Launch Cloudera Express” from Desktop. Please be patience until all service has been started.

 

image

 

If you haven’t open the browser, please open it and open Cloudera Manager from bookmarked address.

image

 

Login with Username : cloudera, and password : cloudera.

image

 

Make sure that all service is running. If there’s some service down, please start that manually.

image

 

You can also access HUE ( Hadoop User Experience ) from browser bookmark also. You can login with the same username and password you entered to Cloudera Manager.

image

image

 

If you want to make sure all installation is correct then you can do some health checking by doing this testing.

Congratulation you have successfully run single node cluster with Cloudera distribution.  Smile 

For more tutorial you can download this via this workshop. Thanks to Gandhi Manalu and Institut Teknologi Del.

If you still have some spirit and energy left please follow Cloudera Comprehensive tutorial from this site.

 

Cheers

Categories: Uncategorized

Memulai Programming dengan Apache Hadoop

February 25, 2017 Leave a comment

Kata-kata big data sekarang menjadi semakin marak, semua orang ramai membicarakannya. Kita harus cepat-cepat bersiap-siap menyambut perubahan teknologi yang sangat cepat. Jika kita membicarakan big data maka kita tidak akan lari jauh-jauh dari Apache Hadoop. Apache Hadoop adalah operating system untuk big data. Mari kita memulai pelajaran pertama ini dengan berkenalan dengan Apache Hadoop

Cara paling cepat untuk memulai berkenalan dengan hadoop adalah dengan memanfaatkan Virtual machine yang disediakan oleh Cloudera. Cloudera adalah salah satu vendor yang mendukung dan membundle hadoop ecosystem dalam satu package. Ini memudahkan kita untuk melakukan deployment dibandingkan dengan menginstall component hadoop satu per satu.

Silahkan download cloudera quickstart dari link ini. Cukup besar kira-kira 4 gb.

Pastikan VMWare Player anda telah terinstall juga. Jika anda belum menginstall VMWare silahkan download dari link ini.

Extract file Quickstart tersebut dan buka file tersebut dengan VMWare.

Untuk menjalankan Cloudera Express kita membutuhkan minimal 8GB RAM dan 2 virtual CPU. Silahkan setting pada virtual machine Cloudera quickstart.

Jalankan Virtual Machine tersebut dan setelah selesai booting maka eksekusi “Launch Cloudera Express” dari Desktop. Harap bersabar menunggu semua service start.

Buka browser anda dan masuk ke Cloudera Manager.

Login dengan Username : cloudera, Password : cloudera

Pastikan semua service running. Jika tidak ada yang running, maka anda dapat menghidupkannya dengan memilh menu start dari cloudera host drop down  menu.

Jika anda ingin memastikan instalasi benar benar berhasil maka anda dapat melakukan testing dengan cara berikut.

Arahkan browser anda ke Hue, anda dapat melihatnya pada bookmark web browser. Login dengan username dan password yang sama.

Selamat anda sudah berhasil menjalakan cluster single node anda dengan cloudera distribution. 🙂

Untuk tutorial lebih detail dapat di download melalui workshop berikut ini. Thanks to Gandhi Manalu dan Institut Teknologi Del.

Jika anda masih semangat dan makin penasaran, silahkan ikuti tutorial berikut ini

Cheers

Categories: Uncategorized

.NET Core Microservices using GeekseatBus

October 22, 2016 2 comments

2016-10-20_12-16-30

GeekseatBus is a simple message bus that can be use to create microservices in .NET.

Here’s the background why we do this on our own.

Background

A lot of microservices right now is in favour using REST API for communication between services. We are in geekseat take different approach for microservices and we avoid request response between services. This is align with SOA tenets for autonomous component. You component can’t be autonomous if you still using request response. If one service dies, other service is dies too. This will create temporal coupling between services.

We are a big fan of Udi Dahan style of microservices. As it enable high cohesion and loose couple on our big system. If you want to learn more you can register for 2 days course for free from here.

The central things from this approach is the needs of message bus. Message bus is used to communicate via fire and forget and also publish and subscribe between each services. So it promotes loosely coupling between each component.

Geekseat has been researching for simple message bus on RabbitMQ. We found NServiceBus and,  Mass Transit. But both of platform can’t be used in Linux. That’s a big problem for us as our backend currently written on .NET Core ( cross platform .net ). So we decide to create our own implementation of Message Bus on top of RabbitMQ.

We has published GeekseatBus to nuget. This is a big first start for us as we are start to open source our infrastructure for microservices.
We are a great believer of keeping thing simple and minimal configuration. As an agile company we like to see our simple stuff works in production. GeekseatBus also rely on convention over configuration. This will made things easier to use.

Getting Started

Ok, Let’s get started. Here’s the schema of what we are going to achieve on this head first with GeekseatBus.

geekseatbus_services

From the above schema we can see that we have 2 services. Order Service and Billing Service. Order Service have 2 component which is Order Client and Order Service ( Server ).

Geekseat.BillingService subscribe to the OrderPlaced event published from OrderService and do it’s own thing by billing the customer according to the product ordered.

Fire and Forget Demo

Now let’s open our beloved Visual Studio IDE.

Create Solution and the .NET core console application for OrderService. This will be OrderService endpoint. This service will handle PlaceOrder command and publish OrderPlaced event.

create_order_services

Add reference to GeekseatBus nuget package to Geekseat.OrderService project.

geekseatbus-nuget

Create class library project for messages. We have 2 messages, PlaceOrder command and OrderPlaced event. We have convention for naming the project. You should have project name that contain service name as prefix. Ex: If your service name is Geekseat.OrderService than your message project should be Geekseat.OrderService.Messages.

orderservice_messages

You also should have create two directory for events and commands like this one. This will give you the namespace for events ( Geekseat.OrderService.Messages.Events ) and for commands ( Geekseat.OrderService.Messages.Commands)

Create PlaceOrder command on Commands folder and OrderPlaced events on Events folder. And Delete Class1.cs.

Make the content of PlaceOrder.cs like this.

placeorder

And the content of OrderPlaced.cs like this.

orderplaced

Please make sure that your namespace follow the conventions we mention above.

Create another endpoint for Geekseat.OrderClient.

ordercilent

Add GeekseatBus reference to Geekseat.OrderClient. Also add GeekseatBus.OrderService.Messages to OrderClient and OrderService.

addrefmessages

Now we can start creating a message handler for PlaceOrder in OrderService.

placeorderhandler

On Program.cs ( in OrderService ) you should start the bus with this code. Really simple startup right ?

programorderservice

You can try to run the OrderService to see what convention is used on creating queue and exchange in rabbitmq. Basically each service will have it’s own queue ( single queue for handling multiple message). This queue can be bind to event that the service interested in.

A service can also create an exchange for the event it’s published. You can check that OrderService have a queue named Geekseat.OrderService and Exchange Geekseat.OrderService.Messages.Events.OrderPlaced.

queue

exchange

Let’s now send some message to our service. Now we will concentrate on Geekseat.OrderClient.

orderclient_program

Run both OrderService and OrderClient. And press enter to send the message from Client.

runnning

Voila, it receive the message !

Publish and Subscribe Demo

Now we will publish an OrderPlaced event from OrderService. The publishing will be handle in PlaceOrderHandler. We will inject IGsBus into this handler and do publishing.

orderplacedhandler

Ok. Now let’s create subscriber for that event. We will leverage exchange in RabbitMQ. But of course this will be transparent from the user.

Create a new console project Geekseat.BillingService. Add reference to messages and GeekseatBus. Create message handler OrderPlacedHandler.

billservicehandler

Add the service startup for BillingService and we’re done.

billingstartup

Run all the console application (OrderService, BillingService and OrderClient). Enter the message from OrderClient and you can see that the event has been published to BillingService.

console_final.png

It Works !

We have open source this library on Github. You can download, experiment and give a pull request !

Happy Microservicing 🙂

 

Cheers

 

 

 

Porting Apache Avro into .NET Core

September 7, 2016 Leave a comment

We are from Geekseat has been used Avro extensively. Currently our project needs to be done on .NET core so .As i post previously i’m working on porting Apache Avro into .NET Core. Basically there are two Avro library available. One from Apache and one from Microsoft. I want to port both of library as that has been used a lot in big data application.

Let’s get back to Apache Avro.

There are two main problem that made headache when porting this one. One is AppDomain and IL Generator / Dynamic method is missing.

appdomainmissing

dynamicmethod

Both problem exists in the ObjectCreator.cs. This class is responsible for dynamically creating object on the fly based on the avro type.

ILGenerator has been known as the fastest object creator by Ayende Research. So i think this can be replaced with Expression tree but not sure how. Then i remember that Zeddy Iskandar has been create a forum post on how to create object with expression tree with constructor. You can see the method from this forum post. Here’s the complete source code of the ctor delegate.

One down, one more to go.

AppDomain has also been solved by Michael Whelan. Here’s the blog post around how to replace AppDomain.

I use the polyfill approach from that blog post. You can see that in action here.

After fixing both problem i’ve got 382 test passed and 6 failing. I’m quite happy with this result, declare victory and push that to repository. Thanks Zeddy and Michael.

test_passed

Cheers

 

 

Categories: .NET, net core Tags: , , , ,