Thursday, March 14, 2013

BIG DATA Getting Started with HADOOP


It is a open source project from Apache that evolved rapidly into a major technology movement. It is capable to handle large data set which are structured and unstructured. It had ability to run on low cost clusters and can scale up rapidly.

Hadoop architecture helps applications run on nodes which has thousands of terabytes of data. It has a distributed file system called HDFS (Hadoop Distributed File System) with fast data transfer rates between the clustered nodes and also support failures. It does not require RAID Storage as it achieves reliability by replication technology on multiple hosts. 

Hadoop is a collection of components like MapReduce , Hive , Pig , NoSQL Database , ZooKeeper , Ambari , HCatalog , Oozie, Hue and more.

MapReduce is one key component. It is a framework for writing applications that processes large amount of data which are structured and unstructured in nature. MapReduce applications can be designed for a single node Hadoop and then deployed on a 100 node Hadoop. 

MapReduce engine had one JobTracker to which jobs are submitted. The JobTracker then pushes work to the TaskTracker nodes in the cluster. JobTracker and TaskTracker help to complete the allocated job. 

I will share my experience in getting a single node hadoop running and then running a mapreduce sample application in details.

Popular Posts