Hadoop Cluster / Ecosystem

Hadoop Clusterbd1
Core Hadoop consists of a way to store data, known as the Hadoop Distributed File System, or HDFS, and a way to process the data, called MapReduce. Split the data up and store it across a collection of machines, known as a cluster.

Then, when we want to process the data, we process it where it’s actually stored. Rather than retrieving the data from a central server, instead it’s already on the cluster, and we can process it in place. You can add more machines to the cluster (make the cluster bigger) as the amount of data you’re storing grows. The machines in the cluster don’t need to be particularly high-end; although most clusters are built using rack-mount servers.

Hadoop Ecosystem – Easier to query their data without knowing how to code. Two key ones are Hive and Pig.

Hive – Instead of having to write Mappers and Reducers, write statements, (which look very much like SQL). The Hive interpreter tHadoop Ecosystemurns that SQL into MapReduce code, which it then runs on the cluster.

Pig – write code to analyse your data in a fairly simple scripting language rather than MapReduce.  Again, the code is turned into actual Java MapReduce and run on the cluster.

Impala  – query your data using SQL but which directly accesses that data, rather than accessing it via MapReduce. Impala is optimized for low-latency queries — in other words, Impala queries run very quickly, typically many times faster than Hive queries — while Hive is optimized for long-running batch processing jobs.

Sqooptakes data from a traditional relational database server such as Microsoft SQL Server and puts it in HDFS as delimited files so it can be processed along with the other data on the cluster.

Flume –  which ingests data as it’s generated by external systems.  HBase is a real-time database built on top of HDFS.  Hue is a graphical front-end to the cluster. Oozie is a workflow management tool.  Mahout is a machine learning library

CDH (Cloudera Hadoop) – Distribution of Hadoop,  This takes all the key ecosystem projects, along with Hadoop itself, and packages them together so that installation is a really simple process.  You could install everything from scratch yourself, but it’s far easier to use CDH

Reference Link

Hadoop Tutoria   HBase Pig HIVE Impala

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: