Need: As we are generating
more and more data everyday we need tools to deal with this huge scale.
The tools can be programming languages, software or infrastructure.
Obviously the infrastructure is the foundation for everything else.
Short history:
Google, one of the internet giants faced large scale computation
problem pretty early. So Google researchers like Sanjay Ghemawat and
Jeff Dean did some work to solve this issue. They published two research
papers on Google File System (2003) and MapReduce (2004). These papers
were basically trying to solve large scale computations in distributed
manner. In 2005 two researchers Doug Cutting (from Yahoo) and Mike Cafarella build something based on GFS and MapReduce called Hadoop.
Hadoop basics: The Hadoop file system is called HDFS, Hadoop Distributed File System. It deals with group of machine called clusters which are part of distributed computing environment. Every single machine is referred as node. This HDFS has minimum amount of data chunk which should be stored called as block size,
the default size is 64 mb. So if we have 640 mb of data then it will be
divided over 10 nodes for storing as default block size is 64 mb. There
are two types of nodes called name node and data node. Name node will
decide details about splitting and storing the data. Name node also
manages the metadata and tree hierarchy associated with it. Data node
will actually have the data chunks physically stored on them. In a
single cluster we will have only one name node and multiple data nodes.
Pitfalls:
Based on discussion above we can say the name node acts as authority or
gateway for accessing the data stored on data nodes. Now if name node
is down then we can not access the data associated with its data nodes.
This is one of the reasons why Hadoop is called single point of failure. This can be avoided by having secondary name node,
which acts as backup if primary name node goes down. Hadoop was also
built for batch processing and not real-time processing. Though few
versions offer the real-time processing capability its not available
with Apache Hadoop as of now.
How it is used:
Now that we have discussed high-level concepts of infrastructure lets
see how it is used. Companies dealing with huge amount of data will
configure the Hadoop clusters with distributed data. For example, the
retail giant Walmart deals with 250 node Hadoop cluster. So far we are
done with the data storing and distribution part. Next thing is to run
jobs (programs) on top of it which will do the actual computation.
MapReduce or similar techniques will be used for this.
No comments:
Post a Comment