Beginners Guide - Introduction of Big Data & Hadoop






4.85/5 (13 votes)
What is Big Data and how Hadoop been introduced to overcome the problems associated with Big Data?
Big Data
As the name implies, Big Data is the huge amount of data which is complex and difficult to store, maintain or access in regular file system using traditional data processing applications. And what are the sources of this huge set of data.
- A typical large stock exchange
- Mobile phones
- Video sharing portal like YouTube, Vimeo, Dailymotion, etc.
- Social networks like Facebook, Twitter, LinkedIn, etc.
- Network sensors
- Web pages, text and documents
- Web logs
- System logs
- Search index data
- CCTV images
Data Types
Data can be identified as the following three types:
- Structured Data: Data which is presented in a tabular format and stores in RDMS (Relational Database Management System)
- Semi-structured Data: Data which does not have a formal data model and stores in XML, JSON, etc.
- Unstructured Data: Data which does not have a pre-defined data model like video, audio, image, text, web logs, system logs, etc.
Characteristics of Big Data Technology
A regular file system with typical data processing application faces the following challenges:
- Volume – The volume of data coming from different sources is high and potentially increasing day by day
- Velocity – A single processor, limited RAM and limited storage based system is not enough to process this high volume of data
- Variety – Data coming from different sources varies
And therefore, the Big Data Technology comes into the picture:
- It helps to store, manage and process high volume and variety of data in cost & time effective manner.
- It analyzes data in its native form, which could be unstructured, structured or streaming.
- It captures data from live events in real time.
- It has a very well defined and strong system failure mechanism which provides high-availability. It handles system uptime and downtime:
- Using commodity hardware for data storage and analysis
- Maintain multiple copies of the same data across clusters
- It stores data in blocks in different machines and then merges them on demand.
Hadoop
Hadoop is a platform or framework which helps to store high volume and variety of data in single or distributed file storage. It's open source, programmed in Java and distributed by Apache Foundation. It has a distributed filesystem called HDFS (Hadoop Distributed File System) which enables storing & fast data transfer among distributed file storages and MapReduce
to process the data.
So, Hadoop has two main components:
- HDFS is a specially designed file system to store and transfer of data among parallel servers using streaming access pattern.
- MapReduce to process data.
Hadoop Hardware Architecture
There are some key terms that need to be understood:
- Commodity Hardware: PCs/Servers uses cheap hardware can be used to make clusters.
- Cluster: A set of commodity PCs/Servers interconnected in a network
- Node: Each of the commodity PCs/Servers is called node.
So, Hadoop supports the concept of distributed architecture. The above diagram shows how a set of interconnected nodes makes clusters and how the clusters are interconnected through Hadoop framework.
- The number of nodes in each cluster depends on the network speed
- Uplink from cluster to node is 3 to 4 Gb/s
- Uplink from cluster to cluster is 1 GB/s
HDFS vs Regular File System
Regular File System | Hadoop Distributed File System |
The size of each data block is small like 4KB | The size of each data block is 64 MB or 120 MB |
If a 2KB file is stored in one block, the remaining 2 KB is unused or wasted | If a file of 50 MB is stored in one block, the remaining 14 MB can be used |
Slow access to blocks | Provides high-throughput access to data blocks |
Large data access suffers from disk I/O problems, mainly because of multiple seek operation | Reads huge data sequentially after a single seek |
Provides fancy and user friendly interface for managing the file system | Provides limited interface for managing the file system |
Creates only one copy of each data block. If the data block is erased, the data is lost | Creates 3 replicas of each data block by default and distributes them on computers throughout the clusters to enable reliable and rapid data access |
- HDFS exposes a specially designed file system on top of OS defined file system.
- It facilitates the user to store data in files.
- It maintains Hierarchical file system with directories and files.
- HDFS supports different file I/O operations like create, delete, rename, move, etc.
Hadoop Core Services
Hadoop follows Master-Slave architecture. There are 5 services run in Hadoop:
NameNode
Secondary NameNode
JobTracker
DataNode
TaskTracker
NameNode
, Secondary NameNode
and JobTracker
are called master services and DataNode
and TaskTracker
are called slave services.
As the diagram denotes, each of the master services can talk to each other and each of the slave services can talk to each other. Since DataNode
is a slave service of NameNode
, they can talk to each other and TaskTracker
is a slave service of JobTracker
, they can also talk to each other.
HDFS Operation Principle
The HDFS components comprise different servers like NameNode
, DataNode
and Secondary NameNode
.
NameNode Server
NameNode
server is a single instance server which is responsible for the following:
- Maintain the file system namespace
NameNode
behaves like table of content of a book. It knows the location of each block of data.- Manage the files and directories in file system hierarchy.
- It uses a file called
FsImage
to store the entire file system namespace including mapping of blocks to file and file system properties. This file is stored inNameNode
server’s local file system. - It uses a transaction log called
EditLog
to record every change that occurs to the file system meta data. This file is stored inNameNode
server’s local file system. - If there is any I/O operation occurs in HDFS, the Meta Data files of
NameNode
server is updated. - Meta Data files are loaded into Memory of
NameNode
server. Whenever there is a newDataNode
server joined to the cluster, the Meta Data files in memory are updated and then keep an image of the files in local file system as checkpoint. - The
Metadata
size is limited to the RAM available inNameNode
server. NameNode
is a critical one point of failure. If it fails, the entire cluster will fail.- But the
NameNode
server can partially be restored from a secondary namenode server.
DataNode Server
There could be any number of DataNode
servers in a cluster depending on type of network and storage system in place. It is responsible for the following:
- Store and maintain the data blocks
- Report to
NameNode
server periodically to update meta data information - Store and retrieve the blocks when there is a request comes from client or
NameNode
server - Execute read, write requests, performs block creation, deletion and replication upon instruction from
NameNode
- Each of the
DataNode
servers sendsHeartbeat
andBlockReport
toNameNode
server in specific duration. - If any
DataNode
server does not report toNameNode
server in specific duration,NameNode
server thinks thisDataNode
server as dead and remove the metadata information of thatDataNode
server.
Secondary NameNode Server
There could be a single instance of Secondary NameNode
server. It is responsible for the following:
- Maintain a backup of
NameNode
server - It’s not treated as a disaster recovery of
NameNode
server but theNameNode
server can partially be restored from this server. - Keeps namespace image through edit log periodically
When a client request Hadoop to store a file, the request goes to NameNode
server. For example, the file size is 300 MB. Since the size of each data block is 64MB, the file will be divided into 5 chunks of data blocks where 4 of them equals 64 MB and the 5th one is 44MB and stores them in 5 different data node servers within same cluster with 3 replicas. Here, the chunks of data are called inputsplit
. NameNode
service then keep the information like where the data blocks are stored, how much is the block size, etc. This information is called meta data.
Here is the complete flow of the operation:
- The file is divided into 5 inputsplit say a.jpg, b.jpg, c.jpg, d.jpg and e.jpg and the original file name is photo.jpg and file size 300 MB
- The client sends request to
NameNode
server with this details asking what are theDataNode
Servers has available data blocks to store them. NameNode
server respond to client with the details ofDataNode
servers which has enough space to store the file. Let’s say it sends the following details:
InputSplit | DataNode Server |
a.jpg | Data Node Server 1 |
b.jpg | Data Node Server 3 |
c.jpg | Data Node Server 5 |
d.jpg | Data Node Server 6 |
e.jpg | Data Node Server 7 |
- Once the client receives response from
NameNode
server, it starts requesting theDataNode
servers to store the file. It starts sending the firstinputsplit
a.jpg toDataNode
server 1. - Once the
DataNode
Server 1 receives the request, it stores a.jpg in its local file system and request replication toDataNode
Server 3. - Once the
DataNode
Server 3 receives the request, it stores a.jpg in its local file system and request another replication toDataNode
Server 7. - Once the
DataNode
Server 7 receives the request, it stores a.jpg in its local file system and send acknowledgement back toDataNode
Server 3 saying that the file has been stored properly. DataNode
Server 3 then send acknowledgement back toDataNode
Server 1 saying that the file has been properly replicated inDataNode
Server 3 & 5DataNode
Server 1 then send acknowledgement back to the client saying that the file has been stored and replicated properly.DataNode
Server 1, 3 & 5 sendBlockReport
toNameNode
server to update metadata information.- The same process repeats for the other inputsplits.
- If any of the
DataNode
server 1, 3 & 5 stops sendingHeartbeat
andBlockReport
,NameNode
server thinks that theDataNode
server is dead and choose anotherDataNode
server to replace the replication of the a.jpginputsplit
. - There should be a program written in Java or any other language to process the file photo.jpg. The client sends this program to
JobTracker
component of Hadoop. TheJobTracker
component gets the meta data information fromNameNode
server and then communicates withTaskTracker
components of the respectiveDataNode
servers to process the file. The communication betweenJobTracker
andTaskTracker
is calledMap
. The number ofinputslit
s inDataNode
servers equals the number ofMapper
. In the above example, there will be 5 mappers running to process photo.jpg file. - The
TaskTracker
components keep reporting toJobTracker
component if they are processing the request properly or are they alive. If there is anyTaskTracker
stops reporting toJobTracker
, theJobTracker
assigns the same task to one of theTaskTracker
s where the replications of theinputslipt
are stored. - The
JobTracker
assigns tasks toTaskTracker
depending on how close theTaskTracker
is and how many mappers are running. - Each
Mapper
produces one output file for every task assigned. In this example, there will be 5 mappers producing 5 different output files. There will be aReducer
who will combine these 5 input files and report to theDataNode
server where theReducer
is running,DataNode
Server 4 for example. TheDataNode
Server 4 will then communicate withNameNode
server by providing meta data saying that there is one out file called output.jpg has been processed and ready to use. - Client will keep watching and communicate with
NameNode
server once the processing is completed 100% and output.jpg file is generated.NameNode
server response back to client saying that the file has been processed and ready to use inDataNode
Server 4. - Client then sends request to
DataNode
Server 4 directly and get the output.jpg file
Conclusion
Hope you enjoyed reading it and have learnt something new. In my next consecutive articles, I will show you how to install Hadoop and explain different components of Hadoop in details.
Thanks for reading my article and keep in touch.
History
- 22nd January, 2017: Initial version