Hadoop Beginners Guide - How to Install





5.00/5 (10 votes)
Step by step procedure to install Hadoop 2.7.3 version on Ubuntu 16.04 operating system
Introduction
In my previous article, I tried to give an overview on Big Data and Hadoop. In this article, I will show you how to install Hadoop (single node cluster) on Ubuntu operating system. Windows users can also follow this article to install Ubuntu in a virtual machine and get the flavor of Hadoop. :)
Prerequisite of Hadoop
- JDK: The Java Development Kit (JDK) is a software development environment used for developing Java applications and applets. It includes the Java Runtime Environment (JRE), an interpreter/loader (java), a compiler (javac), an archiver (jar), a documentation generator (javadoc) and other tools needed in Java development. Since Hadoop framework is written in Java, it requires JDK.
- SSH: SSH ("Secure SHell") is a protocol for securely accessing one computer from another. Despite the name, SSH allows you to run command line and graphical programs, transfer files, and even create secure virtual private networks over the Internet.
Install VMWare Player and Ubuntu Operating System
This step is for windows users only. Please skip this step if you already have Ubuntu system installed. Start from step "Install Java 8 JDK".
- Download VMWare Player from here
- Install VMWare Player
- Download Ubuntu from here
- Open VMWare Player
- Click on “Create a New Virtual Machine” which opens the following screen:
- Choose option “I will install the operating system later” and click on “Next” button which opens the following screen:
- Choose option “Linux” and select “Ubuntu 64-bit” from version dropdownlist and click on “Next” button to go to the next screen:
- Enter the name of virtual machine, set the location and click on “Next” button to go to the next screen:
- Set maximum disk size as 40 GB if you have enough disk space, choose option “Store virtual disk as a single file” and click on “Next” button which navigates to the next screen:
- Click on Customize Hardware if you have more than 4GB RAM:
- Select 2GB RAM and click on “Close” button. And then click on “Finish” button.
- Click on “Edit virtual machine settings”:
-
Click on “CD/DVD (SATA)” hardware, choose option “Use ISO image file” and browse the Ubuntu ISO file. Click “OK” to close this window
- Click on “Play Virtual Machine”. This will start installing Ubuntu operating system. Follow the step by step procedure and finish the installation
Install Java 8 JDK
- Login to Ubuntu machine
- Open Terminal by pressing Ctrl+Alt+T
- Login as "su" (super user) using the following command. Use the same password while you install Ubuntu:
sudo su
- Type "cd" (change directory) and press Enter to move to the root directory:
cd
- Type the following command and press Enter:
apt-get install openjdk-8-jdk
- This will ask for a confirmation. Type
Y
and press Enter: - This will take some time to complete. Execute “
clear
” command to clear the screen:clear
- Execute the following command to see if JDK is installed successfully:
java -version javac -version
Setting JAVA_HOME Variable
- Run this command to get JDK path:
update-alternatives --config java
So JDK is installed in “/usr/lib/jvm/java-8-openjdk-amd64” path: - Edit environment variables by typing the following command:
gedit /etc/environmen
- This will open an editor. Add the following line to the end of the editor:
JAVA_HOME=”/usr/lib/jvm/java-8-openjdk-amd64”
-
Click on “Save” and close the window.
-
Run this command to check if the edited file is error free.
source /etc/environmen
- Run this command to check if
JAVA_HOME
variable has been added properly:echo $JAVA_HOME
Installing SSH
- Run the following command:
apt-get install ssh
- This will ask for a confirmation. Type
Y
and press Enter. - Once done, generate public/private rsa key pair by executing the following command:
ssh-keygen -t rsa -P ""
- This will ask “Enter file in which to save the key (/root/.ssh/id_rsa):”. Type nothing and press Enter.
- Make the generated public key authorized by running the following command:
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
- Check if ssh is installed and running properly by executing the following command:
ssh localhost
- This will ask “Are you sure you want to continue connecting (yes/no)?”. Type
yes
and press Enter. - If it shows error, execute the same command again:
ssh localhost
- It should display the above message if ssh is installed and running properly.
Download Hadoop
Download Hadoop version 2.7.3 from this link.
Click on 2.7.3 version binary:
- Click on the link marked as red to download the file. This will open a window. Select “Save File” option and click on “Save” button.
- This will start downloading the file:
- The file will be saved in default download location set in the browser.
Installing Hadoop
- Close the terminal and open it again. No need to login as “
su
”. - Find the path where the hadoop installation file is downloaded and run the following command to unpack it.
tar -xvzf ‘<downloaded package path>’
- In my case, it is:
tar -xvzf ‘/home/fazlur/Downloads/hadoop-2.7.3.tar.gz’
- This creates a directory "hadoop-2.7.3" under home directory:
Configuring Hadoop
- In Terminal, login as root using the following command. Use the same password while you install Ubuntu:
sudo su
- Run this command to edit “.bashrc” file:
gedit ~/.bashrc
- This will open an editor. Add the following lines to the end of this editor. Replace
<JAVA_PATH>
and<HADOOP_HOME_PATH>
with appropriate paths:#HADOOP VARIABLES START <meta charset="utf-8" />export JAVA_HOME=<JAVA PATH> <meta charset="utf-8" />export PATH=${JAVA_HOME}/bin:${PATH} <meta charset="utf-8" />export HADOOP_INSTALL=<HADOOP HOME PATH> export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib" #HADOOP VARIABLES END
-
In my case, it looks like this:
- Save and close the editor.
- Run the following command to check if there is any error in .bashrc file:
source ~/.bashrc
- Get into path “hadoop-2.7.3/etc/hadoop” by running the following command:
cd <HADOOP PATH>
In my case, it is:
cd /home/fazlur/hadoop-2.7.3/etc/hadoop
- Edit “hadoop-env.sh” file using the following command:
gedit hadoop-env.sh
- This will open an editor. Append this line to the end of the editor. Save and close the editor.
export JAVA_HOME=<Your Java Path>
In my case, it looks like this:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
- Run the following command to check if there is any error in hadoop-env.sh file:
source hadoop-env.sh
- Make a directory called “hadoop_store” in the same directory where
hadoop-2.7.3
exists. And get into the directory. Run the following commands to do that:cd <HOME PATH> mkdir hadoop_store cd hadoop_store
- In my case, it is:
cd /home/fazlur
- Make a directory called “hdfs” and get into it. Run these commands to do that:
mkdir hdfs cd hdfs
- Make two directories called “namenode” and “datanode” inside “hdfs” directory. Run these commands to do that. The screenshot shows the consecutive commands and directory structure:
mkdir namenode mkdir datanode
- Get into path “hadoop-2.7.3/etc/hadoop” by running the following command:
cd <HADOOP PATH>
In my case, it is:
cd /home/fazlur/hadoop-2.7.3/etc/hadoop
- Edit “hdfs-site.xml” by running the following command. This will open an editor:
gedit hdfs-site.xml
- Append the following lines between
<configuration></configuration>
tags. Replace<NAMENODE_FOLDER_PATH>
and<DATANODE_FOLDER_PATH>
with appropriate paths.<property> <name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:<NAMENODE_FOLDER_PATH></value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:<DATANODE_FOLDER_PATH></value> </property>
- It looks like this in my case:
- Save and close the editor.
- Get into “hadoop-2.7.3” folder and create a directory called “tmp”. The following commands do this:
cd <hadoop-2.7.3 path> mkdir tmp
In my case:
cd /home/fazlur/hadoop-2.7.3 mkdir tmp
- Edit “core-site.xml” file using the following command:
gedit core-site.xml
- This will open an editor. Append the following lines between
<configuration></configuration>
tags. Replace<TMP_FOLDER_PATH>
with appropriate path.<property> <name>hadoop.tmp.dir</name> <value>/home/fazlur/hadoop-2.7.3/tmp</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property>
- Here is my one looks like:
- Save and close the editor.
- Run the following command to create “mapred-site.xml” file using “mapred-site.xml.template” template:
cp mapred-site.xml.template mapred-site.xml
- Edit “mapred-site.xml” using the following command:
gedit mapred-site.xml
- This will open an editor. Append the following lines between
<configuration></configuration>
tags. Replace<TMP_FOLDER_PATH>
with appropriate path.<property> <name>mapred.job.tracker</name> <value>localhost:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property>
- Here is my one looks like:
- Save and close the editor.
- Get into the root directory by executing command “
cd
”. - Format Hadoop File System by running the following command:
hadoop namenode -format
- Restart your machine.
- Open the terminal and login as “
su
”. - Run this command to start hadoop:
start-all.sh
- Run this command to check if all the services has been started:
jps
- It looks like NameNode service is not running. Follow these steps to get it working:
- Restart your machine.
- Open terminal and login as “
su
”. - Type “
cd
” to move to root directory. - Execute command “
hadoop namenode -format
” to format hadoop file system. - Execute command “
start-all.sh
” to start all services. - Execute command “
jps
” to check if all the services has been started.
- Now open your favourite browser and type the following url:
http://localhost:8088
- It opens a page like this if everything is up and running:
- Type the following url to check datanodes as well as browse hadoop file system:
http://localhost:50070
- This opens a page like this:
- Navigate to “Utilities-->Browse the file system” to check hadoop file system:
Conclusion
Hope you enjoyed reading and get a successful installation of hadoop in your ubuntu system. In my next consecutive articles, I will explain different components of Hadoop in details.
Thank you for reading my article and keeping in touch.
History
- 26th January, 2017: Initial version