Hadoop Beginners Guide - How to Install

Fazlur Rahman

5.00/5 (10 votes)

Jan 26, 2017

CPOL

7 min read

25667

Step by step procedure to install Hadoop 2.7.3 version on Ubuntu 16.04 operating system

Introduction

In my previous article, I tried to give an overview on Big Data and Hadoop. In this article, I will show you how to install Hadoop (single node cluster) on Ubuntu operating system. Windows users can also follow this article to install Ubuntu in a virtual machine and get the flavor of Hadoop. :)

Prerequisite of Hadoop

JDK: The Java Development Kit (JDK) is a software development environment used for developing Java applications and applets. It includes the Java Runtime Environment (JRE), an interpreter/loader (java), a compiler (javac), an archiver (jar), a documentation generator (javadoc) and other tools needed in Java development. Since Hadoop framework is written in Java, it requires JDK.
SSH: SSH ("Secure SHell") is a protocol for securely accessing one computer from another. Despite the name, SSH allows you to run command line and graphical programs, transfer files, and even create secure virtual private networks over the Internet.

Install VMWare Player and Ubuntu Operating System

This step is for windows users only. Please skip this step if you already have Ubuntu system installed. Start from step "Install Java 8 JDK".

Download VMWare Player from here
Install VMWare Player
Download Ubuntu from here
Open VMWare Player
Click on “Create a New Virtual Machine” which opens the following screen:
Choose option “I will install the operating system later” and click on “Next” button which opens the following screen:
Choose option “Linux” and select “Ubuntu 64-bit” from version dropdownlist and click on “Next” button to go to the next screen:
Enter the name of virtual machine, set the location and click on “Next” button to go to the next screen:
Set maximum disk size as 40 GB if you have enough disk space, choose option “Store virtual disk as a single file” and click on “Next” button which navigates to the next screen:
Click on Customize Hardware if you have more than 4GB RAM:
Select 2GB RAM and click on “Close” button. And then click on “Finish” button.
Click on “Edit virtual machine settings”:
Click on “CD/DVD (SATA)” hardware, choose option “Use ISO image file” and browse the Ubuntu ISO file. Click “OK” to close this window
Click on “Play Virtual Machine”. This will start installing Ubuntu operating system. Follow the step by step procedure and finish the installation

Install Java 8 JDK

Login to Ubuntu machine
Open Terminal by pressing Ctrl+Alt+T
Login as "su" (super user) using the following command. Use the same password while you install Ubuntu:
```
sudo su
```
Type "cd" (change directory) and press Enter to move to the root directory:
```
cd
```
Type the following command and press Enter:
```
apt-get install openjdk-8-jdk 
```
This will ask for a confirmation. Type Y and press Enter:
This will take some time to complete. Execute “clear” command to clear the screen:
```
clear
```
Execute the following command to see if JDK is installed successfully:
```
java -version
javac -version
```

Setting JAVA_HOME Variable

Run this command to get JDK path:
```
update-alternatives --config java
```
So JDK is installed in “/usr/lib/jvm/java-8-openjdk-amd64” path:
Edit environment variables by typing the following command:
```
gedit /etc/environmen
```
This will open an editor. Add the following line to the end of the editor:
```
JAVA_HOME=”/usr/lib/jvm/java-8-openjdk-amd64”
```
Click on “Save” and close the window.
Run this command to check if the edited file is error free.
```
source /etc/environmen
```
Run this command to check if JAVA_HOME variable has been added properly:
```
echo $JAVA_HOME
```

Installing SSH

Run the following command:
```
apt-get install ssh
```
This will ask for a confirmation. Type Y and press Enter.
Once done, generate public/private rsa key pair by executing the following command:
```
ssh-keygen -t rsa -P ""
```
This will ask “Enter file in which to save the key (/root/.ssh/id_rsa):”. Type nothing and press Enter.
Make the generated public key authorized by running the following command:
```
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
```
Check if ssh is installed and running properly by executing the following command:
```
ssh localhost
```
This will ask “Are you sure you want to continue connecting (yes/no)?”. Type yes and press Enter.
If it shows error, execute the same command again:
```
ssh localhost
```
It should display the above message if ssh is installed and running properly.

Download Hadoop

Download Hadoop version 2.7.3 from this link.

Click on 2.7.3 version binary:

Click on the link marked as red to download the file. This will open a window. Select “Save File” option and click on “Save” button.
This will start downloading the file:
The file will be saved in default download location set in the browser.

Installing Hadoop

Close the terminal and open it again. No need to login as “su”.
Find the path where the hadoop installation file is downloaded and run the following command to unpack it.
```
tar -xvzf ‘<downloaded package path>’
```

In my case, it is:

tar -xvzf ‘/home/fazlur/Downloads/hadoop-2.7.3.tar.gz’

This creates a directory "hadoop-2.7.3" under home directory:

Configuring Hadoop

In Terminal, login as root using the following command. Use the same password while you install Ubuntu:
```
sudo su
```
Run this command to edit “.bashrc” file:
```
gedit ~/.bashrc
```

This will open an editor. Add the following lines to the end of this editor. Replace <JAVA_PATH> and <HADOOP_HOME_PATH> with appropriate paths:

#HADOOP VARIABLES START
<meta charset="utf-8" />export JAVA_HOME=<JAVA PATH>
<meta charset="utf-8" />export PATH=${JAVA_HOME}/bin:${PATH}
<meta charset="utf-8" />export HADOOP_INSTALL=<HADOOP HOME PATH>
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
#HADOOP VARIABLES END

In my case, it looks like this:
Save and close the editor.
Run the following command to check if there is any error in .bashrc file:
```
source ~/.bashrc
```
Get into path “hadoop-2.7.3/etc/hadoop” by running the following command:
```
cd <HADOOP PATH>
```
In my case, it is:
```
cd /home/fazlur/hadoop-2.7.3/etc/hadoop
```
Edit “hadoop-env.sh” file using the following command:
```
gedit hadoop-env.sh
```
This will open an editor. Append this line to the end of the editor. Save and close the editor.
```
export JAVA_HOME=<Your Java Path>
```
In my case, it looks like this:
```
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
```
Run the following command to check if there is any error in hadoop-env.sh file:
```
source hadoop-env.sh
```
Make a directory called “hadoop_store” in the same directory where hadoop-2.7.3 exists. And get into the directory. Run the following commands to do that:
```
cd <HOME PATH>
	mkdir hadoop_store
	cd hadoop_store
```
In my case, it is:
```
cd /home/fazlur
```
Make a directory called “hdfs” and get into it. Run these commands to do that:
```
mkdir hdfs
cd hdfs
```
Make two directories called “namenode” and “datanode” inside “hdfs” directory. Run these commands to do that. The screenshot shows the consecutive commands and directory structure:
```
mkdir namenode
mkdir datanode
```
Get into path “hadoop-2.7.3/etc/hadoop” by running the following command:
```
cd <HADOOP PATH>
```
In my case, it is:
```
cd /home/fazlur/hadoop-2.7.3/etc/hadoop
```
Edit “hdfs-site.xml” by running the following command. This will open an editor:
```
gedit hdfs-site.xml
```

Append the following lines between <configuration></configuration> tags. Replace <NAMENODE_FOLDER_PATH> and <DATANODE_FOLDER_PATH> with appropriate paths.

<property>
 <name>dfs.replication</name>
 <value>1</value>
 <description>Default block replication.
 The actual number of replications can be specified when the file is created.
 The default is used if replication is not specified in create time.
 </description>
</property>
<property>
  <name>dfs.namenode.name.dir</name>
 <value>file:<NAMENODE_FOLDER_PATH></value>
</property>
<property>
  <name>dfs.datanode.data.dir</name>
  <value>file:<DATANODE_FOLDER_PATH></value>
</property>

It looks like this in my case:
Save and close the editor.
Get into “hadoop-2.7.3” folder and create a directory called “tmp”. The following commands do this:
```
cd <hadoop-2.7.3 path>
mkdir tmp
```
In my case:
```
cd /home/fazlur/hadoop-2.7.3
mkdir tmp
```
Edit “core-site.xml” file using the following command:
```
gedit core-site.xml
```

This will open an editor. Append the following lines between <configuration></configuration> tags. Replace <TMP_FOLDER_PATH> with appropriate path.

<property>
 <name>hadoop.tmp.dir</name>
 <value>/home/fazlur/hadoop-2.7.3/tmp</value>
 <description>A base for other temporary directories.</description>
</property>

<property>
 <name>fs.default.name</name>
 <value>hdfs://localhost:54310</value>
 <description>The name of the default file system.  A URI whose
 scheme and authority determine the FileSystem implementation.  The
 uri's scheme determines the config property (fs.SCHEME.impl) naming
 the FileSystem implementation class.  The uri's authority is used to
 determine the host, port, etc. for a filesystem.</description>
</property>

Here is my one looks like:
Save and close the editor.
Run the following command to create “mapred-site.xml” file using “mapred-site.xml.template” template:
```
cp mapred-site.xml.template mapred-site.xml
```
Edit “mapred-site.xml” using the following command:
```
gedit mapred-site.xml
```

This will open an editor. Append the following lines between <configuration></configuration> tags. Replace <TMP_FOLDER_PATH> with appropriate path.

<property>
 <name>mapred.job.tracker</name>
 <value>localhost:54311</value>
 <description>The host and port that the MapReduce job tracker runs
 at.  If "local", then jobs are run in-process as a single map
 and reduce task.
 </description>
</property>

Here is my one looks like:
Save and close the editor.
Get into the root directory by executing command “cd”.
Format Hadoop File System by running the following command:
```
hadoop namenode -format
```
Restart your machine.
Open the terminal and login as “su”.
Run this command to start hadoop:
```
start-all.sh
```
Run this command to check if all the services has been started:
```
jps
```
It looks like NameNode service is not running. Follow these steps to get it working:
- Restart your machine.
- Open terminal and login as “su”.
- Type “cd” to move to root directory.
- Execute command “hadoop namenode -format” to format hadoop file system.
- Execute command “start-all.sh” to start all services.
- Execute command “jps” to check if all the services has been started.
Now open your favourite browser and type the following url:
```
http://localhost:8088
```
It opens a page like this if everything is up and running:
Type the following url to check datanodes as well as browse hadoop file system:
```
http://localhost:50070
```
This opens a page like this:
Navigate to “Utilities-->Browse the file system” to check hadoop file system:

Conclusion

Hope you enjoyed reading and get a successful installation of hadoop in your ubuntu system. In my next consecutive articles, I will explain different components of Hadoop in details.

Thank you for reading my article and keeping in touch.

History

26^th January, 2017: Initial version