This article is intended for beginners who want to setup Hadoop debug environment, i.e., debugging Namenode, Datanode and Map Reduce APIs. This will help in better understanding of the framework itself. Being a Microsoft developer, I was new to Java, IntelliJ IDEA and Hadoop development. The reason why I wrote this article is I struggled for a couple of days to setup debug environment as I didn't get complete details online. However I had a strong feeling of learning this wonderful bigdata technology and that helped me cross all the hurdles.:) I hope this article helps those who want to start learning hadoop from scratch.
This article is divided into two parts:
- Part 1. Running HDFS by building the source code
- Part 2. Setting up debug environment using IntelliJ
In general, Hadoop technology consists of two parts - data storage and analyzing the stored data.
HDFS (Hadoop Distributed File System) that stores and manages data and MapReduce framework that does data analysis job.
HDFS consists of Namenode to manage metadata of file system and several Datanodes to store the data.
I assume the reader has sufficient understanding of the basics of Hadoop architecture. If not, please spend some time learning and come back to this article. I suggest reading Tom White's "Hadoop, The Definitive Guide" book.
Namenode and Datanode are implemented as Linux daemons and both exposes RPC and HTTP interfaces for communication.
Namenode RPC interface listen on port 8020 by default and HTTP interface on port 50070 by default.
Datanode RPC interface listen on port 50010 by default and HTTP interface on port 50075.
These ports can be overridden using the configuration.
The Namenode and Datanode runs on their own JVM process. So we can run them on the same machine with some simple configuration changes.
I've used Ubuntu 14.x Linux VM on my Windows box.
- Install Java SDK 1.8
Run the below commands on your command prompt and Ubuntu will install sdk on your machine.
sudo apt-add-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer
Verify the installation by running
java -version on command prompt. On my machine, it displays as below.
java version "1.8.0_65"
Java(TM) SE Runtime Environment (build 1.8.0_65-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.65-b01, mixed mode)
- Install Maven
sudo apt-get install maven
Verify maven installation. Type
mvn -version on command prompt. You should get:
Apache Maven 3.0.5
Maven home: /usr/share/maven
Java version: 1.8.0_65, vendor: Oracle Corporation
Java home: /usr/local/jdk1.8.0_65/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "Linux", version: "3.19.0-25-generic", arch: "amd64", family: "unix"
- Install native libraries
sudo apt-get -y install maven build-essential autoconf automake libtool
cmake zlib1g-dev pkg-config libssl-dev libfuse-dev
- Install Google's protocol buffers
sudo apt-get install protobuf-compiler
Verify it by typing
protoc on command prompt. You should get:
Missing input file.
- Download IntelliJ IDEA for Linux.
https://www.jetbrains.com/idea/download/#section=linux You should choose community edition.
Download Hadoop Source Code
There are different ways to get hadoop source code. I chose to get it from github directly so that I can get latest codebase. When you start contributing to hadoop, you can use git tool to clone the repository. For now, we can download it directly from github.
- Goto https://github.com/apache/hadoop
- Click on download ZIP option and it'll be downloaded to 'Downloads' folder.
- Unzip it and you get a folder called hadoop-trunk
- Place this folder in home directory. I placed it under ~/code/hadoop-trunk
Hadoop Source Code Structure
Hadoop consists of many projects that are listed below. Each project contains various modules.
The projects that we're interested right now are hadoop-common-project and hadoop-hdfs-project.
The hadoop-common-project has module called hadoop-common that has all the common libraries shared by all the projects/modules.
The hadoop-hdfs-project has hadoop-hdfs module that's the standard implementation of HDFS. It contains Namenode and Datanode implementation.
Build Hadoop Source Code
Hadoop has many dependencies that are typically available in maven repository. So we need to build the source code once to get all the dependencies. Make sure that you're connected to the Internet.
Goto hadoop source code and run this command
mvn clean install -Pdist -DskipTests
~/code/hadoop-trunk/ $mvn clean install -Pdist -DskipTests
It will take some time based on your Internet speed to resolve dependencies and build the source code, so relax!
Make sure that build is successful. If you've followed instructions correctly, then you shouldn't get any errors. If you get any errors, then Google it for resolution or please post in the comment section. We will try to help!
The build produces two important folders that are interests us now. In my case, they are:
One is in hadoop-common : /home/mallan/code/hadoop-trunk/hadoop-common-project/hadoop-common/target/hadoop-common-3.0.0-SNAPSHOT
One is in hadoop-hdfs: /home/mallan/code/hadoop-trunk/hadoop-hdfs-project/hadoop-hdfs/target/hadoop-hdfs-3.0.0-SNAPSHOT
3.0.0 indicates the hadoop version. It might change based on your source version.
Whatever we need for HDFS exist in only these two folders. Spend some time in exploring the content of these folders.
Let's configure the hadoop environment variables for these two folders.
open the .bashrc file (exists in your home location) and copy below lines.
HADOOP_MAPRED_HOME are needed to just make script happy, else it will will cry. So, I've provided dummy paths. They need to be configured properly when you need to enable Map reduce and YARN framework.
Close all open Terminals and open them back so that these environment variables will be set.
Next, we need to configure default file system home in /home/mallan/code/hadoop-trunk/hadoop-common-project/hadoop-common/target/hadoop-common-3.0.0-SNAPSHOT/etc/core-site.xml and add this property to the configuration section.
<description>The name of the default file system. </description>
Run HDFS (Namenode and Datanode)
You are ready to run your own built Namenode and Datanode.
Go to prompt and type
hdfs namenode -format. This will prepare the file system metadata files in the /tmp folder.
hdfs namenode. This will run the Namenode that will start listening at 8020 and 50070 port.
In another terminal, type
hdfs datanode. This will run Datanode that will start listening at 50010 and 50075 port as explained above.
Now using other Terminal, you can run dfs commands such as:
hdfs dfs -ls /
hdfs dfs -mkdir /folder1
hdfs dfs -coptyFromLocal <source> <destination>
I hope you are able to run HDFS using the binaries that you've built. In the following section let's setup our own debug environment where you we debug Namenode and Datanode.
Setup source code with IntelliJ
The hadoop project uses Maven for building and maintaining the project. If you are new to Maven then I suggest spending some time learning it.
Maven project consists POM.xml file that defines the project structure and dependecies. The IntelliJ is very good in parsing the POM file.
Follow these steps
1. Run IntelliJ, you get below screen
2. Click Open and select the POM.xml in the hadoop-trunk folder.
3. IntelliJ IDEA will read the POM file and loads all the projects. Each project/module in turn contains its own POM.xml file. The IntelliJ resolves all the dependencies and finally it looks as below.
4. If you build the projects using Build->Rebuild option, it should just build fine. However if you goto Namenode.java (hadoop-hdfs-project->hadoop-hdfs->src->main->java->org.apache.hadoop->hdfs->server->Namenode->Namenode.java)
And tries to Run is using Run option (you get it if you right click on the file) It doesn’t run, it throws some NoClassFoundErros. Use below steps to fix it.
4.1 We need to make some changes to POM file in (hadoop-hdfs-project->hadoop- hdfs->POM.xml file.
Go to <dependencies> section, you see that some of the dependency has provided as scope type. It means that the container that runs process should provide these dependencies in run time. The IntelliJ IDEA cannot provide this so change all ‘provided’ to ‘compile’ option
Learn more about Maven dependency here
4.2 Go to hadoop-common-projectàhadoop-commonàsrcàmainàresourcesàcore-defaults.xml.
Make sure that fs.defaultFS and fs.default.name has the hdfs://localhost/ as a value
<description>The name of the default file system. </description>
4.3 We need to fix one last thing. The Namenode runs HttpServer and it has to locate the web application folder.
In explorer Go to ~/code/hadoop-trunk/hadoop-hdfs-project/hadoop-hdfs/target/ you’ll fine webapps folder. Copy this and place it inside ‘classes’ folder. Where run time looks for this webapps folder. The below screenshot shows after copying webapps folder.
5. Before we start debugging run hdfs namenode –format This will prepares the metadata folder in /tmp.
6. Now we are ready to run Namenode in debug mode. Go to Namenode.java, right click on it and select ‘debug Namenode’ option. Before this you may want to place break point in main method of Namenode.
7. Once Namenode is running in debug mode, you can run some hdfs commands from command prompt. i.e. hdfs dfs –ls / Below screenshot shows hitting the breakpoint in NamenodeRpcServer class
Side by Side Execution
When debugging either Namenode or Datanode we can make use of the instances that we are running from the binaries that we built. For e.g.
When trying to debug Namenode, we can run only Datanode from the binaries that we built. The debugging Namenode from IntelliJ can connect to Datanode that’s running. Similarly for Datanode debugging.
I hope you're succefully able to debug HDFS now. I believe it's a agreat way to learn Hadoop architecture. If you come across any issues please do post a comment. Thank you.
- 26th December, 2015: Initial version
- 27th Decemeber, 2015: Update with setting up debug environment.