Click here to Skip to main content
14,936,036 members
Articles / Database Development
Article
Posted 9 Sep 2017

Stats

20.2K views
139 downloads
13 bookmarked

Big Data MapReduce Hadoop Scala on Ubuntu Linux by Maven intellj Idea

Rate me:
Please Sign up or sign in to vote.
4.63/5 (14 votes)
12 Sep 2017CPOL8 min read
This article is the most complete essay about big data from scratch to practical.
This is a comprehensive article that explains big data from beginning to end.

Introduction

I wanted to start surfing the net about big data, but I could not find any complete article which explained the process from beginning to end. Each article just described part of this huge concept. I felt a lack of a comprehensive article and decided to write this essay.

Nowadays, we encounter the phenomena of growing volume of data. Our desire to keep all of this data for long term and process and analyze them with high velocity has been caused to build a good solution for that. Finding a suitable substitute for traditional databases which can store any type of structured and unstructured data is the most challenging thing for data scientists.

Image 1

I have selected a complete scenario from the first step until the result, which is hard to find throughout the internet. I selected MapReduce which is a processing data by hadoop and scala inside intellij idea, while all of this story will happen under the ubuntu Linux as operating system.

What is Big Data?

Big data is huge volume of massive data which is structured, unstructured or semi structured and it is difficult to store and manage with traditional databases. Its volume is variable from Terabyte to petabytes. Relational databases such as SQL Server are not suitable for storing unstructured data. As you see in the below picture, big data is set of all kinds of structured or unstructured data which has fundamental requirements such as storing, management, share, analyze.

Image 2

What is Hadoop?

Indeed, we expect two issues from all of databases. Firstly, we need to store data, secondly, we want to process stored data properly in a fast and accurate way. Because of arbitrary shape and large volume of big data, it is impossible to store them in traditional databases. We need to think about new one which can handle either storing or processing of big data.

Hadoop is as a revolutionary database for big data, which has the capacity to save any shape of data and process them cluster of nodes. It also dramatically reduces the cost of data maintenance. With the aid of hadoop, we can store any sort of data for example all of user click for long period. So it makes easy historical analysis. Hadoop has distributed storage and also distributed process system such as Map Reduce.

Image 3

It is obvious two expectation duties from databases in the below pictures. First store on the left and beginning step, then processing on the right side. Hadoop is an innovative database which is different from traditional and relational databases. Hadoop runs in bunch of clusters in multiple machines which each cluster includes too many numerous nodes.

What is Hadoop Ecosystem?

As I mentioned above, hadoop is proper for either storing unstructured databases or processing them. There is abstract definition for hadoop eco system. Saving data is on the left side with two different storage possibilities as HDFS and HBase. HBase is top of HDFS and both has been written by Java.

Image 4

HDFS:

  1. Hadoop Distributed File System allows store large data in distributed flat files.
  2. HDFS is good for sequential access to data.
  3. There is no random real-time read/write access to data. It is more proper for offline batch processing.

HBase:

  1. Store data I key/value pairs in columnar fashion.
  2. HBase has possibility to read/write on real time.

Hadoop was written in Java but also you can implement by R, Python, Ruby.

Spark is open source cluster computing which is implemented in Scala and is suitable to support job iteration on distributed computation. Spark has high performance.

Hive, pig, Sqoop and mahout are data access which make query to database possible. Hive and pig are SQL-like; mahout is for machine learning.

MapReduce and Yarn are both working to process data. MapReduce is built to process data in distributed systems. MapReduce can handle all tasks such as job and task trackers, monitoring and executing as parallel. Yarn, Yet Another Resource Negotiator is MapReduce 2.0 which has better resource management and scheduling.

Sqoop is a connector to import and export data from/to external databases. It makes easy and fast transfer data in parallel way.

How Hadoop Works?

Hadoop follows master/slaves architecture model. For HDFS, the name node in master monitors and keeps tracking all of slaves which are bunch of storage cluster.

Image 5

There are two different types of jobs which do all of magic for MapReduce processing. Map job which sends query to for processing various nodes in cluster. This job will be broken down to smaller tasks. Then Reduce job collect all of output which each node has produced and combine them to one single value as final result.

Image 6

This architecture makes Hadoop an inexpensive solution which is very quick and reliable. Divide big job into smaller ones and put each task in different node. This story might remind you about the multithreading process. In multithreading, all concurrent processes are shared with the aid of locks and semaphores, but data accessing in MapReduce is under the control of HDFS.

Get Ready for Hands On Experience

Until now, we upgraded and touched upon the abstract meaning of hadoop and big data. Now it is time to dig into it and make a practical project. We need to prepare some hardware which will meet our requirements.

Hardware

  1. RAM: 8 GB
  2. Hard Free Space: 50 GB

Image 7

Software

  1. Operating System: Windows 10 - 64-bit
  2. Oracle Virtual Machine VirtualBox
  3. Linux-ubuntu 17.04
  4. Jdk-8
  5. Ssh
  6. Hadoop
  7. Hadoop jar
  8. Intellij idea
  9. Maven
  10. Writing Scala code or wordcount example

1. Operating System: Windows 10 - 64-bit

If you still do not have Win 10 and want to upgrade your OS, let's follow the below link to get it right now:

Image 8

2. Oracle Virtual Machine VirtualBox

Please go to this link and click on Windows hosts and get lastest virtual box. Then follow the below instructions:

Image 9

Image 10

Image 11

Image 12

Image 13

Image 14

Image 15

3. Linux-ubuntu 17.04

Please go to this link and click on 64-bit and get lastest ubuntu. Then follow the below instructions:

Image 16

Click on "New" to create new vm for new OS which is Linux/Ubuntu 64-bit.

Image 17

In the below picture, you should assign memory to new OS.

Image 18

Image 19

Image 20

Image 21

In the below picture, you should assign hard disk to new OS.

Image 22

Now, it is time to select your ubuntu 17 which is .iso type. Please select "ubuntu-17.04-desktop-amd64".

Image 23

Image 24

Image 25

Image 26

Now, double click on new vm and start to install.

Image 27

Select your language and click on "Install Ubuntu".

Image 28

Image 29

Image 30

Image 31

Image 32

In the below picture, you specify who you are, so it is first user who can use ubuntu.

Image 33

Image 34

Image 35

Image 36

Image 37

Image 38

Image 39

sudo apt-get install update

If you see below error, let's try to use one of two solutions:

Image 40

First Solution:

sudo rm /var/lib/apt/lists/lock
sudo rm /var/cache/apt/archives/lock
sudo rm /var/lib/dpkg/lock

Image 41

Second Solution:

sudo systemctl stop apt-daily.timer
sudo systemctl start apt-daily.timer

Image 42

Then again:

sudo apt-get install update

Image 43

Image 44

Image 45

Image 46

Let's create a new user and group for hadoop.

sudo addgroup hadoop 

Image 47

Add user to group:

sudo adduser --ingroup hadoop hduser

Image 48

Let's enter your specification, ,then press "y" to save it.

Image 49

4. Installig JDK

sudo apt-get update
java -version
sudo apt-get install default-jre

Image 50

Image 51

Image 52

5. Installig open-ssh Protocol

OpenSSH provides secure and encrypted communication on the network by using SSH protocol. The question is that why we use openssh. It is because hadoop needs to get password in order to allow to reach its nodes.

sudo apt-get install openssh-server

Image 53

Image 54

ssh-keygen -t rsa -P ''

After running the above code, if you are asked to enter file name, leave it blank and press enter.v

Image 55

Then try to run:

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

If you have "No File Directory" error, let's follow the below instructions:

Image 56

After running "cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys" -> error: "No such file or directory"

  1. mkdir ~/.ssh
  2. chhmod 700 ~/.ssh
  3. touch ~/.ssh/authorized_keys
  4. chmod 600 ~/.ssh/authorized_keys
  5. touch ~/.ssh/id_rsa.pub
  6. chmod 600 ~/.ssh/id_rsa.pub

Image 57

Image 58

Image 59

Image 60

6- Installing Hadoop

Firstly, you should download hadoop via the code below:

wget http://apache.cs.utah.edu/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz

Image 61

Image 62

To extract hadoop file:

tar xzf hadoop-2.7.1.tar.gz 

Image 63

To find Java path:

update-alternatives --config java

Whatever after /jre is your Java path.

Image 64

1. Edit ~/.bashrc

nano ~/.bashrc

Image 65

The above code line will open addop-ev.sh, and after scrolling to end of file, find java-home and enter as hard code Java path which you have found with "update-altetrnatives --config java".

Then at the end, enter the below lines:

#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_INSTALL=/home/mahsa/hadoop-2.7.1
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path= $HADOOP_INSTALL/lib/native"
#HADOOP VARIABLES END

Image 66Image 67

Also change java home by hard code such as:

Image 68

For closing and saving file, press "ctl+X" and when it asks to save press "y" and when it asks for changing the file name just press "Enter" and do not change file name.

In order to confirm your above operations, let's execute:

source ~/.bashrc

Image 69

2. Edit hadoop-env.sh

sudo nano /home/mahsa/hadoop-2.7.1/etc/hadoop/hadoop-env.sh

Image 70

Write at the end of file:

XML
<property>
   <name>fs.default.name</name>
   <value>hdfs://localhost:9000</value>
</property>

Image 71

3. Edit yarn-site.xml

sudo nano /home/mahsa/hadoop-2.7.1/etc/hadoop/yarn-site.xml

Write at the end of file:

XML
<property>
   <name>yarn.nodemanager.aux-services</name>
   <value>mapreduce_shuffle</value>
</property>
<property>
   <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
   <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

Image 72

4. Edit mapred-site.xml

cp /home/mahsa/hadoop-2.7.1/etc/hadoop/mapred-site.xml.template 
   /home/mahsa/hadoop-2.7.1/etc/hadoop/mapred-site.xml

sudo nano /home/mahsa/hadoop-2.7.1/etc/hadoop/mapred-site.xml

Image 73

XML
<property>
   <name>mapreduce.framework.name</name>
   <value>yarn</value>
</property>

Image 74

5. Edit hdfs

mkdir -p /home/mahsa/hadoop_store/hdfs/namenode

Image 75

mkdir -p /home/mahsa/hadoop_store/hdfs/datanode

Image 76

XML
sudo nano /home/mahsa/hadoop-2.7.1/etc/hadoop/hdfs-site.xml

<property>
   <name>dfs.replication</name>
   <value>1</value>
</property>
<property>
   <name>dfs.namenode.name.dir</name>
   <value>file:/home/mahsa/hadoop_store/hdfs/namenode</value>
</property>
<property>
   <name>dfs.datanode.data.dir</name>
   <value>file:/home/mahsa/hadoop_store/hdfs/datanode</value>
</property>

Image 77

6. Format

hdfs namenode -format

Image 78

Image 79

start-dfs.sh

Image 80

start-yarn.sh

Image 81

By running "jps", you can be sure that hadoop has been installed properly.

jps

Image 82

6. Installing Intellij Idea

Image 83

Image 84

Image 85

Image 86

Image 87

Image 88

Image 89

Image 90

Image 91

Image 92

Image 93

Image 94

Image 95

There are two possibilities for MapReduce implementation, one is scala-sbt and other way is maven, I want to describe both of them. First scala-sbt.

Image 96

Image 97

Image 98

Image 99

Image 100

Image 101

Image 102

Image 103

Image 104

7. Installing Maven

Image 105

Image 106

Image 107

Image 108

Image 109

There are three text file at the below picture for practicing word count. MapReduce starts splitting each file to cluster of nodes as I explained at the top of this article. At Mapping phase, each node is responsible to count word. At intermediate splitting in each node is just hommogeous word and each number of that specific word in previous node. Then in reducing phase, each node will be summed up and collect its own result to produce single value.

Image 110

Image 111

Image 112

Image 113

Image 114

At below picture that you will see as pop up, you must click to "Enable Auto Import" which will handle all libraries.

Image 115

Image 116

Image 117

Image 118

Go to Menu "Run" -> Select "Edit Configurations":

  1. Select "Application"
  2. Select "Main Class = "App"
  3. Select arguments as: "input/ output/"

You just need to create "input" as directory and output directory will be created by intellij.

After running application, you will see two files:

  1. "_SUCCESS"
  2. "part-r-00000"

If you open "part-r-00000", you will see the result as:

  • moon 2
  • sun 3
  • day 3
  • night 3

Using the Code

Java
-package bigdata.mahsa;

/**
* wordcount!
*
*/
        import org.apache.hadoop.conf.Configuration;
        import org.apache.hadoop.conf.Configured;
        import org.apache.hadoop.fs.Path;
        import org.apache.hadoop.io.IntWritable;
        import org.apache.hadoop.io.LongWritable;
        import org.apache.hadoop.io.Text;
        import org.apache.hadoop.mapreduce.Job;
        import org.apache.hadoop.mapreduce.Mapper;
        import org.apache.hadoop.mapreduce.Reducer;
        import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
        import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
        import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
        import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
        import org.apache.hadoop.util.Tool;
        import org.apache.hadoop.util.ToolRunner;

        import java.io.IOException;
        import java.util.StringTokenizer;

public class App extends Configured implements Tool {

    public static class Mapper extends Mapper<LongWritable, Text, Text, IntWritable> {
        private static final IntWritable ONE = new IntWritable(1);
        private final transient Text word = new Text();

        @Override public void map
        (final LongWritable key, final Text value, final Context context)
                throws IOException, InterruptedException {
            final String line = value.toString();
            final StringTokenizer tokenizer = new StringTokenizer(line);
            while (tokenizer.hasMoreTokens()) {  //splitting
                   word.set(tokenizer.nextToken());
                context.write(word, ONE);
            }
        }
    }

    public static class Reducer extends Reducer<Text, IntWritable, Text, IntWritable> {

        @Override
        public void reduce(final Text key, final Iterable<IntWritable> values, 
                           final Context context)
                throws IOException, InterruptedException {
            int sumofword = 0;
            for (final IntWritable val : values) {
                sum += val.get();
            }
            context.write(key, new IntWritable(sumofword));
        }
    }

    @Override public int run(final String[] args) throws Exception {
        final Configuration conf = this.getConf();
        final Job job = Job.getInstance(conf, "Word Count");
        job.setJarByClass(WordCount.class);

        job.setMapperClass(Mapper.class);
        job.setReducerClass(Reducer.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        FileInputFormat.addInputPath(job, new Path(args[0]));  // ** get text from input
        FileOutputFormat.setOutputPath(job, new Path(args[1])); // ** write result to output

return job.waitForCompletion(true) ? 0 : 1;
    }

    public static void main(final String[] args) throws Exception {
        final int result = ToolRunner.run(new Configuration(), new App(), args);
        System.exit(result);
    }
}

Pom.xml

<project xmlns="<a data-saferedirecturl="https://www.google.com/url?
hl=en&q=http://maven.apache.org/POM/4.0.0&source=gmail&ust=1505079008271000&
usg=AFQjCNELhZ47SF18clzvYehUc2tFtF4L1Q" href="http://maven.apache.org/POM/4.0.0" 
target="_blank">http://maven.apache.org/POM/4.0.0</a>" 
xmlns:xsi="<a data-saferedirecturl="https://www.google.com/url?
hl=en&q=http://www.w3.org/2001/XMLSchema-instance&source=gmail&
ust=1505079008271000&usg=AFQjCNFD9yHHFr1eQUhTqHt1em3OxoDqEg" 
href="http://www.w3.org/2001/XMLSchema-instance" target="_blank">
http://www.w3.org/2001/XMLSchema-instance</a>"
xsi:schemaLocation="<a data-saferedirecturl="https://www.google.com/url?
hl=en&q=http://maven.apache.org/POM/4.0.0&source=gmail&ust=1505079008271000&
usg=AFQjCNELhZ47SF18clzvYehUc2tFtF4L1Q" href="http://maven.apache.org/POM/4.0.0" 
target="_blank">http://maven.apache.org/POM/4.0.0</a> 
<a data-saferedirecturl="https://www.google.com/url?hl=en&
q=http://maven.apache.org/xsd/maven-4.0.0.xsd&source=gmail&ust=1505079008271000&
usg=AFQjCNF31lT_EYlu0SxGI9EvuhtJcJ9Y0w" href="http://maven.apache.org/xsd/maven-4.0.0.xsd" 
target="_blank">http://maven.apache.org/xsd/maven-4.0.0.xsd</a>">
  <modelVersion>4.0.0</modelVersion>

  <groupId>bigdata.mahsa</groupId>
  <artifactId>App</artifactId>
  <version>1.0-SNAPSHOT</version>
  <packaging>jar</packaging>

  <name>wordcount</name>
  <url><a data-saferedirecturl="https://www.google.com/url?hl=en&
q=http://maven.apache.org&source=gmail&ust=1505079008271000&
usg=AFQjCNHfdU8bl-1WzHoSKompqgsFvDc6cA" href="http://maven.apache.org/" 
target="_blank">http://maven.apache.org</a></url>

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <hadoop.version>2.6.0</hadoop.version>
  </properties>

  <dependencies>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-client</artifactId>
      <version>2.7.1</version>
    </dependency>
  </dependencies>

  <build>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.pluginsplugins</groupId>
        <artifactId>maven-jar-plugin</artifactId>
        <configuration>
          <archive>
            <manifest>
              <mainClass>bigdata.mahsa.App</mainClass>
            </manifest>
          </archive>
        </configuration>
      </plugin>
    </plugins>
  </build>
</project>

Feedback

Feel free to leave any feedback on this article; it is a pleasure to see your opinions and vote about this code. If you have any questions, please do not hesitate to ask me here.

History

  • 9th September, 2017: Initial version

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Mahsa Hassankashi
Doctorandin Technische Universität Berlin
Iran (Islamic Republic of) Iran (Islamic Republic of)
I have been working with different technologies and data more than 10 years.
I`d like to challenge with complex problem, then make it easy for using everyone. This is the best joy.

ICT Master in Norway 2013
Doctorandin at Technische Universität Berlin in Data Scientist ( currently )
-------------------------------------------------------------
Diamond is nothing except the pieces of the coal which have continued their activities finally they have become Diamond.

http://www.repocomp.com/

Comments and Discussions

 
QuestionGreat Article, Good Job. Pin
iPazooki20-Oct-17 2:53
professionaliPazooki20-Oct-17 2:53 
AnswerRe: Great Article, Good Job. Pin
Mahsa Hassankashi24-Oct-17 11:00
MemberMahsa Hassankashi24-Oct-17 11:00 
QuestionNice... Pin
Shivprasad koirala9-Oct-17 16:01
MemberShivprasad koirala9-Oct-17 16:01 
AnswerRe: Nice... Pin
Mahsa Hassankashi9-Oct-17 23:25
MemberMahsa Hassankashi9-Oct-17 23:25 
GeneralMy vote of 5 Pin
san2debug7-Oct-17 4:50
professionalsan2debug7-Oct-17 4:50 
GeneralRe: My vote of 5 Pin
Mahsa Hassankashi7-Oct-17 22:03
MemberMahsa Hassankashi7-Oct-17 22:03 
PraiseNice one Pin
Dharmesh_Kemkar12-Sep-17 17:21
MemberDharmesh_Kemkar12-Sep-17 17:21 
GeneralRe: Nice one Pin
Mahsa Hassankashi13-Sep-17 12:14
MemberMahsa Hassankashi13-Sep-17 12:14 
PraiseMy 5 Pin
Igor Ladnik12-Sep-17 1:24
professionalIgor Ladnik12-Sep-17 1:24 
GeneralRe: My 5 Pin
Mahsa Hassankashi12-Sep-17 12:05
MemberMahsa Hassankashi12-Sep-17 12:05 
QuestionGood read Pin
Sibeesh Passion10-Sep-17 6:18
professionalSibeesh Passion10-Sep-17 6:18 
Thanks for the detailed article.
==================!!!====================!!!========================
So much complexity in software comes from trying to make one thing do two things.
Kindest Regards
Sibeesh Venu
http://sibeeshpassion.com/

AnswerRe: Good read Pin
Mahsa Hassankashi11-Sep-17 5:17
MemberMahsa Hassankashi11-Sep-17 5:17 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.