Click here to Skip to main content
13,146,014 members (42,700 online)
Click here to Skip to main content
Add your own
alternative version

Stats

7.5K views
22 downloads
6 bookmarked
Posted 9 Sep 2017

Big Data MapReduce Hadoop Scala on Ubuntu Linux by Maven intellj idea

, 12 Sep 2017
Rate this:
Please Sign up or sign in to vote.
This article is the most complete essay about big data from scratch to practical.

I wanted to start surfing the net about big data, but I could not find any complete article which explain from the start until the end of the process. Each article just described part of this huge concept. I felt lack of comprehensive article and I decided to write this essay.

Nowadays we encounter to phenomena of growing volume of data. Our desire to keep all of this data for long term and process and analyze them with high velocity has been caused to build a good solution for that. Finding a suitable substitute for traditional databases which can store any type of structured and unstructured data is the most challenge for data scientists.
 

I have selected a complete scenario from first step until the result, which is hard to find such as throughout the internet. I selected MapReduce which is a processing data by hadoop and scala inside intellij idea, while all of this story will happen under the ubuntu linux as operatig system.

What is big data?

Big data is huge volume of massive data which are structured, unstructured or semi structured and it is difficult to store and mange with traditional databases. Its volume is variable from Terabyte to petabytes. Relational databases such as SQL Server are not suitable for storing unstructured data. As you see I below picture, big data is set of all kind of structured or unstructured data which has fundamental requirements such as storing, management, share, analyze.

What is Hadoop?

Indeed, we expect two issues from all of databases. Firstly, we need to store data, secondly we want to process stored data properly in a fast and accurate way. Because of arbitrary shape and large volume of big data, it is impossible to store them in traditional databases. We need to think about new one which can handle either storing or processing of big data.

Hadoop is as a revolutionary database for big data, which has capacity to save any shape of data and process them cluster of nodes. It also reduces dramatically the cost of data maintenance. With the aid of hadoop, we can store any sort of data for example all of user click for long period. So it makes easy historical analysis. Hadoop has distributed storage and also distributed process system such as Map Reduce.

 

It is obvious two expectation duties from databases in the below pictures. First store on the left and beginning step, then processing on the right side. Hadoop is an innovative database which is different from traditional and relational databases. Hadoop runs in bunch of clusters in multiple machines which each cluster includes too much numerous nodes.

 

What is Hadoop ecosystem?

As I mentioned above, hadoop is proper for either storing unstructured databases or processing them. There is abstract definition for hadoop eco system. Saving data is on the left side with two different storage possibilities as HDFS and HBase. HBase is top of HDFS and both has been written by java. 

HDFS:

1.       Hadoop Distributed File System allows store large data in distributed flat files.

2.       HDFS is good for sequential access to data.

3.       There is no random real-time read/write access to data. It is more proper for offline batch processing.

HBase:

1.       Store data I key/value pairs in columnar fashion.

2.       HBase has possibility to read/write on real time.

Hadoop was written in java but also you can implement by R, Python, Ruby.

Spark is open source cluster computing which is implemented in Scala and is suitable to support job iteration on distributed computation. Spark has high performance.

Hive, pig, Sqoop and mahout are data access which make query to database possible. Hive and pig are SQL-like; mahout is for machine learning.

MapReduce and Yarn are both working to process data. MapReduce is built to process data in distributed systems. MapReduce can handle all tasks such as job and task trackers, monitoring and executing as parallel. Yarn, Yet Another Resource Negotiator is MapReduce 2.0 which has better resource management and scheduling.

Sqoop is a connector to import and export data from/to external databases. It makes easy and fast transfer data in parallel way.

How Hadoop works?

Hadoop follows master/slaves architecture model. For HDFS the name node in master monitors and keeps tracking all of slaves which are bunch of storage cluster.

There are two different type of jobs which do all of magic for MapReduce processing. Map job which sends query to for processing various nodes in cluster. This job will be broken down to smaller tasks. Then Reduce job collect all of output which each node has produced and combine them to one single value as final result.

This architecture makes Hadoop as inexpensive solution which is very quick and reliable. Dividing big job into smaller and put each task in different node. This story might remind you multithreading process. In multithreading all of concurrent processes are shared with the aid of locks and semaphores but data accessing in MapReduce is under the control of HDFS.

Get Ready for hands on experience

 Until now we upgraded and touch abstract meaning of hadoop and big data. Now it is time to dig into it and make practical project. We need to prepare some hardware which meet our requirements.

Hardware:

1. RAM: 8 GB

2. Hard Free Space: 50 GB

Software:

1.       Operating System: Windows 10 - 64-bit

2.       Oracle Virtual Machine VirtualBox

3.       Linux-ubuntu 17.04

4.       Jdk-8

5.       Ssh

6.       Hadoop

7.       Hadoop jar

8.       Intellij idea

9.       Maven

10.   Writig Scala code or wordcount example

 

1. Operating System: Windows 10 - 64-bit

If you still do not have win 10 and wanna upgrade your OS, lets follow below link to get it right now:

https://www.microsoft.com/en-us/software-download/windows10

2. Oracle Virtual Machine VirtualBox

Please go to this link and click on windows hosts and get lastest virtual box. Then follow below instructions:

 

 

 

 

 

 

 

3.  Linux-ubuntu 17.04

Please go to this link and click on 64-bit and get lastest ubuntu. Then follow below instructions:

Click on "New" to create new vm for new OS which is Linux/Ubuntu 64-bit. 

In below picture, you should assign memorry to new OS. 

 

 

 

In the below picture you should assign hard disk to new OS.

Now, it is time to select your ubuntu 17 which is .iso type. Please select "ubuntu-17.04-desktop-amd64".

 

 

 

Now, double click on new vm and start to install.

Select your language and click on "Install Ubuntu".

 

 

 

 

In below picture you speciy that who you are, so it is first user who can use ubuntu.

 

 

 

 

 

 

sudo apt-get install update

If you see below error, lets try to use one of two solution:

First Solution:

sudo rm /var/lib/apt/lists/lock

sudo rm /var/cache/apt/archives/lock

sudo rm /var/lib/dpkg/lock

Second Solution:

sudo systemctl stop apt-daily.timer

sudo systemctl start apt-daily.timer

Then again:

sudo apt-get install update

 

 

 

Lets create new user and group for hadoop.

sudo addgroup hadoop 

Add user to group:

sudo adduser --ingroup hadoop hduser

Lets enter your specification, ,then press "y" to save it.

 

4. Installig JDK 

sudo apt-get update
java -version
sudo apt-get install default-jre

 

 

 

5. Installig open-ssh protocol

OpenSSH provides secure and encrypted communication on the network by using SSH protocol. The question is that why we use openssh. It is because hadoop needs to get password in order to allow to reach its nodes.

sudo apt-get install openssh-server

 

ssh-keygen -t rsa -P ''

After running above code if you are asked to enter file name, leave it blank and press enter.v
 

Then try to run:

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

If you have "No File Directory" error, lets follow below instructions:

After running  "cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys" -> error: "No such file or dierctory"

  i.      mkdir ~/.ssh

 ii.      chhmod 700 ~/.ssh

iii.      touch ~/.ssh/authorized_keys

iv.      chmod 600 ~/.ssh/authorized_keys

v.      touch ~/.ssh/id_rsa.pub

vi.      chmod 600 ~/.ssh/id_rsa.pub

 

 

6- Installing Hadoop

Firstly you should download hadoop via below code:

wget http://apache.cs.utah.edu/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz

 

To extract hadoop file:

tar xzf hadoop-2.7.1.tar.gz 

To find java path:

update-alternatives --config java

Whatever after /jre is your java path.

1. Edit ~/.bashrc

nano ~/.bashrc

Above code line will open addop-ev.sh, and after scrolling to end of file, find java-home and enter as hard code java path which you have found with "update-altetrnatives --config java".

Then at the end enter belo lines:

 

#HADOOP VARIABLES START

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

export HADOOP_INSTALL=/home/mahsa/hadoop-2.7.1

export PATH=$PATH:$HADOOP_INSTALL/bin

export PATH=$PATH:$HADOOP_INSTALL/sbin

export HADOOP_MAPRED_HOME=$HADOOP_INSTALL

export HADOOP_COMMON_HOME=$HADOOP_INSTALL

export HADOOP_HDFS_HOME=$HADOOP_INSTALL

export YARN_HOME=$HADOOP_INSTALL

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native

export HADOOP_OPTS=”-Djava.library.path= $HADOOP_INSTALL/lib/native”

#HADOOP VARIABLES END

 

Also change java home by hard code such as :

 

For closing and saving file, press "ctl+X" and when it asks to save press "y" and whe nit asks for changing the file name just press "Enter" and do not change file name. 

In odedr to confirm your above operations, lets execute:

source ~/.bashrc

2. Edit hadoop-env.sh

sudo nano /home/mahsa/hadoop-2.7.1/etc/hadoop/hadoop-env.sh

Write at the end of file:

<property>
   <name>fs.default.name</name>
   <value>hdfs://localhost:9000</value>
</property>

3. Edit yarn-site.xml

sudo nano /home/mahsa/hadoop-2.7.1/etc/hadoop/yarn-site.xml

Write at the end of file:

<property>
   <name>yarn.nodemanager.aux-services</name>
   <value>mapreduce_shuffle</value>
</property>
<property>
   <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
   <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

 

4. Edit mapred-site.xml

cp /home/mahsa/hadoop-2.7.1/etc/hadoop/mapred-site.xml.template /home/mahsa/hadoop-2.7.1/etc/hadoop/mapred-site.xml


sudo nano /home/mahsa/hadoop-2.7.1/etc/hadoop/mapred-site.xml

<property>
   <name>mapreduce.framework.name</name>
   <value>yarn</value>
</property>

 

5. Edit hdfs

mkdir -p /home/mahsa/hadoop_store/hdfs/namenode

 

mkdir -p /home/mahsa/hadoop_store/hdfs/datanode

sudo nano /home/mahsa/hadoop-2.7.1/etc/hadoop/hdfs-site.xml

<property>
   <name>dfs.replication</name>
   <value>1</value>
</property>
<property>
   <name>dfs.namenode.name.dir</name>
   <value>file:/home/mahsa/hadoop_store/hdfs/namenode</value>
</property>
<property>
   <name>dfs.datanode.data.dir</name>
   <value>file:/home/mahsa/hadoop_store/hdfs/datanode</value>
</property>

6. Format

hdfs namenode -format

 

start-dfs.sh

start-yarn.sh

By running "jps", you can be sure that hadoop has been installed properly. 

jps

 

6. Installing Intellij Idea 

 

 

 

 

 

 

 

 

 

 

 

 

There are ttwo possibilities for MapReduce implementation, one is scala-sbt and other way is maven, I wanna describe both of them. First scala-sbt.  

 

 

 

 

 

 

 

 


7. Installing Maven

 

 

 

 

There are three text file at the below picture for practicing word count. MapReduce starts to splitting each file to cluster of nodes as I explained at the top of this article. At Mapping phase, each node is responsible to count word. At intermediate spliting in each node is just hommogeous word and each number of that specific word in previous node. Then in reducig phase, each node will be sum up and collect its own result to produce single value.    

 

 

 

 

At below picture that you will see as pop up, you must click o "Enable Auto Import" which will handle all of libaries.

 

 

 

 

Go to Menu "Run" -> Select "Edit Configurations" 

1. Select "Application"

2. Select "Main Class = "App"

3. Select arguments as: "input/ output/"

You just need to create "input" as directory and output directory will be created by intellij. 

After running application, you will see two files:

1. "_SUCCESS"

2. "part-r-00000"

If you open "part-r-00000" you will see the result as:

moon 2

sun 3

day 3

night 3

 

Using the code

-package bigdata.mahsa;

/**
* wordcount!
*
*/

        import org.apache.hadoop.conf.<wbr />Configuration;
        import org.apache.hadoop.conf.<wbr />Configured;
        import org.apache.hadoop.fs.Path;
        import org.apache.hadoop.io.<wbr />IntWritable;
        import org.apache.hadoop.io.<wbr />LongWritable;
        import org.apache.hadoop.io.Text;
        import org.apache.hadoop.mapreduce.<wbr />Job;
        import org.apache.hadoop.mapreduce.<wbr />Mapper;
        import org.apache.hadoop.mapreduce.<wbr />Reducer;
        import org.apache.hadoop.mapreduce.<wbr />lib.input.FileInputFormat;
        import org.apache.hadoop.mapreduce.<wbr />lib.input.TextInputFormat;
        import org.apache.hadoop.mapreduce.<wbr />lib.output.FileOutputFormat;
        import org.apache.hadoop.mapreduce.<wbr />lib.output.TextOutputFormat;
        import org.apache.hadoop.util.Tool;
        import org.apache.hadoop.util.<wbr />ToolRunner;

        import java.io.IOException;
        import java.util.StringTokenizer;


public class App extends Configured implements Tool {

    public static class Mapper extends Mapper<LongWritable, Text, Text, IntWritable> {
        private static final IntWritable ONE = new IntWritable(1);
        private final transient Text word = new Text();

        @Override public void map(final LongWritable key, final Text value, final Context context)
                throws IOException, InterruptedException {
            final String line = value.toString();
            final StringTokenizer tokenizer = new StringTokenizer(line);
            while (tokenizer.hasMoreTokens()) {  //splitting
word.set(tokenizer.nextToken()<wbr />);
                context.write(word, ONE);
            }
        }
    }


    public static class Reducer extends Reducer<Text, IntWritable, Text, IntWritable> {

        @Override
        public void reduce(final Text key, final Iterable<IntWritable> values, final Context context)
                throws IOException, InterruptedException {
            int sumofword = 0;
            for (final IntWritable val : values) {
                sum += val.get();
            }
            context.write(key, new IntWritable(sumofword));
        }
    }


    @Override public int run(final String[] args) throws Exception {
        final Configuration conf = this.getConf();
        final Job job = Job.getInstance(conf, "Word Count");
        job.setJarByClass(WordCount.cl<wbr />ass);

        job.setMapperClass(Mapper.clas<wbr />s);
        job.setReducerClass(Reducer.cl<wbr />ass);

        job.setOutputKeyClass(Text.cla<wbr />ss);
        job.setOutputValueClass(<wbr />IntWritable.class);

        job.setInputFormatClass(<wbr />TextInputFormat.class);
        job.setOutputFormatClass(<wbr />TextOutputFormat.class);

        FileInputFormat.addInputPath(<wbr />job, new Path(args[0]));  // ** get text from input
FileOutputFormat.<wbr />setOutputPath(job, new Path(args[1])); // ** write result to output

return job.waitForCompletion(true) ? 0 : 1;
    }

    public static void main(final String[] args) throws Exception {
        final int result = ToolRunner.run(new Configuration(), new App(), args);
        System.exit(result);
    }
}

Pom.xml

<project xmlns="<a data-saferedirecturl="https://www.google.com/url?hl=en&q=http://maven.apache.org/POM/4.0.0&source=gmail&ust=1505079008271000&usg=AFQjCNELhZ47SF18clzvYehUc2tFtF4L1Q" href="http://maven.apache.org/POM/4.0.0" target="_blank">http://maven.apache.<wbr />org/POM/4.0.0</a>" xmlns:xsi="<a data-saferedirecturl="https://www.google.com/url?hl=en&q=http://www.w3.org/2001/XMLSchema-instance&source=gmail&ust=1505079008271000&usg=AFQjCNFD9yHHFr1eQUhTqHt1em3OxoDqEg" href="http://www.w3.org/2001/XMLSchema-instance" target="_blank">http://www.w3.org/<wbr />2001/XMLSchema-instance</a>"
xsi:schemaLocation="<a data-saferedirecturl="https://www.google.com/url?hl=en&q=http://maven.apache.org/POM/4.0.0&source=gmail&ust=1505079008271000&usg=AFQjCNELhZ47SF18clzvYehUc2tFtF4L1Q" href="http://maven.apache.org/POM/4.0.0" target="_blank">http://<wbr />maven.apache.org/POM/4.0.0</a> <a data-saferedirecturl="https://www.google.com/url?hl=en&q=http://maven.apache.org/xsd/maven-4.0.0.xsd&source=gmail&ust=1505079008271000&usg=AFQjCNF31lT_EYlu0SxGI9EvuhtJcJ9Y0w" href="http://maven.apache.org/xsd/maven-4.0.0.xsd" target="_blank">http://maven.apache.org/xsd/<wbr />maven-4.0.0.xsd</a>">
  <modelVersion>4.0.0</modelVers<wbr />ion>

  <groupId>bigdata.mahsa</groupI<wbr />d>
  <artifactId>App</artifactId>
  <version>1.0-SNAPSHOT</version<wbr />>
  <packaging>jar</packaging>

  <name>wordcount</name>
  <url><a data-saferedirecturl="https://www.google.com/url?hl=en&q=http://maven.apache.org&source=gmail&ust=1505079008271000&usg=AFQjCNHfdU8bl-1WzHoSKompqgsFvDc6cA" href="http://maven.apache.org/" target="_blank">http://maven.apache.org</a></<wbr />url>

  <properties>
    <project.build.sourceEncoding><wbr />UTF-8</project.build.<wbr />sourceEncoding>
    <hadoop.version>2.6.0</hadoop.<wbr />version>
  </properties>

  <dependencies>
    <dependency>
      <groupId>org.apache.hadoop</gr<wbr />oupId>
      <artifactId>hadoop-client</art<wbr />ifactId>
      <version>2.7.1</version>
    </dependency>
  </dependencies>

  <build>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins<wbr />plugins</groupId>
        <artifactId>maven-jar-plugin</<wbr />artifactId>
        <configuration>
          <archive>
            <manifest>
              <mainClass>bigdata.mahsa.App</<wbr />mainClass>
            </manifest>
          </archive>
        </configuration>
      </plugin>
    </plugins>
  </build>
</project>

 

Feedback

Feel free to leave any feedback on this article; it is a pleasure to see your opinions and vote about this code. If you have any questions, please do not hesitate to ask me here.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Mahsa Hassankashi
Software Developer
Iran (Islamic Republic of) Iran (Islamic Republic of)
I have been working with different technologies and data more than 10 years.
I`d like to challenge with complex problem, then make it easy for using everyone. This is the best joy.

ICT Master in Norway 2013
Work as Data Scientist in Germany - Berlin ( currently )
-------------------------------------------------------------
Diamond is nothing except the pieces of the coal which have continued their activities finally they have become Diamond.

*Article of The Community Spotlight in Microsoft ASP.NET, Wednesday, February 11, 2015, www.asp.net
*Article of The Day in Microsoft ASP.NET Tuesday, February 3, 2015, www.asp.net/community/articles
*1 Jan 2015: CodeProject MVP 2015
*22 Mar 2014: Best Web Dev Article of February 2014 - Second Prize


You may also be interested in...

Pro
Pro

Comments and Discussions

 
PraiseNice one Pin
937417034012-Sep-17 17:21
member937417034012-Sep-17 17:21 
GeneralRe: Nice one Pin
Mahsa Hassankashi13-Sep-17 12:14
memberMahsa Hassankashi13-Sep-17 12:14 
PraiseMy 5 Pin
Igor Ladnik12-Sep-17 1:24
professionalIgor Ladnik12-Sep-17 1:24 
GeneralRe: My 5 Pin
Mahsa Hassankashi12-Sep-17 12:05
memberMahsa Hassankashi12-Sep-17 12:05 
QuestionGood read Pin
Sibeesh Passion10-Sep-17 6:18
professionalSibeesh Passion10-Sep-17 6:18 
AnswerRe: Good read Pin
Mahsa Hassankashi11-Sep-17 5:17
memberMahsa Hassankashi11-Sep-17 5:17 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170915.1 | Last Updated 12 Sep 2017
Article Copyright 2017 by Mahsa Hassankashi
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid