Click here to Skip to main content
12,891,458 members (47,893 online)
Click here to Skip to main content
Add your own
alternative version


5 bookmarked
Posted 11 Jul 2014

MongoDB Hadoop Hortonworks in Windows Server 2008

, 11 Jul 2014 CPOL
Rate this:
Please Sign up or sign in to vote.
MongoDB Hadoop Hortonworks in Windows Server 2008


MongoDB is good for storage and query when the data size is not too big (< 1 million?). But if your data size grows 10 million rows daily, and it comes to over hundred millions rows, performing complex calculations like aggregation would take ages.

With 100 millions rows with about 200GB, I have tried to write a multithread program to perform aggregation using Map/Reduce and Mongo aggregation framework, it ends up something not doable, and when either of it is running, it will affect mongodb performance as well (e.g. The streaming record insert will be slower and slower.)

And, hence it needs to work with Hadoop to do complex calculation. Hadoop basically acts like a processing engine to perform aggregation works for mongodb.

I was looking for a decent hadoop implementation that works well with MongoDB and found Hortonworks Hadoop. The good thing of it is that it has Windows setup package files for ease of installation.

And some more good news, MongoDB Hadoop connector is certified on Hortonworks hadoop.

So, let's get started to run Hadoop with MongoDB.


MongoDB Installation

  • Download mongo db from here

Hortonworks Hadoop Installation

  • Download installation package from here
  • Installation steps refer to here
  • Something notes:
    • I tried to install hadoop to "D" drive, but its seems to have problems with some hardcoded path I suspect, some of the hadoop services can't be started. I ended up only being able to start all hadoop services (except Hadoop HWI) by installing program files in "C" drive, and data files in "D" drive.
    • Force name node to leave safe mode. Execute the following script in command prompt: bin/hadoop dfsadmin -safemode leave

MongoDB Hadoop Connector Installation

  • Download MongoDB Hadoop connector driver from here
  • Extract and copy the following files to "<drive>:\hdp\hive-\lib" & "<drive>:\hdp\pig-\lib"
    • mongo-hadoop-core-1.3.0
    • mongo-hadoop-hive-1.3.0
    • mongo-java-driver-2.12.2

Using the Code

Hadoop Hive

// Connect to input source

STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler' 
WITH SERDEPROPERTIES('mongo.columns.mapping'='{

// Create summary table

CREATE TABLE SummaryTable ( RecordDate STRING, SourceName STRING, TOTAL INT );

// Aggregate and insert into summary table

SELECT to_date(RecordDate), SourceName, COUNT(*) FROM InputCollection 
GROUP BY to_date(RecordDate), SourceName;

Hadoop PIG

Sample Hadoop PIG script to aggregate data by record SourceName & Day from a datetime. I have spent 3 days to learn PIG script and came out with the following parts. One of the most challenging parts would be input/output to mongo and aggregate records based on day of a date.

rawData = LOAD 'mongodb://userid:password@server:27017/DatabaseName.InputCollection' 
USING com.mongodb.hadoop.pig.MongoLoader('id,RecordDate,SourceName','id'); 

RecordDateConversion = FOREACH rawData GENERATE SourceName, 
ToDate(UnixToISO(RecordDate),'yyyy-MM-dd\'T\'HH:mm:ss.SSSZ') AS RecordDateDT;

DataGetSrcAndDateOnly = FOREACH RecordDateConversion GENERATE SourceName, 
CONCAT(CONCAT(CONCAT((chararray)GetYear(RecordDateDT), '-'), 
CONCAT((chararray)GetMonth(RecordDateDT), '-')),(chararray)GetDay(RecordDateDT)) AS RecordDayOnly;

DataGetSrcAndDateOnlyGroup = GROUP DataGetSrcAndDateOnly BY (SourceName, RecordDayOnly);

result = FOREACH DataGetSrcAndDateOnlyGroup GENERATE group.SourceName, COUNT(DataGetSrcAndDateOnly) as Total;

STORE result INTO 'mongodb://userid:password@server:27017/DatabaseName.OutputCollection' 
USING com.mongodb.hadoop.pig.MongoInsertStorage('', '' );

Points of Interest

Hadoop Server Configuration

  • How to Plan and Configure YARN and MapReduce 2 in HDP 2.0. Refer here.

Hadoop Performance Tuning Best Practices


More to Come...

  • Query time comparison using Map/Reduce, Mongo Aggregation and Hadoop
  • Running hadoop cluster with multiple nodes
  • Control mongo split size
  • Configure number of hadoop mapper, reducer
  • Process BSON documents
  • Output from Hadoop hive to Mongo
  • Input BSON from Mongo PIG, output back to Mongo storage
  • Run Hadoop HWI


  • 11th July, 2014: Initial version


This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


About the Author

LM Heah
Malaysia Malaysia
No Biography provided

You may also be interested in...

Comments and Discussions

QuestionGood Work Pin
sunny_sharma1233-Nov-15 19:36
membersunny_sharma1233-Nov-15 19:36 
GeneralMy vote of 5 Pin
Lydia Gabriella16-Jul-14 10:43
memberLydia Gabriella16-Jul-14 10:43 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Terms of Use | Mobile
Web01 | 2.8.170424.1 | Last Updated 11 Jul 2014
Article Copyright 2014 by LM Heah
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid