MongoDB is good for storage and query when the data size is not too big (< 1 million?). But if your data size grows 10 million rows daily, and it comes to over hundred millions rows, performing complex calculations like aggregation would take ages.
With 100 millions rows with about 200GB, I have tried to write a multithread program to perform aggregation using Map/Reduce and Mongo aggregation framework, it ends up something not doable, and when either of it is running, it will affect mongodb performance as well (e.g. The streaming record insert will be slower and slower.)
And, hence it needs to work with Hadoop to do complex calculation. Hadoop basically acts like a processing engine to perform aggregation works for mongodb.
I was looking for a decent hadoop implementation that works well with MongoDB and found Hortonworks Hadoop. The good thing of it is that it has Windows setup package files for ease of installation.
And some more good news, MongoDB Hadoop connector is certified on Hortonworks hadoop.
So, let's get started to run Hadoop with MongoDB.
- Download mongo db from here
Hortonworks Hadoop Installation
- Download installation package from here
- Installation steps refer to here
- Something notes:
- I tried to install hadoop to "D" drive, but its seems to have problems with some hardcoded path I suspect, some of the hadoop services can't be started. I ended up only being able to start all hadoop services (except Hadoop HWI) by installing program files in "C" drive, and data files in "D" drive.
- Force name node to leave safe mode. Execute the following script in command prompt: bin/hadoop dfsadmin -safemode leave
MongoDB Hadoop Connector Installation
- Download MongoDB Hadoop connector driver from here
- Extract and copy the following files to "<drive>:\hdp\hive-0.13.0.2.1.1.0-1621\lib" & "<drive>:\hdp\pig-0.12.1.2.1.1.0-1621\lib"
Using the Code
CREATE EXTERNAL TABLE InputCollection(id STRING, RecordDate TIMESTAMP, SourceName STRING)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
CREATE TABLE SummaryTable ( RecordDate STRING, SourceName STRING, TOTAL INT );
INSERT INTO TABLE SummaryTable
SELECT to_date(RecordDate), SourceName, COUNT(*) FROM InputCollection
GROUP BY to_date(RecordDate), SourceName;
Sample Hadoop PIG script to aggregate data by record
Day from a
datetime. I have spent 3 days to learn PIG script and came out with the following parts. One of the most challenging parts would be input/output to mongo and aggregate records based on day of a date.
rawData = LOAD 'mongodb://userid:password@server:27017/DatabaseName.InputCollection'
RecordDateConversion = FOREACH rawData GENERATE SourceName,
ToDate(UnixToISO(RecordDate),'yyyy-MM-dd\'T\'HH:mm:ss.SSSZ') AS RecordDateDT;
DataGetSrcAndDateOnly = FOREACH RecordDateConversion GENERATE SourceName,
CONCAT((chararray)GetMonth(RecordDateDT), '-')),(chararray)GetDay(RecordDateDT)) AS RecordDayOnly;
DataGetSrcAndDateOnlyGroup = GROUP DataGetSrcAndDateOnly BY (SourceName, RecordDayOnly);
result = FOREACH DataGetSrcAndDateOnlyGroup GENERATE group.SourceName, COUNT(DataGetSrcAndDateOnly) as Total;
STORE result INTO 'mongodb://userid:password@server:27017/DatabaseName.OutputCollection'
USING com.mongodb.hadoop.pig.MongoInsertStorage('', '' );
Points of Interest
Hadoop Server Configuration
- How to Plan and Configure YARN and MapReduce 2 in HDP 2.0. Refer here.
Hadoop Performance Tuning Best Practices
More to Come...
- Query time comparison using Map/Reduce, Mongo Aggregation and Hadoop
- Running hadoop cluster with multiple nodes
- Control mongo split size
- Configure number of hadoop mapper, reducer
- Process BSON documents
- Output from Hadoop hive to Mongo
- Input BSON from Mongo PIG, output back to Mongo storage
- Run Hadoop HWI
- 11th July, 2014: Initial version