Click here to Skip to main content
13,407,999 members (52,274 online)
Click here to Skip to main content
Add your own
alternative version

Tagged as


5 bookmarked
Posted 20 Jun 2014

Don’t Use Elephant for Your Garden Work

, 20 Jun 2014
Rate this:
Please Sign up or sign in to vote.
Don’t use elephant for your garden work

While learning the new Tez engine and query vectorization concepts in Hadoop 2.0, I came to know that the query vectorization is claimed as 3x powerful and consume less CPU time in actual Hadoop cluster. Hortonworks tutorial uses a sample sensor data in a CSV that is imported into Hive. Then a sample has been used to explain the performance.

The intention of this post is neither explaining Tez engine and query vectorization nor Hive query. Let us familiarize the problem I have worked before getting to know the purpose of this post. :)

One sample CSV file called ‘HVAC.csv’ contains 8000 records that contain temperature information on different building during different days. Part of the file content:


In the Hive, following configurations are specified to enable Tez engine and query vectorization.

hive> set hive.execution.engine=mr;
hive> set hive.execution.engine=tez;
hive> set hive.vectorized.execution.enabled;

I execute the following query in my sandbox that surprisingly took 48 seconds for a ‘group by’ and ‘count’ on 8000 records as shown below:

select date, count(buildingid) from hvac_orc group by date;

This query groups the sensor data by date and counts the number of buildings for that date. It produces 30 results as shown below:

Status: Finished successfully
6/1/13  267
6/10/13 267
6/11/13 267
Time taken: 48.261 seconds, Fetched: 30 row(s)

Then I plan to write a simple program without MapReduce castle, since it is just 8000 records. I created a F# script that reads the CSV (note that I did not use any CSV type provider) and using Deedle exploratory library (again, LINQ can also help). I achieved the same result as shown below.

module ft

#I "..\packages\Deedle.1.0.0"
#load "Deedle.fsx"
open System
open System.IO
open System.Globalization
open System.Diagnostics
open Deedle

type hvac = { Date : DateTime; BuildingID : int}

let execute =
    let stopwatch = Stopwatch.StartNew()

    let enus = new CultureInfo("en-US")
    let fs = new StreamReader("..\ml\SensorFiles\HVAC.csv")
    let lines = fs.ReadToEnd() |> (fun s -> s.Split("\r\n".ToCharArray()))

    let ohvac = lines.[1..(Array.length lines) - 1]
                |> (fun s -> s.Split(",".ToCharArray()))
                |> (fun s -> {Date = DateTime.Parse(s.[0], enus); BuildingID = int(s.[6])})
                |> Frame.ofRecords

    let result = ohvac.GroupRowsBy("Date")
                |> Frame.getNumericCols
                |> Series.mapValues (Stats.levelCount fst)
                |> Frame.ofColumns

    (stopwatch.ElapsedMilliseconds, result)

In the FSI,

> #load "finalTouch.fsx";;
> open ft;;
> ft.execute;;
val it : int64 * Deedle.Frame =
01-06-2013 12:00:00 AM -> 267
02-06-2013 12:00:00 AM -> 267
03-06-2013 12:00:00 AM -> 267
04-06-2013 12:00:00 AM -> 267

The is completed within 83 milliseconds. You may argue that I am comparing apples with oranges. No!. My intention is to understand when MapReduce is the savior. The parable of the above exercise is that be cautious and analyze well before moving your data processing mechanisms into MapReduce clusters.

Elephants are very effective in labor requiring hard slogging and heavy lifting. Not for your gardens!! :)

Note that the sample CSV files from HortonWorks is clearly for training purposes. This blog post just takes that as an example to project the maximum data-generation capability of a small or medium size application for a period. The above script may not scale and will not perform well with more than the above numbers. Hence, this is not anti-MapReduce proposal.


This article, along with any associated source code and files, is licensed under The Common Public License Version 1.0 (CPL)


About the Author

M Sheik Uduman Ali
Architect Aditi
India India
Working as Architect for Aditi, Chennai, India.

My Website:

My Blog:

You may also be interested in...


Comments and Discussions

GeneralMy vote of 2 Pin
Rage23-Jun-14 0:37
professionalRage23-Jun-14 0:37 
QuestionSome thoughts Pin
phil.o20-Jun-14 15:04
professionalphil.o20-Jun-14 15:04 
SuggestionMore a tip than article. Pin
DaveAuld20-Jun-14 11:23
protectorDaveAuld20-Jun-14 11:23 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Terms of Use | Mobile
Web03 | 2.8.180221.1 | Last Updated 21 Jun 2014
Article Copyright 2014 by M Sheik Uduman Ali
Everything else Copyright © CodeProject, 1999-2018
Layout: fixed | fluid