Click here to Skip to main content
Click here to Skip to main content

Tagged as

Go to top

Don’t Use Elephant for Your Garden Work

, 20 Jun 2014
Rate this:
Please Sign up or sign in to vote.
Don’t use elephant for your garden work

While learning the new Tez engine and query vectorization concepts in Hadoop 2.0, I came to know that the query vectorization is claimed as 3x powerful and consume less CPU time in actual Hadoop cluster. Hortonworks tutorial uses a sample sensor data in a CSV that is imported into Hive. Then a sample has been used to explain the performance.

The intention of this post is neither explaining Tez engine and query vectorization nor Hive query. Let us familiarize the problem I have worked before getting to know the purpose of this post. :)

One sample CSV file called ‘HVAC.csv’ contains 8000 records that contain temperature information on different building during different days. Part of the file content:

Date,Time,TargetTemp,ActualTemp,System,SystemAge,BuildingID
6/1/13,0:00:01,66,58,13,20,4
6/2/13,1:00:01,69,68,3,20,17
6/3/13,2:00:01,70,73,17,20,18
6/4/13,3:00:01,67,63,2,23,15
6/5/13,4:00:01,68,74,16,9,3
… 

In the Hive, following configurations are specified to enable Tez engine and query vectorization.

hive> set hive.execution.engine=mr;
hive> set hive.execution.engine=tez;
hive> set hive.vectorized.execution.enabled;
      hive.vectorized.execution.enabled=true

I execute the following query in my sandbox that surprisingly took 48 seconds for a ‘group by’ and ‘count’ on 8000 records as shown below:

select date, count(buildingid) from hvac_orc group by date;

This query groups the sensor data by date and counts the number of buildings for that date. It produces 30 results as shown below:

Status: Finished successfully
OK
6/1/13  267
6/10/13 267
6/11/13 267
...
Time taken: 48.261 seconds, Fetched: 30 row(s)

Then I plan to write a simple program without MapReduce castle, since it is just 8000 records. I created a F# script that reads the CSV (note that I did not use any CSV type provider) and using Deedle exploratory library (again, LINQ can also help). I achieved the same result as shown below.

module ft

#I "..\packages\Deedle.1.0.0"
#load "Deedle.fsx"
open System
open System.IO
open System.Globalization
open System.Diagnostics
open Deedle

type hvac = { Date : DateTime; BuildingID : int}

let execute =
    let stopwatch = Stopwatch.StartNew()

    let enus = new CultureInfo("en-US")
    let fs = new StreamReader("..\ml\SensorFiles\HVAC.csv")
    let lines = fs.ReadToEnd() |> (fun s -> s.Split("\r\n".ToCharArray()))

    let ohvac = lines.[1..(Array.length lines) - 1]
                |> Array.map (fun s -> s.Split(",".ToCharArray()))
                |> Array.map (fun s -> {Date = DateTime.Parse(s.[0], enus); BuildingID = int(s.[6])})
                |> Frame.ofRecords

    let result = ohvac.GroupRowsBy("Date")
                |> Frame.getNumericCols
                |> Series.mapValues (Stats.levelCount fst)
                |> Frame.ofColumns

    stopwatch.Stop()
    (stopwatch.ElapsedMilliseconds, result)

In the FSI,

> #load "finalTouch.fsx";;
> open ft;;
> ft.execute;;
val it : int64 * Deedle.Frame =
(83L,
BuildingID
01-06-2013 12:00:00 AM -> 267
02-06-2013 12:00:00 AM -> 267
03-06-2013 12:00:00 AM -> 267
04-06-2013 12:00:00 AM -> 267
...

The is completed within 83 milliseconds. You may argue that I am comparing apples with oranges. No!. My intention is to understand when MapReduce is the savior. The parable of the above exercise is that be cautious and analyze well before moving your data processing mechanisms into MapReduce clusters.

Elephants are very effective in labor requiring hard slogging and heavy lifting. Not for your gardens!! :)

Note that the sample CSV files from HortonWorks is clearly for training purposes. This blog post just takes that as an example to project the maximum data-generation capability of a small or medium size application for a period. The above script may not scale and will not perform well with more than the above numbers. Hence, this is not anti-MapReduce proposal.

License

This article, along with any associated source code and files, is licensed under The Common Public License Version 1.0 (CPL)

Share

About the Author

M Sheik Uduman Ali
Architect Aditi
India India
Working as Architect for Aditi, Chennai, India.
 
My Website: www.udooz.net
 
My Blog: www.udooz.net/blog
Follow on   Twitter

Comments and Discussions

 
GeneralMy vote of 2 PinprofessionalRage22-Jun-14 23:37 
QuestionSome thoughts Pinprofessionalphil.o20-Jun-14 14:04 
SuggestionMore a tip than article. PinprotectorDaveAuld20-Jun-14 10:23 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web01 | 2.8.140922.1 | Last Updated 21 Jun 2014
Article Copyright 2014 by M Sheik Uduman Ali
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid