Click here to Skip to main content
Click here to Skip to main content

HDInsight – Big Data for All

, 28 Aug 2014 CPOL
HDInsight is a Microsoft-provided distribution of Apache Hadoop. It is available as a cloud-hosted service on Windows Azure.

Editorial Note

This article is in the Product Showcase section for our sponsors at CodeProject. These reviews are intended to provide you with information on products and services that we consider useful and of value to developers.

Introduction

HDInsight is a Microsoft-provided distribution of Apache Hadoop. It is available as a cloud-hosted service on Windows Azure.

Hadoop has gained broad acceptance as an effective storage and processing mechanism for big data. Hadoop at its core has two pieces: one for storing large amounts of unstructured data, and another for processing large amounts of data in a cost-effective manner.

  • The data storage solution is called the Hadoop Distributed File System (HDFS).
  • The processing solution is an implementation of the MapReduce programming model documented by Google.

HDFS

HDFS is a file system designed to store large amounts of data economically. It does this by storing data on multiple commodity machines—scaling out instead of scaling up.

  • Each file that is stored by HDFS is split into large blocks (typically 64 MB each, but this setting is configurable).
  • Each block is then stored on multiple machines that are part of the HDFS cluster. A centralized metadata store has information on where individual parts of a file are stored.
  • Considering that HDFS is implemented on commodity hardware, machines and disks are expected to fail. When a node fails, HDFS will ensure that data blocks the node held are replicated to other systems.

This scheme allows for the storage of large files in a fault-tolerant manner across multiple machines.

HDFS visually

In HDFS, the metadata store is typically on a machine referred to as the name node. The nodes where data is stored are referred to as data nodes. In the previous diagram, there are three data nodes. Each of these nodes contains a copy of each block of data that is stored on the HDFS cluster. A production implementation of HDFS will have many more nodes, but the essential structure still applies.

MapReduce

MapReduce is a functional programming model that moves away from shared resources and related synchronization or contention issues. It mandates the use of simple building block operations that are trivially scalable. It also turns out that most common data processing tasks can be modelled using MapReduce.

The MapReduce programming model is not hard to understand, especially if we study it using a simple example. Assume we have a text file, and we would like an individual count of all words that appear in that text file.

MapReduce as implemented in Hadoop is composed of three stages. We will look at the three stages of any MapReduce program in detail.

Map

The Map stage takes input in the form of a key and a value, processes the input, and then outputs another key and value. In this sense, it is no different than the implementation of a Map in several programming environments.

Considering the word count example, a Map task is likely to follow these steps:

Input

  • Key — Key identifying the value being provided to the Mapper.
    • In the context of Hadoop and the word counting problem, this key is simply the starting index of the text in the data block being processed. We can consider it to be opaque and ignore it.
  • Value — Value is a single line of text. Consider it as a unit to be processed.

Processing (implemented by user code)

  • Splits the provided line of text into individual words.

Output

For each word, the output is a set of key or value pairs. The mechanism to output these values is provided by Hadoop.

  • Key — A suitable key is the actual word detected.
  • Value — A suitable value is "1". Think of this as a distributed marker that simply denotes that we saw a particular word once. It is important to distinguish this from a dictionary approach. With a dictionary, we will look up the current value and increment it by one. In our case, we do not do this. Every time we see a word, we simply mark that we have seen it again by outputting a "1". Aggregation will happen later.

Example walkthrough

Input to Mapper
Key Value
{Any number indicating the index within the block being processed} "Twinkle, Twinkle Little Star"
Output by Mapper

We assume that punctuation does not count in our context. Note that the word "Twinkle" was seen twice during processing, and therefore appears twice with "1" as the value and "Twinkle" as the key.

Key Value
Twinkle 1
Twinkle 1
Little 1
Star 1

Shuffle

Once the Map stage is over, data collected from the Mappers (remember, there could be several Mappers operating parallel to one another) will be sent to the Shuffle stage.

During the Shuffle stage, all values that have the same key are collected and stored as a conceptual list tied to the key under which they were registered.

In the word count example, assuming the single line of text we observed earlier was the only input, the Shuffle stage output should be:

Key List of values
Twinkle 1,1
Little 1
Star 1

The Shuffle stage guarantees that data under a specific key will be sent to exactly one reducer (the next stage).

Shuffle is not typically implemented by the application. Hadoop implements Shuffle and guarantees that all data values that belong to a single key will be gathered together and passed to a single reducer. In the instance mentioned above, the key "Twinkle" will be processed by a single reducer. It will never be processed by more than one reducer. Data under different keys can of course be routed to different reducers.

Reduce

The reducer’s role is to process the transformed data and output yet another key-value pair. This is the key-value pair that is actually written to the output. In the word count sample, the reducer can simply return the word as a key again, and the value as a summation of all the ones that appear in the provided list of values. This will, of course, be the number of times the word has appeared in the text—the desired output.

Key Value
Twinkle 2
Little 1
Star 1

The beauty of MapReduce is that once a problem is broken into MapReduce terms and tested on a small amount of data, you can be confident you have a scalable solution that can handle large volumes of similar data.

A complete working Word Count sample in C# and Java ready for HDInsight is available here: https://bitbucket.org/syncfusion/hdinsightwp/src.

HDInsight combines the power of Hadoop with the ease of use of Windows. It truly aims to make big data available for all.

Interested in learning more? Download our free book on HDInsight

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Syncfusion
Syncfusion
United States United States
Syncfusion, Inc. provides the broadest range of enterprise-class .NET components for desktop, web, and mobile applications, whether you’re working in Windows Forms, WPF, ASP.NET, ASP.NET MVC, Silverlight, WinRT, Windows Phone, or JavaScript. In addition to our award-winning products, Syncfusion can provide solutions for big data and predictive analytics, custom-built applications, and consulting services. We also offer a variety of free products, like Metro Studio, a suite of over 2500 customizable icons, and the Succinctly Series, an educational e-book library for developers. To learn more, please contact us at sales@syncfusion.com or +1 919.481.1974, or visit syncfusion.com.
Group type: Organisation

1 members


Comments and Discussions

 
-- There are no messages in this forum --
| Advertise | Privacy | Terms of Use | Mobile
Web04 | 2.8.141220.1 | Last Updated 28 Aug 2014
Article Copyright 2014 by Syncfusion
Everything else Copyright © CodeProject, 1999-2014
Layout: fixed | fluid