Click here to Skip to main content
Click here to Skip to main content

Tagged as

How to Communicate to Hadoop via Hive using .NET/C#

, 4 Mar 2014 CPOL
Rate this:
Please Sign up or sign in to vote.
Connect to database in Hive

Introduction

Before I start telling you my problem, I have put down certain terms that are relevant to my problem. All the definitions are basically excerpts from Wikipedia.

What is BigData?

Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis and visualization. Big data is difficult to work with using most relational database management systems and desktop statistics and visualization packages, requiring instead massively parallel software running on tens, hundreds, or even thousands of servers.

What is Hadoop?

Hadoop is an open-source framework from Apache Software Foundation. It emerged as a solution for storing as well as processing BigData. Hadoop consists of the Hadoop Common package, which provides filesystem and OS level abstractions, a MapReduce engine and the Hadoop Distributed File System (HDFS).

What is MapReduce?

MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of:

  1. Map() procedure performs filtering and sorting.
  2. Reduce() procedure that performs a summary operation.

What is Hive?

Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.

What is HiveQL?

HiveQL is based on SQL, but do not strictly follow the full SQL-92 standard. Internally, a compiler translates HiveQL statements into a directed acyclic graph of MapReduce jobs, which are submitted to Hadoop for execution.

What is my problem?

I was looking for a code snippet which can connect to Hadoop via HIVE using C#. The following discussion will help you connect to HIVE and play with different tables and data underneath. It will also provide you a ground to explore Hadoop/HIVE via C#/.NET.

Background

I Googled everywhere in this regard but could gather few vague references only from Stackoverflow or some other sites. I have added limitations that I cannot use Azure HDInsight.

Using the Code

To begin, you need to download Microsoft® Hive ODBC Driver. The different parameters and their value that can be assigned are explained in detail in this section (Appendix C: Driver Configuration Options) of this article.

Following are the important parameters to get-set ConnectionString. Rest of the parameters can be set as required by ones application.

  • DRIVER={Microsoft Hive ODBC Driver}
  • Host=server_name
  • Port=10000
  • Schema=default
  • DefaultTable=table_name

DRIVER={Microsoft Hive ODBC Driver} is the name of the actual driver.

Host=server_name is the name of the server where the Hadoop is running

Port=10000 is the default port, but you can assign your own.

Schema=default is default database. You can create your own.

DefaultTable=table_name is the name of a table in HIVE system.

Function GetDataFromHive() connects to Hadoop/HIVE using Microsoft® Hive ODBC Driver.

SELECT * FROM table_name LIMIT 10 tells database to bring the TOP(10) records from database in SQL Server style.

private void GetDataFromHive(){
   var conn = new OdbcConnection
                  {
                      ConnectionString = @"DRIVER={Microsoft Hive ODBC Driver};                                        
                                        Host=server_name;
                                        Port=10000;
                                        Schema=default;
                                        DefaultTable=table_name;
                                        HiveServerType=1;
                                        ApplySSPWithQueries=1;
                                        AsyncExecPollInterval=100;
                                        AuthMech=0;
                                        CAIssuedCertNamesMismatch=0;
                                        TrustedCerts=C:\Program Files\Microsoft Hive ODBC Driver\lib\cacerts.pem;"
                  };
    try 
    {
        conn.Open();

        var adp = new OdbcDataAdapter("Select * from table_name limit 10", conn); 
        var ds = new DataSet();
        adp.Fill(ds);

        foreach (var table in ds.Tables)  
        {
            var dataTable = table as DataTable;

            if (dataTable == null)
                continue;

            var dataRows = dataTable.Rows;

            if (dataRows == null)
                continue;

            //log.Info("Records found " + dataTable.Rows.Count);

            foreach (var row in dataRows)
            {
                var dataRow = row as DataRow;
                if (dataRow == null)
                    continue;

                //log.Info(dataRow[0].ToString() + " " + dataRow[1].ToString());
            }
        }

    }
    catch (Exception ex)
    {
       // log.Info("Failed to connect to data source");
    }
    finally
    {
        conn.Close();
    }
} 

Points of Interest

BigData is coming a big way as traditional relational databases such as SQL Server, Oracle, Sybase and others are finding it more and more difficult to handle big data and data in varied(structured/document-style/unstructured, etc.) formats. In this regard, Hadoop is fast emerging as one of the solutions that big banks, and other data mining industries are embracing. This piece of code will help you talk to Hadoop and will accelerate your effort to solve the problem at hand.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Rajibdotnet05
Architect Siliconguys Inc
United States United States
Worked in various projects inlcuding WPF, WCF, Silverlight, MongoDB, Hadoop and Web development projects using ASP.NET, AJAX, C#, JavaScript, Web Services and SQL Server and Oracle.

Comments and Discussions

 
QuestionError IM002 Data source name not found and default driver specified Pinmemberanhdung8825-Oct-14 7:18 
QuestionQuestion on the same topic PinmemberChhrisha_Prasad22-Mar-14 16:07 
AnswerRe: Question on the same topic PinmemberRajibdotnet0526-Mar-14 21:24 
GeneralRe: Question on the same topic PinmemberYogesh Sonawane25-Jul-14 4:06 
QuestionNice article PinmembercoolRahul_124-Mar-14 20:27 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web01 | 2.8.141223.1 | Last Updated 4 Mar 2014
Article Copyright 2014 by Rajibdotnet05
Everything else Copyright © CodeProject, 1999-2014
Layout: fixed | fluid