65.9K
CodeProject is changing. Read more.
Home

Hadoop For .Net & AngularJS Developers

starIconstarIconstarIconstarIconstarIcon

5.00/5 (20 votes)

Dec 30, 2015

CPOL

13 min read

viewsIcon

51010

downloadIcon

863

Query Hadoop using Microsoft oriented technologies (C#, SSIS, SQL Server, Excel etc.)

Download C # Hadoop Visual Studio 13 Solution

Download SQL & Oracle Scripts

Download SSIS To Hadoop

Introduction

I’d like to show Microsoft developers that Hadoop & .Net are a combination of technologies that can work well together. By demonstrating how to setup, configure a Hadoop cluster, import data (from RDBMS, flat file etc.) , query data using Hive and also through a REST API, to eventually displaying the resulting dataset within your .Net\AngularJS applications – all from a Microsoft developer’s perspective.

We plan to import tables from Oracle and SQL Server. Then perform queries within Hortonworks Ambari Dashboard (Hive) to makes joins on the two tables (originating from different data sources) within one application.

I plan to demonstrate Hadoop interaction with the following technologies\applications:

  1. C# Console
  2. C# REST Service
  3. AngularJS
  4. SSIS
  5. SQL Server Linked Server
  6. Excel
  7. Query Hadoop from within Visual Studio IDE

Background

“From four nodes to the future of data…”

Back in 2003 Google invested in and invented the basic frameworks that constitute, what is today popularly called Hadoop. They faced the future first, with the problem of handling billions of searches and indexing millions of web pages. When they could not find any large scale, distributed, scalable computing platforms for their needs, they went ahead and created their own. 

Doug Cutting (from Yahoo) was inspired by Google’s white papers and decided to create an open source project called “Hadoop”. Yahoo further contributed to this project and played a key role in developing Hadoop for enterprise applications. Since then, many companies such as Facebook, Linkedin, ebay, Hortonworks, Cloudera etc. have contributed to the Hadoop project.

The name itself came from Doug’s son toy elephant – Hadoop, ”...easy to remember and easy to Google…”

Terminology

Hadoop 

Is a free Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.

Sqoop

Apache Sqoop (pronounced scoop) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps to move data between Hadoop and other databases and it can transfer data in parallel for performance

Hive

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While initially developed by Facebook, Apache Hive is now used and developed by a number companies.

Hadoop Security

In a nutshell, the security layer is built upon Linux LDAP, when a RDBMS table is imported into Hadoop, it is in a flat file format (like CSV), and thus it is in a folder. The security revolves around restricting access to these folders.

MapReduce

Hadoop MapReduce is a software framework for distributed processing of large data sets on compute clusters of commodity hardware. It is a sub-project of the Apache Hadoop project. The framework takes care of scheduling tasks, monitoring them and re-executing any failed tasks.
The primary objective of Map/Reduce is to split the input dataset into independent chunks that are processed in a completely parallel manner. The Hadoop MapReduce framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically, both the input and the output of the job are stored in a file system.

 

Prerequisites

Download and install the following applications (if you have the full version installed, use that).

Once Oracle Express is install, you should find the entry within your start-up menu.

Open SQL Developer and connect to your Oracle Express (using the password you were prompted for during install for the SYSTEM user).

Open the Oracle script (attached at top) and execute it (clicking the green play button). This will create a new Oracle user and install a database, which contains the demo tables we will import into Hadoop later.

Once the script has finished, create a new connection to test that everything configured correctly – logging in as OracleHadoop/Finnharps1. These are the credentials you will use within Hadoop Sqoop to import the Artist table.

 

Once installed, make a note of your System Administrator (sa) password, open the attached SQL Server script and execute it by clicking on the green play button.

 

This will create the same database that you created earlier for Oracle Express. The idea is that we will import a table from Oracle and a table from SQL Server, and create a Hive query to perform the join in Hadoop on these tables.

 

 

We will now create a new SQL Server user, since we are using Oracle, its best to have a distinctive username for both database servers. Create a new SQL login called SQLServerHadoop with a password Finnharps1.

NB: you must have a strong password or Hadoop Sqoop will fail to connect to our database.

 

Create your SQL Server User

Below is SQL Server and Oracle database using the same table structure, created by the SQL scripts.

 

 

 

VM Setup

Install the VM player. Once the VM Sandbox has been downloaded, you can start the VM Player and play the Hortonworks sandbox you downloaded earlier.

As the VM loads the Sandbox, you will see script information being displayed as follows;

Once the VM has finally started, note IP address (mine below is 192.168.181.129).

 

Click <Alt + F5> to enter the VM and then enter the default VM login and password (root/hadoop). You may be prompted to change the password. Make a note of your new password.

Start Putty and enter the VM’s IP address.

Microsoft JDBC Driver Setup

Inside Putty, copy in the following command to see where you are within the VM:

pwd

Then enter the command to list the directory contents and sub-paths. You will notice I have already downloaded the Microsoft JDBC driver (the .tar file in red)

ls -la

NB: Blue coloured texts are folders.

Let’s download the Microsoft JDBC driver, unzip and copy the jdbc.jar file into the Sqoop lib folder.

NB To get your Sqoop directory path – enter the following into Putty:

ls -la /usr/bin/sqoop

Enter the following commands (one at a time) into Putty

cd /usr/local/

curl -L 'http://download.microsoft.com/download/0/2/A/02AAE597-3865-456C-AE7F-613F99F850A8/sqljdbc_4.0.2206.100_enu.tar.gz' | tar xz

cp sqljdbc_4.0/enu/sqljdbc4.jar /usr/hdp/current/sqoop-client/lib/

 

Preparing the SQL Server Environment

Execute the following SQL statement to determine if you’re “Shared Memory” transport congiguration needs to be changed.

NB: If you are executing SSIS packages from stored procedures within this SQL Server instance; you will want to undo your changes when finishing this tutorial.

SELECT
   CONNECTIONPROPERTY('net_transport') AS net_transport,
   CONNECTIONPROPERTY('protocol_type') AS protocol_type,
   CONNECTIONPROPERTY('auth_scheme') AS auth_scheme,
   CONNECTIONPROPERTY('local_net_address') AS local_net_address,
   CONNECTIONPROPERTY('local_tcp_port') AS local_tcp_port,
   CONNECTIONPROPERTY('client_net_address') AS client_net_address

 

If the code returns Shared Memory as the net_transport (as shown above), you’ll need to go into the SQL Server Configuration Manager, disable Shared Memory, make sure TCP/IP is enabled, and reboot the SQL Server service.

Rerun the SQL statement again and you should see your configuration properties below – make a note of the SQL Server IP address and port, as this will be used when importing table data from SQL Server by Hadoop’s Sqoop.

 

Listing Databases Available to User

NB: password is plain-text, for test purpose only.

NB: The SQL Server password must comply with a strong password rules or Linux\Hadoop will throw an ambiguous error message.

Give the Root account permissions to write to HDFS by running the following commands (one at a time) in putty:

sudo -u hdfs hadoop fs -mkdir /user/root
sudo -u hdfs hadoop fs -chown root:root /user/root

Then enter the following commands into Putty to list available databases that Hadoop can import for this user (do not use localhost as the IP address as this will refer to the VM’s local server, we are looking for SQL Server on your desktop\server).

sqoop list-databases --connect jdbc:sqlserver://192.168.1.100:1433 --username SqlServerHadoop --password Finnharps1

 

Import the table Artist from SQL Server into Hadoop;

sqoop import --connect "jdbc:sqlserver://192.168.1.100:1433;database=Chinook;username=SqlServerHadoop;password=Finnharps1" --table Artist --hive-import -- --schema dbo

You will see information being displayed as the import process is in effect. Use <Ctrl> + PageUp\Down to view the details.

Finally, information regardingthe imported table will be displayed.

If you list the directoty contents you will now see a Java file for the imported table.

I have jumped ahead to show you the newly imported table from within HortonWork's Hive application.

Oracle Driver Setup

We will import the Oracle driver the same way we imported the SQL Server driver.

Navigate to usr folder cd /usr/

wget https://afirs.googlecode.com/files/ojdbc6.jar

In Putty navigate to /usr/ folder and list the contents to see the imported jar file.

Now let’s copy it to the Sqoop directory using the following command

cp ojdbc6.jar  /usr/hdp/current/sqoop-client/lib/

 

List Oracle Databases

Log into SQL Developer as the SYS user and run the following script for the user OracleHadoop:

GRANT DBA TO ORACLEHADOOP;


grant CREATE SESSION, ALTER SESSION, CREATE DATABASE LINK,
  CREATE MATERIALIZED VIEW, CREATE PROCEDURE, CREATE PUBLIC SYNONYM,
  CREATE ROLE, CREATE SEQUENCE, CREATE SYNONYM, CREATE TABLE,
  CREATE TRIGGER, CREATE TYPE, CREATE VIEW, UNLIMITED TABLESPACE
  to ORACLEHADOOP  ;

We can list the Oracle databases that Hadoop can see by executing the following command within Putty;

sqoop list-databases --connect jdbc:oracle:thin:@192.168.1.100:1521/xe  --username ORACLEHADOOP  --password Finnharps1

 

Now, imprt the Album database table into Hadoop

sqoop import --connect jdbc:oracle:thin:@192.168.1.100:1521/xe --username ORACLEHADOOP --password Finnharps1 --table ALBUM --hive-import

NB: This can be a time consuming process even for small data dumps!

List the contents of the usr directorty to see the newly imported ARTISTS (java) file.

View Java File

Enter the following command, to bring up the VI editor and see the (ORM) java class associated with the newly import Album table:

vi ALBUM.java

Log Into Ambari Browser Dashboard

Enter the VM IP address into the browser, to be routed to the Hortonworks dashboard page (using the creentials admin\admin when prompted). From here we will be able to navigate to the Hiove query editor and perform some SQL statements on the newly imported tables.

NB: Even though the tables initially resised in RDBMS, in Hadoop there format is more akin to a CSV file (comma delimited).

 

Using Hive to Query Hadoop's Schema Tables

Click on the menu box and select Hive from the dropdown list.

You will see the database explorer on the left hand side of the page, this will display the existing tables within the default database.

Simply enter a SQL statement into the editor and click teh (green) execute button to see the results rendered below;

Table Schema

If you click on table link, it will expand to display the table field structure.

ODBC Setup (Used by SQL Server, Tableau, Excel etc.)

Use the following Microsoft link to download the Hive ODBC driver (x32 or x64) and install it.

SQL Server Linked to a Hadoop Server - Setup

Create a “System DSN” with the following attributes – enter in your Hadoop VM IP address and click the Test button to make sure the connection is valid. In the advanced options, make sure to configure the String size to 8000 as this will cause an error when retrieving (n)varchar data from Hadoop.

 

 

 

Within SQL Server create new Linked Server entry by executing the following command (note the datasrc property is pointing to our newly craeted DSN).

EXEC master.dbo.sp_addlinkedserver @server = N'HiveFinnharps',
@srvproduct=N'HIVE', @provider=N'MSDASQL',
@datasrc=N'Microsost C# Hadoop',
@provstr=N'Provider=MSDASQL.1;Persist Security Info=True;User ID=root; Password=finnharps;'

The image below, shows how the SQL properties map up within our Linke Server entry; 

To help you with the database, schema & table  path, right click the table and click through to Select To -> Clipboard

Execute the following (inner join) statement to see the data from the Artist and Album tables.

Quick C# Console Query Hadoop Example

A better example is outlined in the Visual Studio section below – but here is a quick reference on how to connect to Hadoop, execute a query and process it’s results.

static void Main(string[] args)
        {
            var conn = new OdbcConnection
            {
                ConnectionString = "Dsn=Microsost C# Hadoop;Uid=root;Pwd=finnharps;"
            };
            try
            {
                conn.Open();
                var adp = new OdbcDataAdapter("Select * from Artist limit 100", conn);
                var ds = new DataSet();
                adp.Fill(ds);
                foreach (var table in ds.Tables)
                {
                    var dataTable = table as DataTable;
                    if (dataTable == null) continue;
                    var dataRows = dataTable.Rows;
                    if (dataRows == null) continue;
                    System.Console.WriteLine("Records found " + dataTable.Rows.Count);
                    foreach (var row in dataRows)
                    {
                        var dataRow = row as DataRow;
                        if (dataRow == null) continue;
                    System.Console.WriteLine(dataRow[0].ToString() + " " + dataRow[1].ToString());
                    }
                }
            }
            catch (Exception ex)
            {
                System.Console.WriteLine(ex.Message);
            }
            finally
            {
               conn.Close();
            }
        }

Office (Excel\Access) Query Hadoop

If your Office version is 32 bit – make sure to install Microsoft Hive ODBC (32). Create a new System DSN entry for your version of Excel (32 or 64) – we will use this to connect to Hadoop. Use the images below to associate the DSN with your connection. 

 

You will be prompted for the rot password. Then a table\field selection dialog will appear for you to choose which fields you wish to import into Excel.

You then have the opportunity to filter the data (just like any other type of ODBC import). Finally the query dialog will process your request and insert the data into Excel as normal.

Imported data from Hadoop in Excel worksheet.

 

Various Visual Studio Projects Querying Hadoop

To run the following projects, the Hortonworks VM must be up and running. Below are the four projects I will demostrate:

  • Angularjs
  • Console
  • Data Access Layer (as Entity Framework will not work with an ODBC connection, I have designed a DAL to perform the SQL actions)
  • Rest Service

Data Access Layer

Below are a couple of methods that perform SQL queries against the Hadoop tables. The DAL methods all use the ODBC connection to query the database, and then it’s a simple execution and passing back the results. This DAL is used by the Console application, AngularJS application and the Rest service.

public ActionResults ExecuteSqlStatement(string sqlStatement)

        {
            using (conn = new OdbcConnection("Dsn=Microsost C# Hadoop;Uid=root;Pwd=finnharps;")) // retrieve password from encrypted ConnectionString section of Web.Config!!!
            {
                using (var command = new OdbcCommand(sqlStatement, conn))
                {
                    try
                    {
                        conn.Open();
                        using (var reader = command.ExecuteReader())
                        {
                            actionResult.ResultTable.Load(reader);
                        }
                    }
                    catch (OdbcException ex)
                    {
                        actionResult.ActionStatus = false;
                        actionResult.ErrorException = ex;
                        throw;
                    }                   
                }
            }
            return actionResult;
        }

        public ActionResults RetrieveTop10Albumns()
        {
            using (conn = new OdbcConnection("Dsn=Microsost C# Hadoop;Uid=root;Pwd=finnharps;")) // retrieve password from encrypted ConnectionString section of Web.Config!!!
            {               
                using (var command = new OdbcCommand(@"SELECT * FROM album LIMIT 100;", conn))
                {
                    try
                    {
                        conn.Open();
                        using (var reader = command.ExecuteReader())
                        {
                            actionResult.ResultTable.Load(reader);
                        }
                    }
                    catch (OdbcException ex)
                    {
                        actionResult.ActionStatus = false;
                        actionResult.ErrorException = ex;
                        throw;
                    }
                }
            }
            return actionResult;
        }

Console

The Console application, references the DAL assembly to perform it's SQL requests. The DAL code is explained in the next section.

static void Main(string[] args)
        {
            try
            {
                // initialise objects
                dal = new HadoopDataAccessLayer();
                result = new ActionResults();

                // Top 10 albums
                result = dal.RetrieveTop10Albumns();
                foreach (DataRow row in result.ResultTable.Rows) // Loop over the rows.
                {
                    System.Console.WriteLine("--- Row ---"); // Print separator.
                    foreach (var item in row.ItemArray) // Loop over the items.
                    {
                        System.Console.Write("Item: ");
                        System.Console.WriteLine(item);
                    }
                }

                // Top 10 artists & albums
                result = dal.RetrieveTop10ArtistAndAlbum();
                foreach (DataRow row in result.ResultTable.Rows) // Loop over the rows.
                {
                    System.Console.WriteLine("--- Row ---"); // Print separator.
                    foreach (var item in row.ItemArray) // Loop over the items.
                    {
                        System.Console.Write("Item: ");
                        System.Console.WriteLine(item);
                    }
                }

                // Top 10 artists
                result = dal.RetrieveTop10Artists();
                foreach (DataRow row in result.ResultTable.Rows) // Loop over the rows.
                {
                    System.Console.WriteLine("--- Row ---"); // Print separator.
                    foreach (var item in row.ItemArray) // Loop over the items.
                    {
                        System.Console.Write("Item: ");
                        System.Console.WriteLine(item);
                    }
                }

                System.Console.WriteLine("Hit <Enter> to close window.")
                System.Console.ReadLine();
            }
            catch (Exception ex)
            {
                System.Console.WriteLine(ex.Message);
                System.Console.ReadLine();
            }           
        }

REST

The Rest service also references the DAL assembly and call the public interface method ExecuteSqlStatement with the SQL statement passed in from the AngularJS client. The resulting class is transformed into JSON and passed back to the calling client.

[HttpGet]
        [Route("api/GenericSql")]
        public HttpResponseMessage Get(string SqlStatement)
        {
            try
            {
                result = dal.ExecuteSqlStatement(SqlStatement);               
                JSONresult = JsonConvert.SerializeObject(result);
                var response = Request.CreateResponse(HttpStatusCode.OK);
                response.Content = new StringContent(JSONresult, System.Text.Encoding.UTF8, "application/json");
                return response;
            }
            catch (Exception ex)
            {
                var response = Request.CreateResponse(HttpStatusCode.BadRequest);
                JSONresult = JsonConvert.SerializeObject(ex);
                response.Content = new StringContent(JSONresult, System.Text.Encoding.UTF8, "application/json");
                return response;
            }
        }

 

Deployed Rest Service in IIS

Below, is the Rest service deployed in IIS - which inturn is called by AngularJS.

 

Testing Service Methods in browser

Enter the following URL into your browser to see the JSON results for the Artists table;

http://localhost/Hadoop/api/Artists

Enter the following URL into your browser to see the JSON results for the Albumtable;

http://localhost/Hadoop/api/Albums

Testing in SoapUI

The following image shows how to execute the Artist controller (GET) method and displays it's JSON results (just like we did with the browser above, only better formatted).

 

AngularJS

The screen shots below demonstrate the UI of the AngularJS to Hadoop application. The UI will let the user select from a ribbon style menu and display the results in a static grid.

When you click on a ribbon button, the respectie REST service method is called, which inturn calls the respective DAL method and initiates the SQL process with Hadoop (a busybox is rendered to inform the user of the process).

The grid below, shows the resuslts from Hadoop.

 

AngularJS calling a REST method to filter Hadoop;

 

HTML

The (Ribbonbar) snippet below is standard (SPA) AngularJS syntax, using Controllers, Directives, Routing, Services and Expressions. With (resulting) html pages injected into the main page (MVC).

<div id="testRibbon" class="officebar" ng-controller="ribbonController">
        <ul>
            <li class="current">
                <a href="#" rel="home">Hadoop</a>
                <ul>
                    <li>
                        <span>Albums</span>
                        <div class="button" ng-click="buttonClick('Top10Albums');">
                            <a href="#/Results" rel="table"><img src="Content/ribbon/images/cover32.png" alt="" />Top 10</a>
                        </div>
                    </li>
                    .
                    .
                    .
                 
    <!--Inject views into ui-view-->
    <div id="content" ui-view="main"></div>

Controllers

The Controller snippet takes the page events and calls the respective service method, which in turn calls the REST service, which references the DAL and thus can communicate with Hadoop.

$scope.buttonClick = function (value) {
            $log.debug('Enter buttonClick');
            blockUI.start(); // block UI
            if (value == 'GenericSQL') {               
                $scope.$parent.ResultTitle = 'Generic SQL'
                if ($scope.genericSQL == '') {
                    alert('Please enter a Hadoop SQL statement!');
                    return;
                }
                // call respective service
                ribbonService.getGenericSQL($scope.genericSQL) // pass back the promise and handle in controller (service is only a pass through\shared logic between ctrls)
                    .then(function (results) {
                        $scope.$parent.Data = results.data.ResultTable // update the parent scope as you have nested controllers in view
                    },
                    function (results) {
                        blockUI.stop(); // unblock UI
                        alert("Failed Hadoop data request" + results); // log error
                    });
            }
            else if (value == 'FilteredArtist') {               
                $scope.$parent.ResultTitle = 'Filtered Artists'
                if ($scope.filterArtist == '') {
                    alert('Please enter an artist to filer by!');
                    return;
                }
                // call respective service
                ribbonService.getFilteredArtist($scope.filterArtist) // pass back the promise and handle in controller (service is only a pass through\shared logic between ctrls)
                    .then(function (results) {
                        $scope.$parent.Data = results.data.ResultTable // update the parent scope as you have nested controllers in view
                    },
                    function (results) {
                        blockUI.stop(); // unblock UI
                        alert("Failed Hadoop data request" + results); // log error
                    });
            }

Services

Below are two Service methods - notice the URL to our deployed Hadoop REST service - thus it is then simple to communicate with Hadoop.

this.getGenericSQL = function (sqlStatement) {         
        return $http({
            method: 'GET',
            withCredentials: true,
            data: 'json',
            url: 'http://localhost/Hadoop/api/GenericSql?sqlStatement=' + sqlStatement
        });
      };
   
      this.getFilteredArtist = function (filteredArtist) {        
          return $http({
              method: 'GET',
              withCredentials: true,
              data: 'json',
              url: 'http://localhost/Hadoop/api/Artists/FilteredArtistAlbum?filter=' + filteredArtist             
          });
      };

SSIS (Using 32bit ODBC DSN)

Once we know how to create an ODBC DSN, we can use basically Microsoft product to communicate with Hadoop, the screenshotsbelow demonstrate how to setup SSIS to export (source) data from Hadoop and export (destination) to an Excel worksheet. The following is a Visual Studio SSIS project.

Adding a Data Conversion to convert the mismatched data types (developers who import from Oracle to SQL Server will take this conversion step as a normal process)

The Excel (destination) worksheet once the SSIS package has been executed.

 

 

Query Hadoop From Within Visual Studio

We can also use Visual Studio to query Hadoop, without having to link in the Hadoop server into SQL Server. by, adding a new connection and using the ODBC DSN as normal.

Perform the SQL queries like it was any other RDBMS system.

To get the connection string – select the properties of the Hive Server.