DataCube

darkoman

5.00/5 (6 votes)

Aug 24, 2018

CPOL

8 min read

19264

488

C# implementation of the serverless OLAP engine

Download source - 33.1 KB

Introduction

This article is about the C# .NET implementation of the serverless OLAP engine that deals with an off-line data aggregation and processing. It is a Microsoft SQL Server dependent solution and based on that storage engine. The future work is targeting the flat data sources like the CSV files.

Background

The OLAP (Online Analytical Processing) is a well-known technique of performing an off-line processing of the information stored inside the database and aggregating of those data to be reported by the different parameters. It is explained on the following web page https://en.wikipedia.org/wiki/Online_analytical_processing. It is composed from the several parts:

data extraction (E)
data transformation (T)
data loading (L)
data aggregation and processing
data reporting

This C# .NET dynamic class library is written in order to support the requested minimum for the successful data processing and reporting. For accessing the Microsoft SQL Server database schema, it uses the INFORMATION_SCHEMA in order to create "in-memory" database model. It collects the information about the tables, columns, primary and foreign keys, as also for the existing referential constraints. The DataCube functionality was tested on the Northwind sample database provided by Microsoft. The SQL script for populating this sample database is included in the project. This database contains the general information about the Products, Orders and Customers, as the main identities which are being processed. Please check the Northwind database schema below:

The DataCube Structure

The DataCube class library contains the following logical layers:

database model
multi-dimensional cube model
data query module

Using the Northwind database as the test-case, here is how the DataCube library is used
(NOTICE: The DataCube.dll must be attached as the reference to the .NET project that is built.)

Modeling The Database

To create a DataCube object, use the following constructor:

DataCube.DataCube dataCube = new DataCube.DataCube
(@"NORTHWIND_DATACUBE", @"Server=localhost; Initial Catalog=NORTHWIND; 
Integrated Security=SSPI", @"E:\\Firma\\NORTHWIND_DataCube\\", DataCubeType.DataCubeTypeMSSS);

The DataCube.DataCube constructor takes the following parameters:

name of the datacube (string)
database connection (string)
server repository path (string)
datacube type (enumerable of the DataCubeType type - the only supported values for now is DataCubeType.DataCubeTypeMSSS)

Next, the following method is called in order to create an "in-memory" model of the database:

dataCube.Model();

Now, the database model is loaded into the memory and can be used for further data processing.

Modeling the Cube

The Cube is a multi-dimensional array that holds the pre-aggregated information from the database. It is made of the following:

dimensions
facts
hierarchies

The Dimension is formed from the selected table from the database. In this example, we will create four different dimensions, please check the code below:

DataCube.Dimension dimensionCustomers = 
       new DataCube.Dimension("Customers", new string[] { "CompanyName" });
dataCube.Dimensions.Add(dimensionCustomers.Name, dimensionCustomers);
DataCube.Dimension dimensionEmployees = 
       new DataCube.Dimension("Employees", new string[] { "FirstName", "LastName" });
dataCube.Dimensions.Add(dimensionEmployees.Name, dimensionEmployees);
DataCube.Dimension dimensionProducts = 
       new DataCube.Dimension("Products", new string[] { "ProductID", "ProductName", "UnitPrice" });
dataCube.Dimensions.Add(dimensionProducts.Name, dimensionProducts);
DataCube.Dimension dimensionCategories = 
       new DataCube.Dimension("Categories", new string[] { "CategoryName" });
dataCube.Dimensions.Add(dimensionCategories.Name, dimensionCategories);

The DataCube.Dimensions object is a "named collection" - the .NET Dictionary object type. The DataCube.Dimension constructor takes the following params:

name of the dimension (string)
array of the columns (string[])

The dimension name is the actual name of the table from the database. The array of the columns is the array holding the unique name of the columns from the same table. In the previous example, we have created the four dimensions: Customers, Employees, Products and Categories.

The Fact is formed also from the selected table from the database. In this example, we will create only one fact, please check the code below:

DataCube.Fact fact = new DataCube.Fact("Order Details");
DataCube.Measure measureQuantity = 
      new DataCube.Measure("Quantity", DataCubeFactAggregationType.DataCubeFactAggregationTypeSum);
fact.Measures.Add(measureQuantity.Name, measureQuantity);
DataCube.DerivedMeasure measureSales = 
      new DataCube.DerivedMeasure("Sales", DataCubeFactAggregationType.DataCubeFactAggregationTypeSum, 
                                  "UnitPrice * Quantity * (1 - Discount)");
fact.DerivedMeasures.Add(measureSales.Name, measureSales);
dataCube.Facts.Add(fact.Name, fact);

The DataCube.Fact object is created on the transaction tables "Order Details". This database table keeps track of the Products, of different Categories, sold to the Customers, by the Employees. Each DataCube.Fact object has a different DataCube.Measure objects that are being processed and calculated in the later steps. As the "Order Details" table has the Quantity and the UnitPrice, as also the Discount column, we create the exact measures we need.

The DataCube.Measure object constructor has the following params:

name of the measure (string)
type of the data aggregation (enumerable of the DataCubeFactAggregationType type)

There are five basic aggregation functions supported: SUM, COUNT, MIN, MAX and AVG.

If there is a need to define a pre-calculated column, like we need a Sales column here, we should create the DataCube.DerivedMeasure object. The additional parameter in its constructor is the expression that is used to calculated its value, like here where the Sales = UnitPrice * Quantity * (1 - Discount).

For basic data processing, most of the work is finished here.

Processing the Cube

In order to get the desired data processing results, we need to have the data aggregated and processed. Please check the code below:

dataCube.Process();

As an output of the previous step, the repository will be filled with the corresponding XML files that represent the processed data.

Here is the inner structure of the generated XML file, Categories.xml:

This XML file has the schema written together with the data. The data is contained inside the elements which tag name matches the dimension name. It is easily loaded inside the .NET DataTable object for later processing. The original type of the data (the column data type) is saved inside the XML schema. All processed dimensions generate the identical output file. Also, the data inside the Fact is aggregated and calculated for every single Dimension value.

Here is the actual SQL query that is automatically generated:

And, here are the results from the execution of the upper SQL query:

At this point, we might want to save the created Cube for later processing.

Saving the Cube

In order to save the Cube, please check the code below:

dataCube.Save();

This action will generate the XML file at the repository by the name of the Model.xml. Here is the inner structure of this file:

This file keeps track of the dimensions and facts created inside the DataCube object. Once the Cube is saved, after processing of its dimensions and measures, it can be loaded again off-line. No connection to the database is needed any more.

Loading the Cube

To load the Cube, check the following code:

DataCube.DataCube dataCube = new DataCube.DataCube("NORTHWIND_DATACUBE", null, 
                  "E:\\Firma\\NORTHWIND_DataCube\\", DataCubeType.DataCubeTypeMSSS);
dataCube.Load();

This constructor has already been explained, but this time, it is important to see that there is no more a database connection string provided. The Load() method will load the Cube from the repository, if there is an existing Model.xml file.

Querying the Cube

To perform a data analysis, we will have to query the data that is processed by the DataCube object. The DataCube.Query class has been written for this purpose. I have tried here to mimic the MDX query logic, and provide the actual data pivoting, in order to represent the data similar to the OLAP clients, like the Microsoft Excel is. So, from this point, we are processing and analysing the off-line data that is stored inside the XML output files generated from the processing of the dimensions and facts. Check the code below:

DataCube.Query query = new DataCube.Query("Test", new string[] 
    { "Categories.Categories_CategoryName" }, null, new string[] { "Quantity", "Sales" }, 
      "Categories_CategoryName IN ('Produce', 'Seafood')");

The DataCube.Query object constructor has the following params:

name of the query (string)
array of row dimensions (string[])
array of column dimensions (string[])
array of measures (string[])
condition (string)

Here, we actually define what data to collect, by what criteria, and what calculations to perform. The dimensions can be placed on rows or on the columns. The measures are precalculated values from the processed dimension file outputs, and the condition is just a filtering criteria for this data. To actually execute the Query, use the following code:

dataCube.Execute();

If we run the Execute() method on the previously created and processed Cube, the following output XML file will be generated, by the name of Test.xml:

As the output file shows, we have got the Categories dimension data for the specific filtering conditions ('Produce' and 'Seafood').

Another, a more complex, example would be to get the [Products Sales By Year] data. We need here a so called "time hierarchy" to be engaged. Please check the following code:

// Model the database
DataCube.DataCube dataCube = new DataCube.DataCube
(@"NORTHWIND_DATACUBE", @"Server=localhost; Initial Catalog=NORTHWIND; 
Integrated Security=SSPI", @"E:\\Firma\\NORTHWIND_DataCube\\", DataCubeType.DataCubeTypeMSSS);
dataCube.Model();

// Model the cube
DataCube.Fact fact = new DataCube.Fact("Order Details");
DataCube.Measure measureQuantity = new DataCube.Measure
("Quantity", DataCubeFactAggregationType.DataCubeFactAggregationTypeSum);
fact.Measures.Add(measureQuantity.Name, measureQuantity);
DataCube.DerivedMeasure measureSales = new DataCube.DerivedMeasure
("Sales", DataCubeFactAggregationType.DataCubeFactAggregationTypeSum, 
"UnitPrice * Quantity * (1 - Discount)");
fact.DerivedMeasures.Add(measureSales.Name, measureSales);
dataCube.Facts.Add(fact.Name, fact);
DataCube.Dimension dimensionProducts = new DataCube.Dimension
("Products", new string[] { "ProductID", "ProductName", "UnitPrice" });
dataCube.Dimensions.Add(dimensionProducts.Name, dimensionProducts);
DataCube.Dimension dimensionOrders = new DataCube.Dimension
("Orders", new string[] { "OrderID", "ShippedDate" });
dataCube.Dimensions.Add(dimensionOrders.Name, dimensionOrders);
DataCube.DimensionGroup dimensionGroupProductsOrders = 
new DataCube.DimensionGroup("Products_Orders", new string[] { "Products", "Orders" });
DataCube.DimensionGroupHierarchy hierarchyProductsOrdersYear = 
new DataCube.DimensionGroupHierarchy("Products_Orders_ShippedDate_YEAR", 
"Orders", "ShippedDate", DataCubeDimensionHierarchyType.DataCubeDimensionHierarchyTypeYear, 
null, new string[] { "Products.ProductName", "Orders.ShippedDate" });
dimensionGroupProductsOrders.Hierarchies.Add(hierarchyProductsOrdersYear.Name, 
hierarchyProductsOrdersYear);
dataCube.DimensionGroups.Add(dimensionGroupProductsOrders.Name, dimensionGroupProductsOrders);

// Process the cube
dataCube.Process();

// Save the cube
dataCube.Save();

Here, we create the necessary dimensions (from the "Products" and "Orders" tables) as also the facts (from the "Order Details" table) with the corresponding measures (Quantity and Sales). Next a "dimension-group" is created. The constructor for the DataCube.DimensionGroup object takes the following params:

name of the group (string)
array of dimensions from the group (string[])

Processing of the created group generates the following SQL query:

And the results that are obtained are the following:

This group should be processed on the yearly base, so we create the time hierarchy for this group using the DataCube.DimensionGroupHierarchy class. Please check the constructor params for this object:

name of the hierarchy (string)
name of the table (string)
name of the column (string)
hierarchy type (enumerable of the DataCubeDimensionHierarchyType type)
parent hierarchy object (DataCube.DimensionGroupHierarchy type)
array of tables and corresponding columns (string[])

Processing of the created group, with the yearly based hierarchy, generates the following SQL query:

Also, the results we get are:

Next, we should create a Query object and execute it, so check the following code:

// Load the cube
DataCube.DataCube dataCube = new DataCube.DataCube
("NORTHWIND_DATACUBE", null, "E:\\Firma\\NORTHWIND_DataCube\\", DataCubeType.DataCubeTypeMSSS);
dataCube.Load();

// Model the queries
DataCube.Query query = new DataCube.Query
("Test", new string[] { "Products.Products_ProductName" }, new string[] 
{ "Orders.Orders_ShippedDate_YEAR" }, new string[] { "Quantity", "Sales" }, 
"Orders_ShippedDate_YEAR IN ('1996', '1997', '1998') 
AND Products_ProductName IN ('Chai', 'Chocolade')");
dataCube.Queries.Add(query.Name, query);

// Execute the queries
dataCube.Execute();

Finally, the following XML output file is generated at the repository:

Here, it should be noted that a multi-level data pivoting is supported, on rows, as also on the columns.

Also, please check the provied examples in the test project for the correct naming convention of the generated XML files, etc.

Points of Interest

This was a very challenging project for me, and I am sure someone will find this class library useful.

History

DataCube v1.0 - August, 2018