Click here to Skip to main content
Click here to Skip to main content

Write ETL jobs in pure C#

, 29 Mar 2009 CPOL
Rate this:
Please Sign up or sign in to vote.
An introduction to using Rhino ETL to avoid dealing with designers such as the ones in DTS/SSIS job creation.

Introduction

If you like writing C# and dislike using DTS/SSIS to create ETL jobs (or the general idea of clicking through a designer in order to get work done), then this is for you.

If you dislike writing C# and/or don't mind creating ETL jobs through a designer and feel that all scenarios you need are covered by the designer, then this article is not for you.

The example process in this article has four operations. Two to read from the different user sources, one to join them together, and one to write out the combined data:

public class MainProcess : EtlProcess
{
    protected override void Initialize()
    {
        Register(new JoinUserRecords()
            .Left(new UserNameRead(Settings.Default.NamesFile))
            .Right(new UserAddressRead(Settings.Default.AddressesFile))
        );

        Register(new UserFullWrite(Settings.Default.OutputFile));
    }
}

User name file:

Id Name
1 Bob
2 John
3 Frank

Addresses file:

Id Address
1 123 Main St.
2 42 Everywhich way
3 1 Microsoft way

Output file from process:

Id Name Address
1 Bob 123 Main St.
2 John 42 Everywhich way
3 Frank 1 Microsoft way

If when you see the above, it isn't terribly clear to you what is going on, then I think this method is not something you will be interested in. But, if you find the above to be descriptive of what this process is actually doing, and easier to understand than clicking through a bunch of dialogs, then you will probably like Rhino ETL very much.

ETL

ETL stands for Extract, Transform, Load. For example, you receive files or other data from vendors or other third parties which you need to manipulate in some way and then insert into your own database.

Getting Rhino ETL

It isn't terribly easy to get a working copy of Rhino ETL right now, which is the reason I wanted to write this. I'm hoping this will spur more interest in it and drive more added features and/or improvements, regular releases, etc. I've seen newsgroup postings where people are trying to figure out how to get started and in order to, they have to download and compile the Rhino tools source. This article will give you something you can quickly and easily download and see how it works and if it is for you. The DLLs included with this are what I have compiled locally and are compiled in debug mode. So, you should eventually figure out how to download and compile the source until they start creating standard releases and downloads for it.

To download and compile Rhino ETL, you will need to:

Get TortoiseSVN or any other SVN client: Download the Rhino tools source from http://rhino-tools.svn.sourceforge.net/svnroot/rhino-tools. Compile the solution in the ETL sub folder - this isn't as easy as opening up the solution and building it, there is a Nant build you have to get to work correctly. To avoid doing that, I just deleted all the referenced AssemblyInfo.cs files, and changed the references so I could build just the projects needed to get the ETL tests to run.

See https://rhino-tools.svn.sourceforge.net/svnroot/rhino-tools/trunk/How%20to%20build.txt for more details on doing it the right way.

Using Rhino ETL

The way I learned how all this works is by looking at the Unit Tests in the ETL project. They are very good for understanding the finer points of all the options that are available. There isn't any documentation, so this also happens to be the only way to figure out how it works. But if you have a problem, see if you can find a similar Unit Test and step through the code. That's how I've solved most of the issues I've had.

Understanding yield return

Rhino ETL makes extensive use of yield return which I thought I understood before I started working with it, but I was quickly relieved of that belief once I started creating more complicated ETL jobs. I'll attempt to explain here what you need to know in order to work with it.

If you have a class that generated numbers up to the maximum value of an integer, like this:

private class NumberGenerator
{
    public int NumbersGenerated { get; set; }
    
    public IEnumerable<int> GenerateNumbers()
    {
        for (int i = 1; i < int.MaxValue; i++)
        {
            Console.WriteLine("NumberGenerator generated a number");
            NumbersGenerated++;
            yield return i;
        }
    }
}

then the only numbers generated would be the ones you iterated over. For example, the following test confirms that the loop does not occur over the entire set of integers from 1 to MaxValue if the results are never enumerated:

[TestMethod]
public void UnenumeratedDoesNoWork()
{
    var generator = new NumberGenerator();

    generator.GenerateNumbers();

    Assert.AreEqual(0, generator.NumbersGenerated);
}

Likewise, the test below confirms that enumerating the first five items only generates five numbers, instead of generating all numbers and then taking the first five. So, work is only performed for exactly what you use and nothing else.

[TestMethod]
public void GenerateFiveNumbers()
{
    var generator = new NumberGenerator();
    
    foreach (int number in generator.GenerateNumbers())
    {
        Console.WriteLine(number);

        if (number == 5) break;
    }

    Assert.AreEqual(5, generator.NumbersGenerated);
}

Now, for the final and more interesting example, chaining enumerators so that the results of one is used for another.

private class ChainedNumberGenerator
{
    public int NumbersGenerated { get; set; }

    public IEnumerable<int> GenerateNumbers(IEnumerable<int> inNumbers)
    {
        foreach(int i in inNumbers)
        {
            Console.WriteLine("ChainedNumberGenerator generated a number");
            NumbersGenerated++;
            yield return i;
        }
    }
}

[TestMethod]
public void ChainedNumberGeneratorsAreDependentOnEachOther()
{
    var firstGenerator = new NumberGenerator();
    var lastGenerator = new ChainedNumberGenerator();

    foreach (int number in 
      lastGenerator.GenerateNumbers(firstGenerator.GenerateNumbers()))
    {
        Console.WriteLine(number);
        if (number == 5) break; 
    } 

    Assert.AreEqual(5, lastGenerator.NumbersGenerated); 
    Assert.AreEqual(5, firstGenerator.NumbersGenerated); 
}

This outputs the following:

NumberGenerator generated a number
ChainedNumberGenerator generated a number
1

NumberGenerator generated a number
ChainedNumberGenerator generated a number
2

NumberGenerator generated a number
ChainedNumberGenerator generated a number
3

NumberGenerator generated a number
ChainedNumberGenerator generated a number
4

NumberGenerator generated a number
ChainedNumberGenerator generated a number
5

If the above does not make sense, then you should study up on yield return before going any further. If this example makes sense, then you should be set to start using Rhino ETL.

The pipeline

The reason it is important to understand yield return is because that is how rows get sent from one operation to another in Rhino ETL. So, if you have a process with a read operation and a process after it is doing something with those rows, then you might expect to see all the rows read, and then you might expect to see the next operation start.

But, what could actually happen (depending on how you are iterating the rows) is that the first operation will process one row, then the second operation will process that same row, etc., through the whole pipeline which would repeat for the second, third, etc., rows. Likewise, if you never iterated the rows in the second operation, the first would never do any work.

That can be confusing, but gives you some important performance benefits - instead of looping over the input rows multiple times, you loop once and perform each of the actions you need to perform.

If you want to process all rows in the first step and then go to the second, that is also possible by simply consuming (iterating) all the rows at the beginning of the next process before it yields its own return rows.

Understanding the "Quacking Dictionary"

The row object uses something called a quacking dictionary which allows you to add/update items for a row easily, and to also retrieve them without checking for null before every access. For example, the below is valid:

object a = row["SomemadeupstringThatDoesntExist"]
row["Column1"] = "abc";
row["Column1"] = "123";

The example code - combining two files into one

The example I've included has Unit Tests for each operation so you can examine each one in detail. You can also launch the console application to see it run all at once. All it does is read in two files, join them together on a common property (Id), and then write out the final results to another text file.

Here is the main process code again:

public class MainProcess : EtlProcess
{
    protected override void Initialize()
    {
        Register(new JoinUserRecords()
            .Left(new UserNameRead(Settings.Default.NamesFile))
            .Right(new UserAddressRead(Settings.Default.AddressesFile))
        );

        Register(new UserFullWrite(Settings.Default.OutputFile));
    }
}

Let's look at the first two operations:

public class UserNameRead : AbstractOperation
{
    public UserNameRead(string filePath)
    {
        this.filePath = filePath;
    }

    string filePath = null;

    public override IEnumerable<Row> Execute(IEnumerable<Row> rows)
    {
        using (FileEngine file = FluentFile.For<UserNameRecord>().From(filePath))
        {
            foreach (object obj in file)
            {
                yield return Row.FromObject(obj);
            }
        }
    }
}

public class UserAddressRead : AbstractOperation
{
    public UserAddressRead(string filePath)
    {
        this.filePath = filePath;
    }

    string filePath = null;

    public override IEnumerable<Row> Execute(IEnumerable<Row> rows)
    {
        using (FileEngine file = FluentFile.For<UserAddressRecord>().From(filePath))
        {
            foreach (object obj in file)
            {
                yield return Row.FromObject(obj);
            }
        }
    }
}

The only thing to note here is the use of FileHelpers through Rhino ETL for doing the IO on the file which frees you from needing to deal with how to parse a CSV, tab delimited file, etc. It reads in strongly typed C# data from the files, and then converts it to the Row object Rhino ETL uses for passing information from operation to operation. So, these operations will create a row object for each row in the input files, and pass them on to the next operation, the join:

public class JoinUserRecords : JoinOperation
{
    protected override void SetupJoinConditions()
    {
        InnerJoin
            .Left("Id")
            .Right("Id");
    }

    protected override Row MergeRows(Row leftRow, Row rightRow)
    {
        Row row = new Row();
        row.Copy(leftRow);

        //copy over all properties not in the user records
        row["Address"] = rightRow["Address"];

        return row;
    }
}

The rows are joined on their common property in the SetupJoinConditions method, and then there is another abstract method that needs to be implemented to specify what you want to do with the joined rows. For example, you may want just one column from the joined row, or you may want all of them. That is controlled in the MergeRows method.

Then, we have the final process which outputs the merged rows:

public class UserFullWrite : AbstractOperation
{
    public UserFullWrite(string filePath)
    {
        this.filePath = filePath;
    }

    string filePath = null;

    public override IEnumerable<Row> Execute(IEnumerable<Row> rows)
    {
        FluentFile engine = FluentFile.For<UserFullRecord>();
        engine.HeaderText = "Id\tName\tAddress";
        using (FileEngine file = engine.To(filePath))
        {
            foreach (Row row in rows)
            {
                file.Write(row.ToObject<UserFullRecord>());

                //pass through rows if needed for another later operation 
                yield return row;
            }
        }
    }
}

Again, using FileHelpers through the Rhino ETL helper methods/classes.

What else?

I only showed file operations here because it was simpler to distribute and get it working on other people's machines that way. SQL is, of course, supported, and there are quite a few operations you can use which will read in a table and automatically populate the rows on the fly among other things. Take a look at the Unit Tests for more details on those.

The real power comes when you need to do something more complicated, like involve a Web Service, use COM objects, threading, etc., within an ETL job. Now, you can use your existing skill set and experience to do those tasks instead of trying to come up with workarounds and hacks in a designer based product.

Enjoy!

History

  • 29-Mar-2009 - Initial version.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Paul B.

United States United States
I've been a software developer since 1996 and have enjoyed C# since 2003. I have a Bachelor's degree in Computer Science and for some reason, a Master's degree in Business Administration. I currently do software development contracting/consulting.

Comments and Discussions

 
QuestionRhino ETL video PinmemberPaul B.22-Jan-14 4:40 
GeneralI also like doing ETL with native code PinmemberDavidKAllen29-Nov-13 21:27 
GeneralMy vote of 1 PinmemberMahalingam2528-Jun-12 21:27 
GeneralMy vote of 1 PinmemberMahalingam2525-Jun-12 19:28 
QuestionMerging a file with classbuilder from filehelpers PinmemberMember 827537728-Sep-11 7:10 
GeneralLeveraging NHibernate mappings Pinmemberigitur4-Feb-11 4:00 
GeneralRe: Leveraging NHibernate mappings PinmemberPaul B.4-Feb-11 4:36 
Sure, it is just C# code so you can use any DB technology you want. That being said, there is a good chance the performance of that will not be as good as using some of the bulk operators in Rhino ETL. But if that isn't important to you then it shouldn't be a problem.
GeneralGreat Alternative to Expensive ETL Tools PinmemberDWH_JOBS_HQ14-Sep-10 17:29 
GeneralDebug DLLs compiled from trunk PinmemberPaul B.6-May-09 4:54 
GeneralGood Article - for Rhino ETL Pinmemberjmptrader5-May-09 16:47 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.141223.1 | Last Updated 29 Mar 2009
Article Copyright 2009 by Paul B.
Everything else Copyright © CodeProject, 1999-2014
Layout: fixed | fluid