Click here to Skip to main content
12,622,527 members (29,194 online)
Click here to Skip to main content
Add your own
alternative version

Stats

11K views
16 bookmarked
Posted

A C# data import caching solution utilizing Generics and your existing business classes

, 7 Jul 2010 CPOL
Rate this:
Please Sign up or sign in to vote.
Leverage your existing classes and the power of Generics to speed up and simplify importing denormalized data into your system.

Introduction

Importing data from an external system is a common task that almost all programmers have to deal with. One of the most must frustrating aspects of this task is reconciling the structure of the foreign data with your own. very often, the data comes from a legacy system that has no notion of relations and simply dumps the data in a flat format. This article demonstrates one possible technique that can be used if you have an established system with separate business classes.

Background

When importing data, I always find myself caching certain pieces of information in order to speed up the import process and add a safety net for myself. For example, many systems not only export the active records, but ones that have been "deleted" or, more precisely, deactivated. This can lead to a large amount of data duplication and a very confusing situation for someone trying to import it. It would be nice if one record in the foreign source corresponded to one record in your database, but that will rarely be the case.

For the sake of simplicity, I will assume the external data source is a CSV file. This is the de-facto standard when all else fails and, in my experience, is a prime example of a format that is misused and often poorly formatted. I will also be using the standard Customer -> Order -> OrderLineItem example that is common in many data discussions. While using these simple standards may seem like a cop-out, I know from experience they represent not only an extremely common scenario, but one that is complex enough to validate the solution. Please note the code examples are truncated for brevity.

Step 1 - Take Care of the Basics

OK, let's start out very simply. We are going to assume that all we are doing is importing customers. Our business class looks like this:

// assume your favorite ORM attributes are in place
public class Customer
{
    public int CustomerID { get; set; } // primary key
    public string CustomerName { get; set; }
}

And our data file looks like this:

ID,Name,City,State,Country,IsHQ,Employees
42,Acme Inc,Boston,MA,USA,0,50
28,Acme Inc,Denver,CO,USA,1,100
87,Foo Corp,Topeka,KS,USA,1,20
99,Acme Inc,Albany,NY,USA,0,50
31,Foo Corp,Newark,NJ,USA,0,10

I'm sure many of you see the problem right away. Our file contains multiple corporate branches for each company, but we only care about storing the name once. This means we can't simply import the rows on a one-to-one basis. Since there could be dozens or even hundreds of branch locations, we don't want to bother processing all the duplicate records. Because there is no way to predict how many unique companies and branches will be in the file, our only choice is to read the customer name every time. This obviously becomes the piece of data we want to cache; once "Acme" has been processed and stored, we can skip the rest of the "Acme" records. One easy solution is shown below:

List<string> customerNames = new List<string>();
IDataReader reader = ... // open data source
while (reader.Read())
{
    string custName = reader.GetString(1);
    if (!customerNames.Contains(custName))
    {
        // ...create and store new customer in DB...
        customerNames.Add(custName);
    }
}

This approach works fine when you are only importing a couple unique entities (i.e., classes). But what if we had 5, 10, or even 20 pieces of data per record? It's not really practical to keep 20 Lists around to cache all the different names. This is where generics and my DataLoadCache class comes in.

public static class DataLoadCacheV1<T>
{
    private static List<string> _names = new List<string>();

    public static bool ContainsName(string name)
    {
        return _names.Contains(name);
    }

    public static void StoreName(string name)
    {
        _names.Add(name);
    }
}

At first glance, it seems like we haven't accomplished much, but think about what kinds of "Ts" (generic types) we are going to use; yep, our existing business classes. Because we have unique classes defined and DataLoadClass is static, we don't have to worry about creating a unique cache instance for every business class we import. The compiler will create a unique DataLoadCache class every time we use a new "T". Our simple example becomes:

IDataReader reader = ... // open data source
while (reader.Read())
{
    string custName = reader.GetString(1);
    if (!DataLoadCacheV1<Customer>.ContainsName(custName))
    {
        // ...create and store new customer in DB...
        DataLoadCacheV1<Customer>.StoreName(custName);
    }
}

Notice we did not declare a List or any local caching instances. OK, now that we have gotten the basics out of the way, we can move on to the good stuff; stay with me, I promise it will get better.

Step 2 - Adding Orders to the Mix

Keeping in mind that we are assuming to be working with a flat, legacy data format, we are now going to tackle importing the Orders. The relevant parts of our import file are now:

ID,Name,OrderNumber
42,Acme Inc,1234
87,Foo Corp,5555
31,Foo Corp,1234
28,Acme Inc,5678

As you can see, order number 1234 is repeated showing that order numbers are not unique. When we add these orders to our system, we want to make sure we have quick access to the new database IDs for the orders, not just the order number. Assume our system associates Orders to Customers through a simple foreign key.

public class Order
{
    public int CustomerID { get; set; } // foreign key
    public int OrderID { get; set; } // primary key
    public string OrderNumber { get; set; }
}

Since we need to store an association, we must expand the DataLoadCache class to handle ID lookups.

public static class DataLoadCacheV2<T>
{
    private static Dictionary<string, int> _nameIDMaps = 
            new Dictionary<string, int>();

    public static bool ContainsIDByName(string name)
    {
        return _nameIDMaps.ContainsKey(name);
    }

    public static void StoreIDByName(string name, int ID)
    {
        _nameIDMaps.Add(name, ID);
    }

    public static int GetIDByName(string name)
    {
        return _namesIDMaps[name];
    }
}

Storing just the customer names is irrelevant now, so the methods from DataLoadCacheV1 have been replaced. We now store the actual database ID of the Customer we added. Now we have knowledge of what customer names have been processed and what the corresponding IDs are. This comes in handy when adding the Orders to the database.

IDataReader reader = ... // open data source 
while (reader.Read())
{
    string custName = reader.GetString(1);
    if (!DataLoadCacheV2<Customer>.ContainsIDByName(custName))
    {
        int newCustID = // ...create, store, and retrieve new customer in DB...
        DataLoadCacheV2<Customer>.StoreIdByName(custName, newCustID);
    }
    int custID = DataLoadCacheV2<Customer>.GetIDByName(custName);

    Order order = new Order()
    {
        CustomerID = custID,
        OrderNumber = ...
    };
    // Add order to DB....
}

Step 3 - Adding a One to Many Relationship

The last example showed why it's important to store the ID of the entity we are importing, not just the fact that it had already been processed. Since we know there are OrderLineItems on an Order, we need to add a way to look up the OrderID we just added.

public static class DataLoadCacheV3<T, U>
{
    private static Dictionary<int, Dictionary<string, int>> _foreignIDsByName =
        new Dictionary<int, Dictionary<string, int>>();

    public static void StoreForeignIDByName(int primaryID, 
                       string foreignName, int foreignID)
    {
        _foreignIDsByName[primaryID].Add(foreignName, foreignID);
    }

    public static bool ContainsForeignIDByName(int primaryID, string foreignName)
    {
        return _foreignIDsByName.ContainsKey(primaryID) 
            && _foreignIDsByName[primaryID].ContainsKey(foreignName);
    }

    public static int GetForeignIDByName(int primaryID, string foreignName)
    {
        return _foreignIDsByName[primaryID][foreignName];
    }
}

All we are doing here is tracking the associations between the string representations of the data we have (CustomerName and OrderNumber) and the new database IDs we created. The example below will make this more clear:

IDataReader reader = ... // open data reader
while (reader.Read())
{
    string custName = reader.GetString(1);
    if (!DataLoadCacheV3<Customer>.ContainsIDByName(custName))
    {
        int newCustID = ... // add customer to DB, get ID
        DataLoadCacheV3<Customer>.StoreIDByName(custName, newCustID);
    }
    int custID = DataLoadCacheV3<Customer>.GetIDByName(custName);
    string orderNumber = reader.GetString(2);
    if (!DataLoadCacheV3<Customer, Order>.ContainsForeignIDByName(custID, orderNumber))
    {
        int newOrderID = ... // add order to DB, get ID...
        DataLoadCacheV3<Customer, Order>.StoreForeignIDByName(custID, 
                                         orderNumber, newOrderID);
    }
    int orderID = DataLoadCacheV3<Customer, Order>.GetForeignIDByName(custID, orderNumber);
}

By adding a Dictionary of Dictionarys to the DataLoadCache, we are allowing multiple layers of mapping to take place. The DataLoadCache class doesn't have to worry about what kinds of names and IDs we are passing, because the generic type parameters ensure we will get a unique static class for each combination of types we use.

A Quick Note

I realize that up to this point much of this has been boring and tedious. If you have gotten this far and you are confused, I encourage you to start at the beginning and try to see how using the generic types in the last example opens up a world of possibilities and takes the burden of maintaining the various cache Dictionary instances yourself. By simply writing the line of code: DataLoadCachev3<Foo,Bar>, you create two custom lookup mechanisms (plus many more to come). Not only that, but the code is very readable, and there can be no confusion that you are relating Foos to Bars, unlike a local variable which can have any name.

Step 4 - Tracking Sets of IDs

The previous example assumed a given OrderNumber would only appear once per customer. When we start to consider OrderLineItems in our flat data format, we see this assumption is not valid. Again, only showing the relevant parts of the data file:

ID,Name,OrderNumber,LineItemNumber,Quantity
42,Acme Inc,1234,1,53
42,Acme Inc,1234,2,91
31,Foo Corp,1234,2,62
31,Foo Corp,1234,1,88
42,Acme Inc,1234,3,57

And our existing OrderLineItem class:

public class OrderLineItem
{
    public int OrderID { get; set; } // foreign key
    public int OrderLineItemID { get; set; } // primary key
    public string LineItemNumber { get; set; }
    public int Quantity { get; set; }
    public double Discount {get; set; }
}

Let's assume for a moment that some sort of further processing is required on the OrderLineItems. It could be another import file, or just setting some post-import data. To do it right, we need to keep track of what OrderLineItem IDs belong to which order. This is easily solved by adding the ability to handle Lists of IDs to DataLoadCache.

public static class DataLoadCacheV4<T, U>
{
    private static Dictionary<int, List<int>> _foreignIDLists =
        new Dictionary<int, List<int>>();

    public static void StoreForeignIDInList(int primaryID, int foreignID)
    {
        if (!_foreignIDLists.ContainsKey(primaryID))
        {
            _foreignIDLists.Add(primaryID, new List<int>());
        }
        _foreignIDLists[primaryID].Add(foreignID);
    }

    public static bool ContainsForeignIDInList(int primaryID, int foreignID)
    {
        return _foreignIDLists.ContainsKey(primaryID) && 
               _foreignIDLists[primaryID].Contains(foreignID);
    }

    public static List<int> GetForeignIDsInList(int primaryID)
    {
        return _foreignIDLists[primaryID];

Step 5 - A Complete Example

Hopefully by now, I have done a good enough job getting the point across. This final example shows a full implementation of the DataLoadCache class. Obviously, the methods in the class are specialized for this article, but you should see how easily you can create new methods to handle almost any relationship between your data classes.

The example uses two import sources to show how you can solve difficult situations with ease. Assume the second file contains discounts, but they are broken down by company branch which we do not track (see example 1). Also, we track discounts on a line item basis, so we will need to break up the discount amount equally among the line items. Just to make it more fun, assume they are not giving us the company name and just the branch ID.

First Data File
ID,Name,OrderNumber,LineItemNumber,Quantity
42,Acme Inc,1234,1,53
42,Acme Inc,1234,2,91
31,Foo Corp,1234,2,62
31,Foo Corp,1234,1,88
42,Acme Inc,1234,3,57

Second Data File
ID,OrderNumber,Discount
42,1234,50
99,5678,67
31,1234,31
28,1234,10

-

public static class DataLoadCacheV5<T, U>
{
    // Simple text name to database ID mappings
    private static Dictionary<string, int> _namesIDMaps = 
            new Dictionary<string, int>();

    public static bool ContainsIDByName(string name)
    {
        return _namesIDMaps.ContainsKey(name);
    }

    public static void StoreIDByName(string name, int id)
    {
        _namesIDMaps.Add(name, id);
    }

    public static int GetIDByName(string name)
    {
        return _namesIDMaps[name];
    }

    // Simple external to local ID mapping
    private static Dictionary<int, int> _externalIDMaps = 
            new Dictionary<int, int>();

    public static void StoreIDByExternalID(int externalID, int primaryID)
    {
        _externalIDMaps[externalID] = primaryID;
    }

    public static bool ContainsIDByExternalID(int externalID)
    {
        return _externalIDMaps.ContainsKey(externalID);
    }

    public static int GetIDByExternalID(int externalID)
    {
        return _externalIDMaps[externalID];
    }

    // Mapping a "foreign name" within the context a unique primary entity
    // (primaryID = CustomerID, foreignName = OrderNumber, foreignID = OrderID)
    private static Dictionary<int, Dictionary<string, int>> 
      _foreignIdsByName = new Dictionary<int, Dictionary<string, int>>();

    public static void StoreForeignIDByName(int primaryID, 
                       string foreignName, int foreignID)
    {
        _foreignIdsByName[primaryID].Add(foreignName, foreignID);
    }

    public static bool ContainsForeignIDByName(int primaryID, string foreignName)
    {
        return _foreignIdsByName.ContainsKey(primaryID)
            && _foreignIdsByName[primaryID].ContainsKey(foreignName);
    }

    public static int GetForeignIdByName(int primaryID, string foreignName)
    {
        return _foreignIdsByName[primaryID][foreignName];
    }

    // Track lists of foreign keys so we don't have to do the DB lookup later
    private static Dictionary<int, List<int>> _foreignIDLists =
        new Dictionary<int, List<int>>();

    public static void StoreForeignIDInList(int primaryID, int foreignID)
    {
        if (!_foreignIDLists.ContainsKey(primaryID))
        {
            _foreignIDLists.Add(primaryID, new List<int>());
        }
        _foreignIDLists[primaryID].Add(foreignID);
    }

    public static bool ContainsForeignIDInList(int primaryID, int foreignID)
    {
        return _foreignIDLists.ContainsKey(primaryID) && 
               _foreignIDLists[primaryID].Contains(foreignID);
    }

    public static List<int> GetForeignIDsInList(int primaryID)
    {
        return _foreignIDLists[primaryID];
    }
}

-

public void TestData5()
{
    IDataReader reader = ... // open 1st file
    while (reader.Read())
    {
        int branchID = reader.GetInt32(0);
        string custName = reader.GetString(1);
        if (!DataLoadCacheV5<Customer,Customer>.ContainsIDByName(custName))
        {
            int newCustID = ... // add customer to DB, get ID...
            DataLoadCacheV5<Customer,Customer>.StoreIDByName(custName, newCustID);
        }
        int custID = DataLoadCacheV5<Customer,Customer>.GetIDByName(custName);

        if (!DataLoadCacheV5<Customer,Customer>.ContainsIDByExternalID(branchID))
        {
            DataLoadCacheV5<Customer,Customer>.StoreIDByExternalID(branchID, custID);
        }

        string orderNumber = reader.GetString(5);
        if (!DataLoadCacheV5<Customer, Order>.ContainsForeignIDByName(custID, orderNumber))
        {
            int newOrderID = ... // add order to DB, get ID...
            DataLoadCacheV5<Customer,Order>.StoreForeignIDByName(
                     custID, orderNumber, newOrderID);
        }
        int orderID = DataLoadCacheV5<Customer,Order>.GetForeignIdByName(custID, orderNumber);

        string orderLineItemNumber = reader.GetString(6);
        if (!DataLoadCacheV5<Order,OrderLineItem>.ContainsForeignIDByName(
                 orderID, orderLineItemNumber))
        {
            int newOrderLiID = ... // add order line item to DB, get ID...
            DataLoadCacheV5<Order, OrderLineItem>.StoreForeignIDByName(
                      orderID, orderLineItemNumber, newOrderLiID);
        }
        int orderLineItemID = 
          DataLoadCacheV5<Order, OrderLineItem>.GetForeignIdByName(
          orderID, orderLineItemNumber);
        DataLoadCacheV5<Order, OrderLineItem>.StoreForeignIDInList(
          orderID, orderLineItemID);
    }
    
     // Here is where you will see the payoff (about time, I know)
    IDataReader reader2 = ... // open 2nd file
    while (reader2.Read())
    {
        int branchID = reader2.GetInt32(0);
        int custID = DataLoadCacheV5<Customer, Customer>.GetIDByExternalID(branchID);
        string orderNumber = reader2.GetString(1);
        int orderID = 
          DataLoadCacheV5<Customer, Order>.GetForeignIdByName(custID, orderNumber);
        List<int> orderLineItemIDs = DataLoadCacheV5<Order, 
                        OrderLineItem>.GetForeignIDsInList(orderID);

        double totalDiscount = reader.GetDouble(3);
        double discountPart = totalDiscount / (double)orderLineItemIDs.Count;
        foreach (int orderLineItemID in orderLineItemIDs)
        {
            // ...update DB or ORM object with discountPart...
        }
    }
}

Conclusion

It's important to note that every time you use the DataLoadCache class with different types, you are actually creating a completely new class. It's basically the same as if you hand typed all the specialized versions you would need (Customer-cache, Order-cache, etc,). If you do any kind of data importing, or even sequential processing, I hope you find this article useful.

If you are asking yourself "What's the point, why don't I just stick the data in the database and query it when I need it?", I ask you to consider if that solution scales and how well it will work on a slow network connection. My solution attempts to utilize the resources on the client machine (namely processing power and RAM) to keep the calls to the database down to a minimum.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Josh Fischer
Architect
United States United States
CodeProject MVP 2010
CodeProject prize winner - Best C# article of December 2009


You may also be interested in...

Pro
Pro

Comments and Discussions

 
Generalnice have5 Pin
Pranay Rana2-Jan-11 20:11
memberPranay Rana2-Jan-11 20:11 
GeneralMy vote of 5 Pin
Member 222628718-Jul-10 22:00
memberMember 222628718-Jul-10 22:00 
GeneralRe: My vote of 5 Pin
Josh Fischer19-Jul-10 9:37
mvpJosh Fischer19-Jul-10 9:37 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.161128.1 | Last Updated 7 Jul 2010
Article Copyright 2010 by Josh Fischer
Everything else Copyright © CodeProject, 1999-2016
Layout: fixed | fluid