Click here to Skip to main content
Click here to Skip to main content

Junk Drawer

, 28 May 2014
Rate this:
Please Sign up or sign in to vote.
Importing a File to a Database Table in One Step
Introduction

analyst: "Is there something that just automatically imports files into a database?"

nerd: "No. Use the data import wizard."

The data analyst sighed.. remembering the wizard...

Using the wizard to import a text file goes something like this:

  • Install SQL Server Management Studio.
  • Find Tasks and choose Import Data.
  • Select "Flat File Source."
  • Browse for the file.
  • Preview the data.
  • Specify the delimiter.
  • Specify if the first row is column names.
  • Preview the data (again).
  • Go to each column and choose the correct data type or use "Suggest Types" feature.
  • Choose if you want to save the SSIS package for later.
  • Execute it

nerd: "The wizard helps you import any kind of file."

analyst: "That's great. But, it would be better if I didn't have to answer so many questions and the program just analyzed and imported the file for me."

nerd: "You're dreaming. You have to use the wizard. Go back to your cube. Press the keys. Click the mouse. Do this until the Wizard says it successfully imported the file. If you get the fail message, you did something wrong. Respond to the error message and go back and fix it."

analyst: "I get a lot of slightly different files. Using the wizard is repetitive. This wastes a lot of my time..."

At this, the nerd put the data analyst in a head lock. Balling up his fist, he pressed his knuckles hard against the analyst's head; rubbing back and forth, causing a great deal of painful friction.

nerd: "You come to my cube, without a ticket, complaining about YOUR time being wasted?"

The nerd pushed the analyst away from him. As the analyst fell forward, the nerd Tae Kwon Doe'd him squarely in the buttocks.

nerd: "Now get out of here!"

Humiliated, and unable to match the nerd's mightiness, the data analysts sulked back to his cubicle.

---

Sadly, this scenario happens a lot in IT offices.

Just recently, while forcing an IT staff member to learn SQL, he asked me something similar:

staff: "How do I get the files into the database so I can query them?"

I began to explain the data import wizard (as described above), but it didn't feel right. I felt bad that he'd have to run the wizard every time. I knew it would take him forever. Also, I knew the conversation would most likely end in a destructive battle (as in the scenario depicted above). So, I thought there might be a better way.

---

I always tell people, that if you can think of it, then chances are it already exists. You just have to Google for it. I should have followed my own advice, but I didn't. See, I'm a programmer. I have these times where all I want to do is program. I'm not even a great programmer (as indicated by the comments below), but, I just love it, and I don't care what anyone thinks because it makes me happy, happy, happy.

So, I decided to bypass the Google search and create an open source project called Junk Drawer. As it turns out, the name Junk Drawer is already used for a variety of things, but it's too late to change the name. Anyway, its goal is to make the majority of Excel or text file importing a one step, no-brainer process. It should also reduce the likelihood of terrible brawls between data analysts and nerds.

Requirements

I don't want staff members to need a copy of Junk Drawer on their computer. I just want them to drop a file on a network share and have it imported. For myself, since I have Junk Drawer, I want to be able to right-click and use the Open With option. For both requirements, all I need is a .NET Console application. I can enable the file monitoring and action trigger with job automation software (i.e. Visual Cron).

A Demo

Before you run Junk Drawer for the first time, make sure you create a "Junk" database for all the files you're going to import. Then, put a SQL Server connection string in the configuration file jd.exe.config.

<transformalize>
    <processes>
        <add name="JunkDrawer">
            <connections>
                <add name="output" connection-string="connection-string" />
            </connections>
        </add>
    </processes>
</transformalize> 

Here are the contents of a sample text file:

Name,Birthday,Points
Dale,3/3/1981 9 AM,73
Tara,12/31/1990,1042
Grace,9/9/2000 11 PM,56
Gavin,7/3/2010,13        

To import it, just run:

jd c:\sample.txt 

Junk Drawer (jd.exe) imports the file, and now I can go query it:

SELECT Name, Birthday, Points FROM sample;  
Name  Birthday                Points
----- ----------------------- -----------
Dale  1981-03-03 09:00:00.000 73
Gavin 2010-07-03 00:00:00.000 13
Tara  1990-12-31 00:00:00.000 1042
Grace 2000-09-09 23:00:00.000 56    

The table structure looks something like this:

CREATE TABLE sample(
    BirthDay DATETIME,
    Name NVARCHAR(5),
    Points INT
);    

How Does it Work?

To be able to import a text file into a database, JD has to figure out three things:

  1. the delimiter
  2. the column names (if available)
  3. the column data types

For Excel files, you can skip the first step.

Delimiters

First, a number of lines are loaded from the file. Then, popular delimiters are counted in each line. If the number of delimiters is greater than zero, and the same number of delimiters is found in each line, then that delimiter may be the delimiter. That is to say, if the same number of commas are found in each line, then odds are this is a comma delimited file.

Column Names

To determine column names, the first row is split by the delimiter and tested. We don't know if it contains column names, or it is merely the first record in the file. So, I run these tests:

  • Are there any duplicate field values?
  • Are there any empty values?
  • Are there any white space values?
  • Are there any numeric values?
  • Are there any date time values?

If any of the answers are "Yes," then the first line cannot be used as column names. If this happens, default column names are generated (i.e. A, B, C, etc.). In the example above; Name, Birthday, and Points answer "No" to all these questions and make good column names.

Data Types

Rather than write new code to efficiently run through a file and check data types, I have elected to use a library called Transformalize. The guy who wrote it is AWESOME! Well, he's not that awesome, but it's based on Rhino ETL, which was created by Ayende, who actually IS awesome! So, it inherits a degree of awesomeness.

Given that we know the file location, the delimiter, and the column names, I am able to "configure" (not code) Transformalize to load all the records and perform "type conversion" validation on every value.

If every value in the column passes a data type validation, then I use that data type for database storage. For example, if all values in a column respond "Yes," to the question "Are you an integer?," then I store it in an INTEGER data type. If there are mixed or failing validation results, I default to use NVARCHAR(x). Anything goes in an NVARCHAR data type. It's for a variable number of Unicode characters, where x is set to the maximum length found in the column.

The data types are the final piece of information we need to import the file into a database. Again, instead of writing new code, I configure Transformalize to take care of importing the file into the database.

Some Code?

There's nothing spectacular in the source code. But, this site is called CodeProject, so, its kind of frowned upon to not include code in your article. So, we'll just take a look at it's entry point.

public class Program {

    static void Main(string[] args) {

        GlobalDiagnosticsContext.Set("process", Common.LogLength("JunkDrawer", 16));
        GlobalDiagnosticsContext.Set("entity", Common.LogLength("Request"));

        var request = new Request(args);

        if (!request.IsValid) {
            LogManager.GetLogger(string.Empty).Error(request.Message);
            Environment.Exit(1);
        }

        new FileImporter().Import(request.FileInfo);
    }
}

There's three things happening:

  1. I'm setting some global variables for NLog.
  2. I'm validating the Request:
    1. Did the user pass in a file name?
    2. Is the file there?
    3. Is it in use? I give up after 5 tries.
  3. I'm importing the file with FileImporter. Note: It's in its own library so you can reference and use it in your program if you want.

If you're curious at all, you're welcome to explore the source code. More likely, the only time you'll want to look at the source code is if you run Junk Drawer and it doesn't import your file right. Like, for example, I'm only searching for commas, pipes, semi-colons, and tabs for delimiters... You might have files that are delimited by 6's or something, I don't know. So, you could create a fork on GitHub and add "6" in the list of delimiters to check for, or better yet, make it configurable!

Conclusion

Well that's the Junk Drawer in a nut-shell. I called it Junk Drawer because importing files directly into a database can get messy. It ends up an uncontrolled staging area for data. Nevertheless, it empowers your trusted friends to put some data in there and run ad-hoc queries until their heart's content.

Now if you're the acting DBA, and you put your "Junk" database on an important server, make sure you have monitors in place for disk space, CPU abuse, and excessive resource blocking. You should have this all setup regardless if you have Junk Drawer running, but especially if you do. Remember to control access to both your network share and your Junk database with Active Directory groups. Smile | :)

Thanks for reading.

Updates

More Configuration

Since the original article, I added more configuration. Here's a sample:

  <junkdrawer>
    <types sample="100"
           default="string"
           default-length="1024">
      <add type="boolean"/>
      <add type="byte"/>
      <add type="int16"/>
      <add type="int32"/>
      <add type="int64"/>
      <add type="single"/>
      <add type="double"/>
      <add type="datetime"/>
    </types>
    <delimiters>
      <add name="comma" character=","/>
      <add name="pipe" character="|"/>
      <add name="tab" character="&#009;"/>
      <add name="semicolon" character=";"/>
    </delimiters>
  </junkdrawer>   

The types collection gives you control of what data types are checked for. You may remove them all. If you do, every field will assume the default type (e.g. string). This can be useful it you don't care about types, or are troubleshooting.

The delimiters collection gives you control over what delimiters are checked for. Currently, only single characters are supported.

The sample attribute allows you to reduce the amount of data type inspection. The lower the percentage, the fewer records (and fields) get checked. You may set this to 100, or remove it altogether if you want to check everything (the original behavior).

Bug Fixes

2014-05-28 Handle .xls (older binary excel files) correctly.

License

This article, along with any associated source code and files, is licensed under The GNU General Public License (GPLv3)

Share

About the Author

dale.newman
Software Developer (Senior)
United States United States
No Biography provided
Follow on   Twitter   LinkedIn

Comments and Discussions

 
GeneralGreat article Pinmemberjoelcarroll29-May-14 9:46 
GeneralGood article, nicely written Pinmemberwicklowwanderer29-May-14 6:46 
BugGreat work but had a few bugs PinmemberMattPenner27-Mar-14 9:10 
GeneralRe: Great work but had a few bugs Pinpremiumdale.newman28-Mar-14 17:36 
QuestionUnspecified Error Pinmembermwalsh6225-Mar-14 10:54 
AnswerRe: Unspecified Error Pinpremiumdale.newman25-Mar-14 10:59 
AnswerRe: Unspecified Error Pinpremiumdale.newman31-Mar-14 5:10 
Questionsource zip is corrupted Pinmembercapacitas.sgl23-Mar-14 0:56 
AnswerRe: source zip is corrupted Pinpremiumdale.newman28-Mar-14 17:33 
AnswerRe: source zip is corrupted Pinpremiumdale.newman31-Mar-14 5:11 
GeneralMy vote of 5 PinmemberMember 982548822-Mar-14 7:03 
QuestionCan't find process PinmemberMember 104203445-Mar-14 7:02 
AnswerRe: Can't find process Pinmemberdale.newman5-Mar-14 16:25 
GeneralRe: Can't find process PinmemberMember 104203446-Mar-14 5:19 
GeneralMy vote of 4 Pinmemberectisch6-Feb-14 6:01 
GeneralRe: My vote of 4 Pinmemberdale.newman6-Feb-14 11:29 
QuestionSpiffy Pinmembernitewulf504-Feb-14 8:49 
QuestionMy vote of - uh, I forget Pinmemberdmjm-h3-Feb-14 10:04 
AnswerRe: My vote of - uh, I forget Pinmemberdale.newman3-Feb-14 10:25 
QuestionHow do I vote for this article? Pinmembermrjava662-Feb-14 10:31 
AnswerRe: How do I vote for this article? Pinmembermrjava662-Feb-14 10:59 
GeneralRe: How do I vote for this article? Pinmemberdale.newman3-Feb-14 4:40 
QuestionQuestion about delimiters PinprotectorMarc Clifton29-Jan-14 5:46 
AnswerRe: Question about delimiters Pinmemberdale.newman29-Jan-14 6:45 
AnswerRe: Question about delimiters Pinmemberdale.newman31-Jan-14 4:20 
QuestionJD could not connect to the server Pinprofessionaldumnovs28-Jan-14 23:16 
AnswerRe: JD could not connect to the server Pinmemberdale.newman29-Jan-14 4:09 
GeneralRe: JD could not connect to the server Pinprofessionaldumnovs29-Jan-14 12:47 
GeneralRe: JD could not connect to the server Pinmemberdale.newman29-Jan-14 12:53 
Questionsource file cannot be downloaded Pinmemberfredatcodeproject26-Jan-14 22:48 
AnswerRe: source file cannot be downloaded Pinmemberdale.newman28-Jan-14 17:54 
GeneralRe: source file cannot be downloaded Pinmemberfredatcodeproject29-Jan-14 2:20 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web03 | 2.8.140821.2 | Last Updated 28 May 2014
Article Copyright 2014 by dale.newman
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid