Click here to Skip to main content
11,576,064 members (57,932 online)
Click here to Skip to main content

Junk Drawer

, 18 Mar 2015 GPL3 28.5K 1.5K 62
Rate this:
Please Sign up or sign in to vote.
Importing Excel & Delimited Files into a Junk Database.

JunkDrawer is a tool that imports excel or delimited files into a database. It is open source under GNU General Public License, version 3 (GPL-3.0).

Introduction

analyst: "Is there something that just automatically imports files into a database?"

programmer: "No. You have to use the data import wizard."

The data analyst sighed as he recalled the wizard.

SQL Server Import Wizard

Using the wizard to import a text file goes something like this:

  • Install SQL Server Management Studio.
  • Find Tasks and choose Import Data.
  • Select "Flat File Source."
  • Browse for the file.
  • Preview the data.
  • Specify the delimiter.
  • Specify if the first row is column names.
  • Preview the data (again).
  • Go to each column and choose the correct data type or use "Suggest Types" feature.
  • Choose if you want to save the SSIS package for later.
  • Execute it

programmer: "The wizard helps you import any kind of file."

analyst: "That's great. But, it would be better if I didn't have to answer so many questions and the program just figured it out and imported the file."

programmer: "Sorry. You have to follow the wizard. If you get an error message, fix the problem and try again."

analyst: "I get a lot of different files. Using the wizard is repetitive. This wastes my time."

At this, the programmer started shouting assembly language and put the data analyst in a head lock. Balling up his fist, he pressed his knuckles against the analyst's head; rubbing back and forth.

programmer: "You come to my cube, without a ticket, complaining about YOUR time being wasted?"


Sadly, this scenario happens a lot in IT offices.

Just recently, while forcing an staff member to learn SQL, he asked me something similar:

staff: "SQL is great, but how do I get these files into the database?"

I began to explain the data import wizard, but it didn't feel right. I felt bad that he'd have to run the wizard every time he wanted to query data. I knew it would take him forever, and he'd probably give up. Also, I knew the conversation would most likely end in a head-lock (as depicted above). So, instead of giving him the beat down, I tried to think of another way.


I tell people that if you can think of it, then it already exists. You just have to Google it. I should have followed my own advice, but I didn't. See, I'm a programmer. I have these times where all I want to do is program. I'm not even a great programmer (as indicated by the comments below), but I just love it, and I don't care what anyone thinks because it makes me happy, happy, happy.

So, I decided to bypass Google and create an open source project called Junk Drawer. The goal is to make importing an Excel or text file to a database easier.

Configuration

Before you run Junk Drawer for the first time, make sure you create a Junk database for all the files you're going to import. Then, update Junk Drawer's configuration file default.xml.

<junk-drawer>
  <connections>
    <add name="output" 
         provider="sqlserver" 
        server="localhost" 
        database="Junk" />
  </connections>
  <!-- more later -->
</junk-drawer>

Note: For a connection, you may use the server and database attributes, or one called connection-string. This lets Junk Drawer know where you want to keep your junk.

Get a File

Now, in order for Junk Drawer to work, the file must be either an Excel file, or some kind of delimited text file.

Here's a sample text file:

Name,Birthday,Points
Dale,3/3/1981 9 AM,73
Tara,12/31/1990,1042
Grace,9/9/2000 11 PM,56
Gavin,7/3/2010,13

If we save this (above) in a file called sample.txt, we could import it from the command line like this:

jd.exe c:\sample.txt

Junk Drawer (jd.exe) imports the file, and now it can be queried:

USE Junk
SELECT Name, Birthday, Points 
FROM sample;
Name  Birthday            Points
----- ------------------- -----------
Dale  1981-03-03 09:00:00 73
Gavin 2010-07-03 00:00:00 13
Tara  1990-12-31 00:00:00 1042
Grace 2000-09-09 23:00:00 56

The table structure looks similar to this:

CREATE TABLE sample(
    BirthDay DATETIME,
    Name NVARCHAR(5),
    Points INT
);

How Does it Work?

When we see sample text above, it's easy for us to notice the first row is different. We recognize it as a set of column names. The lines that follow are records.

Because there are only a few columns (3), and a few records (4), it's easy for us to see that the comma is delimiting the values in each record. Moreover, we can see Name is text, Birthday is a date, and Points is numeric.

Junk Drawer just has to do the same thing as we do. It has to figure out three things:

  1. the delimiter
  2. the column names (if available)
  3. the column data types

For Excel files, you can skip the first step.

Finding the Delimiter

By default, 100 lines are examined. A set of pre-defined delimiters are counted in each line. If any delimiters are found, the average per line and standard deviation are calculated.

Then, the delimiter with the lowest coefficient of variation is declared the winner.

I'm no statistician, but from what I gather, taking the delimiter with the lowest coefficient of variation gives us the most consistent delimiter across all lines.

Column Names

We don't know if the first record is column names, or just another record. So, it is split by the winning delimiter and run through a series of tests:

  • Are there any duplicate values?
  • Are there any empty values?
  • Are there any white space values?
  • Are there any numbers?
  • Are there any dates?

If any of the answers are "Yes," then the first line cannot be used as column names. If this happens, default column names are generated (i.e. A, B, C, etc. like Excel). In the example above; Name, Birthday, and Points answer "No" to all the questions, so they make good column names.

Data Types

By default, every value of every line is tested against a set of types defined in default.xml.

<junk-drawer>
  <!-- connections -->
  <file-inspection>
    <add name="default" sample="100">
      <!-- the pre-defined set of types -->
        <types>
        <add type="boolean"/>
        <add type="int32"/>
        <add type="int64"/>
        <add type="decimal"/>
        <add type="datetime"/>
      </types>
      <!-- delimiters -->
    </add>
  </file-inspection>
</junk-drawer>

Take Points for example:

  • Is 73 a boolean? No
  • Is 73 an int32? Yes
  • Is 13 an int32? Yes
  • Is 1042 an int32? Yes
  • Is 56 an int32? Yes

So, Points is compatible with an int32, and it is stored as an INTEGER in the database.

If there are mixed or failing validation results, like a combination of numbers and strings, I default to a string, which is then stored as an NVARCHAR(x) in the database.

If you want to increase the speed of data type validation, at the cost of accuracy, change the sample size to something less than 100 (100%).

Once all the values are checked, Junk Drawer has a compatible set of data types. The data types are the final piece of information we need in order to import the file.

Doing it in Code

JunkDrawer comes with a library and an executable. If you want to use the library, you can reference JunkDrawer.dll (and Transformalize.dll) and run it like this:

var cfg = new JunkCfg(File.ReadAllText(@"default.xml"));
var request = new Request(@"sample.txt", cfg);
var response = new JunkImporter().Import(request);

Console.WriteLine("Table: {0}", response.TableName);
Console.WriteLine("Records: {0}", response.Records);

This should produce output like this:

Table: sample
Records: 4

If you want to name your table something specific, you can set TableName in the Request.

Note: Both the executable and the libary need a configuration. I loaded the default.xml configuration above. However, you can make as many different configurations as you want.

To use a custom configuration with the executable, pass the configuration file in as the second argument (i.e. jd sample.txt other.xml). If you want to know more about the configuration provider, check out my Cfg-NET article.

Precautions

Overwriting Tables

If you import the same file into Junk Drawer twice, it will overwrite the table. So, if you run jd.exe sample.txt and it creates a sample table, then you run jd.exe sample.txt again, it will over-write the sample table. Don't worry though; it's only your junk.

Things Might Get Crazy

I called it Junk Drawer because importing files directly into a database can get messy. It may end up an uncontrolled staging area for data. If you're the acting DBA, and you put your Junk on an important server, make sure you have monitors in place for disk space, CPU abuse, and excessive resource blocking.

Conclusion

Once in place, Junk Drawer can empower your trusted friends to import their data into a Junk database and run ad-hoc queries until their heart's content.

Of course there are going to be files that are so messed up, that JunkDrawer won't be able to make any sense of them. In that case, you'll have to resort to shouting, head-locks, and noogies.

License

This article, along with any associated source code and files, is licensed under The GNU General Public License (GPLv3)

Share

About the Author

dale.newman
Software Developer (Senior)
United States United States
No Biography provided

You may also be interested in...

Comments and Discussions

 
QuestionThanks! Pin
paul000919-Mar-15 16:26
memberpaul000919-Mar-15 16:26 
AnswerRe: Thanks! Pin
dale.newman20-Mar-15 10:28
memberdale.newman20-Mar-15 10:28 
QuestionMultiple Tabs Pin
SteveHolle18-Mar-15 12:18
memberSteveHolle18-Mar-15 12:18 
AnswerRe: Multiple Tabs Pin
dale.newman19-Mar-15 8:26
memberdale.newman19-Mar-15 8:26 
QuestionJunk Drawer - good stuff Pin
Member 1092409215-Dec-14 5:31
memberMember 1092409215-Dec-14 5:31 
AnswerRe: Junk Drawer - good stuff Pin
dale.newman16-Dec-14 3:16
memberdale.newman16-Dec-14 3:16 
SuggestionIs it possible to name the table based on the file name? Pin
MattPenner12-Dec-14 11:34
memberMattPenner12-Dec-14 11:34 
GeneralRe: Is it possible to name the table based on the file name? Pin
dale.newman16-Dec-14 3:15
memberdale.newman16-Dec-14 3:15 
GeneralRe: Is it possible to name the table based on the file name? Pin
dale.newman19-Mar-15 8:24
memberdale.newman19-Mar-15 8:24 
QuestionCan't find Transformalize.Libs.NLog Pin
MattPenner10-Dec-14 14:10
memberMattPenner10-Dec-14 14:10 
AnswerRe: Can't find Transformalize.Libs.NLog Pin
dale.newman11-Dec-14 5:16
memberdale.newman11-Dec-14 5:16 
Questionscientific notation on large numbers Pin
Mike Pelikan17-Sep-14 7:34
memberMike Pelikan17-Sep-14 7:34 
AnswerRe: scientific notation on large numbers Pin
dale.newman23-Sep-14 5:42
memberdale.newman23-Sep-14 5:42 
GeneralGreat article Pin
joelcarroll29-May-14 9:46
memberjoelcarroll29-May-14 9:46 
GeneralGood article, nicely written Pin
wicklowwanderer29-May-14 6:46
memberwicklowwanderer29-May-14 6:46 
BugGreat work but had a few bugs Pin
MattPenner27-Mar-14 9:10
memberMattPenner27-Mar-14 9:10 
GeneralRe: Great work but had a few bugs Pin
dale.newman28-Mar-14 17:36
memberdale.newman28-Mar-14 17:36 
QuestionUnspecified Error Pin
mwalsh6225-Mar-14 10:54
membermwalsh6225-Mar-14 10:54 
AnswerRe: Unspecified Error Pin
dale.newman25-Mar-14 10:59
memberdale.newman25-Mar-14 10:59 
AnswerRe: Unspecified Error Pin
dale.newman31-Mar-14 5:10
memberdale.newman31-Mar-14 5:10 
Questionsource zip is corrupted Pin
capacitas.sgl23-Mar-14 0:56
membercapacitas.sgl23-Mar-14 0:56 
AnswerRe: source zip is corrupted Pin
dale.newman28-Mar-14 17:33
memberdale.newman28-Mar-14 17:33 
AnswerRe: source zip is corrupted Pin
dale.newman31-Mar-14 5:11
memberdale.newman31-Mar-14 5:11 
GeneralMy vote of 5 Pin
Member 982548822-Mar-14 7:03
memberMember 982548822-Mar-14 7:03 
QuestionCan't find process Pin
Member 104203445-Mar-14 7:02
memberMember 104203445-Mar-14 7:02 
AnswerRe: Can't find process Pin
dale.newman5-Mar-14 16:25
memberdale.newman5-Mar-14 16:25 
GeneralRe: Can't find process Pin
Member 104203446-Mar-14 5:19
memberMember 104203446-Mar-14 5:19 
GeneralMy vote of 4 Pin
ectisch6-Feb-14 6:01
memberectisch6-Feb-14 6:01 
GeneralRe: My vote of 4 Pin
dale.newman6-Feb-14 11:29
memberdale.newman6-Feb-14 11:29 
QuestionSpiffy Pin
nitewulf504-Feb-14 8:49
membernitewulf504-Feb-14 8:49 
QuestionMy vote of - uh, I forget Pin
dmjm-h3-Feb-14 10:04
memberdmjm-h3-Feb-14 10:04 
AnswerRe: My vote of - uh, I forget Pin
dale.newman3-Feb-14 10:25
memberdale.newman3-Feb-14 10:25 
QuestionHow do I vote for this article? Pin
mrjava662-Feb-14 10:31
membermrjava662-Feb-14 10:31 
AnswerRe: How do I vote for this article? Pin
mrjava662-Feb-14 10:59
membermrjava662-Feb-14 10:59 
GeneralRe: How do I vote for this article? Pin
dale.newman3-Feb-14 4:40
memberdale.newman3-Feb-14 4:40 
QuestionQuestion about delimiters Pin
Marc Clifton29-Jan-14 5:46
protectorMarc Clifton29-Jan-14 5:46 
AnswerRe: Question about delimiters Pin
dale.newman29-Jan-14 6:45
memberdale.newman29-Jan-14 6:45 
AnswerRe: Question about delimiters Pin
dale.newman31-Jan-14 4:20
memberdale.newman31-Jan-14 4:20 
AnswerRe: Question about delimiters Pin
SteveHolle11-Dec-14 11:37
memberSteveHolle11-Dec-14 11:37 
GeneralRe: Question about delimiters Pin
dale.newman16-Dec-14 3:20
memberdale.newman16-Dec-14 3:20 
QuestionJD could not connect to the server Pin
dumnovs28-Jan-14 23:16
professionaldumnovs28-Jan-14 23:16 
AnswerRe: JD could not connect to the server Pin
dale.newman29-Jan-14 4:09
memberdale.newman29-Jan-14 4:09 
GeneralRe: JD could not connect to the server Pin
dumnovs29-Jan-14 12:47
professionaldumnovs29-Jan-14 12:47 
GeneralRe: JD could not connect to the server Pin
dale.newman29-Jan-14 12:53
memberdale.newman29-Jan-14 12:53 
Questionsource file cannot be downloaded Pin
fredatcodeproject26-Jan-14 22:48
memberfredatcodeproject26-Jan-14 22:48 
AnswerRe: source file cannot be downloaded Pin
dale.newman28-Jan-14 17:54
memberdale.newman28-Jan-14 17:54 
GeneralRe: source file cannot be downloaded Pin
fredatcodeproject29-Jan-14 2:20
memberfredatcodeproject29-Jan-14 2:20 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web04 | 2.8.150603.1 | Last Updated 18 Mar 2015
Article Copyright 2014 by dale.newman
Everything else Copyright © CodeProject, 1999-2015
Layout: fixed | fluid