Click here to Skip to main content
12,063,909 members (65,432 online)
Click here to Skip to main content
Add your own
alternative version

Stats

32.5K views
1.7K downloads
65 bookmarked
Posted

Junk Drawer

, 25 Jan 2016 Apache
Rate this:
Please Sign up or sign in to vote.
Importing Excel & Delimited Files into a Junk Database.

Introduction

analyst: "Is there something that just automatically imports files into a database?"

programmer: "No. You have to use the data import wizard."

The data analyst sighed as he recalled the wizard.

SQL Server Import Wizard

Using the wizard to import a text file into SQL Server goes something like this:

  • Install SQL Server Management Studio.
  • Find Tasks and choose Import Data.
  • Select "Flat File Source."
  • Browse for the file.
  • Preview the data.
  • Specify the delimiter.
  • Specify if the first row is column names.
  • Preview the data (again).
  • Go to each column and choose the correct data type or use "Suggest Types" feature.
  • Choose if you want to save the SSIS package for later.
  • Execute it

programmer: "The wizard helps you import any kind of file."

analyst: "That's great. But, it would be better if a program just figured it out for me."

programmer: "Sorry. You have to use the wizard. If you get an error message, fix the problem and try again."

analyst: "I get a lot of different files. Using the wizard is repetitive. This wastes my time."

At this, the programmer shouted assembly language and put the data analyst in a head lock. Balling his fist, he pressed his knuckles hard against the analyst's head and rubbed back and forth.

programmer: "You come to my cube, without a ticket, complaining about YOUR time being wasted?"


Sadly, this scenario happens a lot in IT offices. Recently, while forcing a staff member to learn SQL, he said:

staff: "SQL is great, but how do I get these files into the database?"

I thought of the import wizard, but it didn't feel right. If he found out that he'd have to run the wizard every time and most likely deal with error messages, it would be a stumbling block for him. Also, I knew the conversation would most likely end in a head-lock (as depicted above).

So, instead of giving him the beat down, I decided to create a program that makes it easier to import an Excel or text file into a database.

Getting Started

Junk Drawer refers to files as junk, and the database as a drawer. The file is an input, and the database is an output. Both are connections.

To configure the connections, open Junk Drawer's default configuration file default.xml.

<jd>
  <connections>
    <add name="input" 
         provider="file" 
         file="*.*" />
    <add name="output" 
         provider="sqlserver" 
         server="localhost" 
         database="Junk" />
  </connections>

The input is set to a file. The file *.* is changed to your file at run-time. The output is set to a local SQL Server database named Junk. For SQL Server, the default database connection uses trusted security. If you're using a native account, you may add a user and password attribute. You may also use a connection-string attribute instead.

Get a File

The file must be Excel (e.g. .xls, .xlsx), or a delimited text file (e.g. .csv, .txt).

I searched Google for filetype:csv colors and found colors.csv. You can find some pretty neat stuff on Google with the filetype term. Here's a sample of colors.csv:

Code,Name,Hex,Red,Green,Blue
air_force_blue_raf,"Air Force Blue (Raf)",#5d8aa8,93,138,168
air_force_blue_usaf,"Air Force Blue (Usaf)",#00308f,0,48,143
air_superiority_blue,"Air Superiority Blue",#72a0c1,114,160,193
alabama_crimson,"Alabama Crimson",#a32638,163,38,56
alice_blue,"Alice Blue",#f0f8ff,240,248,255
alizarin_crimson,"Alizarin Crimson",#e32636,227,38,54
alloy_orange,"Alloy Orange",#c46210,196,98,16
almond,"Almond",#efdecd,239,222,205

Junk Drawer (jd.exe) imports it from the command line like this:

jd.exe c:\temp\colors.csv

Now it can be queried:

USE Junk;

SELECT TOP 10 Code, Name, Hex, Red, Green, Blue
FROM colors;
Code                  Name                   Hex     Red Green Blue
--------------------- ---------------------- ------- --- ----- ----
air_force_blue_raf    Air Force Blue (Raf)   #5d8aa8 93  138   168
air_force_blue_usaf   Air Force Blue (Usaf)  #00308f 0   48    143
air_superiority_blue  Air Superiority Blue   #72a0c1 114 160   193
alabama_crimson       Alabama Crimson        #a32638 163 38    56
alice_blue            Alice Blue             #f0f8ff 240 248   255
alizarin_crimson      Alizarin Crimson       #e32636 227 38    54
alloy_orange          Alloy Orange           #c46210 196 98    16
almond                Almond                 #efdecd 239 222   205
amaranth              Amaranth               #e52b50 229 43    80
amber                 Amber                  #ffbf00 255 191   0

The resulting data structure is similar to this:

CREATE TABLE colors(
    Code NVARCHAR(40),
    Name NVARCHAR(42),
    Hex NVARCHAR(8),
    Red TINYINT,
    Green TINYINT,
    Blue TINYINT
);

How Does it Work?

When we glance at colors.csv above, it's easy for us to see the first row is a header, and subsequent rows are records.

Moreover, we see that a comma delimits the values. We also recognize patterns with the fields. We see that Code, Name, and Hex are text, and Red, Green, and Blue are numeric.

Junk Drawer has to see the same thing as we do:

  1. the delimiter
  2. the column names (if available)
  3. the column data types

Finding the Delimiter

100 lines are examined for delimiters. If delimiters are found, the average number per line and standard deviation is calculated.

Then, the delimiter with the lowest coefficient of variation is declared winner. This provides us with the most consistent delimiter across the first 100 records.

The default delimiters searched for are comma, pipe, tab, and semicolon. If you want control over the delimiters, configure them in the input connection like this:

<add name="input" provider="file" file="*.*">
    <delimiters>
        <add name="comma" character=","/>
        <add name="pipe" character="|"/>
        <add name="tab" character="&#009;"/>
        <add name="semicolon" character=";"/>
    </delimiters>
</add>

Column Names

The first line is split by the winning delimiter and tested for:

  • duplicates
  • empties
  • white space values
  • numbers
  • dates

If there are any of the above, the first line is not suitable for column names. Excel-like column names are generated (i.e. A, B, C) if necessary. In colors.csv, the first line doesn't have any duplicates, empties, white space values, numbers, or dates, so it is used as column names.

Data Types

Initially, every field is considered a string. Often, when importing a file for ad-hoc queries, strings are fine. However, if you want to type-check the data, add types into the input connection like this:

<add name="input" provider="file" file="*.*">
    <types>
        <add type="bool"/>
        <add type="byte"/>
        <add type="short"/>
        <add type="int"/>
        <add type="long"/>
        <add type="single"/>
        <add type="double"/>
        <add type="decimal"/>
        <add type="datetime"/>
    </types>
</add>

Currently, types are checked in the order they appear. So, to select the most efficient type, add more restrictive types first. For example, if you test for a short (-32,768 to 32,767), before a byte (0 to 255), any would-be bytes end up as shorts.

Every value in a field is checked against a type's restrictions. The first compatible type is used. If no type allows all the values, a string is used.

A string is tested for length. A field assumes the length of the longest value in the file (+1). If you want control over string length, add min-length and/or max-length to the connection:

<add name="input" 
     provider="file" 
     file="*.*"
     min-length="64"
     max-length="4000" />

Once the values are type and/or length checked, Junk Drawer tries to import the file

In Code

JunkDrawer may be used in code like this:

JunkResponse response;
var request = new JunkRequest(@"c:\temp\colors.csv", "default.xml");
using (var scope = new AutofacJunkBootstrapper(request)) {
    response = scope.Resolve<JunkImporter>().Import();  
}

Just like the jd.exe executable, JunkRequest requires the file name you want to import, and a configuration.

In the case above, I'm using Autofac to wire up the JunkImporter dependencies. A JunkDrawer.Autofac project is included in the solution to demonstrate how JunkImporter is composed.

Options

Table Name

By default, Junk Drawer creates a view named after your file (without the extension). For example, colors.csv is named colors. If you want to name your view something else, set the TableName property in JunkRequest.

Configuration

If you do not provide a configuration, default.xml is used.

The configuration is file based. You may make as many configurations as you want. For example, to import into SQLite instead of SQL Server, create a configuration like this:

<jd>
    <connections>
        <add name="input" provider="file" file="*.*" />
        <add name="output"
             provider="sqlite"
             file="c:\temp\junk.sqlite3" />
    </connections>
</jd>

Save it as sqlite.xml. Now import colors.csv into SQLite:

jd.exe c:\temp\colors.csv sqlite.xml

MySql and PostgreSql are also supported.

Precautions

Junk Overwrite

If you import the same file into Junk Drawer twice, it overwrites the previous table. Don't worry though; it's only your junk.

Junk Overflow

I called it Junk Drawer because allowing folks to import files directly into a database can create a mess. You may want to keep an eye on it, or put your Junk database on an isolated test server where it can't hurt anybody.

Conclusion

Once in place, Junk Drawer can empower your trusted friends to import their data into a Junk database and run ad-hoc queries until their heart's content.

Of course, there are files out there that are so jacked up that Junk Drawer won't be able handle them. In that case, you'll have to resort to shouting, head-locks, and noogies (or an import wizard).

Credits

Junk Drawer is not possible without:

License

This article, along with any associated source code and files, is licensed under The Apache License, Version 2.0

Share

About the Author

dale.newman
Software Developer (Senior)
United States United States
No Biography provided

You may also be interested in...

Comments and Discussions

 
QuestionMessage Automatically Removed Pin
19-Mar-15 17:26
memberpaul000919-Mar-15 17:26 
QuestionMultiple Tabs Pin
SteveHolle18-Mar-15 13:18
memberSteveHolle18-Mar-15 13:18 
AnswerRe: Multiple Tabs Pin
dale.newman19-Mar-15 9:26
memberdale.newman19-Mar-15 9:26 
QuestionJunk Drawer - good stuff Pin
Member 1092409215-Dec-14 6:31
memberMember 1092409215-Dec-14 6:31 
AnswerRe: Junk Drawer - good stuff Pin
dale.newman16-Dec-14 4:16
memberdale.newman16-Dec-14 4:16 
SuggestionIs it possible to name the table based on the file name? Pin
MattPenner12-Dec-14 12:34
memberMattPenner12-Dec-14 12:34 
GeneralRe: Is it possible to name the table based on the file name? Pin
dale.newman16-Dec-14 4:15
memberdale.newman16-Dec-14 4:15 
GeneralRe: Is it possible to name the table based on the file name? Pin
dale.newman19-Mar-15 9:24
memberdale.newman19-Mar-15 9:24 
QuestionCan't find Transformalize.Libs.NLog Pin
MattPenner10-Dec-14 15:10
memberMattPenner10-Dec-14 15:10 
AnswerRe: Can't find Transformalize.Libs.NLog Pin
dale.newman11-Dec-14 6:16
memberdale.newman11-Dec-14 6:16 
Questionscientific notation on large numbers Pin
Mike Pelikan17-Sep-14 8:34
memberMike Pelikan17-Sep-14 8:34 
AnswerRe: scientific notation on large numbers Pin
dale.newman23-Sep-14 6:42
memberdale.newman23-Sep-14 6:42 
GeneralGreat article Pin
joelcarroll29-May-14 10:46
memberjoelcarroll29-May-14 10:46 
GeneralGood article, nicely written Pin
wicklowwanderer29-May-14 7:46
memberwicklowwanderer29-May-14 7:46 
BugGreat work but had a few bugs Pin
MattPenner27-Mar-14 10:10
memberMattPenner27-Mar-14 10:10 
GeneralRe: Great work but had a few bugs Pin
dale.newman28-Mar-14 18:36
memberdale.newman28-Mar-14 18:36 
QuestionUnspecified Error Pin
mwalsh6225-Mar-14 11:54
membermwalsh6225-Mar-14 11:54 
AnswerRe: Unspecified Error Pin
dale.newman25-Mar-14 11:59
memberdale.newman25-Mar-14 11:59 
AnswerRe: Unspecified Error Pin
dale.newman31-Mar-14 6:10
memberdale.newman31-Mar-14 6:10 
Questionsource zip is corrupted Pin
capacitas.sgl23-Mar-14 1:56
membercapacitas.sgl23-Mar-14 1:56 
AnswerRe: source zip is corrupted Pin
dale.newman28-Mar-14 18:33
memberdale.newman28-Mar-14 18:33 
AnswerRe: source zip is corrupted Pin
dale.newman31-Mar-14 6:11
memberdale.newman31-Mar-14 6:11 
GeneralMy vote of 5 Pin
Member 982548822-Mar-14 8:03
memberMember 982548822-Mar-14 8:03 
QuestionCan't find process Pin
Member 104203445-Mar-14 8:02
memberMember 104203445-Mar-14 8:02 
AnswerRe: Can't find process Pin
dale.newman5-Mar-14 17:25
memberdale.newman5-Mar-14 17:25 
GeneralRe: Can't find process Pin
Member 104203446-Mar-14 6:19
memberMember 104203446-Mar-14 6:19 
GeneralMy vote of 4 Pin
ectisch6-Feb-14 7:01
memberectisch6-Feb-14 7:01 
GeneralRe: My vote of 4 Pin
dale.newman6-Feb-14 12:29
memberdale.newman6-Feb-14 12:29 
QuestionSpiffy Pin
nitewulf504-Feb-14 9:49
membernitewulf504-Feb-14 9:49 
QuestionMy vote of - uh, I forget Pin
dmjm-h3-Feb-14 11:04
memberdmjm-h3-Feb-14 11:04 
AnswerRe: My vote of - uh, I forget Pin
dale.newman3-Feb-14 11:25
memberdale.newman3-Feb-14 11:25 
QuestionHow do I vote for this article? Pin
mrjava662-Feb-14 11:31
membermrjava662-Feb-14 11:31 
AnswerRe: How do I vote for this article? Pin
mrjava662-Feb-14 11:59
membermrjava662-Feb-14 11:59 
GeneralRe: How do I vote for this article? Pin
dale.newman3-Feb-14 5:40
memberdale.newman3-Feb-14 5:40 
QuestionQuestion about delimiters Pin
Marc Clifton29-Jan-14 6:46
protectorMarc Clifton29-Jan-14 6:46 
AnswerRe: Question about delimiters Pin
dale.newman29-Jan-14 7:45
memberdale.newman29-Jan-14 7:45 
AnswerRe: Question about delimiters Pin
dale.newman31-Jan-14 5:20
memberdale.newman31-Jan-14 5:20 
AnswerRe: Question about delimiters Pin
SteveHolle11-Dec-14 12:37
memberSteveHolle11-Dec-14 12:37 
GeneralRe: Question about delimiters Pin
dale.newman16-Dec-14 4:20
memberdale.newman16-Dec-14 4:20 
QuestionJD could not connect to the server Pin
dumnovs29-Jan-14 0:16
professionaldumnovs29-Jan-14 0:16 
AnswerRe: JD could not connect to the server Pin
dale.newman29-Jan-14 5:09
memberdale.newman29-Jan-14 5:09 
GeneralRe: JD could not connect to the server Pin
dumnovs29-Jan-14 13:47
professionaldumnovs29-Jan-14 13:47 
GeneralRe: JD could not connect to the server Pin
dale.newman29-Jan-14 13:53
memberdale.newman29-Jan-14 13:53 
Questionsource file cannot be downloaded Pin
fredatcodeproject26-Jan-14 23:48
memberfredatcodeproject26-Jan-14 23:48 
AnswerRe: source file cannot be downloaded Pin
dale.newman28-Jan-14 18:54
memberdale.newman28-Jan-14 18:54 
GeneralRe: source file cannot be downloaded Pin
fredatcodeproject29-Jan-14 3:20
memberfredatcodeproject29-Jan-14 3:20 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web01 | 2.8.160204.4 | Last Updated 26 Jan 2016
Article Copyright 2014 by dale.newman
Everything else Copyright © CodeProject, 1999-2016
Layout: fixed | fluid