Click here to Skip to main content
Click here to Skip to main content

Very fast test data generation using exponential INSERT

, 28 Aug 2009 CPOL
Rate this:
Please Sign up or sign in to vote.
Instead of using incremental INSERT to generate test data, this method effectly copies the existing data multiple times.

Introduction

This article (my first) will describe an algorithm that enables a large amount of data to be generated very quickly using a SQL query. The test data can be static or incremental, such as “Item Name” and “Item ID”, respectively, as shown below:

tableSample.png

Background

One of the tasks I did in a project involves generating a testing table with 103,680,000 records. The conventional method of data generation would take a month; hence, a fast method of data insertion was required. The new method took only 5 hours.

Using the code

Conventional method – Sequential INSERT

The conventional way of generating a list of numbers from 0…100000 would be using a loop and an INSERT statement as follows:

CREATE TABLE #tempTable([Item ID] [bigint], [Item Name] nvarchar(30))
DECLARE @counter int
SET @counter = 1
WHILE (@counter < 100000)
BEGIN
        INSERT INTO #tempTable VALUES (@counter, 'Hammer')
        SET @counter = @counter + 1
END
SELECT * FROM #tempTable
DROP TABLE #tempTable

Let's call this method of data generation “Sequential INSERT”.

New method – Exponential INSERT

The new method effectively makes a copy of the existing data and appends it as new data, and does so repeatedly until the desired amount of data is generated.

Here is the code for the exponential INSERT:

CREATE TABLE #tempTable([Item ID] [bigint], [Item Name] nvarchar(30))
INSERT INTO #tempTable VALUES (1, 'Hammer')
WHILE((SELECT COUNT(*) FROM #tempTable) < 100000)
BEGIN
    INSERT INTO #tempTable ([Item ID], [Item Name])
        (SELECT [Item ID] + (SELECT COUNT(*) FROM #tempTable), 
                                 'Hammer' FROM #tempTable)
END
SELECT * FROM #tempTable
DROP TABLE #tempTable

Points of interest

The condition for the WHILE..LOOP is (SELECT COUNT(*)). This condition statement takes a long time to be evaluated. A faster method would be to calculate how many iterations are needed to generate the desired number of records, i.e., 100,000 records in this case, which is 2^17=131,072, so we can rewrite the code to stop after the 17th iteration.

It took 4 seconds to execute the number count from 1 to 100,0000; the exponential method took two seconds with the code below:

CREATE TABLE #tempTable([Item ID] [bigint], [Item Name] nvarchar(30))
INSERT INTO #tempTable VALUES (1, 'Hammer')
DECLARE @counter int
SET @counter = 1
WHILE(@counter <= 17)
BEGIN
    INSERT INTO #tempTable ([Item ID], [Item Name])
        (SELECT [Item ID] + (SELECT COUNT(*) FROM #tempTable), 
                           'Hammer' FROM #tempTable)
    SET @counter = @counter + 1
END
SELECT * FROM #tempTable
DROP TABLE #tempTable

Also, not only can you use this to increment a number field, but it can be applied to datetime fields as well.

History

  • This is v1.0.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

weilidai2001

United Kingdom United Kingdom
No Biography provided

Comments and Discussions

 
GeneralMy vote of 2 PinmemberAborq31-Aug-09 22:50 
GeneralRe: My vote of 2 Pinmemberweilidai20011-Sep-09 1:03 
GeneralRe: My vote of 2 PinmemberAborq3-Sep-09 4:50 
QuestionWhy go row-by-row? PinmemberAborq31-Aug-09 22:37 
AnswerRe: Why go row-by-row? Pinmemberweilidai20011-Sep-09 1:04 
QuestionCould a variation be used to speed up creation of non-trivial records? Pinmembersupercat926-Aug-09 13:10 
AnswerRe: Could a variation be used to speed up creation of non-trivial records? Pinmemberweilidai200126-Aug-09 14:12 
Hi supercat9, thanks for the message.
I've read your first paragraph a few times, but still couldn't figure out exactly what it meant Sigh | :sigh:
Anyway, for the SELECT COUNT(*) problem, I did specify in "Points of Interest" section that a counter variable would be more appropriate.
And I do agree that this method will always over generate test data, and when the table gets large, it is very likely that you over generate hundreds of thousands of records which takes a long time. What I normally do to avoid this is to generate test data in batches. So for example, if I want to generate 1,500,000 records, instead of doing 21 iterations to generate 2,097,152 (i.e. 2^21) records then trim it off, I would generate 3 batches of 524,288 (i.e. 2^19) then combine them, so I only over generate 3*524,288-1,500,000=72,864 instead of 2,097,152-1,500,000=597,152 records.
 
supercat9 wrote:
If it's necessary to add a substantial number of records with non-trivial data, especially if a significant amount of the data will be duplicated on the new records, I wonder whether it would make sense to create a temporary or memory table with 'k' instances of all the columns that have 'unique' data, put data into that table using one record for each 'k' records of the original data, and then use 'k' SQL statements to copy records from the temporary table into the real one? It would seem in some ways a bit clunky and inefficient, and it would only allow linear rather than exponential speedup, but for versions of SQL that don't otherwise allow the creation of multiple records with a single statement I wonder if it might be helpful?
 
With regard to your exponential-growth strategy, I would suggest that it might be better to use a variable rather than SELECT COUNT(*) to keep track of how many items exist; also, it might be worthwhile to, on the last pass through the loop, limit the number of items returned in case there is a need for a total number of records that isn't a power of two.

GeneralRe: Could a variation be used to speed up creation of non-trivial records? Pinmembersupercat927-Aug-09 6:23 
GeneralRe: Could a variation be used to speed up creation of non-trivial records? PingroupMd. Marufuzzaman27-Aug-09 11:26 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web03 | 2.8.150123.1 | Last Updated 28 Aug 2009
Article Copyright 2009 by weilidai2001
Everything else Copyright © CodeProject, 1999-2015
Layout: fixed | fluid