|
|
Comments and Discussions
|
|
 |

|
Very good job
|
|
|
|

|
Thx! David. http://www.linkedin.com/pub/david-catriel/44/b01/382
|
|
|
|

|
Amazing article, easy to read and it helped me solve an issue I had on my application!
|
|
|
|

|
Glad you liked David. http://www.linkedin.com/pub/david-catriel/44/b01/382
|
|
|
|

|
Simple, solves the problem.
"There are only 10 types of people in the world: Those who understand binary, and those who don't"
More articles on my blog
|
|
|
|
|

|
I had to solve a similar problem myself. In order to keep things relatively fast I would do the Bulk Copies in batches of 1000 to 10000 records. If no errors were generated, great! Those records were all "good". If there was an exception then I'd iterate through just that batch record by record in order to find the actual culprits.
Over all this kept the speed up to reasonable levels, even in those cases where the data being copied was a bit dodgy.
|
|
|
|

|
That works, except that if you have multiple failures in more than one batch you need to redo your process over and over again and until they've all been found. With this technique you only search once and find everything in one shot. You save a lot of time this way. David. http://www.linkedin.com/pub/david-catriel/44/b01/382
|
|
|
|

|
Not true. If there's a failure in the batch, you check all the records in the batch. You'd only miss a problem if stopped looking as soon as you find your first error. This is why setting a sensible batch size is important.
The only errors you can miss doing what I've described is an individual record having more than one thing wrong with it, and your method won't pick that up either. (Nor indeed could I think of a method that would that didn't involve writing a whole lot of business rules to be checked for, etc. which is exactly what we're trying to avoid.)
Cheers,
Michael
|
|
|
|

|
I don't think I'm getting my point across. You approach would find the problems in the batch, but you still would have to run your process several more times if there are issues in other batches. The approach I'm using finds all errors from the entire bulk update in one shot, regardless of how you partition your batches. As for multiple errors on the same record - yes, I wouldn't be able to catch that either. I guess not everything can be automated. David. http://www.linkedin.com/pub/david-catriel/44/b01/382
|
|
|
|

|
I think the idea is that if there are 100,000 records and e.g. ten of them have problems, it would be far better to do 90 successful batches of 1,000 records each, ten unsuccessful batches of 1,000 records, and 10,000 individual records, than to do one unsuccessful batch of 100,000 records and 100,000 individual records.
|
|
|
|

|
supercat, that's exactly the problem. You have no idea what the distribution of the bad records is, so if in your example you have ten unsuccessful batches the bulk copy will stop on the first of those ten. Once you've corrected your errors from that batch you'll need to rerun the whole thing again and then stop a second time because of errors in the second batch. And so on until the tenth batch. At this point, you've rerun the entire thing ten times and have probably punched the monitor in aggravation because of the seemingly unending errors. What I'm doing is saying "ok, so I've found an error. I don't intend to run through this whole update n times just find the others. I'm going to run it once, get all the errors, and correct them in one shot". I find this is much more efficient and also gives you a much clearer picture of how many issues your data has as a whole. David. http://www.linkedin.com/pub/david-catriel/44/b01/382
|
|
|
|

|
Assuming there's a failure somewhere, wouldn't the overall speed really depend on the speed of GetBulkCopyFailedData?
That method is going to read every single record, and going down the batch route it would only do that for those batches that contained a failure. So 10k records, 1k batches, 3 errors would worst case read 3k rows. No batches, would always read the full 10k.
I guess it depends on how many errors your expecting, and if your doing bulk ops I think that should be none...if your expecting errors bulk into a staging table (with less restrictions) first, then clean.
But for the occasional error this method would get the majority into the db, and let you know the exact row...so good job
|
|
|
|

|
Yes, the overall speed would definitely depend on running GetBulkCopyFailedData. Re batches - I guess it depends on how you manage your batches. Are you referring to loading completely independent batches, where every one of them is being run within its own transaction? In that case, I suppose your method works. My personal preference, however, is to get a complete list of all the failures up front instead of having to rerun the process until everything is done. Also, I don't think that running separate batches would give you any particular speed or time advantage. You would be saving time if there are some failures (so you don't have to rerun batches that were already saved), so I guess that's an advantage. In the end, however, I think the methods don't differ much in time; they will probably come down to personal preferences. Loading everything into a staging table is an excellent way of cleaning things up before loading into your final database. Your staging area would ideally contain the same data constraints as your final database, so you might be running into issues while loading into the staging area. David. http://www.linkedin.com/pub/david-catriel/44/b01/382
|
|
|
|

|
You are quite incorrect in your assumptions. I am not rerunning the process multiple times each time I find an error.
Here's my assumption: The data is mostly good. I'm not going into the process blind, I have at least some confidence in the data I'm importing and I expect it to mostly import fine.
Now an example: I need to import 100,000 records, and scattered in there are 10 bad records. I start the process in batches of 1,000 records, so if the data was perfect there would be 100 bacthes and I'm done. But because there are errors, 10 (or less, if errors are "clustered") batches will fail. The moment a batch fails the program shifts to re-doing that particular batch record-by-record, as per your example. When an individual record fails it can be noted to a log somewhere, but the process continues, record-by-record for the rest of that batch, then returns to doing 1,000 record batches.
At the end of all this I have a log file with 10 records in it (hopefully identified by a primary key), and I have 99,990 good records in my database. All that good data got there with 100 batches of 1,000 records transferred, 10 of which failed, and 10,000 records individually transferred, of which 10 failed. I fix the 10 records and import them - job done.
Yes, in total I have sent 110,010 records with 10,010 rejected, compared to your 100,010 records with 10 rejected, but in my experience, the time spent resending the failed batches automatically (I don't know where you got the idea that I'm stopping, fixing a record, running it again, stopping again, etc. from) is easily compensated for by the time saved by doing 90% of the work in larger batches. I mean, otherwise what is the point of using the BulkCopyTransfer process at all? Is doing 100,000 single record BulkCopyTransfers any faster than doing 100,000 single record Inserts? (Possibly it is, marginally, since logging isn't done, triggers aren't fired, etc. This is a genuine question which I don't know the answer to... )
And if I had less confidence in the data I could do batches of 500, or 100, and probably still save time. (Once you get down to batches of 10 or so then your data is crap and you might as well do it record-by-record, I agree.)
Fundamentally you and I are agreeing on the technique of getting around the issue of BulkCopyTransfer not beiung very helpful in its error messages, and I think your article is a good one, so don't think I'm ripping your idea to shreads. I've just introduced an extra bit to the recipe which in my experience speeds things up (at least in the case of "mostly good" input data) with no downsides.
Cheers,
Michael
|
|
|
|

|
Yes, looks like my assumptions were off. I've never really tried updating a table with multiple batches of a bulk copy because if there were problems I would then have to somehow filter out the data that had already been loaded or just re-update those rows with no changes. A bulk load, for me, has always been an all-or-nothing process. I guess there is simply more than one way to skin this cat. As for ripping the idea - no worries there. I'm always open to a good discussion and exchange of ideas; that's what helps us keep on learning. P.S. no cats were harmed in the writing of this comment. David. http://www.linkedin.com/pub/david-catriel/44/b01/382
|
|
|
|

|
I've an application that imports excel files into SQL server. The application works fine on my local machine and it imports all rows perfectly. However, when I deploy this to the server IIS 7.0 it works for small files but for larger ones imports partial data. Also, it does not throw any exception. For example it imported only 17837 out of 67456 rows in one of the files and there's no pattern cause it imports different (yet, less) data for each large file.
The only fishy thing I see is that, one of the columns is "NULL" for the very last row that was converted.
Original 17837 row : 358487 12/12/2012 N NULL #87726 UHC NULL
Converted 17837 row 358487 12/12/2012 N N/A Per Provider #87726 UHC NULL
Can you please put some light on what can be the issue here?
|
|
|
|

|
Hi Gudi There's nothing wrong in having a NULL in a column. You'd need to check the column's definition first to tell if that NULL is allowed or not. As for the bulk copy failure - there must be a log of it somewhere. Either within IIS or under the computer event viewer. In any case - if you can't tell what's going, I suggest implementing the code from the article and just log the exceptions to a file of your choice when they occur. Then you'll have all the info you need. You need some kind of log to give you an idea of the problem; without it, there's really no way anyone can help you figure this out. David. http://www.linkedin.com/pub/david-catriel/44/b01/382
|
|
|
|

|
Actually, I implemented your method, but that gave no exceptions. So there's no log of anything erroring out. That is really my confusion. It's almost like sqlbulkcopy only sees those many rows to be imported and imports them without error, except for the very last one.
|
|
|
|

|
That is strange. I would focus on one of the files causing you problems. Does reimporting it a couple of times always fail on the same record? If so, compare that record's values with the schema of the table and see which one is failing the constraints. If you have Enterprise Manager you could also try importing the file by using an Excel ODBC pipe. It will be way slower than a bulk copy, but might give you an idea of what's wrong. David. http://www.linkedin.com/pub/david-catriel/44/b01/382
|
|
|
|

|
Actually, it was the datatable that was being loaded from the excel file that is creating the issue. The datatable itself has less rows when populated on the server and hence, sqlbulkcopy is only seeing those many rows. Thanks for your prompt input.
|
|
|
|

|
NP. Glad you found the issue. David. http://www.linkedin.com/pub/david-catriel/44/b01/382
|
|
|
|
 |
|
|
General News Suggestion Question Bug Answer Joke Rant Admin
|
How to get a list of data rows that caused an exception (or several of them) in a SqlBulkCopy operation
| Type | Article |
| Licence | CPOL |
| First Posted | 18 May 2012 |
| Views | 9,931 |
| Bookmarked | 18 times |
|
|