Click here to Skip to main content
15,881,882 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
Hi,

I'm using Parallel.ForEach to read files and then insert processed information into database. There may cases when multiple files my reside similar data and while saving to database, in this scenario I need to updated previous record instead of creating a new one.

C#
Parallel.ForEach(files, (FileInfo datafile, ParallelLoopState state) =>
                {
                    if (errorCount > maxErrors)
                    {
                        iterationMaxErrorAchieved = true;
                        state.Stop();

                    }
                    else
                    {
                        ImportDataProcess(datafile);
                    }
                });



C#
void ImportPolicyInternalProcess(FileInfo datafile)
        {
            try
            {
                XmlDocument xmldoc = new XmlDocument();
                using (FileStream stream = datafile.OpenRead())
                {
                    xmldoc.Load(stream);
                }

                //....some code...

                if (InsertData(xmldoc.DocumentElement, int value1ID, int value2ID))
                {
                    
                }

                processedPolicyFileCounter++;
            }
            catch (Exception E)
            {
                
            }
        }


So in InsertData method I'm using enityframework and on basis of combination of value1ID, value2ID a distinct record is created in DB. If there exists one such record the InsertData method need to update it with latest information set.

It was working fine till the time I was using synchronous inserts i.e. no parallelism. But now it keeps on creating records in the database irrespective of the fact that the combination already exists in DB.

Kindly assist.

Thanks,
Abhishek
Posted
Comments
Sergey Alexandrovich Kryukov 26-May-15 2:28am    
Why thread safety is required here? What is the common resource which can lead to any inconsistency?
Remember, there is one simple principle (with limited applicability, but very important anyway):
"Best synchronization is no synchronization".

Can you see the point?

—SA
Mehdi Gholam 26-May-15 2:32am    
processedPolicyFileCounter is one and whatever happens in InsertData().

In any case it is a very hard question to answer with the little information provided.
Sergey Alexandrovich Kryukov 26-May-15 2:51am    
This problem is correctly commented on by King Julien in his comment below.
—SA
King Julien 26-May-15 2:48am    
One general thing to consider when using parallel tasks and database is DB connection. How does your InsertData method behaves is not shown in your sample code. Assuming a separate connection object is created within your InsertData method, may still use a single physical connection because of the Connection Pooling (Sql Server Connection Pooling).

Typically, you can make your method thread-safe, by wrapping the critical sections (the code that use the common resource) with a lock statement. In your case, it seems like the critical section is the DB access in the InsertData method.

 
Share this answer
 
v2
Use "lock" before calling InserData. This way you will be sure that only 1 thread will be updating the DB. But it beats the necessity of Parallel.ForEach. Nevertheless if there is lot of data processing happening before InsertData then lock just before Insert should give you better results than a complete sequential processing.
 
Share this answer
 
Comments
Sergey Alexandrovich Kryukov 26-May-15 2:55am    
"Beats the necessity of Parallel.ForEach" is a good idea to consider, but this is not exactly so. It all depends on typical type inside lock, compared to the typical total iteration time. Generally, it's bad to "oversynchronize" the data... I've seen too many design mistakes made by different developers who first develop parallel execution and then synchronize the action which effectively makes the thread doing their jobs sequentially (but getting all the threading overhead). This is way too ridiculous, but it happens with frustrating regularity... :-)
—SA
vinayvraman 26-May-15 3:08am    
Exactly Sergey. I second your thoughts. And for the same reason I have suggested to add lock just before insert and after all the processing is completed. But since the InsertData method is not shown, it is difficult to give any concrete solution.
Sergey Alexandrovich Kryukov 26-May-15 3:32am    
Agree.
—SA
More than a punctual solution I would like to expose some points that have to be considered when multithreading is applied.
1. Is it really necessary? Trying to parallelize a naturally serial flow of execution is not possible. If your code is so, don't do it!
2. Analyze the parts of the execution flow that could be parallelized and those that have interaction and need arbitration. Divide your code in pieces, The part that can be parallelized goes in the 'main' thread module, the parts that could not be parallelized go in interlocked modules that are appropriately arbitrated (with locking, mutex or semaphores) depending on the interaction between them (yes also arbitration is not the simple locking).
3. Consider that after the processors of your CPU (physical or hyperthreaded) all other tasks work sequentially (switched very fast to seem really parallel execution, but with no doubt always serial), so if the load of work is not divided on external resources access you will get no real benefit (here I have to spend few words on multitasking to made more clear the concept: A multitasking system realize the most of its efficiency in doing something else when waiting for a resource. :) Yes the real point is if my code is waiting for a resource, as the disk or whatever, why can the CPU do something else in the meantime? I/O subsystem, and many other resource managers, are coded to stop thread execution immediately as soon as a resource is requested alllowing the OS to start execution of another thread while the resource become available, i.e. the disk reads some sectors in memory.)
4. Last some words about the 'thread safe' concept. This can basically be expressed as "whatever that can be trashed overwriting by many threads have to be preserved". You can understand that it could mean a lot or nothing, because every variable that is not strictly local to a thread can be 'trashed' if updated from another thread, but also a disk write or GUI update. Once again analyze your code, check access conflicts on variables, I/O etc, and code the opportune protections (locks, or whatever) to avoid conflicting access. This is in small the same arbitration that an OS does on its resources... :)
I hope this could be of some interest...
 
Share this answer
 
v2

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900