The Lounge is rated Safe For Work. If you're about to post something inappropriate for a shared office environment, then don't post it. No ads, no abuse, and no programming questions. Trolling, (political, climate, religious or whatever) will result in your account being removed.
Oh yes, as soon as your thread count exceeds the core count, you are going to get some slowdown.
Didn't even do that.
I'm fully aware of where I wen't wrong. I posted it for netizens of the lounge to have a laugh on my behalf.
In this case the specific problem is that the piece of work is smaller than the cost of creating tasks.
And my error in the bigger picture is that one cannot simply convert a task running in sync to one running in async. It has to be purpose built.
except you are putting a lot more vehicles on the same roads which means more chance of traffic jams, accidents, breakdowns, and so forth. Put too many on the same roads and they get blocked up with cars and nobody can move anywhere because there is a car in their way ...
Good analysis. Best way to think about it is this: yourself, you cannot really multitask. You can time slice (we used to call this time share), or you can delegate. Everything done internally is really just time slicing, partitioned according to the rules and privileges you assign to processes, and thread within those processes.
ThisOldTony has it right; I am just an echo.
In my case I actually understand them, we're not the only customer on this data, so for them it's just easier to upload a weekly XML file to an ftp-server.
And it's not even my own government in this case.
I don't understand Danish, and Danes take offence if I speak English to them. (Quite rightly so I might add ) So if I want support I need to employ Johnny.
I wrote a command line app that imports a NESSUS security scan XML data file - the largest I've seen to date is about 8gb. We import the data into a SQL server database. It's not multi-threaded at all that I recall. I do remember that the file was too big for XDoument to work.
I feel your pain.
".45 ACP - because shooting twice is just silly" - JSOP, 2010 ----- You can never have too much ammo - unless you're swimming, or on fire. - JSOP, 2010 ----- When you pry the gun from my cold dead hands, be careful - the barrel will be very hot. - JSOP, 2013
If the parsing can be partitioned into n subproblems, where n is the number of cores, then I would consider creating n daemons and locking each one into its own core. If any of them block, offloading the blocking operations to thread pools might help.
Partitioning the problem will help to reduce semaphore contention and cache collisions.
But I haven't had to populate a large database this way, so I could be full of shite.
I was vaguely reminded of an episode of Home Improvement, where one of the kids got himself in some trouble, so one of the brothers says "I wouldn't wanna be you right now", and the other responds with "I wouldn't wanna be you, ever".
I was going to say - size of the work done in each task is key...
But the underlying technology can also have an effect, by reducing the cost of task creation. If you're using a work queue on top of a thread pool, you're not creating a thread for each task, you're pushing/popping tasks on and off a queue.
The file search library that I use adds a new task for each directory it sees. Each task processes just the files that are immediate children of the directory the task was created for.
The detection of duplicates is split so that each task hashes a group of files that have the same size. This is performed using a data parallelism library, which makes parallelising things very easy.
The amount of speedup I get isn't anywhere near the number of processor cores in use (I get a factor of just over two speedup on an eight core machine), but I think that the amount of IO being done serialises the processing to a certain degree. Benchmarking ripgrep, another tool that uses similar parallelism, shows that running with 8 threads (on 8 logical/4 physical cores) is just over 3x faster than using 1.
Java, Basic, who cares - it's all a bunch of tree-hugging hippy cr*p
why are u even parsing xml files and that too 80gb !!! and then saving it to the database !!! .. u could try to use the sql server bulk import tools to do this and avoid programming such stuff all together...
"Progress doesn't come from early risers – progress is made by lazy men looking for easier ways to do things." Lazarus Long