The Lounge is rated Safe For Work. If you're about to post something inappropriate for a shared office environment, then don't post it. No ads, no abuse, and no programming questions. Trolling, (political, climate, religious or whatever) will result in your account being removed.
I was vaguely reminded of an episode of Home Improvement, where one of the kids got himself in some trouble, so one of the brothers says "I wouldn't wanna be you right now", and the other responds with "I wouldn't wanna be you, ever".
I was going to say - size of the work done in each task is key...
But the underlying technology can also have an effect, by reducing the cost of task creation. If you're using a work queue on top of a thread pool, you're not creating a thread for each task, you're pushing/popping tasks on and off a queue.
The file search library that I use adds a new task for each directory it sees. Each task processes just the files that are immediate children of the directory the task was created for.
The detection of duplicates is split so that each task hashes a group of files that have the same size. This is performed using a data parallelism library, which makes parallelising things very easy.
The amount of speedup I get isn't anywhere near the number of processor cores in use (I get a factor of just over two speedup on an eight core machine), but I think that the amount of IO being done serialises the processing to a certain degree. Benchmarking ripgrep, another tool that uses similar parallelism, shows that running with 8 threads (on 8 logical/4 physical cores) is just over 3x faster than using 1.
Java, Basic, who cares - it's all a bunch of tree-hugging hippy cr*p
why are u even parsing xml files and that too 80gb !!! and then saving it to the database !!! .. u could try to use the sql server bulk import tools to do this and avoid programming such stuff all together...
"Progress doesn't come from early risers – progress is made by lazy men looking for easier ways to do things." Lazarus Long
I can't help, but reading "parse" in the body of the message...
This is clearly a case for... HONEY THE @CODE-WITCH tatatataaaaaaa
If something has a solution... Why do we have to worry about?. If it has no solution... For what reason do we have to worry about?
Help me to understand what I'm saying, and I'll explain it better to you
Rating helpful answers is nice, but saying thanks can be even nicer.
Yes, just to make sure.
I've made test runs just reading an ID from every record which goes twice as fast, and that's on a slow HDD here at home.
And when I move this to a server the disks will be considerably faster.
Lots of Linux guys working in Windows environments. Lots of them hate it, they do it just to earn money so they can pay for their home computers to contribute to Linux based open source projects in their spare time. And they spend a lot of energy on bithcing about things not being exactly as they are used to in the Linux world.
My comment was "based on a true story". I made one application storing a fairly complex persistent data structure in a binary format. This was met with heavy critisism: What if that data structure becomes inconsistent - how can we fix up the inconsistencies when it is not in a readable format? I guess I wasn't too polite when answering them that one major reason for not using a readable format was to prevent them from poking into the file with vi, introducing inconsistencies.
In this system I am working on now: It is a Windows desktop application, but there is a function for converting all file system paths to Unix style forward slash path separators, and a handful utility functions that fails if you submit a DOS/Windows style path with backwards slashes. Forward slashes is the only "correct" path format, they claim - DOS/Windows was simply wrong until they started accepting the correct format. So the (Windows) users of this program must simply accept that when using the conventions of their OS, they are simply wrong.
In an earlier project, the Linux mafia forced me to make special adaptations in my (very) Windows-specific utililty: They inisisted on running it, in their shell based batch jobs, from a Linux-adapted command shell that enforced case sensitive environment symbols. They make use of it, too: Their jobs started crashing, and it boiled down to my utility treating symbols differing only in case as synonyms, while they were distinct in their jobs.
In my current project, one of the first thing I did was to replace case sensitive file name comparisons with case insensitive ones. It was argued, "But cmake always uses CMakeLists.txt, with exactly that casing! There is no need to do a case insensitive comparison!" Well... Why did the program then barf? Someone wrote CmakeLists.txt, and the program just failed, because it didn't find the file.
I would have tolerated this a lot more if it wasn't for the constant bitching from the Linux mafia about Windows users refusing to learn anything new, but cling to Windows ways of doing things (when working under Windows) rather than learning the way these wonderful command-line utilities ported from the wonderful world of free and unsupported software expects you to put everything in a loooong command line. This bitching about unwillingness to learn is nothing new: I have heard it constantly repeated for at least 15-20 years, in numerous different environments.
I recently discovered that the Compare plugin to Notepad++ was incapable of comparing two generated build jobs: The command lines invoking gcc was in excess of 3800 characters. Looking up the documentation for the generator, I found explicit warnings about Windows incapable of handling command lines exceeding 8 Ki characters; this could cause problems (but Npp Compare obviously has a far lower limit). Hooray for command line interfaces, where every detail is available at your fingertips, not in a silly screen form!
Now, a lot of newer Linux born utilities do use binary formats - but the Linux mafia always have explanations for this very special case where it is justified. For us who have lived through several wars, it this interesting to see how a lot of arguments that were boasted as super-essential, a few years after the war was won is laid silently down and more or less replaced by what the loosing side was promoting, although usually with a twist, so it will not be recognized.
Let me give one example of this: Packet routing. One of the fundamental strengths of IP as compared to e.g. ATM/FR/OSI-NP is that if a link is broken or congested, packets can select a different route: Each packet contains the full address and is, in principle, routed independently, and can follow any route to the destination. Connection oriented protocols assume that all packets follow the same route; a link/physical failure requires a full connection reestablishment. ... Yeah, right. In today's Internet, every IP packet finds its own path. Right. You chop of an international trunk fiber, and no router anywhere in the world requires manual intervention for having its routing tables changed; that goes by itself, automatically. Believe the old myths, if you like.
There are several other examples, like Internet and US phone guys insisting on inband signaling (reducing a 64 kbps line to 56 kbps data capacity), while Europeans favor OOB singaling, both in phone systems and data networks. Then when the SIP protocol for establishing IP phone connections (or other kinds of connection) was defined, all that heavy critisism of OBB signaling was kept low - SIP is OOB signaling in a nutshell.
There is no way to completely escape poor solutions promoted by the Linux and Internet communities (which has a large degree of overlap); we have to live with it, in spite of extremely poor tools, poor user interface and high overhead (especially when it comes to space requirements). But when I make Windows specific tools, aimed at another target audience that Linux hackers, I prefer to do things in better ways.