|
Greetings Colby,
Thanks for your interest, no matter what you do with C#/.NET, or GPL code in general, it is rarely a waste of time, especially when we're talking hours and not days.
"Safe" or managed .NET code for a situation like this is almost unbearable. That is to say, without the benefit of pointer arithmetic, you are adding a lot more overhead than an unsafe solution does.
Let me be clear about something (and I only found this out somewhat recently myself), that unsafe code does not mean you are forced to write your own code unsafe which uses the unsafe port I wrote. To elaborate, .NET allows you to write unsafe DLL's and basically, once compiled into an assembly, they are wrapped in a construct that makes them appear "safe" to any other software which references your unsafe DLL.
While I have not had a chance to look at your code yet (being Christmas and all, I'm kinda busy), I will state that a 100% safe and managed solution will end up considerably slower as a result of boundry checking, and other stuff that is not applicable to unsafe and fixed pointers. For example, garbage collection is occuring a lot more frequently in managed code than in the unmanaged version due to "fixed" pinning the buffer in memory. Since it pins it during the entire compress and decompress process, little garbage collection or memory defragmenting occurs.
I wouldn't care to guess how the managed solution holds up not just under profiling, but actual uses. The reason I ported QuickLZ was the high performance, and even my unsafe port has only seen approximately 1/3rd of the C code's speed. But I would guess that fully managed code would further degrade that perfomance considerably.
I have, since the writing of this article, with the help of the original author, fine tuned and rewritten parts of the algorithm to be better suited towards packet compression. Smaller headers, optimized for smaller data subsets, and removed some of the code that the original author mentioned was "added overhead" for smaller packets.
Anyway, the reason I ported it was out of boredom, I didn't see it as a waste of time, even if I never personally use the code... I know someone will, and perhaps your pure managed solution will be the right one for someone who doesn't want to mark their code unsafe, and doesn't want to use my port as an external DLL. A pure managed solution, while without a doubt slower, may fill a small void, so don't be too discouraged. Thanks for giving us all a link, and don't give up porting that archaic C code, I'm in the same boat
|
|
|
|
|
Guessing about optimization is one of those 7 deadly sins isn't it? It will always come back to bite you.
Before you read any further, please note that I am not comparing code, algorithms, or "real world" situation tests. I am only attempting (very poorly) to make a point that array access is very rarely significantly slower than using unsafe pointers. As I interpret it, an unsafe pointer is accessing managed memory using a C style pointer without the bounds and possibly type checking of array access.
Don't be nearly so quick to write off the capabilities of the JIT optimizer, it has surprised me (and made me feel dumb) on more than one occasion. I have found that it handles array access very well and only does checks as necessary. I've ported lots of C code to C# and there have only been one or two situations that using unsafe methods proved to be faster than the array access that I was able to come up with. Just to fight boredom over my lunch break I ran a few test using Stanislav Panasik's test app, modified to fixed a bug where the reading FileStream was not being closed and disposed of in a timely manner. I renamed my class to QuickLZ2 and uncommented the following lines 1 at a time in Tester.cs.
Byte[] ResultBytes = QuickLZ.Compress(SourceBytes, 0, (uint)tr.lSourceBytes);
Byte[] ResultBytes = QuickLZ2.Compress(SourceBytes);
Byte[] ResultBytes = QuickLZ.Decompress(SourceBytes, 0);
Byte[] ResultBytes = QuickLZ2.Decompress(SourceBytes);
I ran 3 series of tests against a couple of binaries I had on my machine. Code changes were limited to changing a single line of code for each test. All were done using a Release mode build and I restarted my machine before each test. There was no other software running at the time, background or foreground, other than the normal Windows processes. All that said, this is a pretty rudimentary test and in no way conclusive or at all scientific.
Before each series I ran the Compress/Decompress code 3 times. To get my measurements I ran the test 14 times, dropped the 2 highest and 2 lowest scores and averaged the remaining 10. I checked to ensure that the compression ratios were the same for both versions of code. This involved modifying the sequence length variable in my code to match yours. Sorry for the difficulty formatting the data.
Array Access Unsafe Pointers
Ratio Compression Decompression Compression Decompression
412Kb 70.08% 48.45Mbps 68.45Mbps 32.51Mbps 70.23Mbps
7.27Mb 65.87% 48.33Mbps 63.32Mbps 33.74Mbps 67.39Mbps
18.64Mb 58.95% 45.35Mbps 59.19Mbps 37.43Mbps 65.59Mbps
As you can see, at least on these 3 binary application files, compression was always faster using the "slow" array access code. For some reason, which I still point the finger at the memcpy_up method, decompression was always slower. After looking at your code, the final array copy into the destination buffer of the Compress method could be smudging the results slightly. Still, I am doing an Array.Resize at the end of my Compress method, so this should not be a deciding factor for the results that I gathered.
Really the test needs to be run again using packet sized data just to see the results of that test. Unfortunately it is hard to find data of that size without generating it, which would reflect a real world situation even less than these tests. I see a few possible hot spots glaring in your code, but I think I made my point already that array access is not always slower than pointer access, at least by any significant margin, and in many cases it can be faster.
As a side note, I ran the last series of tests again using Debug mode. The array access code was faster by a very significant margin for compression and decompression both. Makes you wonder what kind of magic actually happens behind the scenes. I have been wanting to finish an article that I started some time ago comparing array access to unsafe pointers, maybe now would be a good time seeing as how I have two comparable code bases to work with.
I must say that reading your response to my earlier message was a true pleasure and that I look forward to corresponding further.
Colby
|
|
|
|
|
Well done on the research end of things.
As for my code, it is far from optimized. I found a few rather ugly spots in my code as well that could be better optimized. It was more of a direct first shot 1:1 port. The extra cost you see is definately from the difference in resizing versus reallocating. If you wanted to do a better test, make the 2 unsafe methods public and call them with preallocated memory. There is still quite a bit of code to clean up in the decompression, however it is a more accurate example as it's a one time allocation for both codes.
However, I do digress. The difference we see here is the type of difference we'd care about with C and machine code, and not with .NET, so using a fully managed solution may be viable.
The memcopyup concept is really a rather difficult one to solve with .NET in general, and as you seem to have made a somewhat cleaner solution for this through managed code it may superceed my attempt as I don't intend to continue maintaining it, but rather start pushing towards what I've done all this for in the first place, which is a robust, and scalable secure TCP Network engine. However, I do believe that I will remain using unsafe code for 1 reason when it comes to the engine, and that's because once it's inlined, the compression and decompression need every bit of speed they can get not to be a bottleneck in the engine as it scales upto (hopefully) 1000 concurrent connections as the goal.
This is starting to get a bit off topic, so I will add some additional details and get your feedback in a private message.
|
|
|
|
|
I'm going to reply to your private message soon but I thought I would share this for everyone else.
I modified the Compress method of your code to use the same Array.Resize method as mine. Compressing a 7.5Mb buffer in a loop 100 times it shaved about 0.8 seconds off of a 23 second run. I don't think that is the major hold up. I think the main thing is that the JIT is able to compile the array access code down to something remarkably similar to the unsafe pointer code. Also, I believe that because it is using the array access methods of the language the JIT is able to perform better optimizations whereas your pointer code is mostly limited to what you are able to do with your algorithm and pointer math smarts.
End result is that I believe our code turns into something similar which is the reason there isn't a huge speed gap. Aside from the memcpy_up method, I don't believe there are any optimizations you could make in your code that I couldn't make in mine.
I work in the medical imaging field and have been able to write code using C# to perform real time operations on large images (2000x2000 pixels) without using pointer math. Much of our software is deployed using ClickOnce and is run in an IE app domain and so we don't use unsafe assemblies. The .Net runtime provides many features to help you write very efficient code in a short amount of time, if you know where to find them. Also, because C# is compiled to IL, it isn't always as intuitive as you would think to find hot spots in your code and so a good profiler such as red gate or dottrace is invaluable.
It comes down to asking yourself if you want to fight the machine or use it to your advantage. Arrays are much nicer to work with.
Colby
|
|
|
|
|
I think I found bugs with your code. I attempted to create an empty byte[] of size 256. Upto about 270 or so, it was crashing. I tested my code and it worked fine, though I think I may have spotted other errors in my own code to be fixed (has to do with the placement of the second QCLZ string, haven't looked closely yet)
Anyway, hopefully you'll maintain this code and look into the problem. I attempted to look at the problem, but your code is considerably different than the original and found it difficult to track down where it's going wrong, all I could figure out is that the loop appears to be going further than it should be on the source and FastRead4 is attempting to read past the end of the source. This wouldn't happen in typical use, but with a large empty buffer, it's trying to match a really long RLE sequence and it's getting to the end, trying to match another 4 bytes but only 1 or 2 is available. Adding the boundry check wasn't enough, as the loop continued to run I believe due to the way the main loop uses a subtraction of 2 uint's and it wraps over to a really high value. That's as far as I got, before I tested my code again and confirmed that it compressed an empty buffer, despite it growing larger until around 300 bytes when it shrunk.
|
|
|
|
|
This bug has to do with the main while() loop and GuaranteeUncompressed and SequenceLength both being uints in an unchecked region of code. SequenceLength is larger than GuaranteeUncompressed for such a small buffer length and so there is an underflow. There is a fix on my site if you want to download the latest zip. Hopefully I will have time to get the rest of the site back operational soon.
As for maintaining this code.. if you can talk the original author into allowing a license other than the GPL then I would be glad to address bugs and even work to improve the code and compression algorithm. Otherwise this code is not of much use to me and so it will eventually bitrot and disappear.
Your private message was not forgotten, it still has a star beside it. With the holidays I haven't had much time to reply but I will get to it.
|
|
|
|
|
Ultimately, I ended up just plopping my code back in. I need to optimize the protocol for data <64kb, remove the uncesssary headers and try to reduce to 1 8bit header and 1 16bit header, the rest is unneccary.
I understand your reluctance to continue maintaining the code, my original port is probably going to find it's way to nirvana eventually as well. As nice as GPL is, I admit that anything short of LGPL is going to cause this code to go to waste.
No worries on the private reply, it wasn't time critical. I have already written another iteration of my network engine and I think I'm happy with this one. Just need to optimize the compression/decompression for packet type data.
At this point now I'm just bored, and will probably be looking for something new to port, or try to think of a hobby project to consume a little time.
|
|
|
|
|
I have been trying to solve on how to detect IP addresses based on my installed printer drivers around my network but I am getting frustrated each day..heheh.. Win32_Printer, Win32_NetworkAdapter and Win32_TCPIPPrinter failed to get the information that I want..
if you're kind enough share your knowledge to email it to me at janverge@gmail.com
I would prefer your solution to be in C#
thanks again..
Merry Christmas
nice one..
|
|
|
|
|
In the current port upto SEQLEN = 2050 bytes of the source are saved uncompressed. To increase compression ratio for small data sources, find the line
const UInt32 SEQLEN = 2 + (1 << 11);
and replace it with
const UInt32 SEQLEN = 64;
The constant can be fine tuned experimentally, but setting it too low can decrease compression ratio and speed because it also determins the maximum RLE length and LZ length allowed. It can also be set at runtime according to the 'size' argument - the only reason it isn't is for optimization reasons in the ANSI C version.
Maximum setting is 2050, never exceed that.
- Lasse Reinhold (author of original QuickLZ)
|
|
|
|
|
Thank you for gracing us with a little wisdom of the specifics. Today I got a little adventurous and murdered the algorithm into a ComplexBuffer class I've been working on, and got some nice results integrating it. I did fine tune it down to 64 byte sequence matching, and I didn't notice much change on the ratio, as we discussed. However, I was able to compress a 390 byte file to 328, so that was good.
In my butchering for my buffer, I cut out most of the header costs, down to 5 bytes on compressed data, and 1 byte on uncompressable data (no easy task, let me tell you...), and now it seems suitable for packet compression of smaller input sources.
Glad to see this is stimulating a little discussion
|
|
|
|
|
I think this algo is very suitable for embedded devices, where you need faster compression on a slow CPU.
Thanks for the code, very interesting !
|
|
|
|
|
I'm not sure if I'm doing something wrong but using it straight out of the box, so to speak, my compressed array size was greater than the uncompressed array. Then again I know this can be case on some binary files.
In another test the difference favoured the compressed array but it was only marginally improved. I compressed the same files using standard PKZip and WinRAR and they performed significantly better. The important thing is that the routines worked. I will do some more tests on other files.
The important thing for me at this point is (a) it does work and (b) I've been interested in seeing 'unsafe' code in practical use. Thanks for an excellent set of routines.
|
|
|
|
|
You can use my testing utility to discover all aspects of this algo.
Source code is available here
|
|
|
|
|
Very interesting, I see you wrote a performance counter based test. Would you be willing to post your results in a message here for all to see, and give a quick blurb on the way you profiled it versus the original C code. Your form based app looks interesting, but I admit I didn't dig in deep as I do intend to port the original quick32.exe for as close to a real comparison of the original code as possible.
All the same, some preliminary profiling would be nice, please share some results for us
|
|
|
|
|
Yes, it is based on QueryPerformanceCounter API function. Measured only compression part of code (Compress function time).
I have not tested it against C implementation, sorry.
About results - they are very different, for larger files compression speed is higher, for smaller - lower.
On the test text file compression is about 72%, and the speed is from 30 to 40 Mb/s (on the same file, several different runs).
I think difference in results for the same file caused by .NET platform implementation, this is not pure machine code, so it can be executed differently from time to time: same output, but different execution speed.
|
|
|
|
|
Yeah, unfortunately with .NET it is very difficult to get stable accurate results. A lot of this has to do with as you said, specific implementation. It also has to do with .NET's garbage collection. I haven't done extensive research on the default garbage collector, but I believe it collects everything in size and generation blocks. The result is that local scope stuff typically gets destructed fast. However, because of the nature of .NET, you don't really have local scope variables that have a linear allocation time on the stack. So since everything is allocated on the heap, certain things start to kick in like the memory fragmentation prevention code that should make memory access faster, but may cause a slowdown in routines like this.
There may be a way to improve the performance by providing a new garbage collection routine during the compression routine.
However, at 30-40MB/sec I'm not entirely impressed with the implementation and may look at ways to improve that particularily for the packet compression implementation. But as you accurately stated, since it's not machine code there is no reliance on performance, .NET can do a lot in the background right in the middle of a single instruction during Compress. It might do a round of garbage cleanup, might degragment memory, and may optimize the VM states all at the same time. But seeing a top end of 40MB/sec is a bit disconcerting. I wonder if this is a C# aspect, or if writing it in managed C++ would make any difference at all...
|
|
|
|
|
Astaelan wrote: However, at 30-40MB/sec I'm not entirely impressed with the implementation
I can't agree with you
This rate is for 7 Mb text file, but for 19 Mb text file speed is about 60-70 MB/s, so implementation is good.
The only way to improve speed is to modify algo, but then you lost in compression, isn't it ? So, for this specific type of compression and for the .NET platform speed is good, i think.
Astaelan wrote: by providing a new garbage collection routine
This will be an overkill, in my opinion. You can't build an ideal piece of code, there are always some pros and cons, so it's better to play a bit with parameters, rather than spend a week for new garbage collector to improve some parameter for several percents.
Just if you'll be able to write super fast compressed memory buffer - it sounds very good. Memory always can be used for some things, so compressed buffer is very interesting idea. Even if you have only 30% of compression, it is very good - it is 30% of free memory
Happy new year from Russia !
|
|
|
|
|
Writing it in C++/CLI would certainly make a difference. Have the compression done with native code, and then have the .NET side of the assembly throw back the results.
(C++/CLI can mix native and managed code)
ROFLOLMFAO
|
|
|
|
|
You should get ~50% for text. Is it a small .txt file? If so, try setting SEQLEN as described in the my other thread.
|
|
|
|
|
Good point. If the file was small, there could easily be a low compression because the current implementation is based on files > 100kb, as mentioned. If your text file was smaller, it could easily be experiencing less of the potential.
As rlasse pointed out in another post, by reducing seqlen to 64, I got the 12132 byte quick.c (original source) text file down to 5300 or thereabouts, I forget now. So with a minor optimization specific for smaller input data, it managed a pretty significant ratio over what would be seen using a 2050 sequence match which would leave a fairly large chunk of uncompressed data at the end.
|
|
|
|
|
Okay, lacking any formal knowledge of the algorithm, yes, it is possible that the current implementation may not compress anything. To clarify, I spoke with the original author of the algorithm about downsides of the algorithm, and one of them is rather significant... And here's why:
Within the code, is an optimization to find sequences upto 2 >> 11 bytes. That's 4096 byte sequences, of course it's unlikely you will ever find a sequence this long. As a result of how the sequence matching performs so fast, is an obscure issue as to why some times files may not compress well. Based on the algorithm in it's current state, if you only had say, 5000 bytes of data to compress, the actual result worst case, could be that only 5000-4096 bytes get compressed, as the last sequence of bytes remains uncompressed for faster processing (talk to the author if you wish to know more).
Also, it should be noticed that yes, PKZIp and WinRAR are fully expected to get better compression ratio's, but they also take about 100 times longer to compress and decompress data. This algorithm is most suited to streaming compression. However, in it's current form, it isn't absolutely streamlined for packet compression and so there are some cases where QuickLZ may seem like it's doing nothing. The algorithm needs to be adjusted, but doing so would break compatibility with other QuickLZ implementations, so this needs to be realized before hacking it apart to write a more efficient class.
It shold be realized that any file already compressed is going to get little to no compression from QuickLZ (ie, divx files, and other previously compressed formats).
To give you an idea of the compression, I took a very simple, relatively small EXE (Putty.exe) which had an original size of: 412 KB (421,888 bytes)
Compressed with QuickLZ got this down to around: 292 KB (299,008 bytes)
It's a modest saving, considering the code is still optimized for larger files, but it's still looking at about a 43% savings.
Should also note, that with a file which, worst case, cannot be compressed by QuickLZ, the output will be slightly larger than the original size, due to the compression headers still existing on uncompressed results (more accurately, it marks one of the headers as compressed or not so you don't need to manage whether files are compressed or not before decompressing).
By slightly larger, read: 8 * 4 bytes (8 headers, a few unused and reserved). I have discussed modification of the algorithm to suit packet compression, and may be soon to follow, however I think I am going to port quick32.exe to profile the speed difference between .NET and the original C code to see how much is lost going from C to unsafe C#.
I appreciate that the result may not have been what you expected, please continue your testing and let me know if you find any actual malfunction of the code... You can use the existing quick32.exe to confirm both compression and decompression work whether the ported code made the compression or not.
As far as unsafe code goes, there are a few obscure limitations (IE, can't modify a fixed pointer directly, gotta make a new pointer and adjust the assignment). Otherwise, unsafe C# is a lot like working with pointers in C, with a little more control over bounds checking with checked/unchecked (which I have no reason to use in this implementation, so the code is pretty close to a 1:1 port of the original C).
Thanks to all for your interest in the algorithm, nice to get positive feedback.
|
|
|
|
|
This was actually one of the original motivations for porting it, however I have not tested if the code will work on the compact framework or not. If you have the opportunity to test it, I'd happily add a little blurb in the article about it's potential uses for the Compact Framework and small devices like PDA's.
I imagine anything involving wireless communication between small devices could benefit tremendously from this. However, when dealing with communication protocols, this is where I've been holding my breath on QuickLZ until I can optimize it for small packet compression. After talking to the original author, it is clear that for data between the 1kb-64kb range, a max sequence length of 32bytes would significantly outperform a 4096bytes sequence length.
After I do some other testing, profiling and some work on another class I'm writing called ComplexBuffer, I will be looking back to the compression routine and improving it for implementation into ComplexBuffer as a packet compression implementation. Stay tuned if you're interested
|
|
|
|
|
QuickLZ for compressing and decompression packets? There is a lot of hope that this algorithm will be very useful in this area, but this also raises some issues, like malicious input, and garbage input that might crash a program using the algorithm to decode QuickLZ streams. I wonder if it would be safe to leave the issue alone and let the comsumer of the algorithm worry about it, or add some checks within the code and hurt some of the performance.
ROFLOLMFAO
|
|
|
|
|
I completely agree. Overall its interesting !
|
|
|
|
|