|
Message Closed
modified 28-Sep-18 18:39pm.
|
|
|
|
|
Holy cow... are you for real? Please take your chest beating elsewhere. You appear to be a typical IT manager who is an overbearing alpha male who view's management as a means of control instead of getting something accomplishment. I hope I never cross paths with you. Types like you give movies like Office Space meaning. Good luck with your bloated resume dude.
|
|
|
|
|
If you need small data streams and high speed, Microsoft's solution for binary serialization is useless, because of the use of reflection, and the assembly versioning overhead/madness. There are plenty of uses for efficient binary serialization, besides moving typical database application data. Any kind of data that requires millions of records or objects for example: mapping, graphics, data analysis, etc.
check out VG.net: www.vgdotnet.com
An animated vector graphics system integrated in VS.net
|
|
|
|
|
Hi Damon
Well, what a response! I can't think of another piece of similar sized text that has
a) pissed me off...then made me chuckle
b) made me want to punch the author...then ask him for a job
c) made me want to just ignore it...then left me flummoxed as to where to start in a response
You want a debate, well you got one!
Lets start here and put some unwarranted accusations to bed:
DamonCarr wrote: You also state that you can 'deserialize' 2.1MB in .82 seconds... That is 26,224 KB/Sec! I wish I had the network connection you had!!!! Although I am sure you are ignoring this vital fact and are just reporting 'in-process' deserialization, no? Misleading..... But I suppose if this is apples to apples with the previous example, it's fair, but still misleading to the amazing talent pool on this site.
Network connection speed doesn't come into it. It is not a "vital fact" as it is constant whether you use the code or not - further, any saving in the amount of data being transmitted will result in a reduction in transmission speed also.
The article said "...(to a MemoryStream for maximum speed).." so yes I am reporting 'in-process' deserialization - noone has been mislead at all - hopefully you will be good enough to retract that particular accusation.
Next, lets look at your 'alternatives':
DamonCarr wrote: Instead I would use standard compression technologies (as you describe) and/or 'chunk' the data, perhaps in an queued (MSMQ/MQ Series) fashion, or using remoting perhaps on multiple threads.
All of these options fail as a direct alternative to my code for one simple reason - they are transmission techniques and not serialization techniques.
You can't 'chunk' a dataset or zip/compress a list of entities, they need to be serialized first.
If you were designing a new system and had control and full knowledge over all the classes involved then you might be able to write some code that could stream portions of a list of similar items over multiple threads for example but surely that would be more fragile than this code?
I did have the idea of using two Network streams, one for the serialized data and one for the tokens (strings/tokenized objects as they are added) so that serialization and deserialization can occur in parallel on their respective machines and the transient memory footprint would be smaller since a circular list could be used. However, this would be a replacement for remoting rather than an optimization and since the current code is so fast, it would fill up a fixed size buffer before the network connection was made so there would probably be little or no gain (though it sounds cool ).
onwards...the size of the data I was using in an example:
Yes, 34,423 could be considered a lot but for the purpose at the time (the purpose you admit you don't know and didn't bother to ask for before putting putting pixels on screen) I believe was perfectly reasonable. It served perfectly as a test as the data was readily available, was real world data, and probably most importantly, was large enough to give accurate timing figures for each optimization 'trick' I was trying to determine whether it was worth worth including or not.
One of our existing apps does happen to transmit data of this size across the world so should we rearchitect the whole application from the ground up or should we just add a handful of lines to the remoting sinks as the 'minimum necessary change' to achieve the required results (and speed up all other data transmissions as well for free).
your other considerations:
DamonCarr wrote: 1) Cost of Incremental Development and Maintenance. Remember the majority of costs for a system are in the maintenance AFTER release, not development. Most systems die a slow and painful death due to entropy and code like you advocate here which nobody can figure out (when the rookies come in to maintain the thing).
I don't think it is that hard to figure out - method names like WriteObjectArray or ReadString are pretty clear in themselves and a simple reference to this article should be enough to explain to 'rookies' what is happening and why.
Remember that only one method and a constructor are typically involved anyway (for a given class); only key classes might need optimizing anyway; and the same code might suffice for many variations on a class. DataSets and DataTables may be used in many places but the code for thier serialization hasn't, fingers crossed, needed to be changed.
I see you are against using ADO.Net objects and I have some sympathy for that point of view, but LLBLGenPro entities and collection for example benefit in the same way - serialization code written once and will work for all generated entities (and can be extended by the developer if required).
DamonCarr wrote: 2) This would be considered by many to be an inappropriate use of Large Data Transfer across remote processes (not my words or ideas... I am just the messenger - see Martin Fowler's Patterns of Enterprise Architecture as well as many others).
Sometimes, we don't have control over what is being sent via remoting.
The code will work equally well for small amounts of data if you are in the lucky position of being able to ensure that large amounts of data are never sent.
DamonCarr wrote: This article appears to me at least, to be a very smart guy flexing his skills, but perhaps unknowingly spreading bad design concepts to the readers. The fact is, you ALMOST NEVER would want to do this (please, let's debate this.. I would love to hear your opinions)...
"spreading bad design concepts to the readers" rankles a little bit - the code does exactly what it says on the tin - it is a utility class (not a design your whole application must follow) that may be used to help optimize serialization under certain circumstances.
I took a basic idea, acknowledgements are in the article, and just developed it to the nth degree. It was getting almost silly at one point - I drove 30 miles to work at 3am once because I had to know whether Hashtables for string tokens would be fast enough to make a difference (definitely yes as it turned out). Crazy? Well yes but heck, I enjoyed doing it, the code has provided some major benefits in existing apps and will be reused in future apps. I even learned how to use NUnit as bonus! Writing an article for CodeProject was probably the hardest bit as I am not a writer and find it incredibly difficult and frustrating - however I'm glad I did.
The readers of this article, as you mention, are more than capable of determining whether the code would be useful to them or not. "Design concepts", bad or otherwise, don't come in to it because it is not a design concept just a utility.
ADO.Net Types:
You mention several times ADO.Net types and the perils of using them and how you hope I am not using them, yet the second part of the article is devoted to serialize/deserialize DataSets and DataTables! Maybe you didn't read this part?
Thanks for the article links - I was aware of them as it happens but the problems detailed therein are not directly relevant since FastSerializer avoids them.
Your Conclusion:
DamonCarr wrote: Rather then create a new Binary Serialization technique which Microsoft spent countless amounts of money and time on, why not consider that the entire architecture you are proposing is flawed (or not.. Like I said I would love to hear more)? Can you really do a better job the Microsoft here? There is a reason it is so big - You shouldn't send so much data unless you have some weird situation
As the article mentioned, Microsoft are hamstrung in some respects since they have no choice but to work the lowest common denominator and use reflection to retrieve class data. They have acknowledged this limitation by providing the ISerializable interface to allow savvy developers the opportunity to take over the process where they have a better insight as to the particular serialization requirements for a class. All FastSerialization is doing is optimizing this somewhat.
Are you aware of the LosFormatter class (I wasn't until recently as I am not a web developer). This uses a lot of similar techniques - tokens for specific types etc. - so it would seem that Microsoft are not averse to supplementing standard serialization and optimizing it in certain circumstances.
Also, use Reflector and have a look at the deserialization ctor on DateTime - imagine running that code for every DateTime deserialized - certainly not optimized but, as I said, Microsoft have to work to the lowest common denominator so yes, I really can do a better job than Microsoft here.
Again, it is unlikely you will have control over, or maybe even know, how large a lump of data is created by serializing an object graph built from a single object. Regardless, whatever size it is, it and the time taken to create it can be reduced with a little effect if the circumstances deem it desirable.
My conclusions (No disrespect intended but since we're being brutally honest...):
You have had a lot to say but very little seems to be of useful substance.
You have made unwarranted assumptions.
You have made unwarranted accusations.
For a self-proclaimed architect/designer guru you have shown a certain lack of design methodology in not researching and understanding the user's requirements/motivation first.
You have shown a lack of understanding between a utility class and a design concept.
You have confused me regarding Agile methodology. For example, you are a "process leader" but your "members" are "basically peers"; "...my lead developers..." - surely contradictory - you are either all the same level or there is a hierachy/packing order.
You have shown some questionable management/inter-personal skills in that you both praise and bollock in public under the guise of "feedback" and "another perspective" yet meaning "No disrespect" and "..was very careful to not ruffle feathers". This is an interesting way of wanting to "start a lively debate" and get some "interesting responses" even when "But these are not my ideas...".
Now having said all that, if you are still up to furthering this debate in a constructive, fair and honest manner, I am happy to spend the time writing up the history of how this code came to be and show why it is a good solution for that history and why I personally believe the code is about as good as it can get (but would be more than happy to be proved wrong).
Cheers
Simon
|
|
|
|
|
Message Closed
modified 28-Sep-18 18:43pm.
|
|
|
|
|
Hi Damon
Looking over your comments and sample code, I can't be certain but I think there may be a misunderstanding here of how Fast Serialization works within a remoting scenario... we'll see..
It doesn't change the remoting technology at all. The MemoryStream used internally (not to be confused with the one I happened to use for testing - that was only there so that serialization memory usage/timing could be measured without being affected by the network) is not transmitted at all, it is merely used to construct a single byte[] representing the owned data for a given object. That single byte[] is then stored within the SerializationInfo block provided by the serialization infrastructure exactly in keeping with the ISerializable contract.
Therefore any talk about chunking *this* MemoryStream is irrelevant. It will disappear anyway as soon as the SerializationWriter instance goes out of scope. What is important is the byte[] it helped to create and was passed back to the remoting infrastructure via SerializationInfo - it is this that is stored in another MemoryStream, one which is created by the remoting sink and will hold a binary representation of *all* the objects contained in the object graph, regardless of whether they were serialized using the default standard reflection, ISerializable, FastSerialization, Surrogates or whatever.
This sink-owned MemoryStream is also not chunked in any way and will get larger and larger until the whole object graph is built. Only at this point will any custom sinks have any access to this stream and therefore be able to do anything with it such as apply compression or whatever (and as mentioned you may get an OutOfMemoryException first). The point is that Fast Serialization does not have anything to do with this process whatsoever - all it will have done (and done quickly) is reduce the overall size of this sink-owned MemoryStream's buffer in a completely *transparent* way which is a *good thing*.
Therefore I still say that it is not a design concept because we are not changing the remoting design at all - simply storing the data representation for a given object in a different (and better) way.
Now about the code samples you gave which use chunking:
Firstly, that isn't Remoting in the .Net sense (or least as I understand it) - it is alternative to Remoting which is stream-based rather than being message-based as Remoting is.
Secondly, it describes the transmission process rather than the serialization process - the PDF being received? must have been serialized to bytes by the web server first OR is being serialized in chunks on request which is good because it means the whole PDF (or its serialized byte stream) does not necessarilty have to be in memory in its entirety - it could be being read chunk by chunk from a filestore or database.
However, .NET remoting does not work that way. The *whole* remoting message/object graph is serialized into *memory* first and *then* it is transmitted and reassembled and only when *all* of the data is received is the object graph deserialized.
In addition, wouldn't you need to write specific methods for each object type being transmitted? This may be fine for certain, often used types such as PDF content but Remoting should be tranparent for any type (or at least those with a [Serializable] attribute).
Your #1 point about me improving performance in a "...very limited situation".
Well *every single entity transmitted* isn't a very limited situation - it is at or very near 100%.
"But the custom plumbing required is extensive..."
By, "plumbing", I'm guessing you're are referring to custom sinks etc. Well no, nothing like this is actually *required* - the only code you *need* to write is in the GetObject() method and the deserialization constructor to meet the ISerializable interface.
If you are not able to do that because you don't have control over the class source then you *can* use some custom plumbing to create a surrogate - even then, the code is almost trivial and is still part of the standard .Net remoting paradigm.
The code I wrote for LLBLGenPro has a simple switch to turn Fast Serialization optimization on or off - no change to sinks, no change to remoting configuration. Turn it off and you will get standard .NET serialization.
DataSets and DataTables:
No, they are not read-only and they do use n-tier/Domain driven/OO architecture - not written by me though. The DataSet is just a transmission device and for that, is as good as any other really.
When the DataTable is received via remoting it is wrapped in an Entity collection - each contained DataRow is used as the backing store for an Entity instance - all access is done through the Entity Collection and Entity properties (autogenerated from a schema). When changes are made to any/all of the collection contents, just the changed entities/DataRows are sent back to the server (in a DataSet which may contain one or more EntityCollections/DataTables) as a UnitOfWork to be executed on the server. So we are using DataTables/DataRows both as backing fields for Entity instances and for data transmission.
Optimistic/pessimistic is configurable on a per-entity-type basis.
LLBLGenPro/NHibernate:
LLBLGenPro has Entity and EntityCollection classes (plus one or two other, relevant classes). Since it comes with the source code, it is possible to add the Fast Serialization code directly to these classes. Once this is done, and no, this wasn't trivial but perfectly doable, Fast Serialization is available for *any* entity/entitycollection both now and in the future just by setting a single static boolean flag. The investment is done once but reaped many times. In fact you don't even have to this yourself anymore as the next release of LLBLGenPro (v2.1) will have this code built into the runtime libraries.
Our existing app, the one that used DataSets, uses a version of the code from pt 2 of the article. It uses a surrogate since we can't change the DataSet code directly but again, both surrogate and serialization code was written once and works for all the tables we have now and any we add in the future. Nothing else in the app needed to changed and we were already using custom sinks for compression anyway.
NHibernate was in fact was the reason I started writing all this anyway ('the purpose'!). We were going to compare our existing system vs NHibernate vs LLBLGenPro (which was my preferred option). I wrote some code to compute time/memory used between the three systems from single entities to massive chunks of data (including calculatations for total time to transmit over the local network; to the USA and to Singapore; all with and without compression).
Now I knew nothing about NHibernate anyway and found the documentation to be sparse and confusing especially since some of it refers to features only available in its java roots.
As I understood it at the time (disclaimer: you may correct me here!), you have an XML schema file and create entities with properties to match. From the samples I have seen, a private backing field is normally used to store the data which may be an elemental data type or a collection of other entities.
Basically the results using standard serialization showed that NHibernate was no faster than our current system and got slightly slower as the amount of data increased. However LLBLGenPro, which otherwise had all the features we wanted (and no xml file to write!), took up way too much space/time for serialization.
I started a dialogue with the developer, Frans Bouma, on his support forums about how it could be improved back in July last year. ISerializable was already implemented pretty much along the Microsoft recommendations so there wasn't an awful lot to improve there. As I mention in this article, I took ideas from other people, expanded on them and came up with this code. The next version of the test results showed exactly what I wanted (of course!) that LLBLGenPro was much fast and smaller than nHibernate for smaller amounts of data (to which I couldn't see an easy way of applying Fast Serialization because it only used backing fields) and the gap increased as the data got larger. (To be fair, nHibernate is still better with regards to entity memory footprint but as you say, memory is cheap!)
The serialization code was refined to cope with all LLBLGenPro-specific scenarios including coping with circular references and so on and now will be incorporated into the next release.
"We all sometimes get legacy crap, but we must refactor it or suffer.."
I agree but I think what we did, both for our exiting DataSet apps and future LLBLGenPro app, *was* refactoring. We identified bottlenecks and found just a handful of places to make modifications which resulted in massive improvements, did not affect existing code, and is quite manageable - I don't expect the code to change too much or often.
DataSets will probably change in the future as they did between v1.1 and v2.0 and I just reflectored and made the necessary changes (and posted them here so noone else needed do it) - even that was just the names of a few private fields as I recall.
LLBLGenPro - well that also may be changed between (major?) versions, but as Solutions Design now own the code, they will do it anyway.
Microsoft:
I still believe that Microsoft are hamstrung over serialization since they have to write code that works for *any* object (with a Serializable attribute) written both now and in the future. I believe reflection is used but it doesn't really matter - they cannot know other than maybe via the NotSerialized attribute which private data fields can be included/excluded or whatever. This isn't a critiscism - it works great given the constraints they have to work under.
The DateTime point I made:
Yes, in-memory execution is cheap and so the point could be considered irrelevant as the difference is not easily measured. However that small difference gets multiplied for each use. I suppose that is the difference between adequate and optimized - probably not worth the effect by itself but if you get that optimized for free as a bonus because you are really optimizing something else then all the better.
A further point on this subject. Consider that in v1.1, DateTime did not implement ISerializable and so the private long ticks field was read via reflection and therefore would always take 8 bytes regardless of its content. In v2.0, the ISerializable.GetObject looks like this:
info.AddValue("ticks", this.InternalTicks);
info.AddValue("dateData", this.dateData);
its now storing two objects rather than one and incredibly InternalTicks just returns return (((long) this.dateData) & 0x3fffffffffffffff);
so it now takes 16 bytes rather than the original 8 and the new, additional data could have been generated from the old anyway! So yes, I did better than Microsoft for DateTime.
Other points about Writing a New Serializer or adding options on the Binary serializer and using sizeof(object) in unsafe code:
Don't think these are relevant given first reponse - there is nothing wrong with the existing ones - we are just giving it less data to store.
sizeof(object) only gives the size of a single object and not the *object graph* that would be involved in serialization.
The bottom line to me is:
1) If your application is providing adequate remoting performance then it ain't broke so don't fix it.
2) If you need to do something then you would likely be look at implementing ISerializable on *some* classes. If performance is now adequate then the job is done.
3) Still need more? Since you have modified your code to implement ISerializable anyway, have a look at using Fast Serializer to store your class data instead of storing each in a named slot. You can use an if/then/else to make it conditional if required and you won't have to alter the design of your application.
Now then, about that NYC job.....
Cheers
Simon
|
|
|
|
|
Message Closed
modified 28-Sep-18 18:38pm.
|
|
|
|
|
Hi Damon
The remoting MemoryStream is created by, owned by and processed by the remoting infrastructure - FastSerializer doesn't even get to see it let alone manipulate it or start raising events from it! If you want to do anything with it at all then you will have to write custom sinks and basically take over the whole remoting process. Now whos making things complicated!! (actually I did trace through using Reflector and got as far as creating a custom binary formatter but there was no way of controlling creation of that MemoryStream - its creation is *very* deep)
So unless you can prove differently, I am confident you can forget chunking memory streams using .NET remoting - you would have to write an equivalent infrastructure.
You sample code demonstrates how to serialize an object but where exactly does remoting come into this?
Again I don't know nHibernate so its not clear to me whether the GetBookmark method is on a client machine or a server machine (to keep terminology simple). Is BookmarkDAO allowed on the client machine - I doubt it from what you've said previously (though I'm not certain) since a client shouldn't need to access the database but that would mean that your CommandInvoker object would have to be remoted back to the client machine. Alternatively, if this really is client-side code and .BookmarkContent returns the contents of a (decompressed?) MemoryStream then you would have to repeat this for every Type that could be remoted?
"If your objects are large, don't serialize them..."
Forget 50Megabyte objects - I don't have them, you don't have them, they only exist in this thread.
The point should be "large object graphs", not "large objects". Given an object you do not know how large the object graph will be. Since you don't know how large the object graph is, you cannot know how much space it will take up when serialized. Even if you have the source code to a particular class, you still can't know (unless it has no object references) how much space to serialize it.
I may have a Ship class (I work for BP Shipping) and I request a single ship entity and it may serialize to a couple of hundred bytes. I may ask for the same ship and plus all of its Voyages as Aggregates for the last 12 months. I may be getting say 27 objects now - the original Ship entity plus 25 Voyage entities which are containing in 1 VoyageCollection class. I then ask for the same ship plus its voyages plus every Port Call - we could be talking several hundred objects now all linked together with a single Ship object at the root but lots of collections/back references etc.
I now ask for all ships owned by BP over the last 5 years plus all their voyages, the voyage's port calls, all the port calls' operations, the port names, the daily noon-day readings, cargo pick ups and drops offs plus details of all invoices made on those voyages with order-level details and full contact details for any agent/inspector involved in any of the transactions.
OK we're getting silly now but this *is* possible and how many methods/stubs/proxys did I have to write to do this? None. They are all LLBLGenPro entities/collections and so share the same FastSerialization code so all I needed to do was turn on Fast Serialization. (Without the switch .NET crashes with an OutOfMemoryException)
"..Movev to a Memory Stream, compress it, serialize it in chunks, when the reciever has it all uncompress it and deserializeit. SIMPLE!"
But to put it into a MemoryStream you use a BinaryFormatter which is what Remoting does already!!! We are going around in circles!!!
You omit to say exactly how the serialized chunks get from one machine to another. Are you suggesting writing your own transmission framework along the lines of HttpRequest/HttpResponse? Or are you planning to use .Net remoting to move your compressed MemoryStream in which case you have just duplicated the remoting process for no reason???
"Well no, nothing like this is actually *required* - the only code you *need* to write is in the GetObject() method and the deserialization constructor to meet the ISerializable interface.
Of course I know this, but you would not gain the purpose of your article."
Yes you would! That is exactly the point of the article!
It is also the logical place to put this code - it has code directly pertinent to the class at hand; it has full access to all private data; it is the only way (without surrogates) that .Net serialization will let you initialize a deserialized object.
With regard to the DataSet/DataTable stuff:
I was describing an inhouse ORM we have which sounds similar to nHibernate but uses DataTables/DataRows as the backing store. It does just about everything you mention except one thing: entity inheritance. LLBLGenPro supports all forms of this and that is one of the reasons I recommended it - (incidentally nhibernate was suggested later I was only comparing to make sure nHibernate did not have significant advantages over what was already more or less decided).
We don't access the DataSet/DataTable via app code - only the Entity and EntityCollection are used. We have a service interface (in a seperate assembly naturally) that contains the DataTable to EntityCollection code - not that there is much, the EntityCollection just accepts a DataTable in the constructor and creates enough Entity instances to wraps each DataRow). I omitted to tell you that prior to FastSerialization being incorporated we had other similar code in a surrogate which basically did the optimization/bug workaround from the articles you mentioned previously - this was enough to give performance comparable to nHibernate.
This gave adequate performance but only one line was need to switch to the new code. Why have cotton when you can have silk at the same price?
I should leave the nHibernate/LLBLGenPro/DataSet wrapping ORM religious debate alone but.... what the heck...
You mention writing a 30 line XML file which works across all major databases with no code changes or code gen.
- no code changes or code gen but then you mention an option to create stub classes for DomainObjects. I get the bit about no Data Access Code but you are still coding by hand??
- Hmm. Suppose the XML doesn't quite match the database schema - what happens then.
- You want a cool UI tool to create your relationships? LLBLGenPro has one - even lets you define virtual relationships in the GUI if you feel so inclined. No Xml, no accidental database/schema mismatch. Just press the tit and you will get two projects created ready for compilation - one containing data access code and one containing your entities so you can keep data access off the client machine altogether. Want some POC classes? Just add a template and they will be created for you.
(Incidentally, I think Frans Bouma is maybe the only man on the planet who can write as much as your goodself )
- Data Access Code Gen is Evil - I agree on this one - LLBLGenPro creates a parameterized SQL query specific to the required operation not code gen'ed at all.
Okey dokey, your turn......
Cheers
Simon
|
|
|
|
|
Simon,
As you said, this is just a utility class (a very useful+well written one!) that can be used in many scenarios.
Thank you very much for publishing it and please let me know if you have any updates to the code
Kemal
|
|
|
|
|
I read all commented following Damon Carr's post. Very interesting, the conversations between people from different backgrounds.
After all, I always think the old technology is not necessarily bad technology, and new technology is not necessarily a silver bullet more powerful. I do agree that "standard compression" suggested by Damon might be more appropriate, of course, depending more "factors" unspoken.
It is always risky to put a lot programming efforts on the top of an architecture not intend for the use case. This could be consider as hacking the architecture, comparable to over-clocking the CPU.
I don't deny that sometimes we need to hack to make things work (for deadline, or for lack of knowledge of alternative architectures), but we need to keep in mind we are hacking.
For articles published in CodeProject, I wish the authors who publish their works, in particular programming works, talk about more "factors", such as use cases, limitations, and how to make choices or balances between alternative solutions.
Zijian
|
|
|
|
|
Hi Zijian
The issue with using "standard compression" for remoting (I'm assuming you mean by creating custom sinks) is that it can only be applied *after* the object graph has been serialized.
It will help therefore to reduce the amount of information that is *transmitted* across the wire but won't help with memory usage/fragmentation prior to transmission and it won't help with the speed of the serialization process - in fact it will add an overhead both timewise and memorywise. It also won't help if the serialization process throws an OutOfMemoryException.
I don't believe it is a hack at all. Microsoft supplies the ISerializable interface specifically to allow the developer to store one or more data items each with a string tag.
All this code is doing is storing a single data item with a string tag (a byte[]) which contains a binary representation of all the data items combined.
Therefore it is still well within the contract defined by ISerializable and therefore works transparently with respect to the remoting process. I don't agree that it is a new technology from the remoting point of view - that stays *exactly* the same - the only difference is the data supplied to that technology.
Cheers
Simon
|
|
|
|
|
Hi Simmo
Well, read all the comments, and I just thought I drop you a note to say Thank You for the excellent and relevant (to me) code. I work in the Financial Industry, and specifically are involved a lot with Trading Systems and Internet Trading Systems for Stock Exchanges and Futures Exchanges. Every little piece of bytes shaved of data, not to mention improvements in serialization speed is extremly important. With your serialization technique I was able to shave 40% of the size of our quote messages (90% of messages sent). While this might not appear to make a major difference when the packets are only +- 180 bytes in size, to us, living in a country where bandwith is still an extremly expensive resource, this is major. This provides me with extremly happy clients as they do not need to upgrade expensive bandwith.
So, my 2 cents for the current thread is, that indeed for a lot of scenarios, this is extremly relevant code (and no, we are not using DataTable's). If you take a look at the FIX protocol, even they are using 7bit encoding for their new FAST protocol.
Drop me a mail and I will send you a Nice bottle of wine as a thank you .
Cheers
Andre
|
|
|
|
|
Hi Andre
Thanks for your comments and thanks for the offer of wine, very much appreciated but there really is no need - just happy that someone is making use of the code.
Cheers
Simon
|
|
|
|
|
To jump in here...
1. It's difficult to criticize a design without knowing the history that led up to the design decision.
2. The real point of this article and my article that the author referenced is that Microsoft gives you the illusion that you just slap a BinaryFormatter onto a DataSet or DataTable and you're done. If you want to criticize anybody for design issues, I'd criticize Microsoft for creating a framework that makes it so easy to do serialization wrong way and for the wrong reasons.
1) Cost of Incremental Development and Maintenance. Remember the majority of costs for a system are in the maintenance AFTER release, not development. Most systems die a slow and painful death due to entropy and code like you advocate here which nobody can figure out (when the rookies come in to maintain the thing).
This is a most dangerous statement. It certainly reflects reality. I was essentially booted off a project because my code needed to be "dumbed down" for exactly the reasons you state--the rookies couldn't figure it out. But it's a dangerous statement because it implies that complicated, difficult to maintain code is not acceptable. Sometimes complicated code, or more precisely, the complicated issues that require the complicated code, are unavoidable.
In fact, what really surprised me was that most people seem to view layers of abstraction, declarative programming, and other techniques, which I personally view as promoting maintenance, they view them as making the code more difficult to maintain! It makes no sense to me.
DamonCarr wrote: I am an Agile process leader
Then you of all people should appreciate that the real issue is not the complexity of the code but the quality of the infrastructure supporting the code--the documentation, change logs, ATP's, unit tests, and so forth. Given a quality infrastructure, yes, a rookie should be able to maintain code to any level of complexity.
Yet in my experience, that's the area ignored in the cost of development. If the development cost accurately reflected the cost of documentation, testing, test plans, procedures, tools, etc., then the maintenance cost wouldn't be burdened with the costs of an incomplete development budget.
Marc
Thyme In The CountryInteracxPeople are just notoriously impossible. --DavidCrow There's NO excuse for not commenting your code. -- John Simmons / outlaw programmer People who say that they will refactor their code later to make it "good" don't understand refactoring, nor the art and craft of programming. -- Josh Smith
|
|
|
|
|
Marc,
I love your points. Almost all I agree with.
To jump in here...
1. It's difficult to criticize a design without knowing the history that led up to the design decision.
Agreed. That is why I repeatedly asked for more information on what led to the design. Nice point, but I tried my best not to do what you have described.
2. The real point of this article and my article that the author referenced is that Microsoft gives you the illusion that you just slap a BinaryFormatter onto a DataSet or DataTable and you're done. If you want to criticize anybody for design issues, I'd criticize Microsoft for creating a framework that makes it so easy to do serialization wrong way and for the wrong reasons.
Wel... I would say it is a poor design (again depending on (1) above) to serialize the types.. But that is just me... Microsoft HAD to make this work. They are large types because they are amazingly powerful.
Most people doing read-only should get a DataReader and move info into Domain Classes stored in a Generic collection or other serializable collections (Like HashSet from IESI (SP?)- another plug for NHibernate) (with the domain types defined by Interface or AbstractBase).
1) Cost of Incremental Development and Maintenance. Remember the majority of costs for a system are in the maintenance AFTER release, not development. Most systems die a slow and painful death due to entropy and code like you advocate here which nobody can figure out (when the rookies come in to maintain the thing).
This is a most dangerous statement.
Like Darwin's? That was called dangerous. It still doesn't make it false or irrelevant (not that I mean to infer you are saying that).
It certainly reflects reality. I was essentially booted off a project because my code needed to be "dumbed down" for exactly the reasons you state--the rookies couldn't figure it out.
I have almost left this industry for this reason. I fundamentally do not believe in 'entry level' developers except for the most irrelevant projects. Give me 4 superstars and I will blow away 50 mediocre people any day (depending on (1) above - ha ha).
But it's a dangerous statement because it implies that complicated, difficult to maintain code is not acceptable. Sometimes complicated code, or more precisely, the complicated issues that require the complicated code, are unavoidable.
Kent Beck says he is often critizised for this. He says: NO! Don't just make the code simple
r then it needs to be, make it AS SIMPLE AS IT MUST BE TO MEET CURRENT AND FUTURE CHANGES! (ok I added that last but but it is the same).
If someone does not know Design Patterns, and other other techniques (plug-in architectures using reflection, etc.) then they should:
1) Learn - Buy 'Head First Design Patterns' and Read it for god sake! Buy 'refactoring and Refactoring to Patterns! (Fowler/Kerievsky). I have a full reading list on Amazon for .NET people. Many others do as well. The best developers are up to 28x better then the worst. Where are you reader? I only hire those top people and I make a massive return and they get paid hundreds of thousands a year. You wwant to make $40,000 a year? Want to go home and have a beer and watch American Idol? See 2. I will profit from you lack of interestingness.
2) Find a new profession
3) Work for the Government (not NASA please or any other area where I could die)
In fact, what really surprised me was that most people seem to view layers of abstraction, declarative programming, and other techniques, which I personally view as promoting maintenance, they view them as making the code more difficult to maintain!
See above. Abstract or go home. SERIOUSLY! We need a zero tolerance policy for this crap! Why will it never happen? The people hiring them are even stupider.
It makes no sense to me.
Nor me... That is why I come in and charge insane amounts to usually be told 'we cannot change like that. It is too hard'.
Fine by me. I still get paid and you still loose another $10,000,000 a year in wasted dev projects. Oh, and I know to short your stock.
DamonCarr wrote:
I am an Agile process leader
Then you of all people should appreciate that the real issue is not the complexity of the code but the quality of the infrastructure supporting the code
OK here we diverge a little. It's not the complexity, but the NECESSARY complexity. No more. In other terms, as simple as possible but nbo simpler (as I believe I said before). This article presented (see (1) for a caveat) what some might see as a license to violate this.
THIS IS HOW THIS ALL STARTED! Now it is a full-on discussion and I love it.
--the documentation, change logs, ATP's, unit tests, and so forth.
See my article on this in 'Agile Development'. I am with you here (although it depends on what you mean by documenation).....For me I need:
1) Test Driven Development
2) Continuous Integration
3) Nightly full-Automated System Regression Testing (like AutomatedQA)
4) Daily Stand-Up
5) Short (1-2 week) iterations
and here is where I start to brea away
6) An obsession with Design Patterns as the iterations evolve. We call this 'Pattern Hunting'
7) We reverse engineer the code into UML Diagrams at the start of an Iteration, look for patterns, play with ideas, and start TDD, eventually throwing away all UML diagrams. They are just a base of reference and an 'AH HA! Here is a place where a Command pattern would help!'.
TDD is the 'architect' of a system for lack of a better term, not large up-front UML (and I know you never said that).
6) MASSIVE customer involvement
7) 'Iteration Planning; and ORM 'Index Card' style thinking
8) LOTS of white boars everywhere
9) I know I am forgetting at least 1 thing......
Given a quality infrastructure, yes, a rookie should be able to maintain code to any level of complexity.
Absolutely, positively not true. Here is my main disagreement with you.
We cannot compriomise our profession to (as people would say here) the 'lowest common denominator' which is an idiot in most cases. Do this and all is lost.
Instead? REQUIRE ALL DEVELOPERS TO 'RAISE THE WATER LEVEL'. How? I certainly do not know. I know Brain Bench has a Design Patterns certification which is probably the single most important thing I can think of for a developer to master.
Often my first interview question: Name 3 Design patterns and how you used them to make your software more flexible to change. Fail on that? Interview is over.
90% fail.
Second question:
What are the two main data types in .NET and how are they managed differently in terms of data structure (Reference/Heap and Value/Stack). The remaining 6% fail. That leaves 4%. I am lucky to hire 1 (if 100 were starting). Bonus points for the large object heap (almost nobody mentions it).
So we have a failure of: 1) OO Principle as Practices that are over a decade old and 2) Platform Knowledge. And I have like 18 more questions!
No wonder Microsoft and Google get the best people!
Assume we recieve a new requirement which fits perfectly into the existing Decorator we have set-up.
The 'rookie' has never even heard of the Decorator. So he will F**L things up. Again, Michael Feathers wrote the absolute definitive book (Working Effectively with Legacy Code) on this (although he is far more diplomatic then I).
Hell, most NEW projects are starting out by writing legacy code!
Yet in my experience, that's the area ignored in the cost of development.
Yes... Managers need to hire FAR better people and FAR fewer people. Buy how? They need to:
1) Be amazing themselves or
2) Have help from someone amazing in the hiring process
3) Read and re-read the book 'Facts and Fallacies of Software Engineering' by Robert Glass.
If the development cost accurately reflected the cost of documentation, testing, test plans, procedures, tools, etc., then the maintenance cost wouldn't be burdened with the costs of an incomplete development budget.
They may never. Can you imagine going to get money approval for a $10,000,000 project (3 years) and saying 'well it is now $100,000,000 over 15'?
It's like political figures with a constituancy, CEOs, etc. Deliver short term and get as much (which is not much) staytegy into your legacy. That is why CTOs are the least likely to become CEO and the most likely to be fired.
Thanks,
Damon Carr
|
|
|
|
|
What I've done:
I read the article and the ensuing discussion in full... I did not download the code and play with it, but I did read over it enough to understand what it's doing.
What I still don't understand:
Why was this developed? What is the problem this is solving? Optimization, in general, is done to create performance gains in a situation where performance is lacking, or is the major failing point to a existing stable system.
What is the system? Could someone please describe, in simple terms, the system in which this code is playing a part? Could someone also describe how the standard serialization techniques would be applied, and then compare and contrast it with this solution?
Could someone describe how the standard serialization scheme created a performance problem which was best solved by this optimization?
I apologize if I missed where this was already clearly explained.
The reason I ask, is that this seems like an interesting topic, and the discussion has some obviously experienced and capable people involved, and if I was able to put it in some kind of concrete context, I think I would get a LOT more out of the discussion as a whole.
As a side note, this might also be very useful to me in the near future, as we're currently working on a dataserver architecture for a distributed computing system which will involve extremely large object counts as well as large individual object sizes in some cases, being shuffled around between various systems depending what work needs to be done on the object at the time... A fast and stable serialization and transport system would benefit this project greatly, and the existing systems have already proven somewhat insufficient for our needs.
Thanks,
Troy
|
|
|
|
|
Troy,
My stack pointer is at an invalid memory location for my Win32 process and about to try to jump to pop the stack and execute the pointer for a routine that is supposed to be there..... - GRIN.
I hope on the behalf of the small group you can see that EVERYONE is right from the correct perspective. Nietzsche helped us understand that there is no one universal right and wrong vantage point. It is dependent on the observer and their own mental filters and the specifics of their concerns. Listen to the reports of a crime from 10 suspects and you will see just how important and powerful this is.
We can say things like 'rape is wrong and we need laws to prosecute those who commit it' and who would argue? Not me….
However, let’s take the moral and legal points out for a moment and ask ‘Why do men rape’? Actually people already have. Some very smart ‘Evolutionary Psychologists’ (a field I am deeply interested in as a way to gain a better understanding of ‘human nature’) who wrte a book on the idea that JUST SAID MABEE this was an evolutionary trait that evolved as a mating option for those with no other. Does that make it MORALLY right? Hell no.
Does it help us deal with this profound problem? Absolutely it does, as it would lead to a better understanding of rape prevention and treatment for those who perform these vile acts.
But you might know what happened to them. They were attacked viciously (as I was at least once here).
Our society says to scientists:
1) You Can work over here
2) But don’t even THINK about going there (as in ‘the Bell Curve’ which tried to attribute certain traits (intelligence being just one of many) to ethic groups. They were also attached viciously as it went against one of the largest false statements ever made: ‘All men are created equal’.
Yeah? Then what about the twin studies? What about children from highly intelligent parents who are adopted? Yet statistically they are FAR more likely to carry on that intelligence. I could go on and on….
JUST FOR RAISING THE INTELLECTUAL ARGUMENT they (and hundreds, even thousands of scientists who choose to study 'unfavorable' subject matter are prosecuted, jailed, exiled, even killed for their work). Am I likening myself in some grandiose way to a scientist? Well I am a scientist. So are all of you. I just don’t have a PhD (grin).. Seriously we all must not allow our own baggage to keep us from an objective view on new ideas.
These scientists were destroyed in the media and as far as I know, lost their careers or significantly destroyed them (anyone? I am not sure…). Did they deserve it for just doing what they are supposed to do? Hel no…NOTHING is beyond deep examination at any time, especially that which you hold most sacred and truthful.
My point? They were just scientists investigating a hypothesis and were doing exactly what they should; screw the 'taboo' nature of the subject. Religion teaches us not to lie, however it practically discourages scientific discovery of the truth. So religion and science are aligned in one way, yet science often proves religion to be incorrect in 'faith' based believes.
My god the Pope even now says Darwin had it right, however the instant we became conscious, it was a divine act. Hmm.. Interesting that of the millions of species that evolve in exactly the same way we did, we are the only ones.. But I have nothing against religion (OK I do but it is not relevant here).
How did we now get to religion? Just stay with me…
Religion demands we not lie and seek truth. You may not immediately see my larger point here. Both institutions (Science and Religion) are correct and form a kind of balance. People need faith based ‘supernatural’ beliefs as humans. We now know that. However science is destroying these ‘supernatural’ religious beliefs one by one (for all religions).
Science proved the ‘Shroud of Turin’ to be a fake. Yet it is still proudly displayed and worship because WE NEED FAITH.
We even have the big-bag figured out to something like everything after 10 to the -100000 seconds. It is predicted we will be able to create ‘Carbon based Life’ in the same form as the first life on Earth in labs within 10-25 years (or less) and the ‘singularity’ where Computers exceed human intelligence is likely in the next 20-50 years.
Both have a place in society, just as all of the opinions here are valid.
So I may appear to be asking others to agree with me in my posts (and in a way I would be lying to myself and my ego if I said it wasn't a nice thought), but what I am REALLY doing instead is ASKING THEM to try to see my 'observer's' perspective and to try to view this problem from an opposing (but no less valid) viewpoint.
I think if you were to say 'He was right here and he was wrong here' it would be a negative and possibly destructive way to move forward. There is no way one person can say this as a kind of ‘universal truth’.
All I can even say is 'based on the work of others and based on my experiences, this APPEARS to have been a bad implementation'. And even then ALL I AM ASKING FUTURE AUTHORS TO DO is to first write a disclaimer as such in similar situations:
1) "The techniques provided in this article are not generally recommended as a 'first line of attack'. Instead this is a solution when you are not faced with any other alternatives but to get yourself out of a jam you likely didn’t create. To START a design using this work would be a mistake as Optimization should be left to the end of an iteration. Optimize last, never before there is a very good business driven reason and be sure it is not caused by a flawed design/architecture first if you have the luxury to revisit it. Also remember solutions in software are not ‘waiting to be found’ as Michelangelo said of the people ‘trapped’ the huge slabs of marble he simply ‘set free’. There is no one person; there are hundreds, all with Greek-God like pros and cons. Your job is to help find the best balance.
How do you do this? The only way in my 16-17 year career, and decades of obsessive reading, is through an iterative process. In other words, you cannot know what you want before you start so don’t even try. Software is what is known as a ’Wicked Problem’.
I could go off here on another tangent. Just please (unless you already know the idea) educate yourself on what is the central principal in software development which has made our industry a kind of Joke. We have insane losses and a miserable track-record and it is not improving.
http://en.wikipedia.org/wiki/Wicked_problems
All points (from what I have hear, especially the gentleman who likened me to a movie's idiotic middle manager, have merit).
He was just expressing his ‘perspective’ that I was full of it (and perhaps I am).
Hell, I even emailed that person directly to see if I could learn more from them. LEARN! I thought he could teach me something about myself and help me improve how I communicate.
I would not have been angry, only interested in their life experiences that would make them read my writings and think what they concluded. THAT interests me, not any absolute 'right and wrong'.
Others have already benefited from this article, so from a utilitarian perspective, perhaps THAT moral framework calls it a success. In my framework, it is far more complex (and I could argue even the Utilitarian model fails here). I would look at all of the POTENTIALLY negative forces the article would move people towards.
What I've done:
I read the article and the ensuing discussion in full... I did not download the code and play with it, but I did read over it enough to understand what it's doing.
That is excellent. But I would only humbly ask you not attempt to provide 'judgment' or 'right and wrong' here. It does not exist. We are all right and wrong from our perspectives. By sharing mine, I hope some people could see the article in a different light, one that I am paid to represent and one that I would never change as it is the largest good I can do (far more good than any development role n the big-picture on almost all occassions).
Most superstar developers are instinctually against my position here. 10 years ago I would of probably flamed me as well. There is a ying and yang here (sorry this is sounding more like a Dali Lama speech then a post I am starting to realize)…..
I represent the opposing force where I must consider the 3-10 year picture where the 'superstar' developer(s) will probably be long gone. I represent the client's interests and am paid many orders of magnitudes above what a developer is. Why? Because the ROI I provide is many orders of magnitude higher. Am I some tyrant, but kills all creativity and advanced code? HELL NO! Code must be as complicated as it must be, no more! And that in my experience has been pretty damn complex many times.
However as a coder (just as much as I was at 26 – I am now 36) and as a previous 'young superstar' I can instinctively feed the pull of the other side) – OK now this sounds like Star Wars (grin)….
The other main opposing force is that of the developer who:
1) Optimizes code before any indication it needs optimizing as it is challenging and shows his/her peers thier status
2) From their perspective this is almost always the 'right thing to do', even though it may create waste and no tangible benefit (NOT what I am saying about the article. That would have to be looked at on a case by case basis).
What I still don't understand:
Why was this developed?
This was clear: The writer had a requirement to serialize very large data (and did not have any ability to re-architect the solution). For this, it appears to have been the ‘least worst option’.
I wrote my comments as I am very alarmed at the undeniable dynamic of the ‘guru developer culture' of 'screw the client', and just coding the coolest (most complex) work possible.
What is the problem this is solving?
Overall time to transfer via Single Machine-Cross AppDomain or cross-machine serialization of very large objects is significantly reduced (like storing that dataset in the SQL Server session). A bad design, but we all must live with them.
Optimization, in general, is done to create performance gains in a situation where performance is lacking,
I would agree but add: Optimization is almost (and should almost) always be done ONLY after a performance problem has been shown to be a problem. The anticipation of areas to optimize and the work done BEFORE they occur are almost always waste. Why? We have many studies that show we are usually wrong (not always). This is a foundation of Agile. Don't optimize until you must, or if you are sure it will be a problem, do it last. Make sure to get it working first, then add small levels of complexity and unit tests (you started with one remember – TDD) as you Refactor your way to NECESSARY optimization.
Also, "code without Unit Tests - Think NUnit or equivalent - is Legacy code". Why? ‘
You cannot verify the stability of your code base when any change occurs (I am of course speaking of non-trivial apps here) after any changes are made and you almost certainly do not have time to pay a QA individual to manually regression the entire system every day (and that is only SYSTEM Regresssion. You still need Unit Regression).
AGAIN: I highly recommend you all read Michael Feather's book "Working Effectively with Legacy Code". It should be called:
"Being an Excellent Developer: Both on Legacy projects and New Ones - or How not to Code a New Project as a Legacy One"
To most of the readers here, I probably do represent 'the man'. However I can code at the level of 99% of the people here I would guess (in C# or Java across all domains, especially large distributed object systems like the one this article is trying to help.
or is the major failing point to a existing stable system.
I believe the author said this was an existing problem to an unfortunate prior architecture that fundamentally eroded in designing things this way (again, just my 'perspective').
What is the system? Could someone please describe, in simple terms, the system in which this code is playing a part? Could someone also describe how the standard serialization techniques would be applied, and then compare and contrast it with this solution?
Author?
Could someone describe how the standard serialization scheme created a performance problem which was best solved by this optimization?
Easy... It was far too bloated to support the flawed architecture in place (again my perspective) so a kind of 'hack' was required to get around the architectural ignorance that was present I believe before the author became involved. I could be wrong. It would appear however the Author did a damn fine job in the nasty place he found himself in.
I apologize if I missed where this was already clearly explained.
The reason I ask, is that this seems like an interesting topic, and the discussion has some obviously experienced and capable people involved, and if I was able to put it in some kind of concrete context, I think I would get a LOT more out of the discussion as a whole.
I agree. But your 'context' is unlike anyone else's. Your experiences, biases, concerns, etc. make the questions you asked wise ones in my opinion.
What are my thoughts to you? If you have the luxury of architecting this correctly, serializing large single transaction style objects over the wire is a recipe for deep disaster in most cases (and not even my opinion really).
Ask yourself:
1) How are recoveries performed and security enforced?
2) Are these required to be 'guaranteed' in any way if the destination server is down (or overloaded because you are slamming it so hard)?
3) Are there multiple units of work that need to be atomic?
4) Do not rely on sending large data over a network unless you do so in a guaranteed way and usually in a batch style mode, not transactional
5) If you ARE doing ATOMIC work, learn from TCP/IP and other protocols and how they deal with this problem.
6) Design your domain objects with MANY small classes each with very specific and singular responsibilities
7) With the use of Generics, there is little argument among the top .NET gurus (not me) now (as there was before) that sending around DataSets/Types is a legacy concept.
It is just too easy to do Generic Collections which better represent your domain (where almost all development should be focused anyway) and allow you to easily get around the large DataSet scenario that so many people try to force. It's one of the most common consulting 'short-term' engagements’ I get asked to fix:
“A web app has moved session state from 'In-Proc' to SQL Server and now, all of a sudden (with a new farm of web servers) the app takes 10 seconds to load a page instead of 1 when only 1 server existed before”
Why? Every user is storing a 100,000 row DataSet in their session, which must now serialize to SQL Server. When it was in-memory all was OK (bad design from my perspective but still to the business experience, no idea of the problem). This brings home the point of FUTURE CHANGES and Entropy kill systems, not guns (grin).
As a side note, this might also be very useful to me in the near future, as we're currently working on a dataserver architecture for a distributed computing system which will involve extremely large object counts as well as large individual object sizes in some cases, being shuffled around between various systems depending what work needs to be done on the object at the time...
Well, this demands the highest levels of .NET expertise and architectural expertise IN GENERAL across all the .NET Distributed Object technologies. I would be happy to offer ideas as I have done on many systems like this (most for global Financial Services firms in New York and London). I am sure others here could help as well.
Remember: There is basically never a 'best' solution for a scenario but there are almost always MANY bad solutions.
A fast and stable serialization and transport system would benefit this project greatly, and the existing systems have already proven somewhat insufficient for our needs.
No, you already posses a 'fast and stable serialization and transport system' in .NET.
This amazing system is called Remoting and represents millions and millions of dollars of investment. What you DO NOT seem to have is an architecture that will use this 'fast and stable serialization and transport system' in the best way for your needs.
Thanks,
Damon Carr
|
|
|
|
|
Hi Damon,
Thank you for your lengthy response.
Regarding our current data server design, we've been going with a standard setup of .Net Remoting for transport and Serialization for persistence. We have already done a lot of work done that. My main goal was not to think about changing our existing design to implement the custom serialization done here, but rather to gain some perspective on our design by hearing the account of why the original poster choose to implement this design, in detail.
I agree that there as many ways to skin a cat as there are cats with skin, but as everyone knows, you generally start with a knife and a living cat that you have to first chase down.
On my way walking to the light-rail a couple days ago, I saw something that reminded me (for some reason) of this discussion. On a smallish residential street near the city, I saw a crow sort of dancing around under a walnut tree. The crow was playing with a fallen walnut. He picked it up in his beak, then flew to the top of a light pole. He waited there a moment, then dropped the nut on the ground. He immediately flew after it, and tapped it around a little while on the ground, before picking it up again, and flying to the top of the pole once more. He dropped the walnut again, and repeated the whole process about two or three times, before finally, the walnut had broken open, and he pecked away at the soft nut meats inside.
As I said, I don't know what the relevance is, but I thought I'd share.
Talk to you soon,
Troy
|
|
|
|
|
Hi Troy
The reasons why I need some optimization are mainly detailed in the article (with some additional detail in these comments) - standard .Net serialization took too much time and memory space and could crash with an out of memory exception under certain circumstances.
Let me start by briefly describing how Serialization works (as I understand it)
Any class that will be involved in serialization/remoting *must* have the [Serializable] attribute applied to it - if the .Net serializer encounters a class anywhere in the object graph that doesn't, it will throw an exception.
During serialization, .Net will examine all of the private fields within a class and attempt to store them in a binary stream. If the field is a value type then it is stored directly otherwise if it is an object type then it is added to the object graph and a reference to it stored - this is done so that a given object will only be stored once.
The examination of objects is done via reflection so that .Net serialization can cope with, in theory, any type of object without any prior knowledge of it. Now we know that reflection is relatively slow in comparison to direct field access but for most serialization/remoting work the performance is acceptable and you don't need to write any code to make it work.
There are some options to 'help' the .Net serializer during its object examination. You can apply a [NonSerializable] attribute to any field to indicate it should be ignored for example but thats about all without writing some code.
The next level of optimization is to implement the ISerializable interface on your class and you will need to implement two methods (well one method and a special constructor). The GetObjectData method allows you to take over the process any store any data you deem necessary to reconstruct your object in the SerializationInfo object passed into the method. It is like a dictionary in that you tag each of your data items with a string. In your deserialization constructor, you do the reverse and exact your data via its string name. One thing you need to be careful of is that the objects you retrieve are not necessarily populated at this point - so don't try to use them - just store them in your fields and they will be populated once serialization is complete.
There is also the optional IDeserializationCallback interface which give you an opportunity to be 'notified' when deserialization of the entire object graph has been completed, ie your objects are now fully populated and usable.
This level allows more control (and possibly some increase in speed since reflection is not used) but involves some manual work on the developer's part. You won't want to do this on all your objects, only those frequently used or that can result in large object graphs.
It still has some drawbacks however. Suppose your object has a object[10] as part of its data. You can either store this as a single object[] or as 10 object items. The latter is actually quicker but involves writing a loop and you will have 10 string names all of which take up extra space.
My code is a utility (not a design!) to allow further optimization as a replacement for some or all of the code you would write in the previous ISerializable optimization. It allows you to store extremely quickly and compactly all your classes 'owned' data in a single byte[]. Instead of identifying which data items you want to store and giving them names, you Write it into a SerializationWriter instance and, once all data has been written, you store the resulting byte[] into the SerializationInfo block using a single name. Most data types have some level of optimization but where it really shines is when you have data where the type is unknown at compile time, such as an object[], in which case it identifies the type and stores it in its most optimized form. The biggest payback though, is when you can identify certain 'root' classes that encapsulate potentially many other objects (a DataSet for example) - by having a single SerializationWriter instance store data for the whole object graph, you get the advantages of string tokenization across all your objects and just one single, relatively small, byte[] to store.
So you are still having to do some manual coding work for optimization but typically no more than if you were using the standard .Net way of using ISerializable and typically on the fewest classes that you need.
In my particular case, DataSets and LLBLGenPro entities/collections (2 completely different projects) gave excellent results for size, speed, and (indirectly) memory fragmentation and network usage. The code only needed to be written once for each type but will work for *any* DataSet we now pass across the network regardless of whether it is small or large or even empty and the same for the LLBLGenPro entities/collections we have now or create in the future because they are all ultimately derived from the same class. Write well once, use many times.
Other posters have found value even when they *know* that the objects they are remoting will always be small - see andreboom's comment about saving 40% even on an object that is around 180 bytes. Another poster needed to serialize web state and found FastSerialization to be much faster than Microsoft's LosFormatter which does a similar thing but only for a few, specific types. v2.1 added support for Surrogate helper classes so that you can move the code for FastSerialization into separate classes to help optimize serialization of classes where you don't have control of the source code - hence a WebFastSerializationHelper sample to get you started.
Damon's main point (there are many!) is that your application should have been designed so that the objects going across the network are small and few - anything else indicating a design flaw. I don't necessarily fully subscribe to this point of view. Yes, by all means reduce what it going across the network but its not the individual object size that causes problems but the object *graph* size and that is very rarely predictable. If you let your user enter range criteria, for a set of reports say, then it is very difficult to predicate how much data this will actually involve and especially if there are a large number of criteria permutations.
The intention of the article is not to tell you how to design or write your application but provide an option for optimization (and some technique to use) where you deem it necessary or desirable. If you can identify certain classes that will be remoted frequently or in large quantities (as part of an object graph, not just their individual size), it may be worthwhile investing a little time to see if they are optimizable using FastSerialization. If you use DataSets or LLBLGenPro entities then the work has already been done for you.
In your particular case, you seem to have already identified that many and possibly large objects will need to be moved around. You can either speak to Damon who will tell you that your design is wrong or you can see whether the speed is acceptable using as-is .Net remoting and, if not, try ISerializable (.Net style) and then ISerializable (Fast Serialization style). Compression is also an option to look at but bear in mind it is usually applied *after* serialization not *during* and so helps only with the network transmission side.
Cheers
Simon
|
|
|
|
|
@Damon Wow! What a clever man! Youre definitely one of those dangerous "code religious" types (database or else). This whole discussion has been taken over by one huge ego massage. Others have found benefit, why try to convince us that we have not when we are not blind?
modified on Thursday, February 07, 2008 3:43:19 AM
|
|
|
|
|
Dear Damon Carr:
I wasn't going to add something to a 1½ year old thread, but I see somebody else couldn't resist, so I'll give in to my innate fish-slapping urges and reply, too.
My favorite quote from your original post:
"Can you really do a better job the Microsoft here?"
Well, yes... yes, he can. He has proven he can, and taken the time to analyze the results and post them for us to review. He has given a set of detailed posts explaining the underlying problem, come up with an answer, presented us with the code, and explained in detail what he is doing to address the problem and why.
He has researched shortcomings of his approach and improved his code. He has listened to the input provided by many readers (even you), and incorporated their responses into his solution. He even tracked down a faster compression library (one that I, at least, had never heard of), improving his results even further.
Simon Hewitt is the perfect consultant: smart, insightful, and clever. He has the perfect balance of theory and practicality, and, most importantly, he solves problems (and his solutions are not the horrific mess you make them out to be).
Damon, if I were looking for a company to contract for work, and read this thread, I would dump agilefactor out of consideration and not give them another thought. I would, however, feel very confident in SimmoTech's ability to meet my needs and would be willing to trust them with my most important projects.
Not only does your post come across as quite arrogant, but I see you wearing blinders, on a moral high road that you will follow regardless of the reality in front of you. You say you are willing to acknowledge circumstances, but that sounds like lip-service to me. Nobody truly willing to acknowledge circumstances would even consider writing a reply like yours.
Be honest: if the circumstances required you to pass large quantities of data across the wire, how willing would you be to do it? How much time would you lose, certain there is a 'better' way? At the end of the day, the goal is a working solution, and the key to success is balance.
You probably can't imagine a case where you would have to pass so much data, so answering those questions might be tough for you. I don't have to imagine any specific situation to acknowledge that such a case could easily exist. But as a lead developer for a data-analysis/metrics company, I could rattle off a dozen situations where we have to aggregate a million rows or more by arbitrary parameters and display the results in various graphs and controls across the web. Simon's code is salvation for a problem we face, one that "better design" wouldn't resolve if we spent the rest of the year on it.
You may feel that your post was respectful and encouraged a "lively debate", but I found it condescending, uninformed, and near-sighted, and every reply you posted merely cast you in a worsening light. These posts are now preserved for countless generations to discover when they search for more information about you and your company.
When you are looking for a "lively debate", consider the difference between "I know best and your way is wrong" and statements such as "The problem I see with this approach", or "I've found there's often a better way of handling this much data", or "I'm concerned that a, b, and c".
When I say statements such as these, I mean them, because (a) no matter how much experience I gain, I know that the person across from me has different, but equally valid, experiences; (b) a tone like that of your post will lead to an attack-and-defend debate rather than an open meeting of minds; and (c) that even if I were 100% right, and the other person 100% wrong, talking down to them and implying they have no idea what they're doing is unproductive.
I am extremely unimpressed with your knowledge, maturity, and professionalism, and sincerely hope you take these thoughts into consideration when you work with others.
Best Regards,
James B
modified 23-Jan-13 22:00pm.
|
|
|
|
|
Hi,
You write, that the WriteString-Method is allways optimized, so I ask me, if the ReadString can deserialize an V1-serialized String-Object.
Backgound: I'm using your FastSerializer V1 to store serialized objects in a database (as BLOB). Is it possible to deserialize these objects by using V2 of your FastSerializer or would there be any problems?
Example:
Greetings
Klaus-Jürgen
|
|
|
|
|
Hi Klaus
Its not really recommended to use different versions for this purpose since even the tiniest change can result in a failure to deserialize - remoting was the original aim where the code would always be the same version. (Having said that I have an app which stores very large DataSets of audit data in a BLOB field - I ran a script to reserialize all of the stored data when I moved to .NET 2. That worked for me but might not be appropriate for you)
Even if the string optimization code was identical between versions, changes to other parts of the code, even a reordering the Enum of type codes, might result in the data not being deserializable.
The easiest way to test this for your particular situation is write a quick app using v2 to read and deserialize your BLOB fields - you will soon get an exception if the data is not exactly in the expected format.
Another alternative to be absolutely safe is to incorporate both versions into your code - just move v1 to a different namespace. You still have the problem of knowing which to run for a given BLOB field though.
Further still, you could change the source code to add some versioning information to the stream.
Persisting serialized data will always have this problem - even Microsoft have this problem.
Cheers
Simon
|
|
|
|
|
Hi,
Great code !
What's the best approach for enum ?
I'm doing it :
writer.Write(MyEnumType.ToString());
... read
_myenumtype = (MyEnumType)Enum.Parse(typeof(MyEnumType),reader.ReadString());
Do you have a better option ?
|
|
|
|
|
Depends on the Enum.
If you will know its type at deserialization time then you can cast it to/from an int and store it optimized.
If you don't know the type, then v2.1 has support in WriteObject for Enums anyway and I hope to release it this week.
Cheers
Simon
|
|
|
|
|