|So I have this method ( C# , .net ), let's call it
i is an integer representing a UTF-16 character. The method determines which one of the following classes the character is a member of:
Control (ASCII control characters)
Delimiter (the caller can specify which characters are delimiters)
Non-ASCII (i > 127)
Normal (ASCII characters which are not members of another classes)
This is implemented as an array look-up with a
IndexOutOfRangeException which will fire for EOF and non-ASCII characters. This has been working well for a while. The data (JSON files mostly, but not exclusively) is nearly all ASCII characters with only an occasional non-ASCII character -- maybe a few "smart-quotes" or similar, which are OK, in many cases I replace those with their ASCII versions anyway.
BUT once in a while we receive a corrupt file which (in the latest case) includes a JSON value which contains more than a million non-ASCII characters (in the file they are encoded as three-byte UTF-8).
F(i) was not performing well in this case. Apparently having the
catch fire occasionally is OK, but firing a million times in rapid succession is decidedly not.
Once I tracked the issue to
F(i), I could try altering it to add a test for
i > 127 and avoid the exception (which I am loathe to do on principle). But unit testing did show that it improved the performance considerably for the non-ASCII characters without significantly hindering the performance of ASCII characters (EOF is still handled by a catch).
That sounds like a win, except... I just don't like having the extra test which is essentially needless given that we don't expect any/many non-ASCII characters in most files we receive.
Sooo... I named the original version
Fa(i) and the new version
Fn(i) and I made
F(i) a delegate which starts out pointing to
Fa(i) encounters a non_ASCII character it will re-point
Fn(i) encounters an ASCII character it will re-point
Slick as snot. Unit testing shows good performance.
Time required to read the million non-ASCII characters with Fa == 12 seconds
Time required to read the million non-ASCII characters with Fn == 0.06 seconds
I have integration testing running now. The current production version times out the file read after ten seconds (a protection I had to add a while back for another corrupt file), but with the new version, it should read successfully then I should get an error when trying to stuff more than a million non-ASCII characters into a database column which is defined for 500 ASCII (CP-1252) characters.
In the meantime, the people who send us this file are trying to find out what's causing the issue. So far, it's intermittent (a dozen times in the last four years), so it hasn't become critical.
I'm pretty sure I've done this sort of thing before -- having a delegate which points to one of two slightly different implementations of a method depending on what has been encountered in the data, and flipping back and forth dynamically as required. I guess I'll be code-spelunking this afternoon to review that code.
This is the way.