The Lounge is rated PG. If you're about to post something you wouldn't want your
kid sister to read then don't post it. No flame wars, no abusive conduct, no programming
questions and please don't post ads.
Anyway, what's so special about "special" characters? It's a phrase that's always got my back up. Is it just meant to be non-alphanumeric characters? These "special" characters are not so special when it comes to punctuation, etc!
Hey, it could be worse. I recently inherited a code base that won't even compile in release mode, only debug mode. And OFC it's using C's char type all over the place. Hey, the codebase is from 2016 mind you, so both Unicode and Unicode path names are a thing. And we're a German R&D office so umlauts are a thing as well. Ah, and the running code expects several support files RELATIVE TO THE WORKING DIRECTORY! In the meantime, I was able to toss that monstrosity. Do you want to guess what my successor did? He wrote a batch file to change to the proper folder and then launch the binary. Instead of fixing the source code to ignore the working directory. Ah, and this batch file bloody hell relies on an environmental variable to tell it where it lies itself.
A part of this, I know for a fact, is to blame on both my predecessor's and my successor's deep hate for Windows and love for Linux (so they litereally couldn't give less of a damn how to makes things properly work on Windows), another part is simply "I don't want to learn anything new since I learned coding back in the 60s". And, I kid you not, this is but a slightly redacted quotation of the answer I received when trying to teach one of those guys ARC to pass a linked list between a part they're mainaining and the part that I was maintaining.
And here we're back to where you started: Some people are just stuck in the 60s, or generally in the past. Learned coding back then, when 7-bit ASCII was the only way to go and simply couldn't care less about keeping up with the times. Even if keeping up with the tiems is but a matter of using ready constructs (like Unicode strings or c++'s list<T>).
Ponder your own code that reflects user input data (like comments) back to a web page.
Realize that disallowing ALL special characters makes the data in the DB very future proof.
Points to consider:
Assume any input is trying to hack you
Don't trust that the data in your DB is really safe if a user entered it originally.
Today you emit from DB -> HTML and everything is safe.
Tomorrow you emit from DB -> JSON and a lurking time bomb blows up in your face
as all of your customers start mining bitcoin for someone (not you).
What I don't get is why this could be possible for input fields that are only meant for plain text input: aren't these recognizable to the interpreter as containers with content that shouldn't be interpreted at all?
If so, why would the contents - user-provided or not - be subjected to any restriction at all? Why should the browser even try to interpret the input field contents?
If not, why not? Why is there no way to define input field contents as off-limits to the interpreter?
GOTOs are a bit like wire coat hangers: they tend to breed in the darkness, such that where there once were few, eventually there are many, and the program's architecture collapses beneath them. (Fran Poretto)
As a main rule, I write software to handle any name etc. in two forms: One "internal", technical, which may be restricted in use of encoding, special characters including space, lenght etc. This is the true identifier for an object, or whatever.
The other is an arbitrary length "User friendly" name which may contain arbitrary characters, and it may even be replaced with another name depending on the UI language; it is used in all user dialogs.
The "disadvantage" is that the application code very much leaves this name to be controlled by the user. So e.g. uniqueness cannot be guaranteed. Sure, the code could have refused to accept a duplicate name, but that would restrict the UI naming. (UI names can originate from different sources, so who has "the highest right" to use a given name?) In an interactive UI both matches can be presented, with additional metadata for the user to select.
For non-interactive applications, I always make a fallback to the internal name (which is not at all "secret"), so that batch files / scripts may uniquely identify an object no matter what the user friendly names are. (But for that reason, some resemblance with the user friendly name turns out to be "user friendly" )
This is very much in the style of database identification of tuples: Selection may be based on one or more user specified values, or on some tuple ID defined as a primary key if no (combination of) user attributes are suitable.
I have followed this naming philosophy, with a restricted syntax "internal" name and a free syntax "external" name for a handful systems, and I am so satisfied with it that I will continue to do it that way in future projects whenever it fits in among other requirements.