 |
|
|
 |
|
 |
this article helped me greatly as I was stuck on inserting and retrieving multilingual data from sql server.
with this method in place all chars look as they should without all those ?????? in place of the real data.
Thanks a lot.
|
|
|
|
 |
|
 |
Uses LINQ and method extensions (.NET v3.5)
public static string ToHtml(this string str)
{
return string.Join("", str.ToCharArray().Select(c => (int)c > 127 ? "&#" + (int)c + ";" : c.ToString()).ToArray());
}
Any qu's to my ICQ: 94053010
|
|
|
|
 |
|
 |
This is very useful and I had a very bad implementation of the same thing on my app till now. However, the reason I reached this article is because I need to HTML encode only the 'text' part of a string, leaving the HTML tags intact. For example:
This isn't a <font face='Verdana'>test</font> string & I need to £ it into HTML.
In this string, I don't want the HTML tags encoded, but the quotation marks, ampersand, pound sign, etc need to be encoded. Any ideas? Thanks in advance.
|
|
|
|
 |
|
 |
If you know specifically what symbols you don't want encoded (such as the lessthan/greaterthans), you could remove the call to HtmlEncode and just do everything manually via the second half of the algorithm. You'd have to add special instructions for symbols like ampersand which are below the 127 value threshold.
Alternatively you could apply a regular expression and identify all of the pieces of the string that aren't html tags and just pass those chunks to the encode function.
That's all I can think of at the moment.
Sad but true: 4/3 of Americans have difficulty with simple fractions.
There are 10 types of people in this world: those who understand binary and those who don't.
{o,o}.oO( Check out my blog! )
|)””’) http://pihole.org/
-”-”-
|
|
|
|
 |
|
 |
Yeah, that's what I thought of too. Here's what I'm doing:
public static string ProcessHtml(string source)
{
Regex regexStripHtml = new Regex(@"<[^>]+>", RegexOptions.Compiled);
List<string> parts = new List<string>();
foreach (Match m in regexStripHtml.Matches(source))
{
string tag = m.ToString();
parts.Add(ReplaceHtmlWithSymbols(source.Substring(0, source.IndexOf(tag))));
parts.Add(tag);
source = source.Substring(source.IndexOf(tag) + tag.Length);
}
parts.Add(ReplaceHtmlWithSymbols(source));
return String.Join("", parts.ToArray());
}
The regex is a standard strip-all-html regex which I have been using in other parts of the application. I use it to identify the HTML tags instead of stripping them out. Then I take the non-HTML parts and process it through the encoding method (I've called it ReplaceHtmlWithSymbols()), and I take the tags and stuff them in without processing.
For anyone who intends to use this method, remember to declare the Regex outside the method, so that it is planned and compiled once in the lifetime of the app, not every time that the method is called.
|
|
|
|
 |
|
 |
I don't know how many others care about this, but I wanted to keep the new lines (\r\n) and the spacing (including \t).
I realize you can use the pre tag, but I preferred not to in my situation
I just changed the last else to the following
was
else
{
result.Append(c);
}
change to
else
{
switch(value)
{
case 13:
case 10:
result.Append("<br />");
break;
case 9:
result.Append(" ");
break;
case 32:
if(result.Length > 0 && result[result.Length - 1] == ' ')
result.Append(' ');
else
result.Append(' ');
break;
default:
result.Append(c);
break;
}
}
Don't know if anyone else would find it useful, but figured I'd put it out there
modified on Tuesday, June 9, 2009 5:58 PM
|
|
|
|
 |
|
 |
The proper way to do what you're trying to do is to encode your text as UTF-8. Instead of sending a String class, you need to send a byte array. Here's a code sample.
public static String Utf8ToString(Byte[] byteArray)
{
UTF8Encoding coder = new UTF8Encoding(false);
return coder.GetString(byteArray);
}
public static Byte[] StringToUtf8(String xmlString)
{
UTF8Encoding coder = new UTF8Encoding(false);
return coder.GetBytes(xmlString);
}
Then, you'll write the bytes from the StringToUtf8 function directly to the HttpWebRequest. Here's a sample, assuming that you're posting form data.
private static string PostData(string postURL, string postData)
{
// Create the request.
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(postURL);
webRequest.Method = "POST";
webRequest.ContentType = "application/x-www-form-urlencoded";
// Get the form post data.
byte[] postBytes = StringToUtf8(postData);
webRequest.ContentLength = postData.Length;
// Write the post data.
Stream webStream = webRequest.GetRequestStream();
webStream.Write(postBytes, 0, postBytes.Length);
// Get the response.
HttpWebResponse webResponse = (HttpWebResponse)webRequest.GetResponse();
Stream responseStream = webResponse.GetResponseStream();
// Read the response into a byte array.
ArrayList responseBytes = new ArrayList();
byte[] responseBuffer = new byte[16 * 1024];
bool fStop = false;
while (!fStop)
{
int bytes = responseStream.Read(responseBuffer, 0, 16 * 1024);
if (bytes == 0)
{
fStop = true;
}
else
{
for (int iCount = 0; iCount < bytes; iCount++)
{
responseBytes.Add(responseBuffer[iCount]);
}
}
}
// Prepare the response.
byte[] responseArray = new byte[responseBytes.Count];
responseBytes.CopyTo(responseArray);
return Utf8ToString(responseArray);
}
I've left things like closing the streams and error-handling as an exercise for the reader. Also, for a real-world application, you'll want to use async IO, assuming that you don't want to hang the UI. This was just thrown together, though.
|
|
|
|
 |
|
 |
Excellent, thank you. However this is well beyond the intended scope of the article and appears to be geared specifically towards ASP.NET, whereas I would like to keep the article universally applicable and string-based for simplicity.
ASP.NET users will be thankful for this post.
Logan
|
|
|
|
 |
|
 |
Regarding your question, if the conversion from char to int returns the Unicode point: There may be a problem with your naive implementation, if HTML entities use the UTF-32 format. Windows itself uses UTF-16, which requires the use for characters above 0xFC00 two surrogate characters, resulting in the the use of 2 chars instead only one. So you have possibly to convert the surrogate characters into the UTF-32 form, as described on http://www.unicode.org/faq/utf_bom.html#35.
|
|
|
|
 |
|
 |
Thanks for the info, I will keep this in mind for future examination. In the meantime, it appears being limited to UTF-16 is not a significant problem.
|
|
|
|
 |
|
 |
It does seem weird that this isn't in the Framework, doesn't it?
I ended up solving the same problem a while back (code here[^]) but mine isn't as succinct.
The main reason I post it is because I wrote my own Decode function which I saw someone ask about in a comment. You are probably right that the Framework Decode works fine - but if anyone's interested, here's a Decode function (with named entity table)[^].
Disclaimer: this was written when I was still relatively new to .NET, so you might have a chuckle at the way I've approached the problem
|
|
|
|
 |
|
 |
Took a look at your code and noticed that you also include a lower-bound on the character values--a nice touch in the event that a passed string is mangled and not in good form. Good work, though certainly a little more verbose than my quick hack.
Logan
|
|
|
|
 |
|
 |
All you have to is:
int value = (int)c;
char's are stored as 16 bit integers, so it only requires a type cast to move to another integer type.
Trivial, but just an FYI
Aaron
|
|
|
|
 |
|
 |
Good to know. Do you think there's any overhead to calling Convert.ToInt32()? It is a static method after all.
Also this still doesn't do anything to guarantee that the value is, in fact, the standard Unicode value for the character (unless this is stated somewhere in documentation that I am unaware of).
|
|
|
|
 |
|
 |
I'm wondering if you really need to call HttpUtility.HtmlEncode since you are anyway encoding all the characters greater than 127.
|
|
|
|
 |
|
 |
In fact, the most "dangerous" characters are below the 127 value threshold: namely, &, <, >, and ". If these are left unescaped in HTML bad things can happen. The call to the HttpUtility method covers these, and a few others, though you are right for anything > 127 it encodes is redundant. Nevertheless, my method will not "over-process" since anything it encodes beforehand will be encoded to characters below the 127 threshold and my algorithm will pass over those.
So think of my method as a "second pass" that covers specifically those higher-order extended characters.
|
|
|
|
 |
|
 |
This line:
result.Append( "" + value.ToString() + ";" );
totally defeats the purpose of using a string builder. The reason you use the stringbuilder is to NOT have multiple repeated string concatenations..... then you do 3 on 1 line
result.AppendFormat("{0};",value);
edit/ Just to illustrate the problem here; Say you have a HTML doc of 100,000 chars and 10% of them are >127. In that line above you create a string ""(1) and append value.ToString (2) to it making a string xx (3) then you add another string ";" (4) to that. Thats 4 strings created and destroyed in 1 line. Now 10% of 100,000 is of course 10,000. times by 4 thats 40,000 strings that have been allocated during the process. Granted, some are Interned and the compiler will sort out some of the mess but its still very bad practice. Using AppendFormat (or string.Format at least) will never have this problem as long as you instantiate the string builder wqith a large enough initial buffer.
I suggest something like new StringBuilder(1000);
|
|
|
|
 |
|
 |
Thanks! I was actually wondering about that. I'm not very familiar with StringBuilders.
I've decided to instantiate the StringBuilder with text.Length + 10%, which should be more than enough room for typical usage. Of course people can customize this to what they need.
|
|
|
|
 |
|
 |
I agree that using the AppendFormat is cleaner and more readable. I would recommend that one!
But from a performance perspective, the Append is faster.
This is because AppendFormat has to search through the format string for the placeholder(s) before concatenating them.
I tested the performance for the four approaches i could think of:
A:
sb.AppendFormat("&{0};", i);
B:
sb.Append("&" + i.ToString() + ";");
C:
sb.Append("&");
sb.Append(i);
sb.Append(";");
D: [Corrected]
sb.Append('&');
sb.Append(i);
sb.Append(';');
The results varied some on each run, but would typically be like this:
A: 1.50 (worst time)
B: 1.12
C: 1.07
D: 1.00 (best time, reference index)
-- modified at 18:15 Tuesday 4th September, 2007
|
|
|
|
 |
|
 |
Interesting. Yes it makes sense that the search would slow things down a little. I might update the article to reflect these findings. After all, a better algorithm is a better algorithm.
Could you explain the difference between tests C and D? They appear identical to me. Also don't forget the '#' sign after the ampersand; I'm pretty sure it's necessary to identify numerical entities.
|
|
|
|
 |
|
 |
Well, there was for a moment NO difference between C and D. Sorry! I fixed the typo in the original post. (Copy-paste problem)
The difference would be using ' instead og " for the single characters. Since you need to add two chars, you should use alternative C.
|
|
|
|
 |
|
 |
I have htmlencoded string? How could I decode it?
|
|
|
|
 |
|
 |
I believe HttpUtility.HtmlDecode() works perfectly for this.
|
|
|
|
 |
|
 |
have u tried to compare the size of the file after encoded? also the download time?
|
|
|
|
 |