Click here to Skip to main content
14,825,882 members
Please Sign up or sign in to vote.
5.00/5 (2 votes)

I have some HTML Text. When i display that i want to highlight some keywords.  I dont want to match if that is a part of html tag or any special characters like  

for eg :
My HTML Text : <span>Hello&#160;&#160;Welcome to my Spa No. 160</span>

my keywords : spa 160

for highlighting i use <span class="highlight">keyword</span>

But now its matching the spa inside the tag <span> and 160 inside the special char &#160;

How to overcome this...??? I use C# RegEx.

I need a RegEx that matches the keyword but not in tags or special characters.

Advance thank you. 

What you want is negative lookbehinds:

and replace with
<span class="highlight">$1</span>

The negative lookbehind syntax is: (?<! ... ), which indicates that the keyword cannot be preceded by a certain pattern. That pattern in this case is either the beginning of a tag </?[^>]* or the beginning of an HTML entity &[^;]* that isn't complete.

</?[^>]* indicates an open bracket, possibly followed by a slash, followed by any number of chars that aren't close brackets.

&[^;]* indicates an ampersand followed by any number of chars that aren't semicolons.

Here's how to incorporate this into your C# code:
string[] keywords = { "spa", "160", "whatever" };
Regex.Replace(htmlContent, "(?<!</?[^>]*|&[^;]*)(\b" + string.Join("\b|\b", keywords) + "\b)", "<span class=\"highlight\">$1</span>", RegexOptions.IgnoreCase);

EDIT: I incorporated the good point made by Andreas Gieriet - that you need to ensure you are matching complete "words" only by matching word boundaries with \b.
Not real sure what you are trying to accomplish here, if you want to highlight "Spa No. 160" try this RegEx:

If you want to highlight just the words "Spa" and "160" then try this one:

The above RegEx uses a negative look behind to ensure that it doesn't include a < or </ before Spa or spa and it doesn't include a &# before 160.

Negative Look Behind[^]

Use \b around the keywords to anchor words that must stand for themselves.
To additionally ignore HTML entities, you may take beneft of the fact that .Net Regex behaves greedily by default: prefix the match for the words by an alternative for matching the entities first, e.g.
string pattern = @"[&#]\w+?;|(\bspa\b|\b160\b)";
foreach (var match in Regex.Matches(input, pattern))
   if (match.Groups[1].Success)
      string text = match.Groups[1].Value;

Or with Linq:
var emphasis = Regex.Matches(input, pattern).Cast<Match>().Where(m=>m.Groups[1].Success).Select(m=>m.Groups[1].Value);
foreach(string text in emphasis)
   ...// do emphasize


This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900