Click here to Skip to main content
16,015,481 members
Please Sign up or sign in to vote.
3.67/5 (3 votes)
Hi guys ,I am parsing an html file in C# and extracting the text from html.My html file has a lot of tags in it.Html file has select tag and option tag. I need a regex for removing the select tag and option tag from html file.I don't want this information.So I want to delete it using any regex.Please help me. Any help would be appreciated.Below is the html that I want to remove from my html file:
HTML
<select name="state"  önchange="setCities();" id="state">>
                <option value="CA" selected="selected">CA</option>
<option value="WA">WA</option>
<option value="OR">OR</option>
<option value="AZ">AZ</option>
<option value="UT">UT</option>
<option value="IA">IA</option>
<option value="MD">MD</option>

<option value="TX">TX</option>
<option value="NV">NV</option>
<option value="CO">CO</option>
<option value="MI">MI</option>
<option value="SC">SC</option>
<option value="AL">AL</option>
<option value="OH">OH</option>
<option value="KY">KY</option>
<option value="FL">FL</option>

<option value="MT">MT</option>
<option value="WI">WI</option>
<option value="GA">GA</option>
<option value="NY">NY</option>
<option value="KS">KS</option>
<option value="MA">MA</option>
<option value="LA">LA</option>
<option value="VA">VA</option>
<option value=""></option>

<option value="IL">IL</option>
<option value="NM">NM</option>
<option value="IN">IN</option>
<option value="NC">NC</option>
<option value="ID">ID</option>
<option value="NJ">NJ</option>
<option value="DC">DC</option></select>
        
        

            <select name="city" id="city" style="width:150px;"><option value="Anaheim" selected="selected">Anaheim</option>
<option value="Azusa">Azusa</option>
<option value="Baldwin Park">Baldwin Park</option>
<option value="Bellflower">Bellflower</option>
<option value="Brea">Brea</option>
<option value="Buena Park">Buena Park</option>
<option value="Burbank">Burbank</option>
<option value="Canoga Park">Canoga Park</option>

<option value="Cerritos">Cerritos</option>
<option value="Chino">Chino</option>
<option value="Chino Hills">Chino Hills</option>
<option value="Chula Vista">Chula Vista</option>
<option value="Compton">Compton</option>
<option value="Corona">Corona</option>
<option value="Corona Del Mar">Corona Del Mar</option>
<option value="Costa Mesa">Costa Mesa</option>
<option value="Cudahy">Cudahy</option>

<option value="Cypress">Cypress</option>
<option value="Davis">Davis</option>
<option value="E. Los Angeles">E. Los Angeles</option>
<option value="El Monte">El Monte</option>
<option value="El Segundo">El Segundo</option>
<option value="Elk Grove">Elk Grove</option>
<option value="Encinitas">Encinitas</option>
<option value="Fontana">Fontana</option>
<option value="Fountain Valley">Fountain Valley</option>

<option value="Fullerton">Fullerton</option>
<option value="Garden Grove">Garden Grove</option>
<option value="Glendale">Glendale</option>
<option value="Granada Hills">Granada Hills</option>
<option value="Hesperia ">Hesperia </option>
<option value="Hollywood">Hollywood</option>
<option value="Huntington Beach">Huntington Beach</option>
<option value="Huntington Park">Huntington Park</option>
<option value="Inglewood">Inglewood</option>

<option value="Irvine">Irvine</option>
<option value="La Habra">La Habra</option>
<option value="La Palma">La Palma</option>
<option value="La Quinta">La Quinta</option>
<option value="Ladera Ranch">Ladera Ranch</option>
<option value="Laguna Beach">Laguna Beach</option>
<option value="Laguna Hills">Laguna Hills</option>
<option value="Laguna Niguel">Laguna Niguel</option>
<option value="Lake Forest">Lake Forest</option>

<option value="Lakewood">Lakewood</option>
<option value="Lennox">Lennox</option>
<option value="Long Beach">Long Beach</option>
<option value="Los Angeles">Los Angeles</option>
<option value="Lynwood">Lynwood</option>
<option value="Manhattan Beach">Manhattan Beach</option>
<option value="Mission Viejo">Mission Viejo</option>
<option value="Modesto">Modesto</option>
<option value="Montrose">Montrose</option>

<option value="Napa">Napa</option>
<option value="Newport Beach">Newport Beach</option>
<option value="Northridge">Northridge</option>
<option value="Norwalk">Norwalk</option>
<option value="Oceanside">Oceanside</option>
<option value="Ontario">Ontario</option>
<option value="Orange">Orange</option>
<option value="Pacoima">Pacoima</option>
<option value="Palmdale">Palmdale</option>

<option value="Paramount">Paramount</option>
<option value="Pasadena">Pasadena</option>
<option value="Petaluma">Petaluma</option>
<option value="Pomona">Pomona</option>
<option value="Redondo Beach">Redondo Beach</option>
<option value="Rialto">Rialto</option>
<option value="Riverside">Riverside</option>
<option value="Sacramento">Sacramento</option>
<option value="San Bernardino">San Bernardino</option>

<option value="San Carlos">San Carlos</option>
<option value="San Diego">San Diego</option>
<option value="San Fernando Valley">San Fernando Valley</option>
<option value="San Francisco">San Francisco</option>
<option value="San Pedro">San Pedro</option>
<option value="San Ramon">San Ramon</option>
<option value="Santa Ana">Santa Ana</option>
<option value="Santa Barbara">Santa Barbara</option>
<option value="Santa Clarita">Santa Clarita</option>

<option value="Santa Maria">Santa Maria</option>
<option value="Santa Monica">Santa Monica</option>
<option value="Seal Beach">Seal Beach</option>
<option value="Signal Hill">Signal Hill</option>
<option value="Somewhere">Somewhere</option>
<option value="South Gate">South Gate</option>
<option value="Stanton">Stanton</option>
<option value="Studio City">Studio City</option>
<option value="Sun Valley">Sun Valley</option>

<option value="Sunland">Sunland</option>
<option value="Temecula">Temecula</option>
<option value="Thousand Oaks">Thousand Oaks</option>
<option value="Torrance">Torrance</option>
<option value="Tustin">Tustin</option>
<option value="Union City">Union City</option>
<option value="Valencia">Valencia</option>
<option value="Van Nuys">Van Nuys</option>
<option value="Ventura">Ventura</option>

<option value="Vista">Vista</option>
<option value="W. Covina">W. Covina</option>
<option value="West Hollywood">West Hollywood</option>
<option value="Westminster">Westminster</option>
<option value="Whittier">Whittier</option>
<option value="Woodland Hills">Woodland Hills</option>
<option value="Yorba Linda">Yorba Linda</option></select>
        
        

            <input type="submit" value="Go">
Posted
Updated 5-Feb-12 23:12pm
v5
Comments
Sergey Alexandrovich Kryukov 6-Feb-12 5:00am    
Such a useless code dump! The problem is pretty simple, but you did not ask any question. What's the problem? Just do it.
--SA
Waseem Fastian 6-Feb-12 5:11am    
@SAKryukov, I want to remove this html using any regex.My html file has lot of tags in it.This html that I have posted is the part of the html file.I do not need this html information.So I want to remove it from html file.I need a regex for this.Thanks

If you want to remove the whole select element with it's content, then you can do it with regex only if some constraints are met: no nested element of the same name (select in this case).

Try this:

C#
string pattern = @"<select\b[\s\S]*?</select>";


Cheers

Andi
 
Share this answer
 
Comments
1castle1 22-Oct-12 5:23am    
Thanks.... string pattern = @"<select\b[\s\S]*?"; worked for me :) now just need to get the value and the text out the middle
Andreas Gieriet 24-Oct-12 18:27pm    
Hello 1castle1,
Out of the middle of what?
Example: input = ..., expected output = ...?
Cheers
Andi
1castle1 20-Nov-12 9:33am    
I was trying to make a HTML scraper out of Regex to build some tables with test data in.. but then I found a project called html agility and used that instead
I have done it myself, this is the regex that I have used

@"<select(\s+[^>]*)?>(.|\n)*?< /select(\s+[^>]*)?>"

Thanks to my mighty Allah.Also thanks to who give me feedback.
 
Share this answer
 
v2
Comments
Prerak Patel 6-Feb-12 11:51am    
and it doesn't remove the option tags as it was asked for in question. :doh:
Waseem Fastian 6-Feb-12 12:23pm    
this regex will remove the select tag in the html and option tag is inside the select tag.So option tag will be automatically deleted.
Andreas Gieriet 6-Feb-12 13:57pm    
This regex will not remove the whole select element with it's content. Did something went lost while pasting it into the solution?

Cheers

Andi
Waseem Fastian 7-Feb-12 0:25am    
This regex has removed all the select content.I have used this in my project and it did it. Cheers
Andreas Gieriet 7-Feb-12 15:03pm    
Yeah. Now it's better. Before the whole second select part went lost up to the ].
Now, I agree. This would work fine, though, it's a bit an overkill. You use (...) which wil store the matched string. Use (?:...) instead. And if you want to make sure that a word is not part of a larger word, you may use the word-boundary anchor \b. Combining all that results in my solution #3.

But as you said: your's work as well.

Cheers

Andi

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900