Click here to Skip to main content
Rate this: bad
good
Please Sign up or sign in to vote.
Hi guys ,I am parsing an html file in C# and extracting the text from html.My html file has a lot of tags in it.Html file has select tag and option tag. I need a regex for removing the select tag and option tag from html file.I don't want this information.So I want to delete it using any regex.Please help me. Any help would be appreciated.Below is the html that I want to remove from my html file:
<select name="state"  önchange="setCities();" id="state">>
                <option value="CA" selected="selected">CA</option>
<option value="WA">WA</option>
<option value="OR">OR</option>
<option value="AZ">AZ</option>
<option value="UT">UT</option>
<option value="IA">IA</option>
<option value="MD">MD</option>
 
<option value="TX">TX</option>
<option value="NV">NV</option>
<option value="CO">CO</option>
<option value="MI">MI</option>
<option value="SC">SC</option>
<option value="AL">AL</option>
<option value="OH">OH</option>
<option value="KY">KY</option>
<option value="FL">FL</option>
 
<option value="MT">MT</option>
<option value="WI">WI</option>
<option value="GA">GA</option>
<option value="NY">NY</option>
<option value="KS">KS</option>
<option value="MA">MA</option>
<option value="LA">LA</option>
<option value="VA">VA</option>
<option value=""></option>
 
<option value="IL">IL</option>
<option value="NM">NM</option>
<option value="IN">IN</option>
<option value="NC">NC</option>
<option value="ID">ID</option>
<option value="NJ">NJ</option>
<option value="DC">DC</option></select>
        
        
 
            <select name="city" id="city" style="width:150px;"><option value="Anaheim" selected="selected">Anaheim</option>
<option value="Azusa">Azusa</option>
<option value="Baldwin Park">Baldwin Park</option>
<option value="Bellflower">Bellflower</option>
<option value="Brea">Brea</option>
<option value="Buena Park">Buena Park</option>
<option value="Burbank">Burbank</option>
<option value="Canoga Park">Canoga Park</option>
 
<option value="Cerritos">Cerritos</option>
<option value="Chino">Chino</option>
<option value="Chino Hills">Chino Hills</option>
<option value="Chula Vista">Chula Vista</option>
<option value="Compton">Compton</option>
<option value="Corona">Corona</option>
<option value="Corona Del Mar">Corona Del Mar</option>
<option value="Costa Mesa">Costa Mesa</option>
<option value="Cudahy">Cudahy</option>
 
<option value="Cypress">Cypress</option>
<option value="Davis">Davis</option>
<option value="E. Los Angeles">E. Los Angeles</option>
<option value="El Monte">El Monte</option>
<option value="El Segundo">El Segundo</option>
<option value="Elk Grove">Elk Grove</option>
<option value="Encinitas">Encinitas</option>
<option value="Fontana">Fontana</option>
<option value="Fountain Valley">Fountain Valley</option>
 
<option value="Fullerton">Fullerton</option>
<option value="Garden Grove">Garden Grove</option>
<option value="Glendale">Glendale</option>
<option value="Granada Hills">Granada Hills</option>
<option value="Hesperia ">Hesperia </option>
<option value="Hollywood">Hollywood</option>
<option value="Huntington Beach">Huntington Beach</option>
<option value="Huntington Park">Huntington Park</option>
<option value="Inglewood">Inglewood</option>
 
<option value="Irvine">Irvine</option>
<option value="La Habra">La Habra</option>
<option value="La Palma">La Palma</option>
<option value="La Quinta">La Quinta</option>
<option value="Ladera Ranch">Ladera Ranch</option>
<option value="Laguna Beach">Laguna Beach</option>
<option value="Laguna Hills">Laguna Hills</option>
<option value="Laguna Niguel">Laguna Niguel</option>
<option value="Lake Forest">Lake Forest</option>
 
<option value="Lakewood">Lakewood</option>
<option value="Lennox">Lennox</option>
<option value="Long Beach">Long Beach</option>
<option value="Los Angeles">Los Angeles</option>
<option value="Lynwood">Lynwood</option>
<option value="Manhattan Beach">Manhattan Beach</option>
<option value="Mission Viejo">Mission Viejo</option>
<option value="Modesto">Modesto</option>
<option value="Montrose">Montrose</option>
 
<option value="Napa">Napa</option>
<option value="Newport Beach">Newport Beach</option>
<option value="Northridge">Northridge</option>
<option value="Norwalk">Norwalk</option>
<option value="Oceanside">Oceanside</option>
<option value="Ontario">Ontario</option>
<option value="Orange">Orange</option>
<option value="Pacoima">Pacoima</option>
<option value="Palmdale">Palmdale</option>
 
<option value="Paramount">Paramount</option>
<option value="Pasadena">Pasadena</option>
<option value="Petaluma">Petaluma</option>
<option value="Pomona">Pomona</option>
<option value="Redondo Beach">Redondo Beach</option>
<option value="Rialto">Rialto</option>
<option value="Riverside">Riverside</option>
<option value="Sacramento">Sacramento</option>
<option value="San Bernardino">San Bernardino</option>
 
<option value="San Carlos">San Carlos</option>
<option value="San Diego">San Diego</option>
<option value="San Fernando Valley">San Fernando Valley</option>
<option value="San Francisco">San Francisco</option>
<option value="San Pedro">San Pedro</option>
<option value="San Ramon">San Ramon</option>
<option value="Santa Ana">Santa Ana</option>
<option value="Santa Barbara">Santa Barbara</option>
<option value="Santa Clarita">Santa Clarita</option>
 
<option value="Santa Maria">Santa Maria</option>
<option value="Santa Monica">Santa Monica</option>
<option value="Seal Beach">Seal Beach</option>
<option value="Signal Hill">Signal Hill</option>
<option value="Somewhere">Somewhere</option>
<option value="South Gate">South Gate</option>
<option value="Stanton">Stanton</option>
<option value="Studio City">Studio City</option>
<option value="Sun Valley">Sun Valley</option>
 
<option value="Sunland">Sunland</option>
<option value="Temecula">Temecula</option>
<option value="Thousand Oaks">Thousand Oaks</option>
<option value="Torrance">Torrance</option>
<option value="Tustin">Tustin</option>
<option value="Union City">Union City</option>
<option value="Valencia">Valencia</option>
<option value="Van Nuys">Van Nuys</option>
<option value="Ventura">Ventura</option>
 
<option value="Vista">Vista</option>
<option value="W. Covina">W. Covina</option>
<option value="West Hollywood">West Hollywood</option>
<option value="Westminster">Westminster</option>
<option value="Whittier">Whittier</option>
<option value="Woodland Hills">Woodland Hills</option>
<option value="Yorba Linda">Yorba Linda</option></select>
        
        
 
            <input type="submit" value="Go">
Posted 5-Feb-12 22:57pm
Edited 5-Feb-12 23:12pm
v5
Comments
SAKryukov at 6-Feb-12 5:00am
   
Such a useless code dump! The problem is pretty simple, but you did not ask any question. What's the problem? Just do it. --SA
Waseem Fastian at 6-Feb-12 5:11am
   
@SAKryukov, I want to remove this html using any regex.My html file has lot of tags in it.This html that I have posted is the part of the html file.I do not need this html information.So I want to remove it from html file.I need a regex for this.Thanks
Rate this: bad
good
Please Sign up or sign in to vote.

Solution 2

I have done it myself, this is the regex that I have used
 
@"<select(\s+[^>]*)?>(.|\n)*?< /select(\s+[^>]*)?>"
 
Thanks to my mighty Allah.Also thanks to who give me feedback.
  Permalink  
v2
Comments
Prerak Patel at 6-Feb-12 11:51am
   
and it doesn't remove the option tags as it was asked for in question. :doh:
Waseem Fastian at 6-Feb-12 12:23pm
   
this regex will remove the select tag in the html and option tag is inside the select tag.So option tag will be automatically deleted.
Andreas Gieriet at 6-Feb-12 13:57pm
   
This regex will not remove the whole select element with it's content. Did something went lost while pasting it into the solution? Cheers Andi
Waseem Fastian at 7-Feb-12 0:25am
   
This regex has removed all the select content.I have used this in my project and it did it. Cheers
Andreas Gieriet at 7-Feb-12 15:03pm
   
Yeah. Now it's better. Before the whole second select part went lost up to the ]. Now, I agree. This would work fine, though, it's a bit an overkill. You use (...) which wil store the matched string. Use (?:...) instead. And if you want to make sure that a word is not part of a larger word, you may use the word-boundary anchor \b. Combining all that results in my solution #3. But as you said: your's work as well. Cheers Andi
Rate this: bad
good
Please Sign up or sign in to vote.

Solution 3

If you want to remove the whole select element with it's content, then you can do it with regex only if some constraints are met: no nested element of the same name (select in this case).
 
Try this:
 
string pattern = @"<select\b[\s\S]*?</select>";
 
Cheers
 
Andi
  Permalink  
Comments
1castle1 at 22-Oct-12 5:23am
   
Thanks.... string pattern = @"<select\b[\s\S]*?</select>"; worked for me :) now just need to get the value and the text out the middle
Andreas Gieriet at 24-Oct-12 18:27pm
   
Hello 1castle1, Out of the middle of what? Example: input = ..., expected output = ...? Cheers Andi
1castle1 at 20-Nov-12 9:33am
   
I was trying to make a HTML scraper out of Regex to build some tables with test data in.. but then I found a project called html agility and used that instead

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

  Print Answers RSS
Your Filters
Interested
Ignored
     
0 Shai Vashdi 1,858
1 OriginalGriff 278
2 Tadit Dash 260
3 Sergey Alexandrovich Kryukov 250
4 Peter Leow 220
0 Sergey Alexandrovich Kryukov 9,440
1 OriginalGriff 5,618
2 Peter Leow 4,280
3 Maciej Los 3,540
4 Abhinav S 3,363


Advertise | Privacy | Mobile
Web03 | 2.8.140415.2 | Last Updated 7 Feb 2012
Copyright © CodeProject, 1999-2014
All Rights Reserved. Terms of Use
Layout: fixed | fluid