Hello all,
today I had some business with html parsing. The requested result was: using the java.net.URL class get all html content from http://www.google.com/ and set up a file which can be used to view the website offline. The greatest problem turned out to be "fetching" the html elements attributes like src from an <img></img> tag, href from an
tag etc. So far I have got to the src attribute by using regular expressions and BufferedReader/Writer classes. A code sample:
URL google = new URL("http://www.google.com/");
BufferedReader in = new BufferedReader(new InputStreamReader(google
.openStream()));
BufferedWriter wr;
String s = null;
Pattern p;
p = Pattern.compile(".*<img[^>]*src=\"([^\"]*)",Pattern.CASE_INSENSITIVE);
Matcher m;
try {
wr = new BufferedWriter(new FileWriter("D:/HTMLFile.txt"));
while ((s = in.readLine()) != null) {
m = p.matcher(s);
wr.write(s);
while(m.find()) {
System.out.println(m.group(1));
}
}
in.close();
} catch (IOException ex) {
Logger.getLogger(JavaNetworking.class.getName()).log(Level.SEVERE, null, ex);
}
For this particular URL the output is: "/textinputassistant/tia.png"
What I wanted to ask, is can someone give a better example on how to do this? I read on various forums that regex + java is a hidious monster, sort of speak. I have an algorithm in mind that could lighten stuff up for an experienced programmer, unlike me :)...here it is.
- read all html from the URL
- copy to a string variable
- search in string for "<img"
- when "<img"> - copy to new string variable
- search for "src" or "href" attribute
- extract the attributes value (System.out.println("..") will do just fine for now)
I see this is an idiot-proof problem since I think that this could work out just fine like this, but still I think it's better to ask for an oppinion from a community made of waaay bigger professionals :)