It is one of those good practices which you're asked to follow while writing your code because it makes a lot cleaner and concise code. This is similar as to why we programmers focus so much on indentation... Because whenever you're working on a big project chances are you may not be the only person working on it, so if someone else has to review your code it'll be a lot easier for him/her to get a reference as to where was each variable declared. Because you might be well aware of the haunting NullPointerException which just pops out of no where and can be really trick to negotiate if your code isn't clear.
I want to build a web crawler that will take a list of urls and search those urls for events happening. I want the crawler to pick up details such as address, image urls, description of event, title of event. And anything else that would be useful for sombody wanting to know about an event. I would like to write this program in java or nodejs. Doing this project quickly and simply is important.
I have checked out nutch the java framework, but I had a difficult time getting going with it quickly. I want my web crawler to be up and running by the end of the week, so the simplest quickest solution is important.
What frameworks should I use and/or what advice do you have to complete such a project?
what advice do you have to complete such a project?
Be prepared for years of work; what you are asking for is far beyond a few simple classes. You would need to read each url, break it down into all its different parts and somehow analyse the content to identify each event (whatever you mean by that). You would then need to follow links from the event to extract any other relevant details. Just take a look at a few websites and see how they advertise events, each one is different.
Thanks for the input. But there is a huge difference between a few simple classes and 'years of work'. I think that maybe my problem was not defined well. But mainly I was hoping to get advice for a simple, quick to setup webcrawler with nodejs or java. Like a framework or a tool or somthing. As mentinoed I checked out nutch, but it seems overkill. I am not trying to scrape the whole web and I dont want to have to type like 300 characters into a terminal to start it up. I want to define in the beginning maybe 5 urls to scrape in the beginning and slowly but surely add to that url list. Any helpfuls suggestions are appreciated!
Ive never built a web crawler so in the context of a webcrawler no. In the context of the DOM, then I was planning to just cycle through those elements and take the data. But yeah I havent gotten that far yet, I thought this would be a good first step. I was also looking at elastic search, and wondering if somehow that might be useful.
Well that is easy enough to test. Write a small piece of code to pull a page from any website and go through each element in the DOM. Now you have those elements how do you identify these events that you are looking for?
I havent gotten that far yet. I dont really know. There are a lot of tools out there that help developers with these kinds of problems. I dont think I am the first person who wants to crawl websites looking for events, so it seems like there might be some already made tools either a java dependency I can put into my pom file or some npm package that will help me to solve that issue. If there is no such dependency, then I guess I will have to make it. If that is the case, the spontaneously, dont know if this is a good idea, because I came up with it 5 seconds ago. But You could create a bunch of lists of keywords that can be used as titles for the different data points I am interested with for example. Date could have a title on a website "date","Day of event", "Time", and so fourth. And then when you parse you looked for those key words, and then if you find it then you take the text that is nearby. Then using regular expressions you identify and remove any html tags. you could also have another list of regular expressions that identify if the text following the key word matches the regular expression, so in the case of date, you have like 50 regexes all identifying a date, so that you make sure to get a date. But yeah came up with that in 5 seconds. All I really want is events for a particular city on a particular date, maybe there is an API that takes care of that? Google doesnt, last time I checked.
Like I said try some basic tests; go and look at some websites and see how they identify the sort of events you are looking for. The issue is really not about getting the web pages, that part is relatively simple. The issue is how you analyse the data and identify the parts you are interested in, and that is the complicated part.