For starters, let me just say that I HAVE searched and tried several possibilities, but I'm too green and perhaps have too few brain cells to piece this together myself.
I am trying to write a web scraper using selenium and scrapy. This should be easy enough right?
Unfortunately, it was at this point that I realized I needed to learn some python, never mind the fact that I am still relatively new to coding overall.
There's some great examples on the web that I've seen so far about how to write some of these crawlers, but try as I might, I can't seem to get my $#&!@! to work and frankly, I'm far more used to visual studio anyway. All of this command line prompt makes my mechanical heart weep.
These are the link to the example I've tried to use:
edit: now it's "working" but not doing anything. neither chrome, nor firefox opens up and if i try to export the data to csv...well it's blank and empty BUT a csv file does show up..for whatever that's worth :P
here's my code, it's probably garbage, but hopefully some of it...isn't?
from scrapy.item import Item, Field
from selenium import webdriver
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from test.items import TestItem
name = "boo4"
allowed_domains = ["domain.com"]
start_urls = [
rules = (Rule (SgmlLinkExtractor(allow=("\.html", )), callback="parse_items", follow=True),)
self.driver = webdriver.Chrome()
def parse(self, response):
el1 = self.driver.find_element_by_link_text("P Search").click()
el2 = self.driver.find_element_by_css_selector("div.c").click()
el3 = self.driver.find_element_by_link_text("<<")
el4 = self.driver.find_element_by_link_text(">")
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select('//*[@class="s"]')
items = 
for titles in titles:
item = TestItem()
item["url"] = titles.select("a/@href").extract()
thank you for any help you can give