Click here to Skip to main content
15,878,945 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
For starters, let me just say that I HAVE searched and tried several possibilities, but I'm too green and perhaps have too few brain cells to piece this together myself.

I am trying to write a web scraper using selenium and scrapy. This should be easy enough right?

Unfortunately, it was at this point that I realized I needed to learn some python, never mind the fact that I am still relatively new to coding overall.

There's some great examples on the web that I've seen so far about how to write some of these crawlers, but try as I might, I can't seem to get my $#&!@! to work and frankly, I'm far more used to visual studio anyway. All of this command line prompt makes my mechanical heart weep.

The pages I need to scrape are heavily loaded with javascript. Actually, to even GET to the data, I have to push a couple javascript links and then in order to page through the data...gotta click on another javascript link and the url never seems to change! Joy... This is why I am trying to use Selenium. I used the Selenium IDE to record a test script and would like to incorporate this into my crawl spider but am thoroughly unsure how to do so.

These are the link to the example I've tried to use:

example

just to explain some things...as I said, I need to navigate through a few buttons to get to the data, that's what the el1, el2, el3, el4 stuff is for. I figured, I'd check to see if the element existed, and if it did...the button would be clicked. I feel this is safe because these buttons are unique to the "page" or the javascript function...however you want to call it.

edit: now it's "working" but not doing anything. neither chrome, nor firefox opens up and if i try to export the data to csv...well it's blank and empty BUT a csv file does show up..for whatever that's worth :P

here's my code, it's probably garbage, but hopefully some of it...isn't?

import time
from scrapy.item import Item, Field
from selenium import webdriver
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from test.items import TestItem

class ElyseAvenueSpider(CrawlSpider):
    name = "boo4"
    allowed_domains = ["domain.com"]
    start_urls = [
    "http://www.domain.com/xyz"]
    
rules = (Rule (SgmlLinkExtractor(allow=("\.html", )), callback="parse_items", follow=True),)

def __init__(self):
    self.driver = webdriver.Chrome()

def parse(self, response):
    self.driver.get(response.url)
    el1 = self.driver.find_element_by_link_text("P Search").click()
    el2 = self.driver.find_element_by_css_selector("div.c").click()
    el3 = self.driver.find_element_by_link_text("<<")
    el4 = self.driver.find_element_by_link_text(">")
    if el1:
        el1.click()
        time.sleep(2)
    if el2:
        el2.click()
        time.sleep(2)
    if el3:
        el3.click()
        time.sleep(2)
    if el4:
        el4.click()
        time.sleep(3)

def parse_items(self, response):
    hxs = HtmlXPathSelector(response)
    titles = hxs.select('//*[@class="s"]')
    items = []
    for titles in titles:
        item = TestItem()
        item["url"] = titles.select("a/@href").extract()
        items.append(item)
    return(items)
    self.driver.close()


thank you for any help you can give
Posted
Updated 23-Jul-13 14:06pm
v4

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900