Click here to Skip to main content
15,944,136 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I have a scrapy script I wrote to extract profiles from LinkedIn using proxy services. The proxy am using is scrapeops. I created a virtual environment and did pip install scrapeops-scrapy-proxy-sdk. I also added the proxy API to my scrapy project settings following the proxy rules of usage. When i run my scrapy script, it runs successfully with no error but returns empty result. Please what am i missing?

Here is my code

What I have tried:

class ProfilespiderSpider(CrawlSpider):
      name = 'profilespider'
      allowed_domains = ['']
      start_urls = ['']

      rules = (
            Rule(LinkExtractor(allow='people-guest_people_search-card'), callback='parse_item'),

      def parse_item(self, response):
          item = {}
             Profile Summary
          item['name'] = response.css(" h1::text").get().strip()
          item['description'] = response.css(" h2::text").get().strip()
             item['location'] = response.css('').get()
               item['location'] = response.css('').get().strip()
               if 'followers' in item['location'] or 'connections' in item['location']:
                  item['location'] = ''
          contacts = response.css("").getall() 
          item['followers'] = contacts[0].replace('followers', '').strip()
          item['connections'] = contacts[1].replace('connections', '').strip()
             About Section
          item['about'] = response.css(".summary  p::text").getall() 
             Experience Section          
          item['experience'] = []
          experience_blocks = response.css('li.experience-item')
          for block in experience_blocks:
              experience = {}
                  experience['position'] = block.css('li.experience-item h3::text').get().strip()
              except Exception as e:
                  experience['position'] = ''
                 experience['organisation_profile'] = block.css('h4 a::attr(href)').get().split('?')[0]
              except Exception as e:
                 print('No organisation profile found')
                 experience['organisation_profile'] = ''
                 date_ranges = block.css(' time::text').getall()
                 if len(date_ranges) == 2:
                    experience['start_time'] = date_ranges[0]
                    experience['end_time'] = date_ranges[1]
                    experience['duration'] = block.css('').get()
                 elif len(date_ranges) == 1:
                    experience['start_time'] = date_ranges[0]
                    experience['end_time'] = 'present'
                    experience['duration'] = block.css('').get()
              except Exception as e:
                  print('No dates')
                  experience['start_time'] = ''
                  experience['end_time'] = ''
                  experience['duration'] = ''               
                 experience['location'] = block.css('p.experience-item__location::text').get().strip()
              except Exception as e:
                 print('No location')
                 experience['location'] = ''
                 experience['description'] = block.css('').get().strip()
              except Exception as e:
                     experience['description'] = block.css('').get().strip()
                 except Exception as e:
                     print('no description found')
                     experience['description'] = ''
                 Education Section
              item['education'] = []
              education_groups = response.css('li.education__list-item')
              for group in education_groups:
                  education = {}
                  education['university_link'] = group.css('h3 a::attr(href)').get().split('?')[0]
                      degree_info = group.css('h4 span::text').getall() 
                      if len(degree_info) == 2:
                      education['degree'] = degree_info[0]
                      education['faculty'] = degree_info[1]
                      print('no degrees acquired')
                     date_range = group.css(' time::text').getall()
                     if len(date_range) == 2:
                        education['start_date'] = date_range[0]
                        education['end_date'] = date_range[1]
                        print('no degree dates')
                      education['description'] = group.css(' p::text').get().strip()
                      education['description'] = ''
                     Skills Section
                  item['skills'] = []
                  skills = {}
                     skills = response.css('div.core-section-container__content li.skills__item') 
                     skills['start_up'] = skills[1].css('li.skills__item a::text').get().strip()
                     skills['strategy'] = skills[3].css('li.skills__item a::text').get().strip() 
                     skills['venture capital'] = skills[5].css('li.skills__item a::text').get().strip() 
                     skills['Saas'] = skills[7].css('li::text').get().strip() 
              yield item

When I run scrapy crawl profilespider -o profiles.json on my cmd prompt, the json file 'profiles' comes back empty. Please do you know what am missing?

Here is my log from console

(venv) C:\Users\LP\Documents\python\ProfileTest\profilescraper>scrapy crawl profilespider -o profiles.json
2023-01-24 15:35:58 [scrapy.utils.log] INFO: Scrapy 2.7.1 started (bot: profilescraper)
2023-01-24 15:35:58 [scrapy.utils.log] INFO: Versions: lxml, libxml2 2.9.12, cssselect 1.2.0, parsel 1.7.0, w3lib 2.1.1, Twisted 22.10.0, Python 3.10.7 (tags/v3.10.7:6cc6b13, Sep  5 2022, 14:08:36) [MSC v.1933 64 bit (AMD64)], pyOpenSSL 23.0.0 (OpenSSL 3.0.7 1 Nov 2022), cryptography 39.0.0, Platform Windows-10-10.0.19044-SP0
2023-01-24 15:35:58 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'profilescraper',
'NEWSPIDER_MODULE': 'profilescraper.spiders',
'SPIDER_MODULES': ['profilescraper.spiders'],
2023-01-24 15:35:58 [asyncio] DEBUG: Using selector: SelectSelector
2023-01-24 15:35:58 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2023-01-24 15:35:58 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2023-01-24 15:35:58 [scrapy.extensions.telnet] INFO: Telnet Password: d126f5d312c5e917
2023-01-24 15:35:59 [scrapy.middleware] INFO: Enabled extensions:
2023-01-24 15:35:59 [scrapy.middleware] INFO: Enabled downloader middlewares:

2023-01-24 15:35:59 [scrapy.middleware] INFO: Enabled spider middlewares:
2023-01-24 15:35:59 [scrapy.middleware] INFO: Enabled item pipelines:
2023-01-24 15:35:59 [scrapy.core.engine] INFO: Spider opened
2023-01-24 15:35:59 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-01-24 15:35:59 [scrapy.extensions.telnet] INFO: Telnet console listening on
2023-01-24 15:36:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET> (referer: None)
2023-01-24 15:36:07 [scrapy.core.engine] INFO: Closing spider (finished)
2023-01-24 15:36:07 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 405,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 323474,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 8.263713,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2023, 1, 24, 14, 36, 7, 939751),
 'log_count/DEBUG': 4,
 'log_count/INFO': 10,
 'response_received_count': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2023, 1, 24, 14, 35, 59, 676038)}
 2023-01-24 15:36:07 [scrapy.core.engine] INFO: Spider closed (finished)
Updated 29-Jan-23 23:07pm

1 solution

The issue might be with the way you are trying to extract data from the response. In the skills section, you are trying to extract data from the skills variable which is a list of elements. However, you are trying to access the elements in the list by index, which will not work because the elements in the list do not have index numbers. You should use a loop or the css() method to extract the data from the elements in the list.

You should also check the structure of the webpage that you are trying to scrape to make sure that the css selectors you are using match the structure of the page correctly.

Additionally, you should check if the proxy is working correctly and providing you with valid responses. This can be done by checking the status codes of the responses, or by checking if the proxy is blocked or not.
Share this answer
Asuzor Miracle 30-Jan-23 4:51am    
Thanks for the kind response.

However I already tried all the css selectors i used in the codeblock and they all returned desired results in the console.

Also the proxy works perfectly because when I run `response.status` in scrapy shell, I get a `200 response`. I made a mistake in the start_urls, I corrected it but it still didn't give me any results when I run the full script.

Unfortunately despite all tried, am still getting an empty file when I run this program. Please am in need of more suggestions from anybody. Thanks in anticipation.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900