Click here to Skip to main content
15,666,479 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I have a scrapy script I wrote to extract profiles from LinkedIn using proxy services. The proxy am using is scrapeops. I created a virtual environment and did pip install scrapeops-scrapy-proxy-sdk. I also added the proxy API to my scrapy project settings following the proxy rules of usage. When i run my scrapy script, it runs successfully with no error but returns empty result. Please what am i missing?

Here is my code

What I have tried:

Python
class ProfilespiderSpider(CrawlSpider):
      name = 'profilespider'
      allowed_domains = ['www.linkedin.com']
      start_urls = ['https://www.linkedin.com/in/reidhoffman?trk=people-guest_people_search-card']

      rules = (
            Rule(LinkExtractor(allow='public_jobs_people-search-bar_search-submit')),
            Rule(LinkExtractor(allow='people-guest_people_search-card'), callback='parse_item'),
      )

      def parse_item(self, response):
          item = {}
    
          """
             Profile Summary
          """
          item['name'] = response.css("div.top-card-layout__entity-info-container h1::text").get().strip()
          item['description'] = response.css("div.top-card-layout__entity-info-container h2::text").get().strip()
    
          try:
             item['location'] = response.css('section.top-card-layout div.top-card__subline-item::text').get()
          except:
               item['location'] = response.css('section.top-card-layout span.top-card__subline-item::text').get().strip()
               if 'followers' in item['location'] or 'connections' in item['location']:
                  item['location'] = ''
            
          contacts = response.css("div.top-card-layout__entity-info-container span.top-card__subline-item::text").getall() 
          item['followers'] = contacts[0].replace('followers', '').strip()
          item['connections'] = contacts[1].replace('connections', '').strip()
    
    
          """
             About Section
          """
          item['about'] = response.css(".summary  p::text").getall() 
    
    
          """
             Experience Section          
          """
          item['experience'] = []
          experience_blocks = response.css('li.experience-item')
          for block in experience_blocks:
              experience = {}
        
              #Position
              try:
                  experience['position'] = block.css('li.experience-item h3::text').get().strip()
              except Exception as e:
                  experience['position'] = ''
                 
              #organisation
              try:
                 experience['organisation_profile'] = block.css('h4 a::attr(href)').get().split('?')[0]
              except Exception as e:
                 print('No organisation profile found')
                 experience['organisation_profile'] = ''
            
              #date
              try:
                 date_ranges = block.css('span.date-range time::text').getall()
                 if len(date_ranges) == 2:
                    experience['start_time'] = date_ranges[0]
                    experience['end_time'] = date_ranges[1]
                    experience['duration'] = block.css('span.date-range__duration::text').get()
                 elif len(date_ranges) == 1:
                    experience['start_time'] = date_ranges[0]
                    experience['end_time'] = 'present'
                    experience['duration'] = block.css('span.date-range__duration::text').get()
              except Exception as e:
                  print('No dates')
                  experience['start_time'] = ''
                  experience['end_time'] = ''
                  experience['duration'] = ''               
            
              #location
              try:
                 experience['location'] = block.css('p.experience-item__location::text').get().strip()
              except Exception as e:
                 print('No location')
                 experience['location'] = ''
            
              #description
              try:
                 experience['description'] = block.css('p.show-more-less-text__text--more::text').get().strip()
              except Exception as e:
                 try:
                     experience['description'] = block.css('p.show-more-less-text__text--less::text').get().strip()
                 except Exception as e:
                     print('no description found')
                     experience['description'] = ''
                
              item['experience'].append(experience)
                
        
              """
                 Education Section
              """
              item['education'] = []
              education_groups = response.css('li.education__list-item')
              for group in education_groups:
                  education = {}
        
                  #university
                  education['university_link'] = group.css('h3 a::attr(href)').get().split('?')[0]
        
                  #degrees
                  try:
                      degree_info = group.css('h4 span::text').getall() 
                      if len(degree_info) == 2:
                      education['degree'] = degree_info[0]
                      education['faculty'] = degree_info[1]
                      else:
                          pass
                  except:
                      print('no degrees acquired')
            
                  #date_range
                  try:
                     date_range = group.css('span.date-range time::text').getall()
                     if len(date_range) == 2:
                        education['start_date'] = date_range[0]
                        education['end_date'] = date_range[1]
                     else:
                         pass
                  except:
                        print('no degree dates')
            
        
                  #description
                  try:
                      education['description'] = group.css('div.show-more-less-text p::text').get().strip()
                  except:
                      education['description'] = ''
            
                  item['education'].append(education)
             
            
                  """
                     Skills Section
                  """
                  item['skills'] = []
                  skills = {}
    
                  try:
                     skills = response.css('div.core-section-container__content li.skills__item') 
                     skills['start_up'] = skills[1].css('li.skills__item a::text').get().strip()
                     skills['strategy'] = skills[3].css('li.skills__item a::text').get().strip() 
                     skills['venture capital'] = skills[5].css('li.skills__item a::text').get().strip() 
                     skills['Saas'] = skills[7].css('li::text').get().strip() 
        
                  except:
                       pass
    
                  item['skills'].append(skills)
    
              yield item



When I run scrapy crawl profilespider -o profiles.json on my cmd prompt, the json file 'profiles' comes back empty. Please do you know what am missing?

Here is my log from console

Python
(venv) C:\Users\LP\Documents\python\ProfileTest\profilescraper>scrapy crawl profilespider -o profiles.json
2023-01-24 15:35:58 [scrapy.utils.log] INFO: Scrapy 2.7.1 started (bot: profilescraper)
2023-01-24 15:35:58 [scrapy.utils.log] INFO: Versions: lxml 4.9.2.0, libxml2 2.9.12, cssselect 1.2.0, parsel 1.7.0, w3lib 2.1.1, Twisted 22.10.0, Python 3.10.7 (tags/v3.10.7:6cc6b13, Sep  5 2022, 14:08:36) [MSC v.1933 64 bit (AMD64)], pyOpenSSL 23.0.0 (OpenSSL 3.0.7 1 Nov 2022), cryptography 39.0.0, Platform Windows-10-10.0.19044-SP0
2023-01-24 15:35:58 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'profilescraper',
'NEWSPIDER_MODULE': 'profilescraper.spiders',
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
'SPIDER_MODULES': ['profilescraper.spiders'],
'TWISTED_REACTOR': 
'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2023-01-24 15:35:58 [asyncio] DEBUG: Using selector: SelectSelector
2023-01-24 15:35:58 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2023-01-24 15:35:58 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2023-01-24 15:35:58 [scrapy.extensions.telnet] INFO: Telnet Password: d126f5d312c5e917
2023-01-24 15:35:59 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2023-01-24 15:35:59 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware' 

 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2023-01-24 15:35:59 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2023-01-24 15:35:59 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2023-01-24 15:35:59 [scrapy.core.engine] INFO: Spider opened
2023-01-24 15:35:59 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-01-24 15:35:59 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-01-24 15:36:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://proxy.scrapeops.io/v1/?api_key=9c79a52d-f08d-4c45-b8d2-f51ec9a4e7a4&url=https%3A%2F%2Fwww.linkedin.com%2Fpub%2Fdir%3FfirstName%3Dreid%26lastName%3Dhoffman%26trk%3Dpublic_jobs_people-search-bar_search-submit> (referer: None)
2023-01-24 15:36:07 [scrapy.core.engine] INFO: Closing spider (finished)
2023-01-24 15:36:07 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 405,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 323474,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 8.263713,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2023, 1, 24, 14, 36, 7, 939751),
 'log_count/DEBUG': 4,
 'log_count/INFO': 10,
 'response_received_count': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2023, 1, 24, 14, 35, 59, 676038)}
 2023-01-24 15:36:07 [scrapy.core.engine] INFO: Spider closed (finished)
Posted
Updated 29-Jan-23 23:07pm
v2

1 solution

The issue might be with the way you are trying to extract data from the response. In the skills section, you are trying to extract data from the skills variable which is a list of elements. However, you are trying to access the elements in the list by index, which will not work because the elements in the list do not have index numbers. You should use a loop or the css() method to extract the data from the elements in the list.

You should also check the structure of the webpage that you are trying to scrape to make sure that the css selectors you are using match the structure of the page correctly.

Additionally, you should check if the proxy is working correctly and providing you with valid responses. This can be done by checking the status codes of the responses, or by checking if the proxy is blocked or not.
 
Share this answer
 
Comments
Asuzor Miracle 30-Jan-23 4:51am    
Thanks for the kind response.

However I already tried all the css selectors i used in the codeblock and they all returned desired results in the console.

Also the proxy works perfectly because when I run `response.status` in scrapy shell, I get a `200 response`. I made a mistake in the start_urls, I corrected it but it still didn't give me any results when I run the full script.

Unfortunately despite all tried, am still getting an empty file when I run this program. Please am in need of more suggestions from anybody. Thanks in anticipation.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900