python - In Scrapy, How to set time limit for each url? -

- April 15, 2010

i trying crawl multiple websites using scrapy link extractor , follow true (recursive) .. looking solution set time limit crawl each url in start_urls list.

thanks

import scrapy  class dmozitem(scrapy.item):     title = scrapy.field()     link = scrapy.field()     desc = scrapy.field()  class dmozspider(scrapy.spider):     name = "dmoz"     allowed_domains = ["dmoz.org"]     start_urls = [         "http://www.dmoz.org/computers/programming/languages/python/books/",         "http://www.dmoz.org/computers/programming/languages/python/resources/"     ]     def parse(self, response):         sel in response.xpath('//ul/li'):             item = dmozitem()             item['title'] = sel.xpath('a/text()').extract()             item['link'] = sel.xpath('a/@href').extract()             item['desc'] = sel.xpath('text()').extract()             yield item

you need use download_timeout meta parameter scrapy.request.

to use in starting urls, need overload self.start_requests(self) function, like:

def start_requests(self):     # 10 seconds first url     yield request(self.start_urls[0], meta={'donwload_timeout': 10})     # 60 seconds first url     yield request(self.start_urls[1], meta={'donwload_timeout': 60})

you can read more request special meta keys here: http://doc.scrapy.org/en/latest/topics/request-response.html#request-meta-special-keys

Search This Blog

If cop

python - In Scrapy, How to set time limit for each url? -

Comments

Post a Comment

Popular posts from this blog

Android volley - avoid multiple requests of the same kind to the server? -

magento2 - Magento 2 admin grid add filter to collection -

Combining PHP Registration and Login into one class with multiple functions in one PHP file -