python - In Scrapy, How to set time limit for each url? -
i trying crawl multiple websites using scrapy link extractor , follow true (recursive) .. looking solution set time limit crawl each url in start_urls list.
thanks
import scrapy class dmozitem(scrapy.item): title = scrapy.field() link = scrapy.field() desc = scrapy.field() class dmozspider(scrapy.spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/computers/programming/languages/python/books/", "http://www.dmoz.org/computers/programming/languages/python/resources/" ] def parse(self, response): sel in response.xpath('//ul/li'): item = dmozitem() item['title'] = sel.xpath('a/text()').extract() item['link'] = sel.xpath('a/@href').extract() item['desc'] = sel.xpath('text()').extract() yield item
you need use download_timeout
meta parameter scrapy.request
.
to use in starting urls, need overload self.start_requests(self)
function, like:
def start_requests(self): # 10 seconds first url yield request(self.start_urls[0], meta={'donwload_timeout': 10}) # 60 seconds first url yield request(self.start_urls[1], meta={'donwload_timeout': 60})
you can read more request special meta keys here: http://doc.scrapy.org/en/latest/topics/request-response.html#request-meta-special-keys
Comments
Post a Comment