Python scrapy.spiders.CrawlSpider() Examples

The following are 3 code examples of scrapy.spiders.CrawlSpider(). You can vote up the ones you like or vote down the ones you don't like, and go to the original project or source file by following the links above each example. You may also want to check out all available functions/classes of the module scrapy.spiders , or try the search function .
Example #1
Source File: utils.py    From scrapy-autounit with BSD 3-Clause "New" or "Revised" License 7 votes vote down vote up
def parse_request(request, spider):
    _request = request_to_dict(request, spider=spider)
    if not _request['callback']:
        _request['callback'] = 'parse'
    elif isinstance(spider, CrawlSpider):
        rule = request.meta.get('rule')
        if rule is not None:
            _request['callback'] = spider.rules[rule].callback

    clean_headers(_request['headers'], spider.settings)

    _meta = {}
    for key, value in _request.get('meta').items():
        if key != '_autounit':
            _meta[key] = parse_object(value, spider)
    _request['meta'] = _meta

    return _request 
Example #2
Source File: utils.py    From scrapy-autounit with BSD 3-Clause "New" or "Revised" License 5 votes vote down vote up
def get_filter_attrs(spider):
    attrs = {'crawler', 'settings', 'start_urls'}
    if isinstance(spider, CrawlSpider):
        attrs |= {'rules', '_rules'}
    return attrs 
Example #3
Source File: haofl_spider.py    From Spiders with Apache License 2.0 5 votes vote down vote up
def parse_start_url(self, response):
        """CrawlSpider默认先从start_url获取Request,然后回调parse_start_url方法"""
        li_list = response.xpath('//*[@id="post_container"]/li')
        for li_div in li_list:
            link = li_div.xpath('.//div[@class="thumbnail"]/a/@href').extract_first()
            yield scrapy.Request(link, callback=self.parse_detail_url)

        next_page = response.xpath('//div[@class="pagination"]/a[@class="next"]/@href').extract_first()
        if next_page:
            yield scrapy.Request(next_page, callback=self.parse_start_url)