/ 开发笔记

Scrapy 爬虫学习笔记

https://scrapy.org/

1. 安装 Scrapy

https://docs.scrapy.org/en/latest/intro/install.html

sudo pip3 install Scrapy

2. 文档

https://docs.scrapy.org/en/latest/intro/tutorial.html

3. 示例

3.1. 创建爬虫项目

scrapy startproject <spider_project>

3.2. 添加爬虫代码

cd <spider_project>
cd <spider_project>/spiders

在spiders目录下添加一个spider代码文件

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"   # 注意这里

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

3.3. 运行爬虫

scrapy crawl quotes

4. 抓取与分析

4.1. 抓取页面

yield scrapy.Request(url=url, callback=self.parse)
# 默认回调 parse
def parse(self, response):

4.2. 用css选择元素

1). response.css

response.css("<type>.<class>")

参考:https://www.w3.org/TR/selectors/

2). 选择元素属性

response.css("<type>.<class>::attr(<attr_name>)")

response.css('li.next a::attr(href)')
response.css('li.next a').attrib['href']

3). 选择文本

response.css("<type>.<class>::text")

4). 一次选择多个

quote.css("div.tags a.tag::text").getall()

5). 使用正则表达式

response.css('title::text').re(r'Quotes.*')

4.3. 用XPath选择

参考:https://www.w3.org/TR/xpath/all/

response.xpath('//title')
response.xpath('//title/text()').get()

4.4. 抓取链接页面

参考: https://docs.scrapy.org/en/latest/intro/tutorial.html#following-links

yield scrapy.Request(next_page, callback=self.parse)

或者

yield response.follow(next_page, callback=self.parse)

跟踪多个链接

anchors = response.css('ul.pager a')
yield from response.follow_all(anchors, callback=self.parse)

5. 保存数据

scrapy crawl quotes -o quotes.json

或者 JSON Line

参考:http://jsonlines.org/
JQ 工具:https://stedolan.github.io/jq

scrapy crawl quotes -o quotes.jl

6. 流水线

参考:https://docs.scrapy.org/en/latest/topics/item-pipeline.html#topics-item-pipeline

7. 其他

7.1. 过滤重复链接

DUPEFILTER_CLASS

参考:https://docs.scrapy.org/en/latest/topics/settings.html#std:setting-DUPEFILTER_CLASS

Another interesting thing this spider demonstrates is that, even if there are many quotes from the same author, we don’t need to worry about visiting the same author page multiple times. By default, Scrapy filters out duplicated requests to URLs already visited, avoiding the problem of hitting servers too much because of a programming mistake. This can be configured by the setting DUPEFILTER_CLASS.

7.2. 向回调函数传参

参考:https://docs.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-request-callback-arguments

7.3. 使用规则

参考:https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.CrawlSpider

7.4. 使用命令行参数来改变爬虫行为

格式:-a <参数>

scrapy crawl quotes -o quotes-humor.json -a tag=humor