20 April 2020 / 开发笔记

Scrapy 爬虫学习笔记

https://scrapy.org/

1. 安装 Scrapy

https://docs.scrapy.org/en/latest/intro/install.html

sudo pip3 install Scrapy

2. 文档

https://docs.scrapy.org/en/latest/intro/tutorial.html

3. 示例

3.1. 创建爬虫项目

scrapy startproject <spider_project>

3.2. 添加爬虫代码

cd <spider_project>
cd <spider_project>/spiders

在spiders目录下添加一个spider代码文件

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"   # 注意这里

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

3.3. 运行爬虫

scrapy crawl quotes

4. 抓取与分析

4.1. 抓取页面

yield scrapy.Request(url=url, callback=self.parse)

# 默认回调 parse
def parse(self, response):

4.2. 用css选择元素

1). response.css

response.css("<type>.<class>")

参考：https://www.w3.org/TR/selectors/

2). 选择元素属性

response.css("<type>.<class>::attr(<attr_name>)")

response.css('li.next a::attr(href)')
response.css('li.next a').attrib['href']

3). 选择文本

response.css("<type>.<class>::text")

4). 一次选择多个

quote.css("div.tags a.tag::text").getall()

5). 使用正则表达式

response.css('title::text').re(r'Quotes.*')

4.3. 用XPath选择

参考：https://www.w3.org/TR/xpath/all/

response.xpath('//title')
response.xpath('//title/text()').get()

4.4. 抓取链接页面

参考： https://docs.scrapy.org/en/latest/intro/tutorial.html#following-links

yield scrapy.Request(next_page, callback=self.parse)

或者

yield response.follow(next_page, callback=self.parse)

跟踪多个链接

anchors = response.css('ul.pager a')
yield from response.follow_all(anchors, callback=self.parse)

5. 保存数据

scrapy crawl quotes -o quotes.json

或者 JSON Line

参考：http://jsonlines.org/
JQ 工具：https://stedolan.github.io/jq

scrapy crawl quotes -o quotes.jl

6. 流水线

参考：https://docs.scrapy.org/en/latest/topics/item-pipeline.html#topics-item-pipeline

7. 其他

7.1. 过滤重复链接

DUPEFILTER_CLASS

参考：https://docs.scrapy.org/en/latest/topics/settings.html#std:setting-DUPEFILTER_CLASS

Another interesting thing this spider demonstrates is that, even if there are many quotes from the same author, we don’t need to worry about visiting the same author page multiple times. By default, Scrapy filters out duplicated requests to URLs already visited, avoiding the problem of hitting servers too much because of a programming mistake. This can be configured by the setting DUPEFILTER_CLASS.

7.2. 向回调函数传参

参考：https://docs.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-request-callback-arguments

7.3. 使用规则

参考：https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.CrawlSpider

7.4. 使用命令行参数来改变爬虫行为

格式：-a <参数>

scrapy crawl quotes -o quotes-humor.json -a tag=humor

扫一扫关注微信公众号：耿直的IT男阿斌

聊一聊IT男眼中的世界

Scrapy 爬虫学习笔记

1. 安装 Scrapy

2. 文档

3. 示例

3.1. 创建爬虫项目

3.2. 添加爬虫代码

3.3. 运行爬虫

4. 抓取与分析

4.1. 抓取页面

4.2. 用css选择元素

1). response.css

2). 选择元素属性

3). 选择文本

4). 一次选择多个

5). 使用正则表达式

4.3. 用XPath选择

4.4. 抓取链接页面

5. 保存数据

6. 流水线

7. 其他

7.1. 过滤重复链接

7.2. 向回调函数传参

7.3. 使用规则

7.4. 使用命令行参数来改变爬虫行为

扫一扫关注微信公众号：耿直的IT男阿斌

Python虚拟环境简明教程

Facebook 小游戏加载远程图片