Scrapy 爬虫学习笔记
1. 安装 Scrapy
sudo pip3 install Scrapy
2. 文档
3. 示例
3.1. 创建爬虫项目
scrapy startproject <spider_project>
3.2. 添加爬虫代码
cd <spider_project>
cd <spider_project>/spiders
在spiders目录下添加一个spider代码文件
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes" # 注意这里
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
3.3. 运行爬虫
scrapy crawl quotes
4. 抓取与分析
4.1. 抓取页面
yield scrapy.Request(url=url, callback=self.parse)
# 默认回调 parse
def parse(self, response):
4.2. 用css选择元素
1). response.css
response.css("<type>.<class>")
2). 选择元素属性
response.css("<type>.<class>::attr(<attr_name>)")
response.css('li.next a::attr(href)')
response.css('li.next a').attrib['href']
3). 选择文本
response.css("<type>.<class>::text")
4). 一次选择多个
quote.css("div.tags a.tag::text").getall()
5). 使用正则表达式
response.css('title::text').re(r'Quotes.*')
4.3. 用XPath选择
response.xpath('//title')
response.xpath('//title/text()').get()
4.4. 抓取链接页面
参考: https://docs.scrapy.org/en/latest/intro/tutorial.html#following-links
yield scrapy.Request(next_page, callback=self.parse)
或者
yield response.follow(next_page, callback=self.parse)
跟踪多个链接
anchors = response.css('ul.pager a')
yield from response.follow_all(anchors, callback=self.parse)
5. 保存数据
scrapy crawl quotes -o quotes.json
或者 JSON Line
参考:http://jsonlines.org/
JQ 工具:https://stedolan.github.io/jq
scrapy crawl quotes -o quotes.jl
6. 流水线
参考:https://docs.scrapy.org/en/latest/topics/item-pipeline.html#topics-item-pipeline
7. 其他
7.1. 过滤重复链接
DUPEFILTER_CLASS
参考:https://docs.scrapy.org/en/latest/topics/settings.html#std:setting-DUPEFILTER_CLASS
Another interesting thing this spider demonstrates is that, even if there are many quotes from the same author, we don’t need to worry about visiting the same author page multiple times. By default, Scrapy filters out duplicated requests to URLs already visited, avoiding the problem of hitting servers too much because of a programming mistake. This can be configured by the setting DUPEFILTER_CLASS.
7.2. 向回调函数传参
7.3. 使用规则
参考:https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.CrawlSpider
7.4. 使用命令行参数来改变爬虫行为
格式:-a <参数>
scrapy crawl quotes -o quotes-humor.json -a tag=humor
扫一扫关注微信公众号:耿直的IT男阿斌
聊一聊IT男眼中的世界