scrapy

简介：框架就是一个被集成了很多功能且具有很强通用性的一个项目模板。
是一个专用于异步爬虫的框架。
- 高性能的数据解析、请求发送，持久化存储，全站数据爬取，中间件，分布式…
twisted：异步架构
基本使用
- 创建工程：
  - scrapy startproject ProName
- 目录结构：
  - spiders：爬虫文件夹
    - 必须存放一个爬虫源文件
  - settings：工程的配置文件
- cd ProName
- 创建爬虫源文件：
  - scrapy genspider spiderName www.xxx.com
- 执行工程
  - scrapy crawl spiderName
- 爬虫文件spiderName内容阐述
  - name # 爬虫源文件的唯一标识
  - allowed_domains # 允许的域名
  - start_urls # 起始的url列表，只可以存储url，列表中存储的url都会被进行get请求
  - parse # 数据解析
- seetings.py：
  - 禁止robots
  - 指定类型日志
    - LOG_LEVEL = ‘ERROR’
  - UA伪装

scrapy数据解析

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


import scrapy


class TestSpider(scrapy.Spider):
    name = 'test'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.qiushibaike.com/text/']

    def parse(self, response):
        div_list = response.xpath('//*[@id="content"]/div/div[2]/div')
        for div in div_list:
            # xpath返回的是列表，但列表元素一定是selector类型的对象
            # extract可以讲selector对象中data参数存储的字符串提取出来
            author = div.xpath('./div[1]/a[2]/h2/text()')[0].extract()
            # 列表调用了extract之后，则表示将列表中每一个selector对象中data对应的字符串提取出来
            content = div.xpath('./a[1]/div/span/text()').extract()
            # join转成字符串
            content = ''.join(content)
            print(author, content)

            break

scrapy持久化存储

基于终端指令：
- 要求：只可以将parse方法的返回值存储到本地的文本文件中
  - scrapy crawl xxx -o filePath
  - 好处：简洁高效边界
  - 缺点：只能存储到指定后缀的文本文件中

基于管道(常用)：

编码流程：
- 数据解析
- 将解析的数据对象封装存储到item类型的对象(item.py)

1
2
3
4
5


class FirstbloodItem(scrapy.Item):
    # define the fields for your item here like:
    author = scrapy.Field()
    content = scrapy.Field()
    pass

将item类型的对象提交给管道进行持久化存储的操作(spider.py)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


    def parse(self, response):
        div_list = response.xpath('//*[@id="content"]/div/div[2]/div')
        all_data = []
        for div in div_list:
            # xpath返回的是列表，但列表元素一定是selector类型的对象
            # extract可以讲selector对象中data参数存储的字符串提取出来
            author = div.xpath('./div[1]/a[2]/h2/text()')[0].extract()
            # 列表调用了extract之后，则表示将列表中每一个selector对象中data对应的字符串提取出来
            content = div.xpath('./a[1]/div/span/text()').extract()
            # join转成字符串
            content = ''.join(content)
            item = FirstbloodItem()
            item['author'] = author
            item['content'] = content

            yield item

在管道类的process_item中要将其接收到的item对象中存储的数据进行持久化存储操作(pipelines.py)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


class FirstbloodPipeline:
    fp = None

    # 重写父类方法：该方法旨在开始爬虫的时候被调用一次

    def open_spider(self, spider):
        print(&#39;开始&#39;)
        self.fp = open(&#39;./first.text&#39;, &#39;w&#39;, encoding=&#39;utf-8&#39;)

    # 用于处理item类型对象
    # 该方法可以接收爬虫文件提交过来的item对象
    # 该方法每接收到一个item就会被调用一次
    def process_item(self, item, spider):
        author = item[&#39;author&#39;]
        content = item[&#39;content&#39;]
        self.fp.write(author+&#39;:&#39;+content+&#39;\n&#39;)
        return item

    def close_spider(self, spider):
        print(&#39;结束&#39;)
        self.fp.close()

在配置文件中开启管道

1
2
3


ITEM_PIPELINES = {
    'FirstBlood.pipelines.FirstbloodPipeline': 300,  # 表示的是优先级，数值越小优先级越高
}

爬虫文件提交的item类型的对象最终会提交给优先级较高的管道类

基于管道将数据添加到数据库

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


class mysqlPipeline(object):
    conn = None
    cursor = None

    def open_spider(self, spider):
        self.conn = pymysql.Connect(host='localhost', port=3306, user='root', passwd='newpassword', db='fist',
                                    charset='utf8')

    def process_item(self, item, spider):
        self.cursor = self.conn.cursor()
        try:
            insert_sql = """
                    insert into first(author, content) VALUES (%s,%s)
                    """
            # 执行插入数据到数据库操作
            self.cursor.execute(insert_sql, (item['author'], item['content']))
            self.conn.commit()
        except Exception as e:
            print(e)
            self.conn.rollback()
        return item

    def close_spider(self, spider):
        self.cursor.close()
        self.conn.close()

settings.py:

1
2
3
4


ITEM_PIPELINES = {
    'FirstBlood.pipelines.FirstbloodPipeline': 300,  # 表示的是优先级，数值越小优先级越高
    'FirstBlood.pipelines.mysqlPipeline': 301,  # item传递给下一个即将被执行的管道类
}