xpath解析

使用xpath爬取图片名称和图片数据
- [https://pic.netbian.com/4kmeinv/index.html]
局部数据解析：
- 将定位到的页面中的标签作为待解析的数据。
- 在局部数据时，xpath表达式中要使用./的操作，表示当前的局部数据

需求：要求解析出携带html标签的局部数据？

bs4，bs4在实现标签定位的时候返回值就是定位到标签对应的字符串数据

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41


import os
from lxml import etree
import requests


dirName = 'lsp1'
if not os.path.exists(dirName):
    os.mkdir(dirName)

headers = {
    'User-Agent': 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0) Gecko/20100101 Firefox/80.0'
}

main_url = 'https://pic.netbian.com/4kmeinv/index.html'

# 爬取多页
# 定义一个通用的url模板：不可变
url = 'https://pic.netbian.com/4kmeinv/index_%d.html'
for page in range(1, 6):
    if page == 1:
        new_url = 'https://pic.netbian.com/4kmeinv/index.html'
    else:
        new_url = format(url % page)
    response = requests.get(url=new_url, headers=headers)
    response.encoding = 'gbk'
    page_text = response.text

    # 图片名称+图片数据
    tree = etree.HTML(page_text)
    # 存储的是定位到的指定的li标签
    li_list = tree.xpath('//div[@class="slist"]/ul/li')

    for li in li_list:
        # li的数据类型和tree的数据类型一样，li也可以调用xpath方法
        title = li.xpath('./a/img/@alt')[0]+'.jpg'  # [0]表示取字符串
        img_src = 'https://pic.netbian.com'+li.xpath('./a/img/@src')[0]
        ima_data = requests.get(url=img_src, headers=headers).content
        imgPath = dirName + '/' + title
        with open(imgPath, 'wb') as fp:
            fp.write(ima_data)
        print(title, 'ok!')

xpath表达式如何更具有通用性？
- 在xpath表达式中使用管道符分割的作用，两侧表达式同时生效或者一个生效。