bs4实战+xpath基础

bs4

爬取三国全篇内容：《三国演义》全集在线阅读_史书典籍_诗词名句网 (shicimingju.com)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26


import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0) Gecko/20100101 Firefox/80.0'
}
main_url = 'https://www.shicimingju.com/book/sanguoyanyi.html'
page_text = requests.get(url=main_url, headers=headers).text
fp = open('./sanguo.txt', 'w', encoding='utf-8')
# 数据解析：章节标题，标签页url，章节内容

soup = BeautifulSoup(page_text, 'html.parser')
# 定位到所有的符合要求的a标签
a_list = soup.select('.book-mulu > ul > li > a')
for a in a_list:
    title = a.string
    detail_url = 'https://www.shicimingju.com' + a['href']
    # 对详情页发起请求解析出章节内容
    page_text_detail = requests.get(url=detail_url, headers=headers).text
    soup = BeautifulSoup(page_text_detail, 'html.parser')
    div_tag = soup.find('div', class_="chapter_content")
    content = div_tag.text
    fp.write(title+':'+content+'\n')
    print(title, '保存成功')

fp.close()

注意：新版lxml的命令与之前有所不同

xpath

环境安装：pip install lxml
解析原理：html是以树状的形式进行展示
- 实例化一个etree的对象，且将待解析的页面源码数据加载到该对象中
- 调用etree对象的xpath方法结合着不停的xpath表达式实现标签的定位和数据提取。
实例化etree对象
- etree.parse(‘filename’)：将本地html文档加载到该对象中
- etree.HTML(page_text)：网站获取的页面数据加载到该对象
标签定位
- 最左侧/：如果xpath表达式最左侧是以/开头则表示该xpath表达式一定要从根标签开始定位指定标签的
- 非最左侧的/：表示一个层级
- 非左侧//：表示多个层级
- 最左侧//：xpath表达式可以从任意位置进行标签定位
- 属性定位：tagName[@attrNmae=“value”]
- 索引定位：tag[index]#索引是从1开始
- 模糊匹配
取文本
- /text()：直系文本内容
- //text()：所有文本内容
取属性
- /@attrName