1.Beautiful Soup 简介
Beautiful Soup就是python的一个HTML或XML的解析库,可以用它来方便的从网页中获取数据。
2.基本语法
from bs4 import BeautifulSoup
html='''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class='title' name='dormouse'><b>The Dormouse's story</b></p>
<p class='story'>Once a time there were three little sisiters;and their names were
<a href='https://example.com/elsie' class='sister' id='link1'><!--Elsie--></a>,
<a href='https://example.com/Lacie' class='sister' id='link2'><!--Lacie--></a>and
<a href='https://example.com/Tom' class='sister' id='link3'><!--Tom--></a>;
and they lived at the bottom of a wall</p>
<p class='story'>...</p>
'''
soup=BeautifulSoup(html,'lxml')#该对象的第二个参数为解析器的类型,这里使用lxml,
#初始化BeautifulSoup时自动更正格式
print(soup.prettify())#pretty()方法把解析的字符串以标准的缩进格式输出
print(soup.title.string)#输出title节点的文本内容,soup.titlt选出HTML文本节点
3.节点选择器
直接调用节点的名称就可以选择节点元素,再调用string属性就可以得到节点内的文本。
选择元素
from bs4 import BeautifulSoup html=''' <html><head><title>The Dormouse's story</title></head> <body> <p class='title' name='dormouse'><b>The Dormouse's story</b></p> <p class='story'>Once a time there were three little sisiters;and their names were <a href='https://example.com/elsie' class='sister' id='link1'><!--Elsie--></a>, <a href='https://example.com/Lacie' class='sister' id='link2'><!--Lacie--></a>and <a href='https://example.com/Tom' class='sister' id='link3'><!--Tom--></a>; and they lived at the bottom of a wall</p> <p class='story'>...</p> ''' soup=BeautifulSoup(html,'lxml') print(soup.title) print(type(soup.title)) print(soup.title.string) print(soup.head) print(soup.p)#当有多个节点时,只会选择第一个节点
提取信息
获取节点属性值,节点名称
(1)获取名称
利用name属性来获取节点名称。
print(soup.title.name)
(2)获取属性
每个节点可能有多个属性,利用attrs获取全部属性
print(soup.a.sttrs)#返回类型为字典, #print(soup.a.['class']) print(soup.a.attrs['name'])
(3)获取 内容
利用string属性获取节点元素包含的文本内容
print(soup.p.string)
嵌套选择
返回类型是bs4.element.Tag类型,可以继续调用节点进行下一步选择
print('嵌套选择') print(soup.head.title) print(type(soup.head.title)) print(soup.head.title.string)
关联选择
做选择时,不能一步就选到想要的节点内容,需要先选中某一个节点元素,然后再以它为基准选择其他子节点,父节点,兄弟节点等。
(1)子节点和子孙节点
选取节点之后,想要获取它的直接子节点,可以调用contents属性
from bs4 import BeautifulSoup html=''' <html><head><title>The Dormouse's story</title></head> <body> <p class='title' name='dormouse'><b>The Dormouse's story</b></p> <p class='story'>Once a time there were three little sisiters;and their names were <a href='https://example.com/elsie' class='sister' id='link1'><!--Elsie--></a>, <a href='https://example.com/Lacie' class='sister' id='link2'><!--Lacie--></a>and <a href='https://example.com/Tom' class='sister' id='link3'><!--Tom--></a>; and they lived at the bottom of a wall</p> <p class='story'>...</p> ''' soup=BeautifulSoup(html,'lxml') print(soup.p.contents)
返回结果是列表类型。p节点里既包含文本,又包含节点,最后会将它们以列表形式统一返回。
(2)父节点和子孙节点
#同上 print(type(soup.a.parents))#parents属性输出所有祖先节点,生成器类型 print(list(enumerate(soup.a.parents)))
(3)兄弟节点
同级节点获取
from bs4 import BeautifulSoup html=''' <html><head><title>The Dormouse's story</title></head> <body> <p class='story'>Once a time there were three little sisiters;and their names were <a href='https://example.com/elsie' class='sister' id='link1'> <span>Elsie</span> </a> Hello <a href='https://example.com/Lacie' class='sister' id='link2'><!--Lacie--></a> and <a href='https://example.com/Tom' class='sister' id='link3'><!--Tom--></a>; and they lived at the bottom of a wall</p> ''' soup=BeautifulSoup(html,'lxml') print('同级节点') print('上一个兄弟节点',soup.a.previous_sibling) print('下一个兄弟节点',soup.a.next_sibling) print('上一个兄弟节点',list(enumerate(soup.a.previous_siblings))) print('下一个兄弟节点',list(enumerate(soup.a.next_siblings)))
(4)提取信息
from bs4 import BeautifulSoup html=''' <html><head><title>The Dormouse's story</title></head> <body> <p class='story'> Once a time there were three little sisiters;and their names were <a href='https://example.com/elsie' class='sister' id='link1'>Elsie</a><a href='https://example.com/Lacie' class='sister' id='link2'>Lacie</a> </p> ''' soup=BeautifulSoup(html,'lxml') print('获取文本,属性') print('next_sibling:') print(type(soup.a.next_sibling)) print(soup.a.next_sibling) print(soup.a.next_sibling.string) print('Parent:') print(type(soup.a.parents)) print(list(soup.a.parents)[0]) print(list(soup.a.parents)[0].attrs['class'])
返回结果是单个节点,可以直接调用string,attrs等属性获取其文本和属性。
返回的是多个节点的生成器,则可以转为列表后取出某个元素,再调用string,attrs等属性获取其对应节点的文本和属性。
4.方法选择器
find_all()
查询符合所有符合条件的元素,给它传入一些属性或文本,就可以得到符合条件的元素。
(1)name
根据节点名来查找元素
from bs4 import BeautifulSoup html=''' <div class='panel'> <div class='panel-heading'> <h4>Hello</h4> </div> <div class='panel-body'> <ul class='list' id='list-1'> <li class='element'>Foo</li> <li class='element'>Tom</li> <li class='element'>Bob</li> </ul> <ul class='list list-small' id='list-2'> <li class='element'>Foo</li> <li class='element'>Tom</li> </ul> </div> </div> ''' soup=BeautifulSoup(html,'lxml') print(soup.find_all(name='ul')) print(type(soup.find_all(name='ul')[0]))#返回结果类型为Tag类型,可以嵌套查询 for ul in soup.find_all(name='ul'): print(ul.find_all(name='li')) for li in ul.find_all(name='li'): print(li.string)
输出:
[<ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Tom</li> <li class="element">Bob</li> </ul>, <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Tom</li> </ul>] <class 'bs4.element.Tag'> ul==== <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Tom</li> <li class="element">Bob</li> </ul> [<li class="element">Foo</li>, <li class="element">Tom</li>, <li class="element">Bob</li>] Foo Tom Bob ul==== <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Tom</li> </ul> [<li class="element">Foo</li>, <li class="element">Tom</li>] Foo Tom
(2)attrs
传入属性查询
from bs4 import BeautifulSoup html=''' <div class='panel'> <div class='panel-heading'> <h4>Hello</h4> </div> <div class='panel-body'> <ul class='list' id='list-1' name='elements'> <li class='element'>Foo</li> </ul> <ul class='list list-small' id='list-2'> <li class='element'>Foo</li> </ul> </div> </div> ''' soup=BeautifulSoup(html,'lxml') print(soup.find_all(attrs={'id':'list-1'}))#attrs的残数为字典类型,返回结果为符合id=list-1所有节点列表类型 print(soup.find_all(attrs={'name':'elements'}))
输出:
[<ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> </ul>] [<ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> </ul>]
(3)text
text()参数来匹配节点的文本,传入的形式可以是str,也可以是字符串,或者正则表达式对象。
from bs4 import BeautifulSoup import re html=''' <div class='panel'> <div class='panel-1'> <a href='https://www.baidu.com'>baidu link</a> <a href='https://www.sohu.com'>sohu link</a> </div> </div> ''' soup=BeautifulSoup(html,'lxml') print(soup.find_all(text=re.compile('link')))
输出:
['baidu link', 'sohu link']
find()
类似find_all(),不同的是find()返回的第一个匹配的元素,为单个元素。find_all()返回所有匹配元素的列表。
5.CSS选择器
from bs4 import BeautifulSoup
html='''
<div class='panel'>
<div class='panel-heading'>
<h4>Hello</h4>
</div>
<div class='panel-body'>
<ul class='list' id='list-1'>
<li class='element'>Foo</li>
<li class='element'>Tom</li>
<li class='element'>Bob</li>
</ul>
<ul class='list list-small' id='list-2'>
<li class='element'>Foo</li>
<li class='element'>Tom</li>
</ul>
</div>
</div>
'''
soup=BeautifulSoup(html,'lxml')
print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))
print(type(soup.select('ul')[0]))
输出:
[<div class="panel-heading">
<h4>Hello</h4>
</div>]
[<li class="element">Foo</li>, <li class="element">Tom</li>, <li class="element">Bob</li>, <li class="element">Foo</li>, <li class="element">Tom</li>]
[<li class="element">Foo</li>, <li class="element">Tom</li>]
<class 'bs4.element.Tag'>
嵌套选择
先选择ul节点,再遍历每个ul节点,选择其li节点
for ul in soup.select('ul'): print(ul.select('li'))
获取属性
for ul in soup.select('ul'): print(ul['li'])#Tag类型 print(ul.attrs['id'])#获取每个ul节点的id属性 ''' list-1 list-1 list-2 list-2 '''
获取文本
获取文本你可以使用string方法,亦可以使用get_text()
for li in soup.select('li'): print('Get Text',li.get_text()) print('String:',li.string) ''' Get Text Foo String: Foo Get Text Tom String: Tom Get Text Bob String: Bob Get Text Foo String: Foo Get Text Tom String: Tom '''