Beautiful Soup的使用

1.Beautiful Soup 简介

Beautiful Soup就是python的一个HTML或XML的解析库,可以用它来方便的从网页中获取数据。

2.基本语法
from bs4 import BeautifulSoup
html='''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class='title' name='dormouse'><b>The Dormouse's story</b></p>
<p class='story'>Once a time there were three little sisiters;and their names were 
<a href='https://example.com/elsie' class='sister' id='link1'><!--Elsie--></a>,
<a href='https://example.com/Lacie' class='sister' id='link2'><!--Lacie--></a>and
<a href='https://example.com/Tom' class='sister' id='link3'><!--Tom--></a>;
and they lived at the bottom of a wall</p>
<p class='story'>...</p>
'''
soup=BeautifulSoup(html,'lxml')#该对象的第二个参数为解析器的类型,这里使用lxml,
#初始化BeautifulSoup时自动更正格式
print(soup.prettify())#pretty()方法把解析的字符串以标准的缩进格式输出
print(soup.title.string)#输出title节点的文本内容,soup.titlt选出HTML文本节点
3.节点选择器

直接调用节点的名称就可以选择节点元素,再调用string属性就可以得到节点内的文本。

  • 选择元素

    from bs4 import BeautifulSoup
    html='''
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class='title' name='dormouse'><b>The Dormouse's story</b></p>
    <p class='story'>Once a time there were three little sisiters;and their names were 
    <a href='https://example.com/elsie' class='sister' id='link1'><!--Elsie--></a>,
    <a href='https://example.com/Lacie' class='sister' id='link2'><!--Lacie--></a>and
    <a href='https://example.com/Tom' class='sister' id='link3'><!--Tom--></a>;
    and they lived at the bottom of a wall</p>
    <p class='story'>...</p>
    '''
    soup=BeautifulSoup(html,'lxml')
    print(soup.title)
    print(type(soup.title))
    print(soup.title.string)
    print(soup.head)
    print(soup.p)#当有多个节点时,只会选择第一个节点
  • 提取信息

    获取节点属性值,节点名称

    (1)获取名称

    利用name属性来获取节点名称。

    print(soup.title.name)

    (2)获取属性

    每个节点可能有多个属性,利用attrs获取全部属性

    print(soup.a.sttrs)#返回类型为字典,
    #print(soup.a.['class'])
    print(soup.a.attrs['name'])

    (3)获取 内容

    利用string属性获取节点元素包含的文本内容

    print(soup.p.string)
  • 嵌套选择

    返回类型是bs4.element.Tag类型,可以继续调用节点进行下一步选择

    print('嵌套选择')
    print(soup.head.title)
    print(type(soup.head.title))
    print(soup.head.title.string)
  • 关联选择

    做选择时,不能一步就选到想要的节点内容,需要先选中某一个节点元素,然后再以它为基准选择其他子节点,父节点,兄弟节点等。

    (1)子节点和子孙节点

    选取节点之后,想要获取它的直接子节点,可以调用contents属性

    from bs4 import BeautifulSoup
    html='''
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class='title' name='dormouse'><b>The Dormouse's story</b></p>
    <p class='story'>Once a time there were three little sisiters;and their names were 
    <a href='https://example.com/elsie' class='sister' id='link1'><!--Elsie--></a>,
    <a href='https://example.com/Lacie' class='sister' id='link2'><!--Lacie--></a>and
    <a href='https://example.com/Tom' class='sister' id='link3'><!--Tom--></a>;
    and they lived at the bottom of a wall</p>
    <p class='story'>...</p>
    '''
    soup=BeautifulSoup(html,'lxml')
    print(soup.p.contents)

    返回结果是列表类型。p节点里既包含文本,又包含节点,最后会将它们以列表形式统一返回。

    (2)父节点和子孙节点

    #同上
    print(type(soup.a.parents))#parents属性输出所有祖先节点,生成器类型
    print(list(enumerate(soup.a.parents)))

    (3)兄弟节点

    同级节点获取

    from bs4 import BeautifulSoup
    html='''
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class='story'>Once a time there were three little sisiters;and their names were 
    <a href='https://example.com/elsie' class='sister' id='link1'>
    <span>Elsie</span>
    </a>
        Hello
    <a href='https://example.com/Lacie' class='sister' id='link2'><!--Lacie--></a>
        and
    <a href='https://example.com/Tom' class='sister' id='link3'><!--Tom--></a>;
    and they lived at the bottom of a wall</p>
    '''
    soup=BeautifulSoup(html,'lxml')
    print('同级节点')
    print('上一个兄弟节点',soup.a.previous_sibling)
    print('下一个兄弟节点',soup.a.next_sibling)
    print('上一个兄弟节点',list(enumerate(soup.a.previous_siblings)))
    print('下一个兄弟节点',list(enumerate(soup.a.next_siblings)))

    (4)提取信息

    from bs4 import BeautifulSoup
    html='''
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class='story'>
        Once a time there were three little sisiters;and their names were 
    <a href='https://example.com/elsie' class='sister' id='link1'>Elsie</a><a href='https://example.com/Lacie' class='sister' id='link2'>Lacie</a>
    </p>
    '''
    soup=BeautifulSoup(html,'lxml')
    print('获取文本,属性')
    
    print('next_sibling:')
    print(type(soup.a.next_sibling))
    print(soup.a.next_sibling)
    print(soup.a.next_sibling.string)
    
    print('Parent:')
    print(type(soup.a.parents))
    print(list(soup.a.parents)[0])
    print(list(soup.a.parents)[0].attrs['class'])

    返回结果是单个节点,可以直接调用string,attrs等属性获取其文本和属性。

    返回的是多个节点的生成器,则可以转为列表后取出某个元素,再调用string,attrs等属性获取其对应节点的文本和属性。

4.方法选择器
  • find_all()

    查询符合所有符合条件的元素,给它传入一些属性或文本,就可以得到符合条件的元素。

    (1)name

    根据节点名来查找元素

    from bs4 import BeautifulSoup
    html='''
    <div class='panel'>
    <div class='panel-heading'>
    <h4>Hello</h4>
    </div>
    <div class='panel-body'>
    <ul class='list' id='list-1'>
    <li class='element'>Foo</li>
    <li class='element'>Tom</li>
    <li class='element'>Bob</li>
    </ul>
    <ul class='list list-small' id='list-2'>
    <li class='element'>Foo</li>
    <li class='element'>Tom</li>
    </ul>
    </div>
    </div>
    '''
    soup=BeautifulSoup(html,'lxml')
    print(soup.find_all(name='ul'))
    print(type(soup.find_all(name='ul')[0]))#返回结果类型为Tag类型,可以嵌套查询
    for ul in soup.find_all(name='ul'):
        print(ul.find_all(name='li'))
        for li in ul.find_all(name='li'):
            print(li.string)

    输出:

    [<ul class="list" id="list-1">
    <li class="element">Foo</li>
    <li class="element">Tom</li>
    <li class="element">Bob</li>
    </ul>, <ul class="list list-small" id="list-2">
    <li class="element">Foo</li>
    <li class="element">Tom</li>
    </ul>]
    <class 'bs4.element.Tag'>
    ul==== <ul class="list" id="list-1">
    <li class="element">Foo</li>
    <li class="element">Tom</li>
    <li class="element">Bob</li>
    </ul>
    [<li class="element">Foo</li>, <li class="element">Tom</li>, <li class="element">Bob</li>]
    Foo
    Tom
    Bob
    ul==== <ul class="list list-small" id="list-2">
    <li class="element">Foo</li>
    <li class="element">Tom</li>
    </ul>
    [<li class="element">Foo</li>, <li class="element">Tom</li>]
    Foo
    Tom

    (2)attrs

    传入属性查询

    from bs4 import BeautifulSoup
    html='''
    <div class='panel'>
    <div class='panel-heading'>
    <h4>Hello</h4>
    </div>
    <div class='panel-body'>
    <ul class='list' id='list-1' name='elements'>
    <li class='element'>Foo</li>
    </ul>
    <ul class='list list-small' id='list-2'>
    <li class='element'>Foo</li>
    </ul>
    </div>
    </div>
    '''
    soup=BeautifulSoup(html,'lxml')
    print(soup.find_all(attrs={'id':'list-1'}))#attrs的残数为字典类型,返回结果为符合id=list-1所有节点列表类型
    print(soup.find_all(attrs={'name':'elements'}))
    

    输出:

    [<ul class="list" id="list-1" name="elements">
    <li class="element">Foo</li>
    </ul>]
    [<ul class="list" id="list-1" name="elements">
    <li class="element">Foo</li>
    </ul>]

    (3)text

    text()参数来匹配节点的文本,传入的形式可以是str,也可以是字符串,或者正则表达式对象。

    from bs4 import BeautifulSoup
    import re
    html='''
    <div class='panel'>
    <div class='panel-1'>
    <a href='https://www.baidu.com'>baidu link</a>
    <a href='https://www.sohu.com'>sohu link</a>
    </div>
    </div>
    '''
    soup=BeautifulSoup(html,'lxml')
    print(soup.find_all(text=re.compile('link')))

    输出:

    ['baidu link', 'sohu link']
  • find()

    类似find_all(),不同的是find()返回的第一个匹配的元素,为单个元素。find_all()返回所有匹配元素的列表。

5.CSS选择器
from bs4 import BeautifulSoup
html='''
<div class='panel'>
<div class='panel-heading'>
<h4>Hello</h4>
</div>
<div class='panel-body'>
<ul class='list' id='list-1'>
<li class='element'>Foo</li>
<li class='element'>Tom</li>
<li class='element'>Bob</li>
</ul>
<ul class='list list-small' id='list-2'>
<li class='element'>Foo</li>
<li class='element'>Tom</li>
</ul>
</div>
</div>
'''
soup=BeautifulSoup(html,'lxml')
print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))
print(type(soup.select('ul')[0]))

输出:

[<div class="panel-heading">
<h4>Hello</h4>
</div>]
[<li class="element">Foo</li>, <li class="element">Tom</li>, <li class="element">Bob</li>, <li class="element">Foo</li>, <li class="element">Tom</li>]
[<li class="element">Foo</li>, <li class="element">Tom</li>]
<class 'bs4.element.Tag'>
  • 嵌套选择

    先选择ul节点,再遍历每个ul节点,选择其li节点

    for ul in soup.select('ul'):
        print(ul.select('li'))
  • 获取属性

    for ul in soup.select('ul'):
        print(ul['li'])#Tag类型
        print(ul.attrs['id'])#获取每个ul节点的id属性
    '''
    list-1
    list-1
    list-2
    list-2
    '''
  • 获取文本

    获取文本你可以使用string方法,亦可以使用get_text()

    for li in soup.select('li'):
        print('Get Text',li.get_text())
        print('String:',li.string)
    '''
    Get Text Foo
    String: Foo
    Get Text Tom
    String: Tom
    Get Text Bob
    String: Bob
    Get Text Foo
    String: Foo
    Get Text Tom
    String: Tom
    '''

   转载规则


《Beautiful Soup的使用》 White Spider 采用 知识共享署名 4.0 国际许可协议 进行许可。
  目录