Python3解析库BeautifulSoup

镇长2017-07-242023-11-02

Python

上一篇，学习如何请求网络数据，接下来学习使用Beautiful Soup解析请求到的数据。
Beautiful Soup是一个可以从HTML和XML文件中提取数据的Python库。

版本：4.4.0

安装Beautiful Soup

确保安装Python3之后，只需一行命令。

1 2	pip install beautifulsoup4

注意，Mac中可能需要使用pip3 install beautifulsoup4

安装完BeautifulSoup后，我们还需要HTTP解析器，例如三方解析器lxml

1	pip install lxml

万事俱备只欠东风！

快速开始

1
2
3

>>>from bs4 import BeautifulSoup
>>>soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')

对象种类

BeautifulSoup将复杂的HTML文档转为一个复杂的树形结构，每个节点都是Python对象，所有对象可以归纳为四种：Tag, NavigableString, BeautifulSoup, Comment。

Tag

Tag对象与XML和HTML原生文档中的tag相同。例如：

1
2
3

>>>tagb = soup.b
>>>type(tag) 
<class 'bs4.element.Tag'>

下面介绍两个最重要的属性：name 和 attributes 。Tag有很多属性和方法，在遍历文档树和搜索文档树中详细介绍。

Name

使用.name获取和修改tag的名字

1 2	>>> tag.name 'b'

Attributes

一个tag有很多属性。例如：前面的tag<b class="boldest">,有一个class属性。

1 2	>>> tag['class'] ['boldest']

获取所有的属性

1
2

tag.attrs

另外tag的属性可以添加，删除和修改。操作方法和字典一样

注意：多值属性，一个属性可以同时存在多个值

NavigableString

字符串常被包含在tag中，使用NavigableString类来包装tag中的字符串：

>>> tag.string
'Extremely bold'
>>> type(tag.string)
<class 'bs4.element.NavigableString'>

BeautifulSoup

BeautifulSoup对象并不是真正的HTML或XML的tag，所以它没有name和Attribute属性。有时我们需要.name查看，所以它包含一个值为[documnet]的特殊属性.name

1 2	>>>soup.name '[document]'

Comment

上面三个覆盖了HTML和XML中的所有内容。但是还有一些特殊对象。

>>> markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
>>> soup = BeautifulSoup(markup)
>>> comment = soup.b.string
>>> type(comment)
<class 'bs4.element.Comment'>
>>> comment
'Hey, buddy. Want to buy a used parser?'

Comment 对象是一个特殊类型的NavigableString对象

Comment对象会使用特殊的格式输出：

>>> print(soup.b.prettify())
<b>
 <!--Hey, buddy. Want to buy a used parser?-->
</b>

遍历文档树

🌰：

>>> html_doc = """
... <html><head><title>The Dormouse's story</title></head>
...     <body>
... <p class="title"><b>The Dormouse's story</b></p>
...
... <p class="story">Once upon a time there were three little sisters; and their names were
... <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
... <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
... <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
... and they lived at the bottom of a well.</p>
...
... <p class="story">...</p>
... """
>>>
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html_doc, 'html.parser')

子节点

一个Tag可能包含多个字符串或其它的Tag，这些都是这个Tag的子节点。Beautiful Soup提供了许多操作和遍历子节点的属性。

Tag的名字

操作文档树最简单的方法就是告诉想获取标签的名称：

>>> soup.head 
<head><title>The Dormouse's story</title></head>

>>> soup.title
<title>The Dormouse's story</title>

>>> soup.body.b
<b>The Dormouse's story</b>

.contents 和 .children

.contents属性可以将tag的子节点以列表的方式输出：

>>> head_tag = soup.head
>>> head_tag
<head><title>The Dormouse's story</title></head>

>>> head_tag.contents
[<title>The Dormouse's story</title>]
# .contents返回的是列表

>>> title_tag = head_tag.contents[0]
>>> title_tag
<title>The Dormouse's story</title>


>>> title_tag.contents
["The Dormouse's story"]

注意：字符串没有子节点，所以字符串没有.contents属性。

>>> for child in title_tag.children:
...     print(child)
...
The Dormouse's story

.descendants

.contents和.children属性仅包含tag的直接子节点。.descendants属性可以对所有tag的子孙节点进递归循环

>>> for child in head_tag.descendants:
...     print(child)
...
<title>The Dormouse's story</title>
The Dormouse's story

.string

1 2	>>> title_tag.string "The Dormouse's story"

.strings 和 stripped_strings

如果tag中包含多个字符串，可以使用.strings来循环获取。

for string in soup.strings:
    print(repr(string))
    # u"The Dormouse's story"
    # u'\n\n'
    # u"The Dormouse's story"
    # u'\n\n'
    # u'Once upon a time there were three little sisters; and their names were\n'
    # u'Elsie'
    # u',\n'
    # u'Lacie'
    # u' and\n'
    # u'Tillie'
    # u';\nand they lived at the bottom of a well.'
    # u'\n\n'
    # u'...'
    # u'\n'

输出的字符串中可以包含了很多空格或空行，使用.stripped_strings可以去除多余空白内容。

for string in soup.stripped_strings:
    print(repr(string))
    # u"The Dormouse's story"
    # u"The Dormouse's story"
    # u'Once upon a time there were three little sisters; and their names were'
    # u'Elsie'
    # u','
    # u'Lacie'
    # u'and'
    # u'Tillie'
    # u';\nand they lived at the bottom of a well.'
    # u'...'

父节点

.parent

.parent属性获取某个元素的父节点。

字符串也有父节点
的父节点是BeautifulSoup对象 BeautifulSoup对象的父节点是None

.parents

.parents递归获取所有的父辈节点。

兄弟节点

使用.next_sibling和.previous_sibling属性来查询兄弟节点

通过.next_siblings和.previous_siblings属性可以对当前节点的兄弟节点迭代输出。

搜索文档树

Beautiful Soup定义了很多搜索方法。例如：find() 和 find_all()。

过滤器

常见的过滤器类型，如下几种：

字符串

最简单的过滤器，例如：查找<b>标签可以写成find_all('b')。

正则表达式

匹配符合正则表达式的内容。

列表

匹配列表中所有元素内容。

TRUE

可以匹配任何值。

方法

可以定义一个接受一个参数的方法，返回布尔类型。如果是TRUE表示当前元素匹配找到，否则为找到。

find_all

find_all( name , attrs , recursive , string , **kwargs )

搜索所有当前tag的所有tag子节点，并判断是否符合过滤器的条件。

name

name参数可以查找所有名字为name的tag。

1 2	soup.find_all("title") # [<title>The Dormouse`s story</title>]

keyword参数

如果一个指定名字的参数不是搜索内置的参数名，搜索时会把该参数当作指定名字tag的属性来搜索。

# id 
soup.find_all(id = "links")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

# href 
>>> soup.find_all(href = re.compile('elsie'))
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

# attrs
data_soup = BeautifulSoup('<div data-foo = "value">foo!</div>')
>>> data_soup.find_all(attrs = {"data-foo": "value"})
#[<div data-foo="value">foo!</div>]

按Class搜索

按照类名搜索，但是由于class是保留字，所以使用class_代替。

1
2

>>> soup.find_all("a", class_="sister")
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

String参数

使用string参数搜索和使用name参数的可选值一样。

1 2	>>> soup.find_all(string="Elsie") ['Elsie']

limit参数

使用limit限制返回的数量

1 2	>>> soup.find_all("a", limit=2) [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]