Python爬虫BeautifulSoup4的使用方法-快上网网站建设公司

Python爬虫BeautifulSoup4的使用方法

今天就跟大家聊聊有关Python爬虫BeautifulSoup4的使用方法，可能很多人都不太了解，为了让大家更加了解，小编给大家总结了以下内容，希望大家根据这篇文章可以有所收获。

余杭网站制作公司哪家好，找创新互联公司！从网页设计、网站建设、微信开发、APP开发、响应式网站等网站项目制作，到程序开发，运营维护。创新互联公司成立与2013年到现在10年的时间，我们拥有了丰富的建站经验和运维经验，来保证我们的工作的顺利进行。专注于网站建设就选创新互联公司。

爬虫——BeautifulSoup4解析器

BeautifulSoup用来解析HTML比较简单，API非常人性化，支持CSS选择器、Python标准库中的HTML解析器，也支持lxml的XML解析器。

其相较与正则而言，使用更加简单。

示例：

首先必须要导入bs4库

#!/usr/bin/python3
# -*- coding:utf-8 -*- 
 
from bs4 import BeautifulSoup
 
html = """
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""
 
# 创建 Beautiful Soup 对象，指定lxml解析器
soup = BeautifulSoup(html, "lxml")
 
# 格式化输出 soup 对象的内容
print(soup.prettify())

运行结果


 
  
   The Dormouse's story
  
 
 
  
   
    The Dormouse's story
   
  
  
   Once upon a time there were three little sisters; and their names were
   
    
   
   ,
   
    Lacie
   
   and
   
    Tillie
   
   ;
and they lived at the bottom of a well.
  
  
   ...

四大对象种类

BeautifulSoup将复杂的HTML文档转换成一个复杂的树形结构，每个节点都是Python对象，所有对象可以归纳为4种：

（1）Tag

（2）NavigableString

（3）BeautifulSoup

（4）Comment

1.Tag

Tag 通俗点讲就是HTML中的一个个标签，例如：

The Dormouse's story

The Dormouse's story

上面title head a p 等等HTML标签加上里面包括的内容就是Tag，那么试着使用BeautifulSoup来获取Tags：

#!/usr/bin/python3
# -*- coding:utf-8 -*-
 
from bs4 import BeautifulSoup
 
html = """
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""
 
# 创建 Beautiful Soup 对象，指定lxml解析器
soup = BeautifulSoup(html, "lxml")
 
# # 打印title标签
print(soup.title)
 
# 打印head标签
print(soup.head)
 
# 打印a标签
print(soup.a)
 
# 打印p标签
print(soup.p)
 
# 打印soup.p的类型
print(type(soup.p))

运行结果

The Dormouse's story
The Dormouse's story

The Dormouse's story

我们可以利用soup加标签名轻松地获取这些标签内容，这些对象的类型是bs4.element.Tag。但是注意，它查找的是在所有内容中的第一个符合要求的标签。如果需要查询所有的标签，后面会进行介绍。

对于Tag，它有两个重要的属性，就是name和attrs。

#!/usr/bin/python3
# -*- coding:utf-8 -*-
 
from bs4 import BeautifulSoup
 
html = """
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""
 
# 创建 Beautiful Soup 对象，指定lxml解析器
soup = BeautifulSoup(html, "lxml")
 
# soup对象比较特殊，它的name为[document]
print(soup.name)
 
# 对于其他内部标签，输出的值便为标签本身的名称
print(soup.head.name)
 
# 打印p标签的所有属性，其类型是一个字典
print(soup.p.attrs)
 
# 打印p标签的class属性
print(soup.p['class'])
# 还可以利用get方法获取属性，传入属性的名称，与上面的方法等价
print(soup.p.get('class'))
 
print(soup.p)
 
# 修改属性
soup.p['class'] = "newClass"
print(soup.p)
 
# 删除属性
del soup.p['class']
print(soup.p)

运行结果

[document]
head
{'class': ['title'], 'name': 'dromouse'}
['title']
['title']
The Dormouse's story
The Dormouse's story
The Dormouse's story

2.NavigableString

既然我们已经得到了标签的内容，那么问题来了，我们想要获取标签内部的文字怎么办呢？很简单，用.string即可，例如：

#!/usr/bin/python3
# -*- coding:utf-8 -*-
 
from bs4 import BeautifulSoup
 
html = """
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""
 
# 创建 Beautiful Soup 对象，指定lxml解析器
soup = BeautifulSoup(html, "lxml")
 
# 打印p标签的内容
print(soup.p.string)
 
# 打印soup.p.string的类型
print(type(soup.p.string))

运行结果

The Dormouse's story

3.BeautifulSoup

BeautifulSoup对象表示的是一个文档的内容。大部分时候，可以把它当作Tag对象，是一个特殊的Tag，我们可以分别获取它的类型，名称，以及属性。

#!/usr/bin/python3
# -*- coding:utf-8 -*-
 
from bs4 import BeautifulSoup
 
html = """
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""
 
# 创建 Beautiful Soup 对象，指定lxml解析器
soup = BeautifulSoup(html, "lxml")
 
# 类型
print(type(soup.name))
 
# 名称
print(soup.name)
 
# 属性
print(soup.attrs)

运行结果


[document]
{}

4.Comment

Comment对象是一个特殊类型的NavigableString对象，其输出的内容不包括注释符号。

#!/usr/bin/python3
# -*- coding:utf-8 -*-
 
from bs4 import BeautifulSoup
 
html = """
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""
 
# 创建 Beautiful Soup 对象，指定lxml解析器
soup = BeautifulSoup(html, "lxml")
 
print(soup.a)
 
print(soup.a.string)
 
print(type(soup.a.string))

运行结果


 Elsie

a标签里的内容实际上是注释，但是如果我们利用.string来输出它的内容时，注释符号已经去掉了。

看完上述内容，你们对Python爬虫BeautifulSoup4的使用方法有进一步的了解吗？如果还想了解更多知识或者相关内容，请关注创新互联行业资讯频道，感谢大家的支持。

名称栏目：Python爬虫BeautifulSoup4的使用方法
转载来于：http://cdkjz.cn/article/ijoepc.html

多年建站经验

多一份参考，总有益处

联系快上网，免费获得专属《策划方案》及报价

咨询相关问题或预约面谈，可以通过以下方式与我们联系

网站建设

网站推广

案例

方案

电商网站开发

微信小程序

我们

联系

精准传达 • 有效沟通

查看其它板块

Python爬虫BeautifulSoup4的使用方法

多一份参考，总有益处

联系快上网，免费获得专属《策划方案》及报价

大客户专线成都：13518219792 座机：028-86922220

友情链接交换友情链接

网络推广

Network promotion

网站方案

Solution

电商网站开发

E-commerce & System

我们

About Us

联系

Contact Us

精准传达 • 有效沟通

查看其它板块

Python爬虫BeautifulSoup4的使用方法

相关资讯

景宁抖音代运营收费标准

四川抖音代运营方案价格

火锅探店抖音短句文案（只有这些文案才能引起用户的共鸣）

抖音账号随心推直播间怎么运营(抖音账号直播间小店随心推怎么用)

短视频特效制作脚本,短视频编导脚本

网站SEO排名不稳定问题出在哪里

【成都seo优化】首选创新互联科技

广东个人抖音代运营报价

多一份参考，总有益处

联系快上网，免费获得专属《策划方案》及报价

大客户专线 成都：13518219792 座机：028-86922220

友情链接 交换友情链接

大客户专线成都：13518219792 座机：028-86922220

友情链接交换友情链接