如何用Python实现网页正文的提取-创新互联

这篇文章主要介绍了如何用Python实现网页正文的提取的相关知识，内容详细易懂，操作简单快捷，具有一定借鉴价值，相信大家阅读完这篇如何用Python实现网页正文的提取文章都会有所收获，下面我们一起来看看吧。

创新互联建站长期为上千余家客户提供的网站建设服务，团队从业经验10年，关注不同地域、不同群体，并针对不同对象提供差异化的产品和服务；打造开放共赢平台，与合作伙伴共同营造健康的互联网生态环境。为船营企业提供专业的网站建设、成都网站设计，船营网站改版等技术服务。拥有10多年丰富建站经验和众多成功案例,为您定制开发。

一个典型的新闻网页包括几个不同区域：

如何用Python实现网页正文的提取

新闻网页区域

我们要提取的新闻要素包含在：

标题区域
meta数据区域（发布时间等）
配图区域（如果想把配图也提取）
正文区域

而导航栏区域、相关链接区域的文字就不属于该新闻的要素。

新闻的标题、发布时间、正文内容一般都是从我们抓取的html里面提取的。如果仅仅是一个网站的新闻网页，提取这三个内容很简单，写三个正则表达式就可以完美提取了。然而，我们的爬虫抓来的是成百上千的网站的网页。对这么多不同格式的网页写正则表达式会累死人的，而且网页一旦稍微改版，表达式可能就失效，维护这群表达式也是会累死人的。

累死人的做法当然想不通，我们就要探索一下好的算法来实现。

1. 标题的提取

标题基本上都会出现在html的</code>标签里面，但是又被附加了诸如频道名称、网站名称等信息；</p><p>标题还会出现在网页的“标题区域”。</p><p>那么这两个地方，从哪里提取标题比较容易呢？</p><p>网页的“标题区域”没有明显的标识，不同网站的“标题区域”的html代码部分千差万别。所以这个区域并不容易提取出来。</p><p>那么就只剩下<code><title></code>标签了，这个标签很容易提取，无论是正则表达式，还是lxml解析都很容易，不容易的是如何去除频道名称、网站名称等信息。</p><p>先来看看，<code><title></code>标签里面都是设么样子的附加信息：</p><ul><li><p><code>上海用“智慧”激活城市交通脉搏，让道路更安全更有序更通畅_浦江头条_澎湃新闻-The Paper</code></p></li><li><p><code>“沪港大学联盟”今天在复旦大学成立_教育_新民网</code></p></li><li><p><code>三亚老人脚踹司机致公交车失控撞墙被判刑3年_社会</code></p></li><li><p><code>外交部：中美外交安全对话9日在美举行</code></p></li><li><p><code>进博会：中国行动全球瞩目，中国担当世界点赞_南方观澜_南方网</code></p></li><li><p><code>资本市场迎来重大改革设立科创板有何深意？-新华网</code></p></li></ul><p>观察这些title不难发现，新闻标题和频道名、网站名之间都是有一些连接符号的。那么我就可以通过这些连接符吧title分割，找出最长的部分就是新闻标题了。</p><p>这个思路也很容易实现，这里就不再上代码了，留给小猿们作为思考练习题自己实现一下。</p><h3>2. 发布时间提取</h3><p>发布时间，指的是这个网页在该网站上线的时间，一般它会出现在正文标题的下方——meta数据区域。从html代码看，这个区域没有什么特殊特征让我们定位，尤其是在非常多的网站版面面前，定位这个区域几乎是不可能的。这需要我们另辟蹊径。<br/>跟标题一样，我们也先看看一些网站的发布时间都是怎么写的：</p><ul><li><p>央视网2018年11月06日 22:22</p></li><li><p>时间：2018-11-07 14:27:00</p></li><li><p>2018-11-07 11:20:37 来源：新华网</p></li><li><p>来源：中国日报网 2018-11-07 08:06:39</p></li><li><p>2018年11月07日 07:39:19</p></li><li><p>2018-11-06 09:58 来源：澎湃新闻</p></li></ul><p>这些写在网页上的发布时间，都有一个共同的特点，那就是一个表示时间的字符串，年月日时分秒，无外乎这几个要素。通过正则表达式，我们列举一些不同时间表达方式（也就那么几种）的正则表达式，就可以从网页文本中进行匹配提取发布时间了。</p><p>这也是一个很容易实现的思路，但是细节比较多，表达方式要涵盖的尽可能多，写好这么一个提取发布时间的函数也不是那么容易的哦。小猿们尽情发挥动手能力，看看自己能写出怎样的函数实现。这也是留给小猿们的一道练习题。</p><h3>3. 正文的提取</h3><p>正文（包括新闻配图）是一个新闻网页的主体部分，它在视觉上占据中间位置，是新闻的内容主要的文字区域。正文的提取有很多种方法，实现上有复杂也有简单。本文介绍的方法，是结合老猿多年的实践经验和思考得出来的一个简单快速的方法，姑且称之为“节点文本密度法”。</p><p>我们知道，网页的html代码是由不同的标签（tag）组成了一个树状结构树，每个标签是树的一个节点。通过遍历这个树状结构的每个节点，找到文本最多的节点，它就是正文所在的节点。根据这个思路，我们来实现一下代码。</p><h4>3.1 实现源码</h4><pre>#!/usr/bin/env python3 #File: maincontent.py #Author: veelion import re import time import traceback import cchardet import lxml import lxml.html from lxml.html import HtmlComment REGEXES = { 'okMaybeItsACandidateRe': re.compile( 'and|article|artical|body|column|main|shadow', re.I), 'positiveRe': re.compile( ('article|arti|body|content|entry|hentry|main|page|' 'artical|zoom|arti|context|message|editor|' 'pagination|post|txt|text|blog|story'), re.I), 'negativeRe': re.compile( ('copyright|combx|comment|com-|contact|foot|footer|footnote|decl|copy|' 'notice|' 'masthead|media|meta|outbrain|promo|related|scroll|link|pagebottom|bottom|' 'other|shoutbox|sidebar|sponsor|shopping|tags|tool|widget'), re.I), } class MainContent: def __init__(self,): self.non_content_tag = set([ 'head', 'meta', 'script', 'style', 'object', 'embed', 'iframe', 'marquee', 'select', ]) self.title = '' self.p_space = re.compile(r'\s') self.p_html = re.compile(r'<html|</html>', re.IGNORECASE|re.DOTALL) self.p_content_stop = re.compile(r'正文.*结束|正文下|相关阅读|声明') self.p_clean_tree = re.compile(r'author|post-add|copyright') def get_title(self, doc): title = '' title_el = doc.xpath('//title') if title_el: title = title_el[0].text_content().strip() if len(title) < 7: tt = doc.xpath('//meta[@name="title"]') if tt: title = tt[0].get('content', '') if len(title) < 7: tt = doc.xpath('//*[contains(@id, "title") or contains(@class, "title")]') if not tt: tt = doc.xpath('//*[contains(@id, "font01") or contains(@class, "font01")]') for t in tt: ti = t.text_content().strip() if ti in title and len(ti)*2 > len(title): title = ti break if len(ti) > 20: continue if len(ti) > len(title) or len(ti) > 7: title = ti return title def shorten_title(self, title): spliters = [' - ', '–', '—', '-', '|', '::'] for s in spliters: if s not in title: continue tts = title.split(s) if len(tts) < 2: continue title = tts[0] break return title def calc_node_weight(self, node): weight = 1 attr = '%s %s %s' % ( node.get('class', ''), node.get('id', ''), node.get('style', '') ) if attr: mm = REGEXES['negativeRe'].findall(attr) weight -= 2 * len(mm) mm = REGEXES['positiveRe'].findall(attr) weight += 4 * len(mm) if node.tag in ['div', 'p', 'table']: weight += 2 return weight def get_main_block(self, url, html, short_title=True): ''' return (title, etree_of_main_content_block) ''' if isinstance(html, bytes): encoding = cchardet.detect(html)['encoding'] if encoding is None: return None, None html = html.decode(encoding, 'ignore') try: doc = lxml.html.fromstring(html) doc.make_links_absolute(base_url=url) except : traceback.print_exc() return None, None self.title = self.get_title(doc) if short_title: self.title = self.shorten_title(self.title) body = doc.xpath('//body') if not body: return self.title, None candidates = [] nodes = body[0].getchildren() while nodes: node = nodes.pop(0) children = node.getchildren() tlen = 0 for child in children: if isinstance(child, HtmlComment): continue if child.tag in self.non_content_tag: continue if child.tag == 'a': continue if child.tag == 'textarea': # FIXME: this tag is only part of content? continue attr = '%s%s%s' % (child.get('class', ''), child.get('id', ''), child.get('style')) if 'display' in attr and 'none' in attr: continue nodes.append(child) if child.tag == 'p': weight = 3 else: weight = 1 text = '' if not child.text else child.text.strip() tail = '' if not child.tail else child.tail.strip() tlen += (len(text) + len(tail)) * weight if tlen < 10: continue weight = self.calc_node_weight(node) candidates.append((node, tlen*weight)) if not candidates: return self.title, None candidates.sort(key=lambda a: a[1], reverse=True) good = candidates[0][0] if good.tag in ['p', 'pre', 'code', 'blockquote']: for i in range(5): good = good.getparent() if good.tag == 'div': break good = self.clean_etree(good, url) return self.title, good def clean_etree(self, tree, url=''): to_drop = [] drop_left = False for node in tree.iterdescendants(): if drop_left: to_drop.append(node) continue if isinstance(node, HtmlComment): to_drop.append(node) if self.p_content_stop.search(node.text): drop_left = True continue if node.tag in self.non_content_tag: to_drop.append(node) continue attr = '%s %s' % ( node.get('class', ''), node.get('id', '') ) if self.p_clean_tree.search(attr): to_drop.append(node) continue aa = node.xpath('.//a') if aa: text_node = len(self.p_space.sub('', node.text_content())) text_aa = 0 for a in aa: alen = len(self.p_space.sub('', a.text_content())) if alen > 5: text_aa += alen if text_aa > text_node * 0.4: to_drop.append(node) for node in to_drop: try: node.drop_tree() except: pass return tree def get_text(self, doc): lxml.etree.strip_elements(doc, 'script') lxml.etree.strip_elements(doc, 'style') for ch in doc.iterdescendants(): if not isinstance(ch.tag, str): continue if ch.tag in ['div', 'h2', 'h3', 'h4', 'p', 'br', 'table', 'tr', 'dl']: if not ch.tail: ch.tail = '\n' else: ch.tail = '\n' + ch.tail.strip() + '\n' if ch.tag in ['th', 'td']: if not ch.text: ch.text = ' ' else: ch.text += ' ' # if ch.tail: # ch.tail = ch.tail.strip() lines = doc.text_content().split('\n') content = [] for l in lines: l = l.strip() if not l: continue content.append(l) return '\n'.join(content) def extract(self, url, html): '''return (title, content) ''' title, node = self.get_main_block(url, html) if node is None: print('\tno main block got !!!!!', url) return title, '', '' content = self.get_text(node) return title, content</pre><h4>3.2 代码解析</h4><p>跟新闻爬虫一样，我们把整个算法实现为一个类：MainContent。</p><p>首先，定义了一个全局变量： REGEXES。它收集了一些经常出现在标签的class和id中的关键词，这些词标识着该标签可能是正文或者不是。我们用这些词来给标签节点计算权重，也就是方法calc_node_weight()的作用。</p><p>MainContent类的初始化，先定义了一些不会包含正文的标签 self.non_content_tag，遇到这些标签节点，直接忽略掉即可。</p><p>本算法提取标题实现在get_title()这个函数里面。首先，它先获得<code><title></code>标签的内容，然后试着从<code><meta></code>里面找title，再尝试从<code><body></code>里面找id和class包含title的节点，最后把从不同地方获得的可能是标题的文本进行对比，最终获得标题。对比的原则是：</p><ul><li><p><code><meta></code>, <code><body></code>里面找到的疑似标题如果包含在<code><title></code>标签里面，则它是一个干净（没有频道名、网站名）的标题；</p></li><li><p>如果疑似标题太长就忽略</p></li><li><p>主要把<code><title></code>标签作为标题</p></li></ul><p>从<code><title></code>标签里面获得标题，就要解决标题清洗的问题。这里实现了一个简单的方法： clean_title()。</p><p>在这个实现中，我们使用了lxml.html把网页的html转化成一棵树，从body节点开始遍历每一个节点，看它直接包含（不含子节点）的文本的长度，从中找出含有最长文本的节点。这个过程实现在方法：get_main_block()中。其中一些细节，小猿们可以仔细体会一下。</p><p>其中一个细节就是，clean_node()这个函数。通过get_main_block()得到的节点，有可能包含相关新闻的链接，这些链接包含大量新闻标题，如果不去除，就会给新闻内容带来杂质（相关新闻的标题、概述等）。</p><p>还有一个细节，get_text()函数。我们从main block中提取文本内容，不是直接使用text_content()，而是做了一些格式方面的处理，比如在一些标签后面加入换行符合<code>\n</code>，在table的单元格之间加入空格。这样处理后，得到的文本格式比较符合原始网页的效果。</p><h3>爬虫知识点</h3><p>1. cchardet模块<br/>用于快速判断文本编码的模块</p><p>2. lxml.html模块<br/>结构化html代码的模块，通过xpath解析网页的工具，高效易用，是写爬虫的居家必备的模块。</p><p>3. 内容提取的复杂性<br/>我们这里实现的正文提取的算法，基本上可以正确处理90%以上的新闻网页。<br/>但是，世界上没有千篇一律的网页一样，也没有一劳永逸的提取算法。大规模使用本文算法的过程中，你会碰到奇葩的网页，这个时候，你就要针对这些网页，来完善这个算法类。</p><p>关于“如何用Python实现网页正文的提取”这篇文章的内容就介绍到这里，感谢各位的阅读！相信大家对“如何用Python实现网页正文的提取”知识都有一定的了解，大家如果还想学习更多知识，欢迎关注创新互联-成都网站建设公司行业资讯频道。</p> <br> 文章标题：如何用Python实现网页正文的提取-创新互联 <br> 转载注明：<a href="http://cdkjz.cn/article/dccojj.html">http://cdkjz.cn/article/dccojj.html</a> </div> <div class="g-return-wrapper clearfix"> <a href="http://www.cdkjz.cn/" class="home">返回首页</a> <a href="http://www.cdkjz.cn/news/" class="column">了解更多建站资讯</a> </div> </div> </div> <div class="full-related-news"> <h3 class="related-title">相关资讯</h3> <div class="related-news weblg"> <ul class="clearfix"> <li> <a href="/article/heopd.html"> <h2 class="title">STM32按键的使用控制指示灯和蜂鸣器的使用-创新互联</h2> </a> </li><li> <a href="/article/heohc.html"> <h2 class="title">vue中使用betterscroll无法滚动怎么办-创新互联</h2> </a> </li><li> <a href="/article/heogg.html"> <h2 class="title">C++中标准库bitset类型怎么用-创新互联</h2> </a> </li><li> <a href="/article/heoje.html"> <h2 class="title">怎么看switch多少gswitch手柄能用多久？-创新互联</h2> </a> </li><li> <a href="/article/heogj.html"> <h2 class="title">python可用于什么地方-创新互联</h2> </a> </li><li> <a href="/article/heoho.html"> <h2 class="title">Python中如何通过递归获取目录下指定文件-创新互联</h2> </a> </li><li> <a href="/article/heopp.html"> <h2 class="title">学习python使用哪个版本会更好-创新互联</h2> </a> </li><li> <a href="/article/heohj.html"> <h2 class="title">jQuery多库冲突的解决方法-创新互联</h2> </a> </li> </ul> </div> </div> <div class="full-icontact-cover m-ft-contact"> <div class="weblg"> <div class="clearfix content"> <div class="motto"> 多年建站经验 </div> <div class="info"> <h3>多一份参考，总有益处</h3> <h2> 联系快上网，免费获得专属《策划方案》及报价</h2> <div class="msg"> <p>咨询相关问题或预约面谈，可以通过以下方式与我们联系</p> <h4> 大客户专线成都：<a href="tel:+13518219792" rel="nofollow">13518219792</a> 座机：<a href="tel:02886922220" rel="nofollow">028-86922220</a> </h4> </div> </div> </div> <div class="btns clearfix"> <a href="https://wpa.qq.com/msgrd?v=3&uin=631063699&site=qq&menu=yes" target="_blank" rel="nofollow" class="oline">在线咨询</a> <a href="javascript:;" class="edit" rel="nofollow">提交需求</a> </div> </div> </div> <div class="footer-content"> <div class="weblg clearfix"> <div class="friend-links"> <h6 class="clearfix"> <span class="tilte">友情链接</span> <a class="exchagne" href="http://wpa.qq.com/msgrd?v=3&uin=631063699&site=qq&menu=yes">交换友情链接</a> </h6> <div class="link-list clearfix"> <div class="link-slider"> <a href="https://www.cdcxhl.com/ssl/https.html" title="ssl证书申请" target="_blank">ssl证书申请</a><a href="https://www.cdxwcx.com/jifang/xiyun.html" title="移动服务器托管" target="_blank">移动服务器托管</a><a href="http://www.cdxwcx.cn/tuoguan/xibuxinxi.html" title="中国电信西部信息中心机房" target="_blank">中国电信西部信息中心机房</a><a href="http://www.dyfdjzy.com/" title="成都宣传画册设计" target="_blank">成都宣传画册设计</a><a href="https://www.cdcxhl.com/idc/mintian.html" title="成都高电机柜租用" target="_blank">成都高电机柜租用</a><a href="http://www.huijiubei.com/" title="防护网厂" target="_blank">防护网厂</a><a href="http://www.cdxwcx.cn/tuoguan/guanghua.html" title="光华服务器托管" target="_blank">光华服务器托管</a><a href="https://www.cdxwcx.com/tuiguang/zhenghe.html" title="全网营销" target="_blank">全网营销</a><a href="http://www.hnjierui.cn/" title="巴中网站建设" target="_blank">巴中网站建设</a><a href="https://www.cdxwcx.com/jifang/guanghua.html" title="成都光华电信机房" target="_blank">成都光华电信机房</a> </div> </div> </div> </div> <div class="full-foot-bottom"> <div class="weblg clearfix"> <p>成都网站建设公司地址：成都市青羊区太升南路288号锦天国际A座10层建设咨询<a href="tel:028-86922220">028-86922220</a></p> <p> 成都快上网科技有限公司-四川网站建设设计公司 | <a href="http://www.miitbeian.gov.cn/" target="_blank" rel="nofollow">蜀ICP备19037934号</a> Copyright 2020,ALL Rights Reserved cdkjz.cn | <a href="http://www.cdkjz.cn/" target="_blank">成都网站建设</a> | © Copyright 2020版权所有.</p> <p>专家团队为您提供<a href="http://www.cdkjz.cn/" target="_blank">成都网站建设</a>,<a href="http://www.cdkjz.cn/" target="_blank">成都网站设计</a>,成都品牌网站设计,成都营销型网站制作等服务,成都建网站就找快上网！ | 成都网站建设哪家好？ | <a href="###">网站建设地图</a></p> </div> </div> </div> <script type="text/javascript" src="../js/idangerous.swiper.min.js"></script> <script type="text/javascript" src="../js/wow.min.js"></script> <script type="text/javascript" src="../js/jquery.mousewheel.min.js"></script> <script type="text/javascript" src="../js/jquery.placeholder.min.js"></script> <script type="text/javascript" src="../js/layout.js"></script> </body> </html> <script> $(".singlepage img").each(function(){ var src = $(this).attr("src"); //获取图片地址 var str=new RegExp("http"); var result=str.test(src); if(result==false){ var url = "https://www.cdcxhl.com"+src; //绝对路径 $(this).attr("src",url); } }); window.onload=function(){ document.oncontextmenu=function(){ return false; } } </script>

网站建设

网站推广

案例

方案

电商网站开发

微信小程序

我们

联系

精准传达 • 有效沟通

查看其它板块

如何用Python实现网页正文的提取-创新互联

1. 标题的提取

网络推广

Network promotion

网站方案

Solution

电商网站开发

E-commerce & System

我们

About Us

联系

Contact Us

精准传达 • 有效沟通

查看其它板块

如何用Python实现网页正文的提取-创新互联

1. 标题的提取