测试积点老人 发表于 2020-5-22 13:59:43

请问如何能把news_detail4也一起进行比较?

python的返回值news_detail4无法被get_equal_rate_1认定为字符串,请问如何能把news_detail4也一起进行比较?
我这里是先获取新闻网页内容,然后进行比较,前三个爬取返回值可以进行比较,第四个不行,请问该怎么办?

<p>import difflib
from xml.etree.ElementTree import tostring
import requests
from lxml import etree
import time
from gne import GeneralNewsExtractor
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options
def get_chinanew_data():
cookies = {
'Hm_lvt_0da10fbf73cda14a786cd75b91f6beab': '1587367903',
'Hm_lpvt_0da10fbf73cda14a786cd75b91f6beab': '1587375545',
}
headers = {
    'Connection': 'keep-alive',
    'Cache-Control': 'max-age=0',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.113 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Accept-Language': 'zh-CN,zh;q=0.9',
}</p><p>response = requests.get('http://www.chinanews.com/gn/2020/04-20/9162019.shtml', headers=headers, cookies=cookies,
                        verify=False)
html = response.content.decode(errors='ignore')
etree_html = etree.HTML(html)
main = etree_html.xpath('//div[@id="cont_1_1_2"]')
title = main.xpath('./h1/text()')
pub_time = main.xpath(".//div/div[@class='left-t']/text()")
author = main.xpath('./div/div/div/span/text()')[:-2].split(':')
pubtime = pub_time.split() + ' ' + pub_time.split()
content = ''.join(main.xpath('./div[@class="left_zw"]/p/text()')).strip()
site_url = 'http://www.chinanews.com/gn/2020/04-20/9162019.shtml'
site_name = '中国新闻网'
news_detail = {
    'pub_time': pubtime.replace('年', '-').replace('月', '-').replace('日', ''),
    'author': author,
    'title': title,
    'content': content.replace('\u3000', ''),
    'site_url': site_url,
    'site_name': site_name,
}
return news_detail
def selenium_download_data():
options = Options()
options.add_argument('--headless')
driver = Chrome(options=options,executable_path=r"C:\Users\常乐添\AppData\Local\Google\Chrome\Application\chromedriver.exe")
url_list = [
'https://news.sina.com.cn/gov/xlxw/2020-04-20/doc-iircuyvh8766402.shtml',
'https://news.ifeng.com/c/7vovtvQ2gVc',
'https://baijiahao.baidu.com/s?id=1664460259411900230&wfr=spider&for=pc']</p>


海海豚 发表于 2020-5-25 09:17:36

return difflib.SequenceMatcher(None, str1, str2).quick_ratio()
->
return str(difflib.SequenceMatcher(None, str1, str2).quick_ratio())

郭小贱 发表于 2020-5-25 09:49:11

参考这篇文章:https://ask.csdn.net/questions/1066805

bellas 发表于 2020-5-25 09:53:45

来学习

jingzizx 发表于 2020-5-25 11:51:39

学习

litingting0214 发表于 2020-5-25 15:51:45

学习
页: [1]
查看完整版本: 请问如何能把news_detail4也一起进行比较?