51Testing软件测试论坛

标题: 请问如何能把news_detail4也一起进行比较? [打印本页]

作者: 测试积点老人    时间: 2020-5-22 13:59
标题: 请问如何能把news_detail4也一起进行比较?
python的返回值news_detail4无法被get_equal_rate_1认定为字符串,请问如何能把news_detail4也一起进行比较?
我这里是先获取新闻网页内容,然后进行比较,前三个爬取返回值可以进行比较,第四个不行,请问该怎么办?

  1. <p>import difflib
  2. from xml.etree.ElementTree import tostring
  3. import requests
  4. from lxml import etree
  5. import time
  6. from gne import GeneralNewsExtractor
  7. from selenium.webdriver import Chrome
  8. from selenium.webdriver.chrome.options import Options
  9. def get_chinanew_data():
  10. cookies = {
  11. 'Hm_lvt_0da10fbf73cda14a786cd75b91f6beab': '1587367903',
  12. 'Hm_lpvt_0da10fbf73cda14a786cd75b91f6beab': '1587375545',
  13. }
  14. headers = {
  15.     'Connection': 'keep-alive',
  16.     'Cache-Control': 'max-age=0',
  17.     'Upgrade-Insecure-Requests': '1',
  18.     'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.113 Safari/537.36',
  19.     'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
  20.     'Accept-Language': 'zh-CN,zh;q=0.9',
  21. }</p><p>response = requests.get('http://www.chinanews.com/gn/2020/04-20/9162019.shtml', headers=headers, cookies=cookies,
  22.                         verify=False)
  23. html = response.content.decode(errors='ignore')
  24. etree_html = etree.HTML(html)
  25. main = etree_html.xpath('//div[@id="cont_1_1_2"]')[0]
  26. title = main.xpath('./h1/text()')[0]
  27. pub_time = main.xpath(".//div[3]/div[@class='left-t']/text()")[0]
  28. author = main.xpath('./div[5]/div[2]/div/span/text()')[0][:-2].split(':')[1]
  29. pubtime = pub_time.split()[0] + ' ' + pub_time.split()[1]
  30. content = ''.join(main.xpath('./div[@class="left_zw"]/p/text()')).strip()
  31. site_url = 'http://www.chinanews.com/gn/2020/04-20/9162019.shtml'
  32. site_name = '中国新闻网'
  33. news_detail = {
  34.     'pub_time': pubtime.replace('年', '-').replace('月', '-').replace('日', ''),
  35.     'author': author,
  36.     'title': title,
  37.     'content': content.replace('\u3000', ''),
  38.     'site_url': site_url,
  39.     'site_name': site_name,
  40. }
  41. return news_detail
  42. def selenium_download_data():
  43. options = Options()
  44. options.add_argument('--headless')
  45. driver = Chrome(options=options,executable_path=r"C:\Users\常乐添\AppData\Local\Google\Chrome\Application\chromedriver.exe")
  46. url_list = [
  47. 'https://news.sina.com.cn/gov/xlxw/2020-04-20/doc-iircuyvh8766402.shtml',
  48. 'https://news.ifeng.com/c/7vovtvQ2gVc',
  49. 'https://baijiahao.baidu.com/s?id=1664460259411900230&wfr=spider&for=pc']</p>
复制代码




作者: 海海豚    时间: 2020-5-25 09:17
return difflib.SequenceMatcher(None, str1, str2).quick_ratio()
->
return str(difflib.SequenceMatcher(None, str1, str2).quick_ratio())
作者: 郭小贱    时间: 2020-5-25 09:49
参考这篇文章:https://ask.csdn.net/questions/1066805
作者: bellas    时间: 2020-5-25 09:53
来学习
作者: jingzizx    时间: 2020-5-25 11:51
学习
作者: litingting0214    时间: 2020-5-25 15:51
学习




欢迎光临 51Testing软件测试论坛 (http://bbs.51testing.com/) Powered by Discuz! X3.2