51Testing软件测试论坛

 找回密码
 (注-册)加入51Testing

QQ登录

只需一步,快速开始

微信登录,快人一步

手机号码,快捷登录

查看: 2266|回复: 1
打印 上一主题 下一主题

Python数据采集Selenium、PantomJS浅谈

[复制链接]
  • TA的每日心情
    无聊
    2024-9-27 10:07
  • 签到天数: 62 天

    连续签到: 1 天

    [LV.6]测试旅长

    跳转到指定楼层
    1#
    发表于 2018-2-7 16:46:35 | 只看该作者 回帖奖励 |倒序浏览 |阅读模式
            一直以来我觉得用在运维的Selenium、PantomJS是一个重器,
            不到万不得已的时候不要祭出这个大杀器,
            但是涉及到JavaScript及Ajax渲染的时候,Requests就完全懵逼了!

            最近回过头来重新审视这货,
            这个重器用反倒轻便了很多。

            1.安装Selenium、PantomJS

            Selenium可以直接通过pip安装,PantomJS则时一个exe可执行文件,需要下载解压。在使用的时候指定exe的绝对路径即可。

            2.Selenium、PantomJS基本设置
    1. from selenium import webdriver
    2. from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
    3. dcap = DesiredCapabilities.PHANTOMJS
    4. dcap[ "phantomjs.page.settings.userAgent"] = "Mozilla / 4.0(Windows NT 10.0; Win64;x64) AppleWebKit / 537.36(KHTML, like Gecko) Chrome/51.0.2704.79 Safari/ 537.36Edge/14.14393"
    5. # 请求头不一样,自适应的窗口不一样,卧槽,坑爹!
    6. driver = webdriver.PhantomJS(desired_capabilities=dcap)
    7. driver.set_page_load_timeout(10)
    8. driver.set_script_timeout(10) # 设置页面退出时间,没有必要等一个网页加载完了采集
    9. # 采集网页源码
    10.     try:
    11.         driver.get(inurl)
    12.         content = driver.page_source
    13.         # print(content)
    14.         time.sleep(1)
    15.     except:
    16.         driver.execute_script('window.stop()')
    17. driver.close()
    复制代码
            3.Selenium、PantomJS基本操作
             如果你的网络和机子足够好,基本上就不用等待网页渲染,
             否则,还需要等待,如果用time.sleep(),则有点笨拙,
    1. #等待页面渲染完成
    2. from selenium.webdriver.common.by import By
    3. from selenium.webdriver.support.ui import WebDriverWait
    4. from selenium.webdriver.support import expected_conditions as EC
    5. ...
    6. try:
    7.     element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "loadedButton")))
    8. # 等某个标签元素出来,不见鸭子不撒鹰。

    9. finally:  # 撒鹰
    10.     print(driver.find_element_by_id("content").text)
    11.     driver.close()
    复制代码
             或者用
    1. try:
    2.     elem == driver.find_element_by_tag_name("html")
    3.     # 抛出StaleElementReferenceException异常说明elem元素已经消失了,也就说明页面已经跳转了。
    4. except StaleElementReferenceException:  
    5.     return
    复制代码
          其他driver内置函数,可以通过查看源代码或者在pycharm提示获取。
            4.Xpath定位Html标签
    1. 1.id定位:find_element_by_id(self, id_)
    2. 2.name定位:find_element_by_name(self, name)
    3. 3.class定位:find_element_by_class_name(self, name)
    4. 4.tag定位:find_element_by_tag_name(self, name)
    5. 5.link定位:find_element_by_link_text(self, link_text)
    6. 6.partial_link定位find_element_by_partial_link_text(self, link_text)
    7. 7.xpath定位:find_element_by_xpath(self, xpath)
    8. 8.css定位:find_element_by_css_selector(self, css_selector)
    9. 9.id复数定位find_elements_by_id(self, id_)
    10. 10.name复数定位find_elements_by_name(self, name)
    11. 11.class复数定位find_elements_by_class_name(self, name)
    12. 12.tag复数定位find_elements_by_tag_name(self, name)
    13. 13.link复数定位find_elements_by_link_text(self, text)
    14. 14.partial_link复数定位find_elements_by_partial_link_text(self, link_text)
    15. 15.xpath复数定位find_elements_by_xpath(self, xpath)
    16. 16.css复数定位find_elements_by_css_selector(self, css_selector
    17. 17.find_element(self, by='id', value=None)
    18. 18.find_elements(self, by='id', value=None)
    复制代码
             其中element方法定位到是是单数,是直接定位到元素;elements方法是复数,这个学过英文的
    都知道,定位到的是一组元素,返回的是list队列。可参照Re函数中的findall理解。

              5.完整例子

              这个例子属于标准化操作,在实际中可以适当简化,并结合上面的Xpath定位完成。
    1. from selenium import webdriver
    2. import time
    3. from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

    4. dcap = dict(DesiredCapabilities.PHANTOMJS)
    5. dcap["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"

    6. driver = webdriver.PhantomJS(executable_path=r'C:\Users\taojw\Desktop\pywork\phantomjs-2.1.1-windows\bin\phantomjs.exe', desired_capabilities=dcap)
    7. driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html")
    8. time.sleep(3)
    9. print(driver.find_element_by_id("content").text)
    10. driver.close()

    11. #设置PHANTOMJS的USER-AGENT
    12. from selenium import webdriver
    13. from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

    14. dcap = dict(DesiredCapabilities.PHANTOMJS)
    15. dcap["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"


    16. driver = webdriver.PhantomJS(executable_path='./phantomjs.exe', desired_capabilities=dcap)
    17. driver.get("http://dianping.com/")

    18. cap_dict = driver.desired_capabilities  #查看所有可用的desired_capabilities属性。
    19. for key in cap_dict:
    20.     print('%s: %s' % (key, cap_dict[key]))
    21. print(driver.current_url)
    22. driver.quit()

    23. #等待页面渲染完成
    24. from selenium.webdriver.common.by import By
    25. from selenium.webdriver.support.ui import WebDriverWait
    26. from selenium.webdriver.support import expected_conditions as EC

    27. driver = webdriver.PhantomJS(executable_path=r'C:\Users\taojw\Desktop\pywork\phantomjs-2.1.1-windows\bin\phantomjs.exe')
    28. driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html")
    29. try:
    30.     element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "loadedButton")))
    31. finally:
    32.     print(driver.find_element_by_id("content").text)
    33.     driver.close()

    34. #处理Javascript重定向
    35. from selenium import webdriver
    36. import time
    37. from selenium.webdriver.remote.webelement import WebElement
    38. from selenium.common.exceptions import StaleElementReferenceException

    39. def waitForLoad(driver):
    40.     elem = driver.find_element_by_tag_name("html")
    41.     count = 0
    42.     while True:
    43.         count += 1
    44.         if count > 20:
    45.             print("Timing out after 10 seconds and returning")
    46.             return
    47.         time.sleep(.5)
    48.         try:
    49.             elem == driver.find_element_by_tag_name("html")
    50.         except StaleElementReferenceException:
    51.             return

    52. driver = webdriver.PhantomJS(executable_path=r'C:\Users\taojw\Desktop\pywork\phantomjs-2.1.1-windows\bin\phantomjs.exe')
    53. driver.get("http://pythonscraping.com/pages/javascript/redirectDemo1.html")
    54. waitForLoad(driver)
    55. print(driver.page_source)
    56. ######
    57. from selenium import webdriver
    58. from selenium.webdriver.remote.webelement import WebElement
    59. from selenium.webdriver import ActionChains

    60. driver = webdriver.PhantomJS(executable_path='phantomjs/bin/phantomjs')
    61. driver.get('http://pythonscraping.com/pages/javascript/draggableDemo.html')

    62. print(driver.find_element_by_id("message").text)

    63. element = driver.find_element_by_id("draggable")
    64. target = driver.find_element_by_id("div2")
    65. actions = ActionChains(driver)
    66. actions.drag_and_drop(element, target).perform()

    67. print(driver.find_element_by_id("message").text)
    68. #######
    69. #截屏
    70. driver.get_screenshot_as_file('tmp/pythonscraping.png')

    71. ####
    72. #登陆知乎,然后能自动点击页面下方的“更多”,以载入更多的内容
    73. from selenium import webdriver
    74. from selenium.webdriver.common.keys import Keys
    75. from selenium.webdriver.support.ui import WebDriverWait
    76. from selenium.webdriver import ActionChains
    77. import time
    78. import sys

    79. driver = webdriver.PhantomJS(executable_path='C:\Users\Gentlyguitar\Desktop\phantomjs-1.9.7-windows\phantomjs.exe')
    80. driver.get("http://www.zhihu.com/#signin")
    81. #driver.find_element_by_name('email').send_keys('your email')
    82. driver.find_element_by_xpath('//input[@name="password"]').send_keys('your password')
    83. #driver.find_element_by_xpath('//input[@name="password"]').send_keys(Keys.RETURN)
    84. time.sleep(2)
    85. driver.get_screenshot_as_file('show.png')
    86. #driver.find_element_by_xpath('//button[@class="sign-button"]').click()
    87. driver.find_element_by_xpath('//form[@class="zu-side-login-box"]').submit()

    88. try:
    89.     #等待页面加载完毕
    90.     dr=WebDriverWait(driver,5)
    91.     dr.until(lambda the_driver:the_driver.find_element_by_xpath('//a[@class="zu-top-nav-userinfo "]').is_displayed())
    92. except:
    93.     print('登录失败')
    94.     sys.exit(0)
    95. driver.get_screenshot_as_file('show.png')
    96. #user=driver.find_element_by_class_name('zu-top-nav-userinfo ')
    97. #webdriver.ActionChains(driver).move_to_element(user).perform() #移动鼠标到我的用户名
    98. loadmore=driver.find_element_by_xpath('//a[@id="zh-load-more"]')
    99. actions = ActionChains(driver)
    100. actions.move_to_element(loadmore)
    101. actions.click(loadmore)
    102. actions.perform()
    103. time.sleep(2)
    104. driver.get_screenshot_as_file('show.png')
    105. print(driver.current_url)
    106. print(driver.page_source)
    107. driver.quit()
    复制代码


    分享到:  QQ好友和群QQ好友和群 QQ空间QQ空间 腾讯微博腾讯微博 腾讯朋友腾讯朋友
    收藏收藏
    回复

    使用道具 举报

    本版积分规则

    关闭

    站长推荐上一条 /1 下一条

    小黑屋|手机版|Archiver|51Testing软件测试网 ( 沪ICP备05003035号 关于我们

    GMT+8, 2024-11-23 18:31 , Processed in 0.063361 second(s), 22 queries .

    Powered by Discuz! X3.2

    © 2001-2024 Comsenz Inc.

    快速回复 返回顶部 返回列表