Python + Selenium 爬取网易云课堂课时标题及时长

lsekfe · 发表于 2021-9-24 09:44:14

　目标页面
　　study.163.com/course/intr…
　　一开始用常规方法请求下来，发现源码中根本找不到任何课时信息，说明该网页用 JavaScript 来动态加载内容。
　　使用开发者工具分析一下，发现浏览器请求了如下的地址获取课时详情信息：
　　study.163.com/dwr/call/pl…
　　在预览界面可以看到各课时信息的 Unicode 编码。

尝试直接请求上述地址，显然会报错，不想去研究请求头具体应该传哪些参数了，直接上 Selenium，反正就爬一个页面，对性能没什么要求。
　　代码
　　说明
　　·study163seleniumff.py 是主运行文件
　　· helper.py 是辅助模块，与主运行文件同目录
　　· geckodriver.exe 需要放在 ../drivers/ 这个相对路径下
　　study163seleniumff.py

from selenium.webdriver import Firefox
　　from selenium.webdriver.firefox.options import Options
　　from lxml import etree
　　import csv
　　from helper import Chapter, Lesson
　　# 请求数据
　　url = 'https://study.163.com/course/introduction.htm?courseId=1006078212#/courseDetail?tab=1'
　　options = Options()
　　options.add_argument('-headless') # 无头参数
　　driver = Firefox(
　　 executable_path='../drivers/geckodriver',
　　 firefox_options=options)
　　driver.get(url)
　　text = driver.page_source
　　driver.quit()
　　# 解析数据
　　html = etree.HTML(text)
　　chapters = html.xpath('//div[@class="chapter"]')
　　TABLEHEAD = ['章节号', '章节名', '课时号', '课时名', '课时长']
　　rows = []
　　for each in chapters:
　　 chapter = Chapter(each)
　　 lessons = chapter.get_lessons()
　　 for each in lessons:
　　 lesson = Lesson(each)
　　 chapter_info = chapter.chapter_info
　　 lesson_info = lesson.lesson_info
　　 values = (*chapter_info, *lesson_info)
　　 row = dict(zip(TABLEHEAD, values))
　　 rows.append(row)
　　# 存储数据
　　with open('courseinfo.csv', 'w', encoding='utf-8-sig', newline='') as f:
　　 writer = csv.DictWriter(f, TABLEHEAD)
　　 writer.writeheader()
　　 writer.writerows(rows)

复制代码

helper.py

class Chapter:
　　 def __init__(self, chapter):
　　 self.chapter = chapter
　　 self._chapter_info = None
　　 def parse_all(self):
　　 # 章节号
　　 chapter_num = self.chapter.xpath(
　　 './/span[contains(@class, "chaptertitle")]/text()')[0]
　　 # 去掉章节号最后的冒号
　　 chapter_num = chapter_num[:-1]
　　 # 章节名
　　 chapter_name = self.chapter.xpath(
　　 './/span[contains(@class, "chaptername")]/text()')[0]
　　 return chapter_num, chapter_name
　　 @property
　　 def chapter_info(self):
　　 self._chapter_info = self.parse_all()
　　 return self._chapter_info
　　
　　 def get_lessons(self):
　　 return self.chapter.xpath(
　　 './/div[@data-lesson]')
　　class Lesson:
　　 def __init__(self, lesson):
　　 self.lesson = lesson
　　 self._lesson_info = None
　　 @property
　　 def lesson_info(self):
　　 # 课时号
　　 lesson_num = self.lesson.xpath(
　　 './/span[contains(@class, "ks")]/text()')[0]
　　 # 课时名
　　 lesson_name = self.lesson.xpath(
　　 './/span[@title]/@title')[0]
　　 # 课时长
　　 lesson_len = self.lesson.xpath(
　　 './/span[contains(@class, "kstime")]/text()')[0]
　　 self._lesson_info = lesson_num, lesson_name, lesson_len
　　 return self._lesson_info

复制代码

最终结果
　　最终结果保存为 courseinfo.csv，与主运行文件同路径。

		自动登录	找回密码
密码			(注-册)加入51Testing

[转贴] Python + Selenium 爬取网易云课堂课时标题及时长

本帖子中包含更多资源

站长推荐 /1