Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

爬取公众号历史文章保存为pdf #33

Open
ceerqingting opened this issue Feb 14, 2020 · 0 comments
Open

爬取公众号历史文章保存为pdf #33

ceerqingting opened this issue Feb 14, 2020 · 0 comments

Comments

@ceerqingting
Copy link
Owner

ceerqingting commented Feb 14, 2020

有时候自己接触到一些优质内容的公众号,便想有时间把所有的文章都看一遍,但是微信官方客户端查看历史文章的功能有限,每次都得从头翻到最后,很不方便,而最近刚好学了一下python的官方教程,所以干脆用scrapy写了一个爬取公众号文章并保存为pdf的项目用来练手,非常的简单,如果错误之处,欢迎指正。

1. 安装anaconda

下载地址及安装说明: https://docs.anaconda.com/anaconda/install/

为什么安装anaconda?

第一:提供了包管理功能,解决了Windows平台安装第三方包经常失败的问题

第二: 提供环境管理的功能,解决了多版本python并存、切换的问题

安装完Anaconda之后,python就已经安装好了,不需要再安装python了。检验是否安装成功,可以在命令行里输入conda或者python,如果不是显示command not found,则表示安装成功。

2. 安装scrapy

  • 方式一:用pip安装
pip install Scrapy
  • 方式二: 用conda安装

本人mac环境,用conda安装之后运行scrapy,发现有些第三方包没有,之后用pip再安装一次就没问题了

conda install -c conda-forge scrapy

3. 安装seleniumwebdriver

执行以下命令安装selenum

pip install selenum

下载ChromeDriverhttps://chromedriver.chromium.org/

4. 创建项目

执行以下命令创建项目tutorial

scrapy startproject tutorial

5. 编写爬虫

tutorial/spiders目录下创建quotes_spider.py爬虫文件。在编写之前,我们先大概了解下我们要实现的步骤。

  • 手动获取登录信息,浏览器打开桌面端微信公号历史文章,从控制台复制公众号列表接口地址和请求头中的cookie,因为微信登录比较复杂,还无法做到自动登录

  • 循环爬取公众号列表接口,解析出所有文章链接

  • seleniumwebdriver加载文章链接,然后使用jswindow.print();命令进行打印,保存为pdf

模块功能说明:

  • chromedriver的配置
    def init_driver(self):
        appState = {
            'recentDestinations': [
                {
                    'id': 'Save as PDF',
                    'origin': 'local'
                }
            ],
            'selectedDestinationId': 'Save as PDF',
            'version': 2
        }
        profile = {
            'printing.print_preview_sticky_settings.appState': json.dumps(appState),
            'savefile.default_directory': '此处填写pdf保存的目录'
        }
        chrome_options = webdriver.ChromeOptions()
        chrome_options.add_experimental_option('prefs', profile)
        chrome_options.add_argument('--kiosk-printing')
        driver = webdriver.Chrome(options=chrome_options,executable_path='此处填写chromedirver下载的目录/chromedriver')
        return driver
  • 懒加载图片处理

公众号文章内的图片都做了懒加载处理,如果直接截取整个页面,会导致有些图片缺失,所以需要模拟用户滚动整个页面来加载出图片

    def scroll_page(self, driver):
        time.sleep(5)
        page_height = driver.execute_script('return document.body.scrollHeight;')
        screen_height = driver.execute_script('return window.innerHeight;')
        step = screen_height
        while page_height > 0:
            driver.execute_script(f'window.scrollBy(0, {step});')
            page_height -= step
            time.sleep(0.5)
  • 去除广告

有些页面有广告,如果想去掉广告,可以先执行js脚本,操作浏览器的DOM,去除你想去除的,保留你想保留的,当然这只针对每个页面做统一处理,如果每个页面广告位置都不一样,则只能对每个页面做单独处理

    def remove_ads(self, driver):
        driver.execute_script('''document.getElementById("page-content").innerHTML = document.getElementById("js_content").innerHTML;''')
  • 保存pdf
    driver.execute_script('window.print();')
  • 更改下载的pdf文件名称
    os.rename(f'./pages/{temp_title}.pdf', f'./pages/{title}.pdf')

注意这里需要进行异常处理,有时候会出现找不到该文件的错误,导致程序中断

完整代码配置

import scrapy
import json
import time
import os
import logging
import re
from selenium import webdriver

class wechat(scrapy.Spider):
    name = 'wechat'
    offset = 1
    start_urls = [
        '微信公众号列表接口地址'
    ]
    cookies = '请求头cookie'
    def convert_cookies(self, cookies):
        cookie_dict = dict([l.split("=", 1) for l in cookies.split("; ")])
        return cookie_dict

    def init_driver(self):
        appState = {
            'recentDestinations': [
                {
                    'id': 'Save as PDF',
                    'origin': 'local'
                }
            ],
            'selectedDestinationId': 'Save as PDF',
            'version': 2
        }
        profile = {
            'printing.print_preview_sticky_settings.appState': json.dumps(appState),
            'savefile.default_directory': '此处填写pdf保存的目录'
        }
        chrome_options = webdriver.ChromeOptions()
        chrome_options.add_experimental_option('prefs', profile)
        chrome_options.add_argument('--kiosk-printing')
        driver = webdriver.Chrome(options=chrome_options,executable_path='此处填写chromedirver下载的目录/chromedriver')
        return driver

    def add_cookie(self, driver):
        cookies = self.convert_cookies(self.cookies)
        for k, v in cookies.items():
            driver.add_cookie({
                'name': k,
                'value': v
            })

    def scroll_page(self, driver):
        time.sleep(5)
        page_height = driver.execute_script('return document.body.scrollHeight;')
        screen_height = driver.execute_script('return window.innerHeight;')
        step = screen_height
        while page_height > 0:
            driver.execute_script(f'window.scrollBy(0, {step});')
            page_height -= step
            time.sleep(0.5)

    def down_pdf(self, driver):
        self.scroll_page(driver)
        self.remove_ads(driver)
        driver.execute_script('window.print();')
        time.sleep(10)

    def url_filter(self):
        return re.sub(r'offset=\d+', f'offset={self.offset}', self.start_urls[0])
    
    def rename_pdf(self, msg, driver):
        try:
            publishTime = driver.find_elements_by_css_selector('#publish_time')[0].get_attribute('textContent')
            temp_title= driver.title
            title = publishTime + '_' + msg['app_msg_ext_info']['title']
            os.rename(f'./pages/{temp_title}.pdf', f'./pages/{title}.pdf')
        except Exception as e:
            logging.exception(e)

    def remove_ads(self, driver):
        driver.execute_script('''document.getElementById("page-content").innerHTML = document.getElementById("js_content").innerHTML;''')
        
    def start_requests(self):
        yield scrapy.Request(self.url_filter(), cookies=self.convert_cookies(self.cookies))

    def parse(self, response):
        driver = self.init_driver()
        response_json = json.loads(response.text)
        general_msg_list = json.loads(response_json['general_msg_list'])
        msg_list = general_msg_list['list']
        if len(msg_list) == 0:
            driver.quit()
            return
        for msg in msg_list:
            if 'app_msg_ext_info' not in msg:
                continue
            if (msg['app_msg_ext_info']['copyright_stat'] == 11):
                driver.get(msg['app_msg_ext_info']['content_url'])
                if self.offset == 0:
                    self.add_cookie(driver)
                driver.get(msg['app_msg_ext_info']['content_url'])
                self.down_pdf(driver)
                self.rename_pdf(msg, driver)
        self.offset += 10
        print('current offset', self.offset)
        yield response.follow(self.url_filter(), callback=self.parse)

以上就可以轻松爬取所有文章在本地进行阅读了

参考资料:使用selenium把网页保存为PDF

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant