Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

采集器:公众号文章获取 #15

Closed
howie6879 opened this issue Dec 22, 2021 · 1 comment
Closed

采集器:公众号文章获取 #15

howie6879 opened this issue Dec 22, 2021 · 1 comment
Labels
help wanted Extra attention is needed

Comments

@howie6879
Copy link
Owner

howie6879 commented Dec 22, 2021

考虑如下方式:

@howie6879
Copy link
Owner Author

howie6879 commented Dec 22, 2021

对于搜狗获取,情况如下:

  • 进度:
    • 2021-12-22: 基于playwright的数据获取脚本基本完成
  • 问题:
    • 必须利用playwright调用浏览器的形式抓取,下面两种方式都可能会触发搜狗拦截,增加 Issue [搜狗微信采集器] 关于验证码解决方案 #31
      • playwright调用无头浏览器增加ua,测试可行
      • [默认使用此方案] 尝试用爬虫看看验证码触发限制如何,可行
    • [备份方式解决] 获取的公众号文章链接有时长限制,但是只要在微信体系内打开,哪怕链接过期也会自动跳转到正确的链接,所以只要分发器用的是微信公众号也没事,哪怕是其他分发器,这个链接有效期也是比较长的,影响应该还好
  • 方案:2021-12-22: 基于playwright调用无头浏览器增加ua的形式进行微信最新文章抓取

数据格式:

{
    "doc_author": "howie6879",
    "doc_content": "",
    "doc_ts": 1639702080,
    "doc_date": "2021-12-17 08:48",
    "doc_des": "本周推荐游戏程序员的读书笔记,致敬。",
    "doc_id": "bd998b9c43ba2d91fd6be9f833ecb634",
    "doc_image": "http://mmbiz.qpic.cn/mmbiz_jpg/YRBRJvZXcIVBtU4gtNsZrRQtDLDS725uEGsCGXHbq7GzfDK2KumHOSKkA6TiaWLia1co96EzPqHRoiac7w7wtqlkg/0?wx_fmt=jpeg",
    "doc_keywords": [],
    "doc_link": "https://mp.weixin.qq.com/s?src=11&timestamp=1640227638&ver=3513&signature=KSf-sAynN5L4LZlsLccoZvT7BT2C6BOcinT77piilqyZnDkcBAy8xpN5o1E8XIKNlBei5CiWNuWJ7e8OzqzyvsY6Fr-aF60Sc6mXJLExQrCNDgGf1V-F8LmOuyCxPVZv&new=1",
    "doc_name": "我的周刊(第018期)",
    "doc_source": "2c_wechat",
    "doc_source_account_intro": "编程、兴趣、生活",
    "doc_source_account_nick": "howie_locker",
    "doc_source_meta_list": [
        "howie_locker",
        "编程、兴趣、生活"
    ],
    "doc_source_name": "老胡的储物柜",
    "doc_type": "article"
}

@howie6879 howie6879 added the help wanted Extra attention is needed label Dec 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant