鉴于 C的云存储 遭到了恶意举报 (这年头为爱发电真tm难md) 而且几乎没人用本项目,我们将随时考虑无限期放弃该项目的后续开发(当然欢迎接手awa)
如果有人在用这个项目,或者愿意加入开发,欢迎告诉我们 :)
用于识别与解密部分文章的Python包
- 本项目不提供任何采集知乎内容的方法或技术
- 仅作学习文本分析用途,请勿用于侵犯他人利益 :)
- 代码很粗糙,欢迎PR :)
- 大部分代码没写注释,我懒 :(
- 本项目在GPLv3下授权,欢迎各位在遵循许可的情况下继续开发
poetry add git+https://github.com/cxzlw/zhihuDecrypt
import zhihudecrypt
- 参考zhihuDecryptApp
通过以下代码,注意通过learn_from_id()
收集的文章越多越好
若传入的id
指向了一篇乱码文章,注意传入对应的解密方法,或者只传入正常文章
import os
import time
import consts
import jieba
import httpx
from bs4 import BeautifulSoup
path = os.path.dirname(__file__)
def get_cdycc(passage_id: int):
html = httpx.get(f"https://cdycc.cn/?id={passage_id}")
soup = BeautifulSoup(html, "html5lib")
selected = soup.select(
"body > div.wrapper > div.main.fixed > div.wrap > div.content > div:nth-child(1) > div.post > "
"div.single.postcon > div.readall-body > p")
return "\n".join(x.text for x in selected[2:-3])
def in_charset(word: str):
for char in consts.charset:
if char in word:
return True
return False
with open(path + "/words.txt", "r") as f:
words = dict((x.split(" ")[0], int(x.split(" ")[1])) for x in f.read().rstrip("\n").split("\n") if in_charset(x))
# print(words)
def learn_from_id(pid: int, encrypt_method: str = "none"):
try:
p = get_cdycc(pid)
for block in jieba.cut(p):
if in_charset(block):
if block not in words:
words[block] = 0
words[block] += 1
with open(path + "/words.txt", "w") as f:
f.writelines(f"{x} {y}\n" for x, y in words.items())
# print(pid)
except Exception as e:
print(e)
time.sleep(5)
if __name__ == '__main__':
learn_from_id(5929, "none")