Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bert文章下载后为文件夹而非pdf #3

Open
CJYKeepLearning opened this issue Nov 2, 2023 · 5 comments
Open

Bert文章下载后为文件夹而非pdf #3

CJYKeepLearning opened this issue Nov 2, 2023 · 5 comments

Comments

@CJYKeepLearning
Copy link

感谢作者的工具,我在使用的时候遇到了如下问题。
本人note.md与pdfs文件夹在同一,目录下,note.md中输入了示例- {{BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.}}
在cmd中运行easyliter -i "./note.md" -o "./pdfs"
cmd中输出如下。

C:\Users\Administrator\Desktop\note>easyliter -i "./note.md" -o "./pdfs"
INFO:easyliter:Updating the file ./note.md
INFO:easyliter:Number of files to download -  1
  0%|                                                                                                | 0/1 [00:00<?, ?it/s]INFO:Downloads:ID type: title.
D:\software\Anaconda3\Lib\site-packages\easy_literature\dblp_source.py:19: GuessedAtParserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 19 of the file D:\software\Anaconda3\Lib\site-packages\easy_literature\dblp_source.py. To get rid of this warning, pass the additional argument 'features="lxml"' to the BeautifulSoup constructor.

  return BeautifulSoup(resp.content)
INFO:Downloads:The Google scholar bib: {'title': 'Bert: Pre-training of deep bidirectional transformers for language understanding', 'author': 'J Devlin and MW Chang and K Lee and K Toutanova', 'journal': 'arXiv preprint arXiv …', 'year': '2018', 'url': 'https://arxiv.org/abs/1810.04805', 'pdf_link': 'https://arxiv.org/pdf/1810.04805.pdf&usg=ALkJrhhzxlCL6yTht2BRmH9atgvKFxHsxQ', 'cited_count': 82010}; The DLBP bib: {'title': 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.', 'author': 'Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova', 'journal': 'NAACL-HLT (1)', 'year': '2019', 'url': 'https://doi.org/10.18653/v1/n19-1423', 'pdf_link': None, 'cited_count': None}.
INFO:utils:The paper's arxiv url: https://arxiv.org/abs/1810.04805; The converted arxiv id: 1810.04805; The pdf link: https://arxiv.org/pdf/1810.04805.pdf&usg=ALkJrhhzxlCL6yTht2BRmH9atgvKFxHsxQ.
100%|████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:07<00:00,  7.54s/it]

最后得到的bert文件是一个名为Bert_Pre-training_of_deep_bidirectional_transformers_for_language_understanding.pdf的文件夹,而非pdf文件。
不知是何问题

@JinjieNi
Copy link
Owner

JinjieNi commented Nov 2, 2023

可能是由于windows的路径和linux不一致导致的。我刚才修复了有可能导致该错误的bug。
目前我手边没有windows系统的机器,麻烦你重新安装并测试一下:

pip install --upgrade easyliter

之前的版本只在mac和linux测试过。

注意:不要使用相对路径,尽量使用绝对路径。

@CJYKeepLearning
Copy link
Author

更新后使用绝对路径没问题,但是相对路径下载的仍为文件夹

@JinjieNi
Copy link
Owner

JinjieNi commented Nov 4, 2023

又更新了下。这个版本还不行的话先用绝对路径,我改天用windows虚拟机测试下。

@CJYKeepLearning
Copy link
Author

CJYKeepLearning commented Nov 5, 2023

相对路径可下载Bert,但Xlnet找不到下载的url。如下是使用相对路径下载Xlnet:

C:\Users\Administrator\Desktop\note>easyliter -i note.md -o pdfs
INFO:easyliter:Updating the file note.md
INFO:easyliter:Number of files to download -  1
  0%|                                                                                            | 0/1 [00:00<?, ?it/s]INFO:Downloads:ID type: title.
D:\software\Anaconda3\Lib\site-packages\easy_literature\dblp_source.py:19: GuessedAtParserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 19 of the file D:\software\Anaconda3\Lib\site-packages\easy_literature\dblp_source.py. To get rid of this warning, pass the additional argument 'features="lxml"' to the BeautifulSoup constructor.
  return BeautifulSoup(resp.content)
INFO:Downloads:The Google scholar bib: None; The DLBP bib: {'title': 'XLNet: Generalized Autoregressive Pretraining for Language Understanding.', 'author': 'Zhilin Yang and Zihang Dai and Yiming Yang and Jaime G. Carbonell and Ruslan Salakhutdinov and Quoc V. Le', 'journal': 'NeurIPS', 'year': '2019', 'url': 'https://proceedings.neurips.cc/paper/2019/hash/dc6a7e655d7e5840e66733e9ee67cc69-Abstract.html', 'pdf_link': None, 'cited_count': None}.
INFO:utils:The pdf path to be saved: pdfs\XLNet_Generalized_Autoregressive_Pretraining_for_Language_Understanding.pdf
INFO:utils:PDF link: None
INFO:PDFs:Failed to fetch pdf with all sci-hub urls
INFO:PDFs:Failed to fetch pdf with all sci-hub urls
INFO:utils:Can not find a downloading source for literature id Xlnet: Generalized autoregressive pretraining for language understanding.. You may need to manually download this paper, a template has been generated in the markdown file. Put the pdf file in the folder you specified just now and add its name in the '(pdf)' of your markdown entry.
100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [13:47<00:00, 827.09s/it]

@JinjieNi
Copy link
Owner

JinjieNi commented Nov 5, 2023

这是正常的,google scholar有时候会墙一些活动频繁的用户。下载链接主要依赖google scholar和arxiv的解析,如果没有的话就需要依靠dlbp, sci-hub等一些其他相对较小的数据库。如果这些数据库都找不到pdf链接,那就需要自己手动下载一下pdf放入pdf文件夹,然后在markdown填入文件名即可(markdown文件里面已经把其他所有格式准备好了,只需要在[pdf]的的地方填入pdf文件名即可)。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants