-
Notifications
You must be signed in to change notification settings - Fork 458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
processFulltextDocument fails on 0.23% arXiv PDFs #1113
Comments
Hi @MarksonChen This is normally fixed with #1075 |
Hi kermitt2, Thank you for your reply. I was using 0.8.0. However, after switching to the latest master version (using |
Thank you @MarksonChen for checking and reporting these arXiv error cases. Indeed the problem is not related to the issue corresponding to #1075, sorry. I just pushed a quick fix and these files should work too. |
Hi, kermitt2, thank you so much for your speedy fix! The amount of continual work put into this open-source project has been remarkable. All 22085 fetchable arXiv PDFs can be parsed successfully with processFulltextDocument. |
@kermitt2 I have a dejavu on this issue while working on PR #1097 and #1099. This happen, as far as I remember, when a note with the same "label" is identified in the text. So when the notes list is collected from the text, by using the
For the first article of the list, 2202.03169, happens because there are three notes with the same intervals. Maybe we could just filter them as an additional precaution. I write here also some additional information, as I will forget in one hour.
|
I'm reopening this, I'm following up my last comment. Avoiding the duplicated interval is done by updating the search space of the indexOf by reducing the list of tokens.
identifier , which should be unique from notes point of view.
I'm submitting the PR with two fixes:
|
I ran processFulltextDocument on 22103 arXiv PDFs. 22053 PDFs succeeded and 50 failed.
Running on MacOS M2 chip
Java version: 17.0.10
Server started with Gradle (
./gradlew run
)An example error log:
The 50 PDFs that failed:
https://arxiv.org/pdf/2202.03169
https://arxiv.org/pdf/2007.10408
https://arxiv.org/pdf/2008.08076
https://arxiv.org/pdf/2203.00397
https://arxiv.org/pdf/2202.00145
https://arxiv.org/pdf/2110.13423
https://arxiv.org/pdf/2006.16218
https://arxiv.org/pdf/2305.01868
https://arxiv.org/pdf/2206.11939
https://arxiv.org/pdf/1711.05715
https://arxiv.org/pdf/2110.11222
https://arxiv.org/pdf/2006.13025
https://arxiv.org/pdf/1902.00450
https://arxiv.org/pdf/2109.04212
https://arxiv.org/pdf/2105.14849
https://arxiv.org/pdf/cs/9906002
https://arxiv.org/pdf/2101.09398
https://arxiv.org/pdf/1911.00536
https://arxiv.org/pdf/1912.02762
https://arxiv.org/pdf/2104.07857
https://arxiv.org/pdf/2106.15093
https://arxiv.org/pdf/1901.09401
https://arxiv.org/pdf/2201.10129
https://arxiv.org/pdf/2010.04879
https://arxiv.org/pdf/1206.5241
https://arxiv.org/pdf/2203.14101
https://arxiv.org/pdf/1905.06214
https://arxiv.org/pdf/2205.05789
https://arxiv.org/pdf/1810.00953
https://arxiv.org/pdf/1910.11856
https://arxiv.org/pdf/1501.02876
https://arxiv.org/pdf/2202.01987
https://arxiv.org/pdf/2303.02186
https://arxiv.org/pdf/2010.05761
https://arxiv.org/pdf/2204.11918
https://arxiv.org/pdf/2002.12361
https://arxiv.org/pdf/1810.07311
https://arxiv.org/pdf/1905.03817
https://arxiv.org/pdf/1901.07846
https://arxiv.org/pdf/2202.03798
https://arxiv.org/pdf/1711.01244
https://arxiv.org/pdf/2006.03040
https://arxiv.org/pdf/2004.10964
https://arxiv.org/pdf/1803.00590
https://arxiv.org/pdf/1612.06109
https://arxiv.org/pdf/1704.03651
https://arxiv.org/pdf/1610.09534
https://arxiv.org/pdf/2202.03555
https://arxiv.org/pdf/2008.04990
The text was updated successfully, but these errors were encountered: