-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement caching for citation relations #11189
Comments
Hey @koppor, I would like to take this up |
Welcome to the vibrant world of open-source development with JabRef! Newcomers, we're excited to have you on board. Start by exploring our Contributing guidelines, and don't forget to check out our workspace setup guidelines to get started smoothly. Having any questions or issues? Feel free to ask here on GitHub. Need help setting up your local workspace? Join the conversation on JabRef's Gitter chat. And don't hesitate to open a (draft) pull request early on to show the direction it is heading towards. This way, you will receive valuable feedback. Happy coding! 🚀 |
Hello @koppor Is this issue still reserved ? If not then I would be quite interested to provide a solution. After taking a look to the main branch code I see possible improvements in the repository design that could also:
I do not know MVStore yet but I saw that H2 provides a fairly extensive documentation. |
I was wondering if you could provide some insights to help me understand the whole design:
Regards |
Dear @alexandre-cremieux, we are currently either on holidays or busy with other projects. Therefore only a short answer: Your ideas sound great! Separation from GUI and logic is required. General package dependencies are listed at https://devdocs.jabref.org/getting-into-the-code/high-level-documentation.html. Feel free to refine. Background: We also get contributions from students with a less strong wish for a clean architecture. After a few rounds, we let the code through: Better a feature with an OKish implementation than a feature not coming, because no one has time to work on clean code... |
Hello @koppor Thanks for your answer and documentation sharing. I was quite busy those last days also. I understand your position. I might take some time, but I'm on my way to refactor the package as we agree on this. Depending of my commit management, I may open separate PRs for refactoring and new implementation to try to ease the review. Regards |
* Move repository, cache, and fetcher to logic package * Move citations model to model/citations/semanticscholar package
* Introduce service layer * Rename LRU cache implementation * Add tests helpers for repository
* Move logic from repository to service * Refactor repositories * Update tab configuration
* Move repository, cache, and fetcher to logic package * Move citations model to model/citations/semanticscholar package
* Introduce service layer * Rename LRU cache implementation * Add tests helpers for repository
* Move logic from repository to service * Refactor repositories * Update tab configuration
* Move logic from repository to service * Refactor repositories * Update tab configuration
* Move logic from repository to service * Refactor repositories * Update tab configuration
* Move logic from repository to service * Refactor repositories * Update tab configuration
* Move logic from repository to service * Refactor repositories * Update tab configuration
* Move logic from repository to service * Refactor repositories * Update tab configuration
* Move logic from repository to service * Refactor repositories * Update tab configuration
Hello @koppor , I refactored the fetch and caching logic by separating the concerns into layers and opened a first PR (as a draft). Could you please take a look and add you comments before I write the documentation for the new interfaces ? Here is this link: #11845 After addressing your comments, I will let the draft become a PR to be merged and start to work on implementing an MVStore based repository. Regards. |
@alexandre-cremieux Thank you for working on this! I am very busy with another task. I think, I can dive into this on Wednesday noon the earliest. - Until then, could you resovle the conflict with |
* Move repository, cache, and fetcher to logic package * Move citations model to model/citations/semanticscholar package
* Introduce service layer * Rename LRU cache implementation * Add tests helpers for repository
* Move logic from repository to service * Refactor repositories * Update tab configuration
Okay, thanks @koppor. Just ping me when you reviewed the code. I fixed the conflicts but they will be back each time |
I think, GitHub will send you a notificatoin :)
Should not be the case. Only if classes you changed were changed on main, too. - Do you feel differently? |
Indeed, there was a commit before that modified the |
Large refactorings do not happen that often 😅. OK, if @leaf-soba continues contributing, changes will much higher than 1% 🏎. |
Just merge the main, much better than rebasing and less conflict potential |
Hello @koppor, Just wondering if you got some chance to take a look to the draft PR ? In fact I do not really need an in-depth review: if you prefer I can also implement the MVStore in the same branch / PR but there will be more files to review. On the other hand, this branch is in fact a refactoring that will help to implement the caching system properly and I would prefer to agree with you on this one to avoid be back-pushed when I will provide the whole implementation. Lets say it is kind of an agreement ;) Please let me know which option could suit best your schedule. |
Context: The PR where the review is asked for is #11845. |
Hello @koppor As pointed out here, I have a concern about using the If we look from a domain point of view, we want the I found it error prone to serialize the
This means that we will use the same object to model two different use cases and sources where it is highly possible that we will apply different logic. For example:
Also, the fact that we are for now somehow based on the Semantic Scholar model leads to some kind of logic duplication between the fetching and saving processes:
Also, it seems perfectly normal to not store the relations directly in the I fear that we might end into a cumbersome code if we try to serialize the relations as bibentries and make new changes error prone, especially if another source for relations is implemented in the future. This mainly because we will couple different mapping and equality logic to the ones dedicated to BibEntry. What I propose is to model the relations with another type that could be named
The rule that needs to be followed would be that it should always be possible to convert a A representation of a relation should probably be as simple as possible (based on what we know from Semantic Scholar for now):
This will make the relation easier to serialize and ease further maintenance of this code, especially if over time the storing logic needs to be updated. In that case, we would redefine the Fetcher like this:
The repository will also return list of And finally make the tab manage I understand that this is a deeper change than what we originally discussed but for now the user won't be impacted cause JabRef does not store the Taking this into consideration, I off course propose to do those changes. What do you think ? |
@alexandre-cremieux Thank you for the long explanation. I would call the class The issue with The only thing I would agree on is to add an interface |
Hello @koppor Thanks for your answer.
I would keep the interface: it helps for testing the service and it will help in the future if there is another source that became available.
From my point of view: the issue is more related to the fact that we are using the BibEntry as the model for relation. Off course we are somehow dependent of the Fetcher but we can always abstract the supplier model by creating an ACL (anti corruption layer - DDD). And the base of this ACL could be done by introducing the But main issue I could see in time is that the BibEntry set rules (equals, hashcode, etc) start to vary in such a way that it would impact also the relation storing and comparison logic because a citation relation is only considering a subset of the fields of a BibEntry. This could potentially lead the code to compare apple and pears at some point.
We can still live with the BibEntry as a model for citation relation but it will add more complexity for the developer for managing replacement in sets, equals and hash code in the future while avoiding mixing the concerns between citation relations and the BibEntry management. I am just wondering if this at some point will make the citations relation logic itself more complex in the long run if a new feature should be added (mapping, set management, etc.). If this does not sound as a problem for you then I will continue to implement the MVStore using BibEntry. |
Side track:
One can instantiate Sets and other collections with a separate comparator. - This is what I would do at this point. Short example: Set<BibEntry> set = new TreeSet<>(
Comparator.comparing((BibEntry entry) -> entry.getField(StandardField.TITLE).orElse(""))
.thenComparing(entry -> entry.getField(StandardField.YEAR).orElse(""))
);
We do have issues at
Maybe you can point me to concrete code parts where I can look at. Currently, I cannot decide on longer code.
In my world, JabRef has ways to compare BibEntrys (similarity etc). We know that the similarity does not work in all cases. But this should be fixed - and not new classes invented only because some cases do not work. With the provided information, I think, a |
Context
PR #10980 introduced a memory saving for citation relations to prevent JabRef's memory usage growing constantly. The drawback is that citation relations might become unavailable due to API limitations of the online service.
Quoting https://libguides.ucalgary.ca/c.php?g=732144&p=5260798
That means, for a large library, I cannot step through the references, because I could hit the rate limit. I know, this is seldom, but it could happen at following setting: In a corporate setting: all requests are going through a proxy. Thus, the rate limit is not per person, but per the SUM of persons. In case 100 researchers work in parallel with JabRef, each researcher can get ONE request per 5 minutes. And per entry TWO requests are needed: For the citing and the cited by.
The following solution would really help to make the citation relations really usable. (Because the information for each DOI is stored independent of each library and is presented as soon it is availbable.)...
Task 1
Let's investiage MVStore. MVStore is a library storing the values of a hashmap on disk. Thus, NOT in memory. Thus, it takes less memory than a full hash map in memory, because it is on disk. -- MVStore routes through the request to a map entry to disk. - See https://www.h2database.com/html/mvstore.html for details.
Estimate: 100 lines of code, but scattered around JabRef. NativeDesktop needs to be touched etc. The most difficult thing will be the closing of the MVStore. Since the DOIs are globally unique, one can close the MVStore when JabRef is shut down. This makes it "easier" (in comparison to #10937, where for each tab some closing thing were necessary). -- Nevertheless, it could be that this will be a back-and-forth code development (meaning: code reviews with significant changes could come up).
Implementation hint: Do NOT do it like
org.jabref.logic.journals.JournalAbbreviationRepository
, because there, there is no direct access to the MVStore, but new hashmaps are created. We really need to work on the Maps provided by the MVStore. Because the MVStore "intelligent" does memory management.Task 2
Follow-up requirement: Refresh the DOI information if one week passed since the last fetch. Maybe this can be baked into the HashMap designed for the MVStore.
The text was updated successfully, but these errors were encountered: