Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement caching for citation relations #11189

Open
koppor opened this issue Apr 12, 2024 · 19 comments
Open

Implement caching for citation relations #11189

koppor opened this issue Apr 12, 2024 · 19 comments
Assignees
Labels
FirstTimeCodeContribution Triggers GitHub Greeter Workflow

Comments

@koppor
Copy link
Member

koppor commented Apr 12, 2024

Context

PR #10980 introduced a memory saving for citation relations to prevent JabRef's memory usage growing constantly. The drawback is that citation relations might become unavailable due to API limitations of the online service.

Quoting https://libguides.ucalgary.ca/c.php?g=732144&p=5260798

The API allows up to 100 requests per 5 minutes. To access a higher rate limit, complete the form to request authentication for your project.

That means, for a large library, I cannot step through the references, because I could hit the rate limit. I know, this is seldom, but it could happen at following setting: In a corporate setting: all requests are going through a proxy. Thus, the rate limit is not per person, but per the SUM of persons. In case 100 researchers work in parallel with JabRef, each researcher can get ONE request per 5 minutes. And per entry TWO requests are needed: For the citing and the cited by.

The following solution would really help to make the citation relations really usable. (Because the information for each DOI is stored independent of each library and is presented as soon it is availbable.)...

Task 1

Let's investiage MVStore. MVStore is a library storing the values of a hashmap on disk. Thus, NOT in memory. Thus, it takes less memory than a full hash map in memory, because it is on disk. -- MVStore routes through the request to a map entry to disk. - See https://www.h2database.com/html/mvstore.html for details.

Estimate: 100 lines of code, but scattered around JabRef. NativeDesktop needs to be touched etc. The most difficult thing will be the closing of the MVStore. Since the DOIs are globally unique, one can close the MVStore when JabRef is shut down. This makes it "easier" (in comparison to #10937, where for each tab some closing thing were necessary). -- Nevertheless, it could be that this will be a back-and-forth code development (meaning: code reviews with significant changes could come up).

Implementation hint: Do NOT do it like org.jabref.logic.journals.JournalAbbreviationRepository, because there, there is no direct access to the MVStore, but new hashmaps are created. We really need to work on the Maps provided by the MVStore. Because the MVStore "intelligent" does memory management.

Task 2

Follow-up requirement: Refresh the DOI information if one week passed since the last fetch. Maybe this can be baked into the HashMap designed for the MVStore.

@rohit-garga
Copy link
Contributor

Hey @koppor, I would like to take this up

@koppor koppor added the FirstTimeCodeContribution Triggers GitHub Greeter Workflow label Apr 29, 2024
Copy link
Contributor

Welcome to the vibrant world of open-source development with JabRef!

Newcomers, we're excited to have you on board. Start by exploring our Contributing guidelines, and don't forget to check out our workspace setup guidelines to get started smoothly.

Having any questions or issues? Feel free to ask here on GitHub. Need help setting up your local workspace? Join the conversation on JabRef's Gitter chat. And don't hesitate to open a (draft) pull request early on to show the direction it is heading towards. This way, you will receive valuable feedback.

Happy coding! 🚀

@alexandre-cremieux
Copy link
Contributor

Hello @koppor

Is this issue still reserved ?

If not then I would be quite interested to provide a solution.

After taking a look to the main branch code I see possible improvements in the repository design that could also:

  • ease third party library integration if MVStore does not fit (maintenance)
  • decouple the calling code from the repository logic
  • somehow simplify calling code

I do not know MVStore yet but I saw that H2 provides a fairly extensive documentation.

@alexandre-cremieux
Copy link
Contributor

alexandre-cremieux commented Jun 27, 2024

Hello @koppor and @ThiloteE ,

I was wondering if you could provide some insights to help me understand the whole design:

  • Regarding the citationrelationtab package, is there any specific reason that it includes the repository, the cache and the semanticsscholar code as its scope belongs to the GUI ? From my point of view repository, cache and fetcher logic could live without the GUI and be exposed in a back-end eventually (should it be local or remote). Could it be interesting to move those classes to a dedicated package in the logic layer, like logic.citation.relation ?
  • To address this issue I would like to try to provide a generic approach to answer multi layer data access but without concrete implementation. I was thinking about creating a dedicated package under logic : logic.util.dao . Do you see other possibilities ?
  • And finally, are there specific rules to respect regarding package dependencies ?

Regards

@koppor
Copy link
Member Author

koppor commented Jul 2, 2024

Dear @alexandre-cremieux, we are currently either on holidays or busy with other projects. Therefore only a short answer: Your ideas sound great! Separation from GUI and logic is required.

General package dependencies are listed at https://devdocs.jabref.org/getting-into-the-code/high-level-documentation.html. Feel free to refine.

Background: We also get contributions from students with a less strong wish for a clean architecture. After a few rounds, we let the code through: Better a feature with an OKish implementation than a feature not coming, because no one has time to work on clean code...

@alexandre-cremieux
Copy link
Contributor

Hello @koppor

Thanks for your answer and documentation sharing. I was quite busy those last days also.

I understand your position. I might take some time, but I'm on my way to refactor the package as we agree on this.

Depending of my commit management, I may open separate PRs for refactoring and new implementation to try to ease the review.

Regards

alexandre-cremieux added a commit to alexandre-cremieux/jabref that referenced this issue Sep 28, 2024
* Move repository, cache, and fetcher to logic package
* Move citations model to model/citations/semanticscholar package
alexandre-cremieux added a commit to alexandre-cremieux/jabref that referenced this issue Sep 28, 2024
* Introduce service layer
* Rename LRU cache implementation
* Add tests helpers for repository
alexandre-cremieux added a commit to alexandre-cremieux/jabref that referenced this issue Sep 28, 2024
* Move logic from repository to service
* Refactor repositories
* Update tab configuration
alexandre-cremieux added a commit to alexandre-cremieux/jabref that referenced this issue Sep 28, 2024
* Move repository, cache, and fetcher to logic package
* Move citations model to model/citations/semanticscholar package
alexandre-cremieux added a commit to alexandre-cremieux/jabref that referenced this issue Sep 28, 2024
* Introduce service layer
* Rename LRU cache implementation
* Add tests helpers for repository
alexandre-cremieux added a commit to alexandre-cremieux/jabref that referenced this issue Sep 28, 2024
* Move logic from repository to service
* Refactor repositories
* Update tab configuration
alexandre-cremieux added a commit to alexandre-cremieux/jabref that referenced this issue Sep 28, 2024
* Move logic from repository to service
* Refactor repositories
* Update tab configuration
alexandre-cremieux added a commit to alexandre-cremieux/jabref that referenced this issue Sep 28, 2024
* Move logic from repository to service
* Refactor repositories
* Update tab configuration
alexandre-cremieux added a commit to alexandre-cremieux/jabref that referenced this issue Sep 28, 2024
* Move logic from repository to service
* Refactor repositories
* Update tab configuration
alexandre-cremieux added a commit to alexandre-cremieux/jabref that referenced this issue Sep 28, 2024
* Move logic from repository to service
* Refactor repositories
* Update tab configuration
alexandre-cremieux added a commit to alexandre-cremieux/jabref that referenced this issue Sep 29, 2024
* Move logic from repository to service
* Refactor repositories
* Update tab configuration
alexandre-cremieux added a commit to alexandre-cremieux/jabref that referenced this issue Sep 29, 2024
* Move logic from repository to service
* Refactor repositories
* Update tab configuration
@alexandre-cremieux
Copy link
Contributor

Hello @koppor ,

I refactored the fetch and caching logic by separating the concerns into layers and opened a first PR (as a draft).

Could you please take a look and add you comments before I write the documentation for the new interfaces ?

Here is this link: #11845

After addressing your comments, I will let the draft become a PR to be merged and start to work on implementing an MVStore based repository.

Regards.

@koppor
Copy link
Member Author

koppor commented Sep 29, 2024

@alexandre-cremieux Thank you for working on this! I am very busy with another task. I think, I can dive into this on Wednesday noon the earliest. - Until then, could you resovle the conflict with main branch? 😅

alexandre-cremieux added a commit to alexandre-cremieux/jabref that referenced this issue Sep 29, 2024
* Move repository, cache, and fetcher to logic package
* Move citations model to model/citations/semanticscholar package
alexandre-cremieux added a commit to alexandre-cremieux/jabref that referenced this issue Sep 29, 2024
* Introduce service layer
* Rename LRU cache implementation
* Add tests helpers for repository
alexandre-cremieux added a commit to alexandre-cremieux/jabref that referenced this issue Sep 29, 2024
* Move logic from repository to service
* Refactor repositories
* Update tab configuration
@alexandre-cremieux
Copy link
Contributor

Okay, thanks @koppor. Just ping me when you reviewed the code. I fixed the conflicts but they will be back each time main is updated.

@koppor
Copy link
Member Author

koppor commented Sep 29, 2024

Okay, thanks @koppor. Just ping me when you reviewed the code.

I think, GitHub will send you a notificatoin :)

I fixed the conflicts but they will be back each time main is updated.

Should not be the case. Only if classes you changed were changed on main, too. - Do you feel differently?

@alexandre-cremieux
Copy link
Contributor

alexandre-cremieux commented Sep 29, 2024

I fixed the conflicts but they will be back each time main is updated.

Should not be the case. Only if classes you changed were changed on main, too. - Do you feel differently?

Indeed, there was a commit before that modified the CitationsRelationsTab on which I had to work: Fix architecture gui/logic (#11379)

@koppor
Copy link
Member Author

koppor commented Sep 29, 2024

Should not be the case. Only if classes you changed were changed on main, too. - Do you feel differently?
Indeed, there was a commit before that modified the CitationsRelationsTab on which I had to work: Fix architecture gui/logic (#11379)

Large refactorings do not happen that often 😅. OK, if @leaf-soba continues contributing, changes will much higher than 1% 🏎.

@Siedlerchr
Copy link
Member

Siedlerchr commented Sep 29, 2024

Just merge the main, much better than rebasing and less conflict potential

@alexandre-cremieux
Copy link
Contributor

alexandre-cremieux commented Oct 7, 2024

Hello @koppor,

Just wondering if you got some chance to take a look to the draft PR ?

In fact I do not really need an in-depth review: if you prefer I can also implement the MVStore in the same branch / PR but there will be more files to review.

On the other hand, this branch is in fact a refactoring that will help to implement the caching system properly and I would prefer to agree with you on this one to avoid be back-pushed when I will provide the whole implementation. Lets say it is kind of an agreement ;)

Please let me know which option could suit best your schedule.

@koppor
Copy link
Member Author

koppor commented Oct 10, 2024

Context: The PR where the review is asked for is #11845.

@alexandre-cremieux
Copy link
Contributor

alexandre-cremieux commented Nov 11, 2024

Hello @koppor

As pointed out here, I have a concern about using the BibEntry as the type definition for citations. This is mainly from a domain context point of view to avoid mixing the concerns between managing a bibliography (BibEntry) and managing a bib entry's own relations. However, this has also impacts on the technical parts independently from the MVStore solution.

If we look from a domain point of view, we want the BibEntry to be able to aggregate relations and store those relations in a separate database (here the MVStore). Saving them in a different store is fine, it is just a technical detail in fact. But the concern is more related to the representation of the relation as BibEntry.

I found it error prone to serialize the BibEntry in the context of the citations relations for two reasons:

  • its structure does not fit our needs: it contains far more information than what is needed for modeling this context
  • in the current state of the application: the serialization that is induced by our need to persist the relations (citations) will be based on the Semantic Scholar model.

This means that we will use the same object to model two different use cases and sources where it is highly possible that we will apply different logic. For example:

  • BibEntry is not serializable while we need to be able to serialize relations. This induces that an ad-hoc serialization logic for a BibEntry as a relation (citation storing) needs to be provided
  • Equality computation will not be the same between the BibDatabase and the relations' store (see last commit code)
  • We will add more code involving BibEntry while we are in fact on relations

Also, the fact that we are for now somehow based on the Semantic Scholar model leads to some kind of logic duplication between the fetching and saving processes:

  1. We will first acquire the relations from the web based on what model the web offers
  2. Then we will map them to BibEntry to represent them as relations in our domain context
  3. Then we will persist them to disk using some kind of serialization taking in account the model from the web (we hit the duplication here and if we change the web model then we will need to change the serialization logic also, etc.)
  4. At the end only and if needed, we will integrate a selection of relations (citations) to the main BibDatabase and only then it would be useful to be able to convert a relation to a BibEntry.

Also, it seems perfectly normal to not store the relations directly in the BibDatabase as it is not meant to manage relations.

I fear that we might end into a cumbersome code if we try to serialize the relations as bibentries and make new changes error prone, especially if another source for relations is implemented in the future. This mainly because we will couple different mapping and equality logic to the ones dedicated to BibEntry.

What I propose is to model the relations with another type that could be named BibRelation and base the relation between a BibEntry and its BibRelations using the DOI. The goal are:

  1. Differentiate the logic between the two domains and make relations logic easier to understand
  2. Avoid technical caveats that will become tech debts
  3. Delay the conversion of a relation to a BibEntry until it is needed by the user

The rule that needs to be followed would be that it should always be possible to convert a BibRelation to a BibEntry.

A representation of a relation should probably be as simple as possible (based on what we know from Semantic Scholar for now):

BibRelation {
    String title;
    String year;
    List<String> authors;
    EntryType type;
    DOI doi;
    String url;
    String abstract;

    public BibEntry toBibEntry() {
        // code here
    }
}

This will make the relation easier to serialize and ease further maintenance of this code, especially if over time the storing logic needs to be updated.

In that case, we would redefine the Fetcher like this:

public interface CitationFetcher {
    List<BibRelation> searchCitedBy(BibEntry entry) throws FetcherException;

    List<BibRelation> searchCiting(BibEntry entry) throws FetcherException;
}

The repository will also return list of BibRelation instead of list of BibEntry.

And finally make the tab manage BibRelation instead of BibEntry. The mapping between BibRelation and BibEntry will then be done only when the user will want to add the relation to the main database. The relation between a bibEntry and its relations will be kept by the DOI. It will still be possible to check for duplicates in the main database by converting the BibRelation to BibEntry before searching duplicates (for the Jump to entry in library use case).

I understand that this is a deeper change than what we originally discussed but for now the user won't be impacted cause JabRef does not store the BibEntry relation for now in its main database, but after that the two (bounded) contexts will be coupled it will be more difficult to extend the relations logic independently from the BibEntry's logic.

Taking this into consideration, I off course propose to do those changes.

What do you think ?

@koppor
Copy link
Member Author

koppor commented Nov 11, 2024

@alexandre-cremieux Thank you for the long explanation. I would call the class Citation to be in line with the name CitationFetcher. Maybe, we should get rid of that interface and just use SemanticScholarFetcher and SemanticScholarEntry to make it clear that the data is coming from SemanticScholar.

The issue with BibRelation and Citation is that the content of that class is dependend on the fetcher -- with BibEntry we abstracted from it. But I do not see a "good" intermediate model covering the current fetcher and some future fetcher we do not know. I would not over-engineer here and just work on the single available fetcher and code around that single fetcher.

The only thing I would agree on is to add an interface ToBibEntry offering public BibEntry toBibEntry(). But this also feels over-engineered. (As we currently do not use elsewhere.

@alexandre-cremieux
Copy link
Contributor

alexandre-cremieux commented Nov 12, 2024

Hello @koppor

Thanks for your answer.

@alexandre-cremieux Thank you for the long explanation. I would call the class Citation to be in line with the name CitationFetcher. Maybe, we should get rid of that interface and just use SemanticScholarFetcher and SemanticScholarEntry to make it clear that the data is coming from SemanticScholar.

I would keep the interface: it helps for testing the service and it will help in the future if there is another source that became available.

The issue with BibRelation and Citation is that the content of that class is dependend on the fetcher -- with BibEntry we abstracted from it. But I do not see a "good" intermediate model covering the current fetcher and some future fetcher we do not know. I would not over-engineer here and just work on the single available fetcher and code around that single fetcher.

From my point of view: the issue is more related to the fact that we are using the BibEntry as the model for relation. Off course we are somehow dependent of the Fetcher but we can always abstract the supplier model by creating an ACL (anti corruption layer - DDD). And the base of this ACL could be done by introducing the BibRelation (or Citation but this a term that already exists in JabRef for another feature).

But main issue I could see in time is that the BibEntry set rules (equals, hashcode, etc) start to vary in such a way that it would impact also the relation storing and comparison logic because a citation relation is only considering a subset of the fields of a BibEntry. This could potentially lead the code to compare apple and pears at some point.

The only thing I would agree on is to add an interface ToBibEntry offering public BibEntry toBibEntry(). But this also feels over-engineered. (As we currently do not use elsewhere.

We can still live with the BibEntry as a model for citation relation but it will add more complexity for the developer for managing replacement in sets, equals and hash code in the future while avoiding mixing the concerns between citation relations and the BibEntry management. I am just wondering if this at some point will make the citations relation logic itself more complex in the long run if a new feature should be added (mapping, set management, etc.).

If this does not sound as a problem for you then I will continue to implement the MVStore using BibEntry.

@koppor
Copy link
Member Author

koppor commented Nov 13, 2024

From my point of view: the issue is more related to the fact that we are using the BibEntry as the model for relation. Off course we are somehow dependent of the Fetcher but we can always abstract the supplier model by creating an ACL (anti corruption layer - DDD). And the base of this ACL could be done by introducing the BibRelation (or Citation but this a term that already exists in JabRef for another feature).

Side track: Citaton can be renamed to LatexCitation. Not much impact. Only locally used.

But main issue I could see in time is that the BibEntry set rules (equals, hashcode, etc) start to vary in such a way that it would impact also the relation storing and comparison logic because a citation relation is only considering a subset of the fields of a BibEntry.

One can instantiate Sets and other collections with a separate comparator. - This is what I would do at this point.

Short example:

        Set<BibEntry> set = new TreeSet<>(
                Comparator.comparing((BibEntry entry) -> entry.getField(StandardField.TITLE).orElse(""))
                          .thenComparing(entry -> entry.getField(StandardField.YEAR).orElse(""))
        );

This could potentially lead the code to compare apple and pears at some point.

We do have issues at org.jabref.logic.database.DuplicateCheck, but I do not see how Citation solves handling typos in the title field and then treating an entry as equal.

We can still live with the BibEntry as a model for citation relation but it will add more complexity for the developer for managing replacement in sets, equals and hash code in the future while avoiding mixing the concerns between citation relations and the BibEntry management.

Maybe you can point me to concrete code parts where I can look at. Currently, I cannot decide on longer code.

I am just wondering if this at some point will make the citations relation logic itself more complex in the long run if a new feature should be added (mapping, set management, etc.).

In my world, JabRef has ways to compare BibEntrys (similarity etc). We know that the similarity does not work in all cases. But this should be fixed - and not new classes invented only because some cases do not work.

With the provided information, I think, a Comparator is the thing you are searching for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FirstTimeCodeContribution Triggers GitHub Greeter Workflow
Projects
Status: Assigned
Development

No branches or pull requests

4 participants