Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opds dumper need to read all archives to have illustration size. #625

Closed
mgautierfr opened this issue Oct 25, 2021 · 6 comments · Fixed by #630
Closed

Opds dumper need to read all archives to have illustration size. #625

mgautierfr opened this issue Oct 25, 2021 · 6 comments · Fixed by #630
Assignees
Milestone

Comments

@mgautierfr
Copy link
Member

https://github.com/kiwix/libkiwix/blob/master/src/opds_dumper.cpp#L57-L70 make the opds dumper read all zim file in the library to create the opds stream.

The normal behavior should be that we trust the library.xml (when loading from it) and do not read the file to update the metadata information we know about the books.
This kind of method should be on the library side (Manager::readBookFromPath or Book::update)

On my side, with 14 zim files on usb extern hard drive, the first opds request from the home page takes more than 1 minutes.

@kelson42 kelson42 added this to the 10.0.0 milestone Oct 28, 2021
@kelson42
Copy link
Collaborator

Seems a blocker for 10.0.0 to me. We deal with thousands of ZIM files in library.kiwix.org, we have to trust the library.xml and it should be performant.

@kelson42
Copy link
Collaborator

@veloman-yunkan This is really a performance blocker for the next release. Please have a look.

@mgautierfr
Copy link
Member Author

To be exact, we don't read all archives in the library but all archives returned by the opds stream. If the stream is paged (limited to N books) we read only N archives, but if not, we read all archives.
But, another issue, we do this every time, for each requests, as there is no cache. The fs cache help a bit but that all.

@veloman-yunkan
Copy link
Collaborator

https://github.com/kiwix/libkiwix/blob/master/src/opds_dumper.cpp#L57-L70 make the opds dumper read all zim file in the library to create the opds stream.

The normal behavior should be that we trust the library.xml (when loading from it) and do not read the file to update the metadata information we know about the books. This kind of method should be on the library side (Manager::readBookFromPath or Book::update)

On my side, with 14 zim files on usb extern hard drive, the first opds request from the home page takes more than 1 minutes.

@kelson42 @mgautierfr

The said change in OPDSDumper was introduced by #577 while fixing #533. The current format of the library XML doesn't allow for multiple illustrations, that's why the needed information is read from the ZIM archives. To address this issue, we must enhance the library XML format:

  1. allow multiple illustrations per book
  2. include information about dimensions of each illustration

Tools that generate library XMLs must be updated as well.

In this context, I wonder why we need a separate XML format for the library. Can't we replace the <book> nodes in the library XML with (probably slightly customized) OPDS <entry> nodes, so that we can eliminate some functional duplication between kiwix::Book::updateFromXml() and Book::updateFromOpds()?

@mgautierfr
Copy link
Member Author

In this context, I wonder why we need a separate XML format for the library. Can't we replace the nodes in the library XML with (probably slightly customized) OPDS nodes, so that we can eliminate some functional duplication between kiwix::Book::updateFromXml() and Book::updateFromOpds()?

This is a question we are asking ourselves (@kelson42 and I)
We've already discussed this in live during the hackathon but here is where we are now :

  • It would be interesting to somehow merge the two formats. Less code is better.
  • But we have to be careful. Although the two formats are pretty similar, they respond to different purposes: The OPDS stream is here to expose a remote catalog and library.xml is here to describe a local library.
    There are small subtle differences (opds stream has no path. It contains only links to illustrations, not the illustration data itself). We don't know if this difference in usage is a blocker or not.
  • We need a transition plan (we don't want the user to "lose" its library when he updates his software).
  • Maybe we could move this xml parsing/generating part in a separated library to allow other projects (CMS, ..) to not reinvent the wheel.

Reuse the <entry> node seems a good idea (I always dislike this huge attributes list in the <book> node)
@veloman-yunkan If you have a idea of how we could do this, please do not hesitate to speak about it (hero or in another dedicated issue).

@kelson42
Copy link
Collaborator

Related to kiwix/overview#59

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants