Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Display in DEBUG all the important ZIM metadata #123

Closed
kelson42 opened this issue Nov 4, 2023 · 13 comments
Closed

Display in DEBUG all the important ZIM metadata #123

kelson42 opened this issue Nov 4, 2023 · 13 comments
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@kelson42
Copy link
Contributor

kelson42 commented Nov 4, 2023

If I look to a log I see something like this:

[DEBUG] Confirming output is writable using /output/tmpda8c84ht
[DEBUG] Title: Personal Security Checklist
[DEBUG] Language: en
[DEBUG] Favicon: https://github.githubassets.com/favicons/favicon.png
[DEBUG] Found WARC record for favicon: https://github.githubassets.com/favicons/favicon.png

I would like to list here all the important ZIM metadata, in particular the Description and the LongDescription which might just after make the whole process die (if bigger than the maximal size). See https://wiki.openzim.org/wiki/Metadata for the whole list of ZIM metadata.

@kelson42 kelson42 added enhancement New feature or request good first issue Good for newcomers labels Nov 4, 2023
@richterdavid
Copy link

Should this issue be moved to the repo for warc2zim?

"Confirming output is writable using" is at

logger.debug(f"Confirming output is writable using {fh.name}")

"Found WARC record for favicon" is at

logger.debug(f"Found WARC record for favicon: {self.favicon_url}")

@kelson42
Copy link
Contributor Author

kelson42 commented Nov 6, 2023

@richterdavid Yes, it is something to implement in warc2zim.

@kelson42 kelson42 transferred this issue from openzim/zimit Nov 6, 2023
@richterdavid
Copy link

I'd like to work on this.

@benoit74
Copy link
Collaborator

I don't think this issue is relevant anymore, at least not as it is phrased / oriented today.

What has already been decided for all scrapers is that we must use the validate_metadata methods from zimscraperlib to fail the scraper early when metadata used are incorrect: this displays the offending metadata value.

This still has to be implemented in warc2zim, I've just opened #123

Do we consider it would still help to display all metadata once validated, e.g. for the case where the validation checks are improperly implemented? If yes, this would anyway be better to do this in python-scraperlib so that all scrapers benefit from this enhancement.

@benoit74
Copy link
Collaborator

Oups, I'm wrong, the validation does not display the offending value. Maybe we can consider to also display the offending value.

@richterdavid
Copy link

This still has to be implemented in warc2zim, I've just opened #123

That link goes to this issue; which issue did you mean @benoit74 ?

@kelson42
Copy link
Contributor Author

Oups, I'm wrong, the validation does not display the offending value. Maybe we can consider to also display the offending value.

Yes, this is so important!

@kelson42
Copy link
Contributor Author

This still has to be implemented in warc2zim, I've just opened #123

I guess this is the wrong issue number

Do we consider it would still help to display all metadata once validated, e.g. for the case where the validation checks are improperly implemented? If yes, this would anyway be better to do this in python-scraperlib so that all scrapers benefit from this enhancement.

Yes and yes

@benoit74
Copy link
Collaborator

This still has to be implemented in warc2zim, I've just opened #123

Yes, proper issue is #235

Yes and yes

Closing this issue then, we will implement the necessary in python scraperlib and next upgrade of the dependency in warc2zim will deploy it "automatically", I don't think we need to track this here.

@benoit74
Copy link
Collaborator

I've just opened it: openzim/python-scraperlib#155

@kelson42
Copy link
Contributor Author

@benoit74 Few points regarding the process

@kelson42
Copy link
Contributor Author

Last comment mostly invalidated by newly created ticket at scraperlib. @benoit74 thx

@benoit74
Copy link
Collaborator

Last comment mostly invalidated by newly created ticket at scraperlib. @benoit74 thx

Next time I will open the ticket before closing the other one ^^

#235 is indeed a different thing. I will most probably implement it, but it is not necessary. It is not urgent because we anyway validate metadata now with recent scraperlib, the goal of 235 is to do it as early as possible, i.e. enhance current behavior where validation is done a bit late since it is not done in the "early check" done by zimit but after the crawl and after some processing of warc2zim.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants