Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significantly enhance the safety of metadata manipulation #221

Merged
merged 13 commits into from
Dec 17, 2024

Conversation

benoit74
Copy link
Collaborator

@benoit74 benoit74 commented Nov 21, 2024

Fix #205

This is a full rewrite of #217, so I've opened a new PR since changes since last review made no more sense from my PoV.

  • add types for all metadata, one type per metadata name plus some generic ones for non-standard metadata
    • all types are responsible to validate metadata value at initialization time
    • validation checks for adherence to the ZIM specification and conventions are automated
    • cleanup of unwanted control characters and stripping white characters are automated in all text metadata
    • whenever possible, try to automatically clean a "reasonably" bad metadata (e.g. automaticall accept and remove duplicate tags - harmless - but not duplicate language codes - codes are supposed to be ordered, so it is a weird situation) ; this is an alignment of paradigm, because for some metadata the lib was permissive, while for other it was quite restrictive ; this PR tries to align this and make the lib as permissive as possible, avoiding to fail a scraper for something which could be automatically fixed
    • it is now possible to disable ZIM conventions checks with zim.metadata.check_metadata_conventions
  • simplify zim.creator.Creator.config_metadata by using these types and been more strict:
    • add new StandardMetadata class for standard metadata, including list of mandatory one
    • by default, all non-standard metadata must start with X- prefix
      • this not yet an openZIM convention / specification, so it is possible to disable this check with fail_on_missing_prefix argument
  • simplify add_metadata, use same metadata types
  • simplify zim.creator.Creator.start with new types, and drop all metadata from memory after being passed to the libzim
  • drop zim.creator.convert_and_check_metadata (not usefull anymore, simply use proper metadata type)
  • move MANDATORY_ZIM_METADATA_KEYS and DEFAULT_DEV_ZIM_METADATA from constants to zim.metadata to avoid circular dependencies
  • new inputs.unique_values utility function to compute the list of uniques values from a given list, but preserving initial list order
  • in __init__ of zim.creator.Creator, rename disable_metadata_checks to check_metadata_conventions for clarity and brevity
    • beware that this manipulate the global zim.metadata.check_metadata_conventions, so if you have many creator running in parallel, they can't have different settings, last one initialized will "win"

Nota:

  • I've moved many tests from tests/zim/test_zim_creator.py to tests/zim/test_metadata.py since most checks are now done at metadata initialization instead of when config_metadata or start are called, but coverage is similar

@benoit74 benoit74 self-assigned this Nov 21, 2024
@benoit74 benoit74 force-pushed the safe_metadata_revamp branch from 945daa8 to c064b54 Compare November 21, 2024 22:22
@benoit74 benoit74 force-pushed the safe_metadata_revamp branch 2 times, most recently from 2bccb8d to 6033f68 Compare November 21, 2024 22:33
Copy link

codecov bot commented Nov 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (76a6408) to head (7e2efa1).
Report is 14 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##              main      #221    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files           38        39     +1     
  Lines         2227      2447   +220     
  Branches       426       335    -91     
==========================================
+ Hits          2227      2447   +220     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@benoit74 benoit74 force-pushed the safe_metadata_revamp branch 2 times, most recently from e759a36 to 298beef Compare November 22, 2024 07:46
@benoit74 benoit74 marked this pull request as ready for review November 22, 2024 07:47
@benoit74 benoit74 requested a review from rgaudin November 22, 2024 07:47
Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow that's a lot of change!
See inline comments ; maybe we should discuss it live once you've looked at it

src/zimscraperlib/inputs.py Show resolved Hide resolved
src/zimscraperlib/zim/creator.py Outdated Show resolved Hide resolved
src/zimscraperlib/zim/creator.py Outdated Show resolved Hide resolved
src/zimscraperlib/zim/creator.py Show resolved Hide resolved
src/zimscraperlib/zim/creator.py Outdated Show resolved Hide resolved
src/zimscraperlib/zim/metadata.py Outdated Show resolved Hide resolved
src/zimscraperlib/zim/metadata.py Outdated Show resolved Hide resolved
src/zimscraperlib/zim/metadata.py Outdated Show resolved Hide resolved
src/zimscraperlib/zim/metadata.py Outdated Show resolved Hide resolved
src/zimscraperlib/zim/metadata.py Outdated Show resolved Hide resolved
@rgaudin
Copy link
Member

rgaudin commented Dec 12, 2024

Here's my attempt at making those explicit, dedicated metadata easier to apprehend and use. I wrote this quickly at that time then a loong time passed before I looked at it and had to fix the tests to make it work.

  • A base, generic yet usable Metadata type.
    • works with bytes-like input
    • takes an optional name
    • defines the standard API: libzim_value, value, meta_name, etc.
  • Dedicated subclasses for well-known metadata: LanguageMetadata, TagsMetadata, DateMetadata, etc.
    • Takes proper native type as input
  • Dedicated subclasses for CutomMetadata and CustomTextMetadata as you had. I also added XCustomMetadata for auto-prefixing.

I chose an approach in which the flexibility is built into the base Metadata class, using Class variables that gets overriden by the subclasses and can be shadowed by instances should need be.
Given the constraints/features for those metadata can be multiple and combined, I figured decorators are a very readable way to specify those.
I chose to expose each (is_required, join_list_with, etc) so it's clear, can be properly documented and is properly typed.
The main API for this though is the expecting variable which decides which of the input-type to libzim_data convertion function is used. Could be done differently of course.

One thing that's a tad annoying is that name is optionnal in Metadata() for it to use the class variable (of the subclasses). It thus requires passing value first then name which seems odd… but Metadata is not to be used frequently.
In the CustomMetadata ones, given a name is mandatory I reversed it.
Maybe we could solve it by requiring kwargs on Metadata…

Another take was to spot accepting bytes as inputs as well as strings where there's no good reason to. For instance, DateMetadata takes a date or datetime only now. Supporting those extra stuff is an additional burden and there's no real value in our usecase. "2024-11-21" and date(2024, 11,21) are quite close and in general we use computed dates anyway (.today()). Same goes for tags/languages which now require a list of strings
This led to a lot of changes in the tests

Major changes that has to be introduced (we'll work on CHANGELOG if we pursue with this)

  • Use of beartype for zimscraperlib.zim moodule.

We need to validate this but this frees us from a lot of meaningless tests and really simplifies dev and maintenance.
We should do it on zimscraperlib module so it applies everywhere but of course we have errors in the codebase so it requires some fixes.
I did not remove all the extra tests (only where it was expecting a different exp)

  • Fixed types for CounterMap and MimetypeAndCounter in zim/_libkiwix (revealed by beartype)
  • zim.Archive.counters to return a CounterMap
  • Introduced new typing module with a new Callback class to normalize callback management (we already know previous way was sloppy and beartype wasnt happy)
  • Changed filesystem.delete_callback into a simple callback function which doesnt know about callback chaining
  • zim.Creator.add_item_for and zim.creator.add_item to replace callback args with callbacks arg accepting a single Callback or a list of them
  • Removed check_metadata_conventions from creator. Keeping the concept of a single toggle in the codebase (renamed it to APPLY_RECOMMENDATIONS) but such a global var should be explicitly set.
  • new DefaultIllustrationMetadata type for this special one that's required.
  • IllustrationMetadata that automatically takes care of the meta name. Saves a lot of trouble.
  • MANDATORY_ZIM_METADATA_KEYS still built dynamically although I'm not sure it's worth it.

I wonder if we should offer a way to create the StandardMetadataSet with values directly (to make it less verbose). Would be quite easy now with something like (assuming the expects_* decorator add that type:

  class StandardMetadataList:

    ...

    @classmethod
    def from_values(
        cls,
        Name: NameMetadata.input_type,
        Language: LanguageMetadata.input_type,
        Title: TitleMetadata.input_type,
        Creator: CreatorMetadata.input_type,
        Publisher: PublisherMetadata.input_type,
        Date: DateMetadata.input_type,
        Illustration_48x48_at_1: DefaultIllustrationMetadata.input_type,
        Description: DescriptionMetadata.input_type,
        LongDescription: LongDescriptionMetadata.input_type | None = None,
        Tags: TagsMetadata.input_type | None = None,
        Scraper: ScraperMetadata.input_type | None = None,
        Flavour: FlavourMetadata.input_type | None = None,
        Source: SourceMetadata.input_type | None = None,
        License: LicenseMetadata.input_type | None = None,
        Relation: RelationMetadata.input_type | None = None,
    ): ...

All tests are passing. I did not write new onesbut I had to update coverage as it was way off (hence the dev-deps update). It still is to some extent. Might have difficulty understanding the decorator thing. We'll have to look into that.

benoit74 and others added 13 commits December 17, 2024 15:22
- Explicit callback definition
- simplified delete_callback to be a dumb callback (not chaining)
Reasoning: coverage reported a lot of missing lines on zim/metadata.py with previous version

Also includes auto linting where new ruff complained
In order to properly expose input type in __init__ (for pyright and user assit),
use one base class (subclassing Metadata) per input type.

Cant get rid of the `Any` on `Metadata` init (otherwise would me re-implement the init everywhere).

Used the opportunity to remove the `expecting` classvar and modified tests accordingly

- Also fixed a minor issue in bytes reading by seeking back to previous position and not zero.
- Also shared binary reading logic inside main base class (was already there) so it can be reused in illustration
- Now explicitly says the type of stored data (can be different to inputs in somewhat flexible ones)
@benoit74 benoit74 force-pushed the safe_metadata_revamp branch from a2c1456 to 7e2efa1 Compare December 17, 2024 15:23
@benoit74
Copy link
Collaborator Author

Thanks a lot, nothing left to add, I like it! Glad we've made this "not-negligible" move. I just force-push to fixup commits and rebase on main.

@benoit74 benoit74 merged commit a0a225b into main Dec 17, 2024
9 checks passed
@benoit74 benoit74 deleted the safe_metadata_revamp branch December 17, 2024 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[next major] remove **extra from Creator.config_metadata
2 participants