Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generalise "applicable" argument #25

Merged
merged 2 commits into from
Nov 20, 2024

Conversation

lukavdplas
Copy link
Contributor

This implements the suggestion in #8. It changes the "applicable" argument of extractors to cover more use cases. See the issue for examples.

API changes

Currently, the value of applicable, if any, must be a function that takes the metadata as input. In this version, applicable should be an extractor.

Providing a function is still supported, but I've added a DeprecationWarning. Extractors that use applicable can be updated to use a Metadata extractor (possibly with transform to add a custom function).

if self.applicable is None:
return True
if isinstance(self.applicable, Extractor):
return bool(self.applicable.apply(*nargs, **kwargs))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this check here double the time needed for indexing a given field with applicable? In case we have an "expensive" extractor, such as XML tree parsing, this might be an issue. Perhaps good to warn about this in the documentation of the applicable argument.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is the only extractor allowed for the applicable argument the Metadata extractor? In this case, this is not a worry, but then this should be documented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean because there is a processing cost in applying the extractor? If so, yes, that extra processing is added.

But I can't imagine an implementation where you can provide an expensive extractor without incurring its processing cost. How would that even work? Do we need to warn developers that if they provide expensive checks, those checks will actually be run?

I'm genuinely not sure how to formulate such a warning that isn't an obvious statement like "keep in mind that the code you provide needs to be executed as well".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I see your point. I guess the current setup actually makes the potential processing time more explicit, so perhaps that is in itself already a better documentation than having a callable here, which might be a wrapper around a time consuming XML tree parse, too.

Just a brainfart: Could we make the purpose of what is now implemented as a combination of Backup/Combined and applicable even more explicit in a ConditionalExtractor class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this?

Condition(
    if=XML('foo'),
    then=XML('bar'),
    else=XML('baz'),
)

I can see that working. Or did you have something else in mind?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like that, indeed. Not necessary to overhaul this PR, but might be more readable in the long run.

Copy link
Member

@BeritJanssen BeritJanssen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good idea to generalize the applicable argument! I noticed the documentation of the applicable argument still needs to be updated. Also to clarify what kind of extractors can be used in the applicable argument: only the Metadata extractor? In case this is meant to work with any kind of extractor, warn about "expensive" extractors which might, for instance parse a large XML tree, and add unit tests for more extractors to be used in applicable.

@lukavdplas
Copy link
Contributor Author

Also to clarify what kind of extractors can be used in the applicable argument: only the Metadata extractor?

Good point, I'll add that to the documentation. It's any extractor, as long as it's supported by the Reader.

@lukavdplas lukavdplas force-pushed the feature/generalise_applicable_arg branch from c8bb9e1 to fea5636 Compare November 14, 2024 13:53
@lukavdplas lukavdplas merged commit 948bef3 into develop Nov 20, 2024
8 checks passed
@lukavdplas lukavdplas deleted the feature/generalise_applicable_arg branch November 20, 2024 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants