Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generalise "applicable" argument #25

Merged
merged 2 commits into from
Nov 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 64 additions & 13 deletions ianalyzer_readers/extract.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
import logging
import traceback
from typing import Any, Dict, Callable, Union, List, Optional, Iterable
import warnings

import bs4
import html
Expand All @@ -28,25 +29,37 @@ class Extractor(object):
An extractor contains a method that can be used to gather data for a field.

Parameters:
applicable: optional predicate that takes metadata and decides whether this
extractor is applicable. If left as `None`, the extractor is always
applicable.
transform: optional function to transform the postprocess the extracted value.
applicable:
optional argument to check whether the extractor can be used. This should
be another extractor, which is applied first; the containing extractor
is only applied if the result is truthy. Any extractor can be used, as long as
it's supported by the Reader in which it's used. If left as `None`, this
extractor is always applicable.
transform: optional function to transform or postprocess the extracted value.
'''

def __init__(self,
applicable: Optional[Callable[[Dict], bool]] = None,
applicable: Union['Extractor', Callable[[Dict], bool], None] = None,
transform: Optional[Callable] = None
):

if callable(applicable):
warnings.warn(
'Using a callable as "applicable" argument is deprecated; provide an '
'Extractor instead',
DeprecationWarning,
)

self.transform = transform
self.applicable = applicable


def apply(self, *nargs, **kwargs):
'''
Test if the extractor is applicable to the given arguments and if so,
try to extract the information.
'''
if self.applicable is None or self.applicable(kwargs.get('metadata')):
if self._is_applicable(*nargs, **kwargs):
result = self._apply(*nargs, **kwargs)
try:
if self.transform:
Expand All @@ -73,6 +86,31 @@ def _apply(self, *nargs, **kwargs):
raise NotImplementedError()


def _is_applicable(self, *nargs, **kwargs) -> bool:
'''
Checks whether the extractor is applicable, based on the condition passed as the
`applicable` parameter.

If no condition is provided, this is always true. If the condition is an
Extractor, this checks whether the result is truthy.

If the condition is a callable, it will be called with the document metadata as
an argument. This option is deprecated; you can use the Metadata extractor to
replace it.

Raises:
TypeError: Raised if the applicable parameter is an unsupported type.
'''
if self.applicable is None:
return True
if isinstance(self.applicable, Extractor):
return bool(self.applicable.apply(*nargs, **kwargs))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this check here double the time needed for indexing a given field with applicable? In case we have an "expensive" extractor, such as XML tree parsing, this might be an issue. Perhaps good to warn about this in the documentation of the applicable argument.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is the only extractor allowed for the applicable argument the Metadata extractor? In this case, this is not a worry, but then this should be documented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean because there is a processing cost in applying the extractor? If so, yes, that extra processing is added.

But I can't imagine an implementation where you can provide an expensive extractor without incurring its processing cost. How would that even work? Do we need to warn developers that if they provide expensive checks, those checks will actually be run?

I'm genuinely not sure how to formulate such a warning that isn't an obvious statement like "keep in mind that the code you provide needs to be executed as well".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I see your point. I guess the current setup actually makes the potential processing time more explicit, so perhaps that is in itself already a better documentation than having a callable here, which might be a wrapper around a time consuming XML tree parse, too.

Just a brainfart: Could we make the purpose of what is now implemented as a combination of Backup/Combined and applicable even more explicit in a ConditionalExtractor class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this?

Condition(
    if=XML('foo'),
    then=XML('bar'),
    else=XML('baz'),
)

I can see that working. Or did you have something else in mind?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like that, indeed. Not necessary to overhaul this PR, but might be more readable in the long run.

if callable(self.applicable):
return self.applicable(kwargs.get('metadata'))
return TypeError(
f'Unsupported type for "applicable" parameter: {type(self.applicable)}'
)

class Choice(Extractor):
'''
Use the first applicable extractor from a list of extractors.
Expand All @@ -89,8 +127,21 @@ class Choice(Extractor):
This would extract `'foo'` if `some_condition` is met; otherwise,
the extracted value will be `'bar'`.

Note the difference with `Backup`: `Choice` is based on _metadata_, rather than the
extracted value.
Note the difference with `Backup`: `Backup` will select the first truthy value from a
list of extractors, but `Choice` only checks the `applicable` condition. For example:

Choice(
CSV('foo', applicable=Metadata('bar')),
CSV('baz'),
)

Backup(
CSV('foo', applicable=Metadata('bar')),
CSV('baz'),
)

These extractors behave differently if the "bar" condition holds, but the "foo" field
is empty. `Backup` will try to extract the "baz" field, `Choice` will not.

Parameters:
*extractors: extractors to choose from. These should be listed in descending
Expand All @@ -102,10 +153,10 @@ def __init__(self, *extractors: Extractor, **kwargs):
self.extractors = list(extractors)
super().__init__(**kwargs)

def _apply(self, metadata: Dict, *nargs, **kwargs):
def _apply(self, *nargs, **kwargs):
for extractor in self.extractors:
if extractor.applicable is None or extractor.applicable(metadata):
return extractor.apply(metadata=metadata, *nargs, **kwargs)
if extractor._is_applicable(*nargs, **kwargs):
return extractor.apply(*nargs, **kwargs)
return None


Expand Down Expand Up @@ -149,8 +200,8 @@ class Backup(Extractor):
Since the first extractor returns `None`, the second extractor will be used, and the
extracted value would be `'foo'`.

Note the difference with `Choice`: `Backup` is based on the _extracted value_, rather
than metadata.
Note the difference with `Choice`: `Backup` is based on the _extracted value_,
`Choice` on the `applicable` parameter of each extractor.

Parameters:
*extractors: extractors to use. These should be listed in descending order of
Expand Down
18 changes: 16 additions & 2 deletions tests/test_generic_extractors.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import pytest
from ianalyzer_readers.extract import (
Constant, Combined, Backup, Choice, Metadata, Pass, Order
)
Expand Down Expand Up @@ -32,7 +33,7 @@ def test_choice_extractor():
extractor = Choice(
Constant(
'first',
applicable=lambda metadata: metadata.get('check'),
applicable=Metadata('check'),
),
Constant(
'second'
Expand Down Expand Up @@ -70,4 +71,17 @@ def test_pass_extractor():
def test_order_extractor():
extractor = Order()
output = extractor.apply(index=1)
assert output == 1
assert output == 1


def test_extractor_applicable_extractor():
extractor = Constant('test', applicable=Metadata('testing'))
assert extractor.apply(metadata={'testing': True}) == 'test'
assert extractor.apply(metadata={'testing': False}) == None


def test_extractor_applicable_callable():
with pytest.warns(DeprecationWarning):
extractor = Constant('test', applicable=lambda metadata: metadata['testing'])
assert extractor.apply(metadata={'testing': True}) == 'test'
assert extractor.apply(metadata={'testing': False}) == None
Loading