Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add version parsing #661

Open
wants to merge 3 commits into
base: images
Choose a base branch
from
Open

add version parsing #661

wants to merge 3 commits into from

Conversation

MaxDall
Copy link
Collaborator

@MaxDall MaxDall commented Nov 13, 2024

This PR reworks the versioning system of images, adds extended serialization and parses size for image versions.

@flairNLP flairNLP deleted a comment Nov 13, 2024
Copy link
Collaborator

@addie9800 addie9800 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, this is a really useful addition. I do have a couple comments and questions though :)

def __repr__(self) -> str:
return f"{self.width}x{self.height or '...'}"

def __lt__(self, other: "Dimension") -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be a bit cleaner to add type check for other to cause an accurate exception to be raised.

min_width: Optional[int] = None
size: Optional[Dimension] = None
type: Optional[str] = None
file_formats: ClassVar[List[str]] = ["png", "jpg", "jpeg", "webp"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to keep this as a Class Variable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue yes, cause ultimately it belongs solely to the class. As a compromise, I changed it to __FILE_FORMATS__

return None


_min_width_pattern = re.compile(r"min-width:\s*(?P<minwidth>[0-9]*)(?P<descriptor>[pxem]+)")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to some css docs, possible units are also % and ch



_min_width_pattern = re.compile(r"min-width:\s*(?P<minwidth>[0-9]*)(?P<descriptor>[pxem]+)")
_width_x_height_pattern = re.compile(r"(?P<width>[0-9]+)x(?P<height>[0-9]+)")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make sense to also match max-width, which is sometimes used instead of min-width for smaller images.
e.g. https://www.zeit.de/gesundheit/2023-11/neues-alzheimer-medikament-zulassung-lecanemab-leqembi-demenz

if descriptor != "px":
logger.debug(f"Pixel calculation not implemented for {descriptor}")
else:
min_width = int(value)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like an iteration over all matches is not the best solution here. Granted, we would only expect one match, but in case there are more, overwriting all previous matches seems not ideal.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote the logic. We now assume that there is only one max-/min-width per query string (separated through ,) and either a max- or min-width defined.

if descriptor is not None:
if match := re.search(r"(?P<multiplier>[0-9.]+)x", descriptor):
kwargs["dpr"] = float(match.group("multiplier"))
elif match := re.search(r"(?P<width>[0-9]+)(px|w)", descriptor):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other technically allowed units of measurement are % and em, also w doesn't seem to be a valid unit (https://www.w3schools.com/cssref/css_units.php)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The URLs coming are coming from the srcset attribute. Allowed descriptors are x, w, I also found some px

src/fundus/parser/utility.py Outdated Show resolved Hide resolved
if value is not None:
matches[key] = conversion(value) if conversion is not None else value
elif keep_none:
matches[key] = match_dict[key] or value
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are match_dict[key] and value not equivalent?

Copy link
Collaborator Author

@MaxDall MaxDall Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, that's a typo and should be match

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants