Skip to content

Issue 155: log metadata prior to verification #160

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Changed
- Simplify type annotations by replacing Union and Optional with pipe character ("|") for improved readability and clarity
- Add debug logging of metadata fields

## [3.3.2] - 2024-03-25

Expand Down
39 changes: 39 additions & 0 deletions src/zimscraperlib/zim/creator.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,14 +20,18 @@
from __future__ import annotations

import datetime
import io
import logging
import pathlib
import re
import weakref
from collections.abc import Callable, Iterable
from typing import Any

import libzim.writer # pyright: ignore
import PIL.Image

from zimscraperlib import logger
from zimscraperlib.constants import (
DEFAULT_DEV_ZIM_METADATA,
FRONT_ARTICLE_MIMETYPES,
Expand Down Expand Up @@ -62,6 +66,9 @@
)


TUPLE_SIZE_2D = 2


def mimetype_for(
path: str,
content: bytes | str | None = None,
Expand Down Expand Up @@ -146,7 +153,39 @@ def config_indexing(
self.__indexing_configured = True
return self

def _is_illustration_metadata_name(self, name: str) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are using it once and it's a very simple check ; why the extra function?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Giving it a name is a way of making the code a little more self-documenting. Really the prefix should be put somewhere shared and reused as well, but I didn't want to edit all the other places the constant is used in this change.

"""Return True if name is a valid illustration metadata name"""
return name.startswith("Illustration_")

def _get_illustration_metadata_details(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function is not very appealing. You want to know the image format and its dimensions. Why not returning this instead of building a string in some util function?
As per error handling, you have the choice to handle it when calling or inside the function but anyway, better safe than sorry and handle every exception. Currently if PIL cannot read (and it's not UnidentifiedImageError or if size could not be set to a tuple, it will crash

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, now it won't crash quite as easily. Agreed that this could return a richer data-structure. Doing so would complicate the caller. Avoiding the complication in the caller by pushing the logging into this function smears the knowledge of the logging format in use around. It also prevents converting this whole thing from N logging calls to 1 call contains N items; the latter may be preferable if functions that do other logging end up being called, causing the metadata logging to be spread out rather than in a block.

self,
value: bytes,
) -> str | None:
"""Return image format for debug logging of illustration metadata"""
try:
with PIL.Image.open(io.BytesIO(value)) as img:
if (
img is not None
and img.size is tuple
and len(img.size) >= TUPLE_SIZE_2D
):
return f"{img.format} {img.size[0]}x{img.size[1]}"
except BaseException as e:
return f"Image format issue: {e}"
return f"Unknown image format, {len(value)} bytes"

def _log_metadata(self):
for name, value in sorted(self._metadata.items()):
if self._is_illustration_metadata_name(name) and isinstance(value, bytes):
illus_md = self._get_illustration_metadata_details(value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not logging directly? Then you have all the flexibility you want

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logging illus_md inside of get_illustration_metadata_details? I could, but it would spread out knowledge of the metadata logging format. It's preferable to keep that format info in one function.

else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will log everything that's not bytes. What if it cannot be represented as string?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would indeed be unfortunate. The preferred ways to set up metadata, though, go through typed handlers that ensure it will be bytes. Worst case, IIUC, is that it gets logged as is. That might be annoying if it's a 47K character string or somesuch, but it's not terrible.

illus_md = None
logger.debug(f"Metadata: {name} = {(illus_md if illus_md else value)}")

def start(self):
if logger.isEnabledFor(logging.DEBUG):
self._log_metadata()

if not all(self._metadata.get(key) for key in MANDATORY_ZIM_METADATA_KEYS):
raise ValueError("Mandatory metadata are not all set.")

Expand Down
76 changes: 76 additions & 0 deletions tests/zim/test_zim_creator.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,15 @@
import base64
import datetime
import io
import logging
import pathlib
import random
import shutil
import subprocess
import sys
import tempfile
import time
from unittest.mock import call, patch

import pytest
from libzim.writer import Compression # pyright: ignore
Expand Down Expand Up @@ -540,6 +542,80 @@ def test_check_metadata(tmp_path):
Creator(tmp_path, "").config_dev_metadata(LongDescription="T" * 5000).start()


@pytest.mark.parametrize(
"tags",
[
(
"wikipedia;_category:wikipedia;_pictures:no;_videos:no;_details:yes;"
"_ftindex:yes"
),
(
[
"wikipedia",
"_category:wikipedia",
"_pictures:no",
"_videos:no",
"_details:yes",
"_ftindex:yes",
]
),
],
)
@patch("zimscraperlib.zim.creator.logger", autospec=True)
def test_start_logs_metadata_log_contents(mocked_logger, png_image, tags, tmp_path):
mocked_logger.isEnabledFor.side_effect = lambda level: level == logging.DEBUG
fpath = tmp_path / "test_config.zim"
with open(png_image, "rb") as fh:
png_data = fh.read()
creator = Creator(fpath, "").config_metadata(
Name="wikipedia_fr_football",
Title="English Wikipedia",
Creator="English speaking Wikipedia contributors",
Publisher="Wikipedia user Foobar",
Date="2009-11-21",
Description="All articles (without images) from the english Wikipedia",
LongDescription="This ZIM file contains all articles (without images)"
" from the english Wikipedia by 2009-11-10. The topics are...",
Language="eng",
License="CC-BY",
Tags=tags,
Flavour="nopic",
Source="https://en.wikipedia.org/",
Scraper="mwoffliner 1.2.3",
Illustration_48x48_at_1=png_data,
TestMetadata="Test Metadata",
)
creator.start()
creator.finish()
mocked_logger.debug.assert_has_calls(
[
call("Metadata: Creator = English speaking Wikipedia contributors"),
call("Metadata: Date = 2009-11-21"),
call(
"Metadata: Description = All articles (without images) from the "
"english Wikipedia"
),
call("Metadata: Flavour = nopic"),
call("Metadata: Illustration_48x48@1 = PNG 48x48"),
call("Metadata: Language = eng"),
call("Metadata: License = CC-BY"),
call(
"Metadata: LongDescription = This ZIM file contains all articles "
"(without images) from the english Wikipedia by 2009-11-10. "
"The topics are..."
),
call("Metadata: Name = wikipedia_fr_football"),
call("Metadata: Publisher = Wikipedia user Foobar"),
call("Metadata: Relation = None"),
call("Metadata: Scraper = mwoffliner 1.2.3"),
call("Metadata: Source = https://en.wikipedia.org/"),
call(f"Metadata: Tags = {tags}"),
call("Metadata: TestMetadata = Test Metadata"),
call("Metadata: Title = English Wikipedia"),
]
)


def test_relax_metadata(tmp_path):
Creator(tmp_path, "", disable_metadata_checks=True).config_dev_metadata(
Description="T" * 90
Expand Down