Parser that allows to load dumped model cards

As discussed here: skops-dev#72 (comment) Description This feature adds a new function, skops.card.parse_modelcard. When passing it the path to a dumped model card, it parses it using pandoc and returns a Card object, which can be further modified by the user. In the end, this turned out easier than I initially thought it would. The main difficulty are the data structures returned by the pandoc parser, for which I couldn't find any documentation. I guess Haskell code is just self-documenting. For this reason, there are probably quite a few edge cases that I haven't covered yet. Just as an example, when parsing tables, pandoc tells us how the columns are aligned. This information is currently completely discarded (we let tabulate choose the alignment). If we want to preserve the table alignment, we would need to make some changes Implementation This feature requires the alternative card implementation from skops-dev#203 pandoc is used for the following reasons: - widely used and thus battle tested - can read many other formats, not just markdown, so in theory, we should be able to read, e.g., rst model cards without modifying any code The disadvantage is that pandoc is not a Python package, so users need to install it separately. But it is available on all common platforms. For calling pandoc, I chose to shell out using subprocess. I think this should be fine but LMK if there is a better way. There is a Python package that binds pandoc (https://github.com/boisgera/pandoc) but I don't think it's worth it for us to add it, just to avoid shelling out. The package seems to have low adoption and contains a bunch of stuff we don't need. I chose to implement this such that the parser that generates the Card object should not have to know anything about Markdown. Everything related to Markdown is moved to a separate class in _markup.py. In an ideal world, we would not have to know anything about markdown either. Instead the Card object shoud have methods (similar to what we already have for add_plot etc.) that handles all of that. But in practice, this is far from being true. E.g. if a user wants to add bold text, there is no special method for it, so they would need to add raw Markdown. The Card class is thus a leaky abstraction. TODOs This PR is not finished. Remaining TODOs that come to mind: 1. We need to merge the alternative card implementation 2. Documentation has to be updated in several places 3. Tests need to be more complex, right now only one Card is tested 4. CI needs to install pandoc so that the tests are actually run 5. There are some specifics here that won't work with all Python versions, like the use of TypedDict.
BenjaminBossan · Nov 30, 2022 · 50b0d39 · 50b0d39
1 parent 7cfddf9
commit 50b0d39
Show file tree

Hide file tree

Showing 4 changed files with 417 additions and 1 deletion.
diff --git a/skops/card/__init__.py b/skops/card/__init__.py
@@ -1,3 +1,4 @@
 from ._model_card import Card, metadata_from_config
+from ._parser import parse_modelcard
 
-__all__ = ["Card", "metadata_from_config"]
+__all__ = ["Card", "metadata_from_config", "parse_modelcard"]
diff --git a/skops/card/_markup.py b/skops/card/_markup.py
@@ -0,0 +1,180 @@
+"""Classes for translating into the syntax of different markup languages"""
+
+from collections.abc import Mapping
+from typing import Any, Sequence, TypedDict
+
+from skops.card._model_card import TableSection
+
+
+class PandocItem(TypedDict):
+    t: str
+    c: dict
+
+
+class Markdown:
+    """Mapping of pandoc parsed document to Markdown
+
+    This class has a ``mapping`` attribute, which is just a dict. The keys are
+    Pandoc types and the values are functions that transform the corresponding
+    value into a string with markdown syntax. Those functions are all prefixed
+    with ``md_``, e.g. ``md_Image`` for transforming a pandoc ``Image`` into a
+    markdown figure.
+
+    From the caller side, only the ``__call__`` method should be used, the rest
+    should be considered internals.
+
+    """
+
+    def __init__(self):
+        # markdown syntax dispatch table
+        self.mapping = {
+            "Space": self.md_space,
+            "Strong": self.md_strong,
+            "Plain": self.md_plain,
+            "Str": self.md_str,
+            "RawInline": self.md_rawline,
+            "RawBlock": self.md_raw_block,
+            "SoftBreak": self.md_softbreak,
+            "Para": self.md_para,
+            "Header": self.md_header,
+            "Image": self.md_image,
+            "CodeBlock": self.md_code_block,
+            "Table": self.md_table,
+            "Div": self.md_parse_div,
+        }
+
+    @staticmethod
+    def md_space(value) -> str:
+        return " "
+
+    def md_strong(self, value) -> str:
+        parts = ["**"]
+        parts += [self.__call__(subitem) for subitem in value]
+        parts.append("**")
+        return "".join(parts)
+
+    def md_plain(self, value) -> str:
+        parts = [self.__call__(subitem) for subitem in value]
+        return "".join(parts)
+
+    @staticmethod
+    def md_str(value) -> str:
+        return value
+
+    @staticmethod
+    def md_rawline(value) -> str:
+        _, line = value
+        return line
+
+    def md_raw_block(self, item) -> str:
+        # throw away the first item, which is just something like 'html'
+        # might have to revisit this if output != markdown
+        _, line = item
+        return line
+
+    @staticmethod
+    def md_softbreak(value) -> str:
+        return "\n"
+
+    def _make_content(self, content):
+        parts = []
+        for item in content:
+            part = "".join(self.__call__(item))
+            parts.append(part)
+        return "".join(parts)
+
+    def md_para(self, value: list[dict[str, str]]) -> str:
+        content = self._make_content(value)
+        return content
+
+    def md_header(self, value: tuple[int, Any, list[dict[str, str]]]) -> str:
+        level, _, content_parts = value
+        section_name = self._make_content(content_parts)
+        return section_name
+
+    def md_image(self, value) -> str:
+        (ident, _, keyvals), caption, (dest, typef) = value
+        # it seems like ident and keyvals are not relevant for markdown
+        assert caption
+        assert typef == "fig:"
+
+        caption = "".join([self.__call__(i) for i in caption])
+        content = f"![{caption}]({dest})"
+        return content
+
+    @staticmethod
+    def md_code_block(item: tuple[tuple[int, list[str], list[str]], str]) -> str:
+        # a codeblock consists of: (id, classes, namevals) contents
+        (_, _, namevals), content = item
+        block_start = "```"
+        if namevals:  # TODO: check if this makes "```python" etc.
+            block_start += namevals[0]
+        block_end = "```"
+        content = "\n".join((block_start, content, block_end))
+        return content
+
+    def md_table(self, item) -> str:
+        _, alignments, _, header, rows = item
+        fn = self.__call__
+        columns = ["".join(fn(part) for part in col) for col in header]
+        if not columns:
+            raise ValueError("Table with no columns...")
+
+        data = []  # row oriented
+        for row in rows:
+            data.append(["".join(fn(part) for part in col) for col in row])
+
+        table: Mapping[str, Sequence[Any]]
+        if not data:
+            table = {key: [] for key in columns}
+        else:
+            data_transposed = zip(*data)  # column oriented
+            table = {key: val for key, val in zip(columns, data_transposed)}
+
+        res = TableSection(table).format()
+        return res
+
+    def md_parse_div(self, item) -> str:
+        # note that in markdown, we basically just use the raw html
+        (ident, classes, kvs), contents = item
+
+        # build diff tag
+        tags = ["<div"]
+        if ident:
+            tags.append(f' id="{ident}"')
+        if classes:
+            classes = " ".join(classes)
+            tags.append(f' class="{classes}"')
+        if kvs:
+            kvparts = []
+            for k, v in kvs:
+                if not v:  # e.g. just ['hidden', '']
+                    kvparts.append(k)
+                else:
+                    kvparts.append(f'{k}="{v}"')
+            tags.append(f' {" ".join(kvparts)}')
+        tags.append(">")
+
+        start = "".join(tags)
+        middle = []
+        for content in contents:
+            middle.append(self.__call__(content))
+        end = "</div>"
+        return "".join([start] + middle + [end])
+
+    def __call__(self, item: str | PandocItem) -> str:
+        if isinstance(item, str):
+            return item
+
+        type_, value = item["t"], item.get("c")
+        try:
+            res = self.mapping[type_](value)
+        except KeyError as exc:
+            msg = (
+                f"The parsed document contains '{type_}', which is not "
+                "supported yet, please open an issue on GitHub"
+            )
+            raise ValueError(msg) from exc
+
+        # recursively call until the value has been resolved into a str
+        return self.__call__(res)
diff --git a/skops/card/_parser.py b/skops/card/_parser.py
@@ -0,0 +1,160 @@
+"""Contains the PandocParser
+
+This class needs to know about the pandoc parse tree but should not have
+knowledge of any particular markup syntex; everything related to markup should
+be known by the mapping attribute.
+
+"""
+
+import json
+import subprocess
+from pathlib import Path
+
+from skops.card import Card
+from skops.card._model_card import Section
+
+from ._markup import Markdown, PandocItem
+
+
+class PandocParser:
+    """TODO"""
+
+    def __init__(self, source, mapping="markdown") -> None:
+        self.source = source
+        if mapping == "markdown":
+            self.mapping = Markdown()
+        else:
+            raise ValueError(f"Markup of type {mapping} is not supported (yet)")
+
+        self.card = Card(None, template=None)
+        self._section_trace: list[str] = []
+        self._cur_section: Section | None = None
+
+    def get_cur_level(self) -> int:
+        # level 0 can be interpreted implictly as the root level
+        return len(self._section_trace)
+
+    def get_cur_section(self):
+        # including supersections
+        return "/".join(self._section_trace)
+
+    def add_section(self, section_name: str) -> None:
+        self._cur_section = self.card._add_single(self.get_cur_section(), "")
+
+    def add_content(self, content: str) -> None:
+        section = self._cur_section
+        if section is None:
+            raise ValueError(
+                "Ooops, no current section, please open an issue on GitHub"
+            )
+
+        if not section.content:
+            section.content = content
+        elif isinstance(section.content, str):
+            section.content = section.content + "\n\n" + content
+        else:
+            # A Formattable, no generic way to modify it -- should we add an
+            # update method?
+            raise ValueError(f"Could not modify content of {section.content}")
+
+    def parse_header(self, item: PandocItem) -> str:
+        # Headers are the only type of item that needs to be handled
+        # differently. This is because we structure the underlying model card
+        # data as a tree with nodes corresponding to headers. To assign the
+        # right parent or child node, we need to keep track of the level of the
+        # headers. This cannot be done solely by the markdown mapping, since it
+        # is not aware of the tree structure.
+        level, _, _ = item["c"]
+        content = self.mapping(item)
+        self._section_trace = self._section_trace[: level - 1] + [content]
+        return content
+
+    def generate(self) -> Card:
+        # Parsing the flat structure, not recursively as in pandocfilters.
+        # After visiting the parent node, it's not necessary to visit its
+        # child nodes, because that's already done during parsing.
+        for item in json.loads(self.source)["blocks"]:
+            if item["t"] == "Header":
+                res = self.parse_header(item)
+                self.add_section(res)
+            else:
+                res = self.mapping(item)
+                self.add_content(res)
+
+        return self.card
+
+
+def check_pandoc_installed() -> None:
+    """Check if pandoc is installed on the system
+
+    Raises
+    ------
+    FileNotFoundError
+        When the binary is not found, raise this error.
+
+    """
+    try:
+        subprocess.run(
+            ["pandoc", "--version"],
+            capture_output=True,
+        )
+    except FileNotFoundError as exc:
+        msg = (
+            "This feature requires the pandoc library to be installed on your system, "
+            "please follow these install instructions: "
+            "https://pandoc.org/installing.html"
+        )
+        raise FileNotFoundError(msg) from exc
+
+
+def parse_modelcard(path: str | Path) -> Card:
+    """Read a model card and return a Card object
+
+    This allows users to load a dumped model card and continue to edit it.
+
+    Using this function requires ``pandoc`` to be installed. Please follow these
+    instructions:
+
+    https://pandoc.org/installing.html
+
+    Examples
+    --------
+    >>> import numpy as np
+    >>> from sklearn.linear_model import LinearRegression
+    >>> from skops.card import Card
+    >>> from skops.card import parse_card
+    >>> X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
+    >>> y = np.dot(X, np.array([1, 2])) + 3
+    >>> regr = LinearRegression().fit(X, y)
+    >>> card = Card(regr)
+    >>> card.save("README.md")
+    >>> # later, load the card again
+    >>> parsed_card = parse_modelcard("README.md")
+    >>> # continue editing the card
+    >>> parsed_card.add(**{"My new section": "My new content"})
+    >>> # overwrite old card with new one
+    >>> parsed_card.save("README.md")
+
+    Parameters
+    ----------
+    path : str or pathlib.Path
+        The path to the existing model card.
+
+    Returns
+    -------
+    card : skops.card.Card
+        The model card object.
+
+    """
+    check_pandoc_installed()
+
+    proc = subprocess.run(
+        ["pandoc", "-t", "json", "-s", str(path)],
+        capture_output=True,
+    )
+    source = str(proc.stdout.decode("utf-8"))
+
+    parser = PandocParser(source)
+    card = parser.generate()
+
+    return card