Skip to content

Commit

Permalink
Parser that allows to load dumped model cards
Browse files Browse the repository at this point in the history
As discussed here:

skops-dev#72 (comment)

Description

This feature adds a new function, skops.card.parse_modelcard. When
passing it the path to a dumped model card, it parses it using pandoc
and returns a Card object, which can be further modified by the user.

In the end, this turned out easier than I initially thought it would.
The main difficulty are the data structures returned by the pandoc
parser, for which I couldn't find any documentation. I guess Haskell
code is just self-documenting.

For this reason, there are probably quite a few edge cases that I
haven't covered yet. Just as an example, when parsing tables, pandoc
tells us how the columns are aligned. This information is currently
completely discarded (we let tabulate choose the alignment). If we want
to preserve the table alignment, we would need to make some changes

Implementation

This feature requires the alternative card implementation from skops-dev#203

pandoc is used for the following reasons:

- widely used and thus battle tested
- can read many other formats, not just markdown, so in theory, we
  should be able to read, e.g., rst model cards without modifying any
  code

The disadvantage is that pandoc is not a Python package, so users need
to install it separately. But it is available on all common platforms.

For calling pandoc, I chose to shell out using subprocess. I think this
should be fine but LMK if there is a better way.

There is a Python package that binds
pandoc (https://github.com/boisgera/pandoc) but I don't think it's worth
it for us to add it, just to avoid shelling out. The package seems to
have low adoption and contains a bunch of stuff we don't need.

I chose to implement this such that the parser that generates the Card
object should not have to know anything about Markdown. Everything
related to Markdown is moved to a separate class in _markup.py.

In an ideal world, we would not have to know anything about markdown
either. Instead the Card object shoud have methods (similar to what we
already have for add_plot etc.) that handles all of that. But in
practice, this is far from being true. E.g. if a user wants to add bold
text, there is no special method for it, so they would need to add raw
Markdown. The Card class is thus a leaky abstraction.

TODOs

This PR is not finished. Remaining TODOs that come to mind:

1. We need to merge the alternative card implementation
2. Documentation has to be updated in several places
3. Tests need to be more complex, right now only one Card is tested
4. CI needs to install pandoc so that the tests are actually run
5. There are some specifics here that won't work with all Python
   versions, like the use of TypedDict.
  • Loading branch information
BenjaminBossan committed Nov 30, 2022
1 parent 7cfddf9 commit 50b0d39
Show file tree
Hide file tree
Showing 4 changed files with 417 additions and 1 deletion.
3 changes: 2 additions & 1 deletion skops/card/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
from ._model_card import Card, metadata_from_config
from ._parser import parse_modelcard

__all__ = ["Card", "metadata_from_config"]
__all__ = ["Card", "metadata_from_config", "parse_modelcard"]
180 changes: 180 additions & 0 deletions skops/card/_markup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
"""Classes for translating into the syntax of different markup languages"""

from collections.abc import Mapping
from typing import Any, Sequence, TypedDict

from skops.card._model_card import TableSection


class PandocItem(TypedDict):
t: str
c: dict


class Markdown:
"""Mapping of pandoc parsed document to Markdown
This class has a ``mapping`` attribute, which is just a dict. The keys are
Pandoc types and the values are functions that transform the corresponding
value into a string with markdown syntax. Those functions are all prefixed
with ``md_``, e.g. ``md_Image`` for transforming a pandoc ``Image`` into a
markdown figure.
From the caller side, only the ``__call__`` method should be used, the rest
should be considered internals.
"""

def __init__(self):
# markdown syntax dispatch table
self.mapping = {
"Space": self.md_space,
"Strong": self.md_strong,
"Plain": self.md_plain,
"Str": self.md_str,
"RawInline": self.md_rawline,
"RawBlock": self.md_raw_block,
"SoftBreak": self.md_softbreak,
"Para": self.md_para,
"Header": self.md_header,
"Image": self.md_image,
"CodeBlock": self.md_code_block,
"Table": self.md_table,
"Div": self.md_parse_div,
}

@staticmethod
def md_space(value) -> str:
return " "

def md_strong(self, value) -> str:
parts = ["**"]
parts += [self.__call__(subitem) for subitem in value]
parts.append("**")
return "".join(parts)

def md_plain(self, value) -> str:
parts = [self.__call__(subitem) for subitem in value]
return "".join(parts)

@staticmethod
def md_str(value) -> str:
return value

@staticmethod
def md_rawline(value) -> str:
_, line = value
return line

def md_raw_block(self, item) -> str:
# throw away the first item, which is just something like 'html'
# might have to revisit this if output != markdown
_, line = item
return line

@staticmethod
def md_softbreak(value) -> str:
return "\n"

def _make_content(self, content):
parts = []
for item in content:
part = "".join(self.__call__(item))
parts.append(part)
return "".join(parts)

def md_para(self, value: list[dict[str, str]]) -> str:
content = self._make_content(value)
return content

def md_header(self, value: tuple[int, Any, list[dict[str, str]]]) -> str:
level, _, content_parts = value
section_name = self._make_content(content_parts)
return section_name

def md_image(self, value) -> str:
(ident, _, keyvals), caption, (dest, typef) = value
# it seems like ident and keyvals are not relevant for markdown
assert caption
assert typef == "fig:"

caption = "".join([self.__call__(i) for i in caption])
content = f"![{caption}]({dest})"
return content

@staticmethod
def md_code_block(item: tuple[tuple[int, list[str], list[str]], str]) -> str:
# a codeblock consists of: (id, classes, namevals) contents
(_, _, namevals), content = item
block_start = "```"
if namevals: # TODO: check if this makes "```python" etc.
block_start += namevals[0]
block_end = "```"
content = "\n".join((block_start, content, block_end))
return content

def md_table(self, item) -> str:
_, alignments, _, header, rows = item
fn = self.__call__
columns = ["".join(fn(part) for part in col) for col in header]
if not columns:
raise ValueError("Table with no columns...")

data = [] # row oriented
for row in rows:
data.append(["".join(fn(part) for part in col) for col in row])

table: Mapping[str, Sequence[Any]]
if not data:
table = {key: [] for key in columns}
else:
data_transposed = zip(*data) # column oriented
table = {key: val for key, val in zip(columns, data_transposed)}

res = TableSection(table).format()
return res

def md_parse_div(self, item) -> str:
# note that in markdown, we basically just use the raw html
(ident, classes, kvs), contents = item

# build diff tag
tags = ["<div"]
if ident:
tags.append(f' id="{ident}"')
if classes:
classes = " ".join(classes)
tags.append(f' class="{classes}"')
if kvs:
kvparts = []
for k, v in kvs:
if not v: # e.g. just ['hidden', '']
kvparts.append(k)
else:
kvparts.append(f'{k}="{v}"')
tags.append(f' {" ".join(kvparts)}')
tags.append(">")

start = "".join(tags)
middle = []
for content in contents:
middle.append(self.__call__(content))
end = "</div>"
return "".join([start] + middle + [end])

def __call__(self, item: str | PandocItem) -> str:
if isinstance(item, str):
return item

type_, value = item["t"], item.get("c")
try:
res = self.mapping[type_](value)
except KeyError as exc:
msg = (
f"The parsed document contains '{type_}', which is not "
"supported yet, please open an issue on GitHub"
)
raise ValueError(msg) from exc

# recursively call until the value has been resolved into a str
return self.__call__(res)
160 changes: 160 additions & 0 deletions skops/card/_parser.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
"""Contains the PandocParser
This class needs to know about the pandoc parse tree but should not have
knowledge of any particular markup syntex; everything related to markup should
be known by the mapping attribute.
"""

import json
import subprocess
from pathlib import Path

from skops.card import Card
from skops.card._model_card import Section

from ._markup import Markdown, PandocItem


class PandocParser:
"""TODO"""

def __init__(self, source, mapping="markdown") -> None:
self.source = source
if mapping == "markdown":
self.mapping = Markdown()
else:
raise ValueError(f"Markup of type {mapping} is not supported (yet)")

self.card = Card(None, template=None)
self._section_trace: list[str] = []
self._cur_section: Section | None = None

def get_cur_level(self) -> int:
# level 0 can be interpreted implictly as the root level
return len(self._section_trace)

def get_cur_section(self):
# including supersections
return "/".join(self._section_trace)

def add_section(self, section_name: str) -> None:
self._cur_section = self.card._add_single(self.get_cur_section(), "")

def add_content(self, content: str) -> None:
section = self._cur_section
if section is None:
raise ValueError(
"Ooops, no current section, please open an issue on GitHub"
)

if not section.content:
section.content = content
elif isinstance(section.content, str):
section.content = section.content + "\n\n" + content
else:
# A Formattable, no generic way to modify it -- should we add an
# update method?
raise ValueError(f"Could not modify content of {section.content}")

def parse_header(self, item: PandocItem) -> str:
# Headers are the only type of item that needs to be handled
# differently. This is because we structure the underlying model card
# data as a tree with nodes corresponding to headers. To assign the
# right parent or child node, we need to keep track of the level of the
# headers. This cannot be done solely by the markdown mapping, since it
# is not aware of the tree structure.
level, _, _ = item["c"]
content = self.mapping(item)
self._section_trace = self._section_trace[: level - 1] + [content]
return content

def generate(self) -> Card:
# Parsing the flat structure, not recursively as in pandocfilters.
# After visiting the parent node, it's not necessary to visit its
# child nodes, because that's already done during parsing.
for item in json.loads(self.source)["blocks"]:
if item["t"] == "Header":
res = self.parse_header(item)
self.add_section(res)
else:
res = self.mapping(item)
self.add_content(res)

return self.card


def check_pandoc_installed() -> None:
"""Check if pandoc is installed on the system
Raises
------
FileNotFoundError
When the binary is not found, raise this error.
"""
try:
subprocess.run(
["pandoc", "--version"],
capture_output=True,
)
except FileNotFoundError as exc:
msg = (
"This feature requires the pandoc library to be installed on your system, "
"please follow these install instructions: "
"https://pandoc.org/installing.html"
)
raise FileNotFoundError(msg) from exc


def parse_modelcard(path: str | Path) -> Card:
"""Read a model card and return a Card object
This allows users to load a dumped model card and continue to edit it.
Using this function requires ``pandoc`` to be installed. Please follow these
instructions:
https://pandoc.org/installing.html
Examples
--------
>>> import numpy as np
>>> from sklearn.linear_model import LinearRegression
>>> from skops.card import Card
>>> from skops.card import parse_card
>>> X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
>>> y = np.dot(X, np.array([1, 2])) + 3
>>> regr = LinearRegression().fit(X, y)
>>> card = Card(regr)
>>> card.save("README.md")
>>> # later, load the card again
>>> parsed_card = parse_modelcard("README.md")
>>> # continue editing the card
>>> parsed_card.add(**{"My new section": "My new content"})
>>> # overwrite old card with new one
>>> parsed_card.save("README.md")
Parameters
----------
path : str or pathlib.Path
The path to the existing model card.
Returns
-------
card : skops.card.Card
The model card object.
"""
check_pandoc_installed()

proc = subprocess.run(
["pandoc", "-t", "json", "-s", str(path)],
capture_output=True,
)
source = str(proc.stdout.decode("utf-8"))

parser = PandocParser(source)
card = parser.generate()

return card
Loading

0 comments on commit 50b0d39

Please sign in to comment.