Test framework for sphinx-simplepdf #83

kreuzberger · 2023-07-28T07:16:43Z

Is there a chance to implement a basic test framework? I dont know if i should / could takeover these from the other repositories of useblocks "as is".

The pdf output could be testet with some python pdftotext modules, available at pypi. E.g. to count pages, or get the text from individual pages and compare if some expected text appears

Impementing a "basic" test would be good, i feel motivated to add more tests 😀

danwos · 2023-07-28T08:06:53Z

I agree, a test framework would be great.
But just checking for certain text is not enough for me.
I would like to be able o check also the layout, so the tests cover for instance:

Does a table fit on the page
Is a page break used correctly
Is the used font-size/family/color correct
Is an image scaled correctly

A quick search hasn't found any promising solution for this.

@ubmarco: As PDF miner expert, do you have an idea how this could be achieved?

danwos · 2023-07-28T08:17:52Z

Maybe a solution would be to make a pixel-by-pixel comparison with a golden sample, which got checked once manually.

There is a question on PyMuPDF, which is discussing this:
pymupdf/PyMuPDF#584

technical concept (idea)

A test-case contains:

Sphinx project, which gets built by simplepdf
A PDF as golden-sample, which was checked once

Pytest-fixtures to:

Build the PDF from the Sphinx-project
Extract the textual content as JSON, so that it can be used for tests

A helper function like compare_pdf(new_pdf, golden_sample), which compares PDF pixel-by-pixel to check for layout problems.

So in the end, each test case defines its own little project and therefore PDF.
There is no single PDF file for all test cases, which is containing everything for testing (like our demo-pdf).

ubmarco · 2023-07-28T09:07:26Z

I think we should both:

Read back a PDF into text representation, we could check

is text on pages that are planned
is text on the right location as planned
do tables have correct values in the cells
do images exist

We could use libpdf for this (a pdfplumber and pdfminer wrapper). This test targets directly where things went wrong. This can also detect whether tables wrapped. Keep in mind, PDFs have no understanding of words, sentences, tables. They just know letters, letter orientation, font and color. Tables are made of lines.
So for proper table detection we need to use tables with borders.

Then we'll also need a image comparison to be sure the overall layout is still valid, colors match and to test theme updates.
A quick search: perceptualdiff or a home-grown solution.

Getting all needed programs installed to the Github node that runs the test (e.g. pillow) might be a problem.

kreuzberger · 2023-07-28T11:33:25Z

The text solution would handle most of the test cases i have in mind. Maybe this handling could be used not only for sphinx-simple internal tests, also for the real document tests produced during build.

a pdf (one per test) test is also ok, but i am not sure if this is
a) easy to maintain
b) does not rely to much on weasyprint versions

Here is the question: The tests should not only tests against different sphinx versions, it should also maybe test against different weasyprint versions. This might also be trick to handle

danwos · 2023-07-28T12:49:00Z

The last point can be easily done by matrix tests. Which are supported by github actions.
Sphinx-Needs does this by creating different test-envs based on python, sphinx and docutils versions.

One PDF per test has the advantage that the tests are isolated from each other and therefore normally easier to maintain,.

kreuzberger · 2024-01-09T07:53:48Z

I have to start with a test framework for the generated pdf's from simplepdf in my current project. I saw that libpdf is a repository in your organisation ( https://github.com/useblocks/libpdf ). So i assume work on a test framwork could start with this as there is currently no other solution available?

danwos · 2024-01-09T08:06:19Z

I think so, yes. May be the easiest solution as all other PDF libraries are more low-level.

kreuzberger · 2024-01-10T08:23:43Z

Integration of libpdf seems not to be so easy in an environment with sphinx-simplepdf and weasyprint due to pillow dependencies. libpdf seems to have a (maybe outdated) dependency to an exact pillow reference which is in conflict with the weasyprint dependency.

There seems to exist a branch in libpdf to fix this, but it is not merged in the main branch.
@ubmarco : Maybe you could give me some hints how to solve this?

kreuzberger · 2024-01-10T09:16:00Z

And there seems to by a typo in the pyproject.toml in this branch
"ruamel.yaml" = "^*"

kreuzberger · 2024-01-10T10:13:23Z

After hacking and get it running it only runs with no_annotations, and then gets stucked internal.
So stopping here and wait for further hints about how to proceed.

Hacking steps:

use the code from the mh-update-pillow branch and resolve the above typo.
Build and install libpdf and all dependencies so they can run with simplepdf / weasyprint
try to load simplepdf generated pdf file with flag no_annotations=True

It then fails internaly:

  objects = libpdf.load(pdf_info["document"], verbose=2, no_annotations=True)
../../../../build/debug/pypackages/venv/lib/python3.11/site-packages/libpdf/core.py:228: in main_api
    objects = main(
../../../../build/debug/pypackages/venv/lib/python3.11/site-packages/libpdf/core.py:118: in main
    objects = extract(
../../../../build/debug/pypackages/venv/lib/python3.11/site-packages/libpdf/extract.py:131: in extract
    extract_catalog(pdf, no_annotations)
../../../../build/debug/pypackages/venv/lib/python3.11/site-packages/libpdf/catalog.py:674: in extract_catalog
    des_dict = get_named_destination(pdf)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

pdf = <pdfplumber.pdf.PDF object at 0x7f9d36a4c990>

    def get_named_destination(pdf):  # pylint: disable=too-many-branches
        """Extract Name destination catalog.
    
        Extracts Name destination catalog (link target) from pdf.doc.catalog['Name'] to obtain
        the coordinates (x,y) and page for the corresponding destination's name.
    
        PDFPlumber does not provide explict 'Named Destinations of Document Catalog' like py2pdf, so it needs to be obtained
        by resolving the hierarchical indirect objects.
    
        The first step in this function is to check if the name destination exist in the PDF. If it does not, no extraction
        is executed.
    
        :param pdf: pdf object of pdfplumber.pdf.PDF
        :return: named destination dictionary mapping reference of destination by name object
        """
        LOG.info('Catalog extraction: name destination ...')
    
        # check if name tree exist in catalog and extract name tree
        name_tree = {}
        named_destination = {}
        pdf_catalog = pdf.doc.catalog
        if 'Names' in pdf_catalog:
            # PDF 1.2
            if isinstance(pdf_catalog['Names'], PDFObjRef) and 'Dests' in pdf_catalog['Names'].resolve():
                name_tree = pdf_catalog['Names'].resolve()['Dests'].resolve()
            elif isinstance(pdf_catalog['Names'], dict) and 'Dests' in pdf_catalog['Names']:
>               name_tree = pdf_catalog['Names']['Dests'].resolve()
E               AttributeError: 'dict' object has no attribute 'resolve'

kreuzberger · 2024-01-10T14:26:27Z

patch_libpdf.zip

After "zero knowledge based hacking" the libpdf source code i was able to extract some content.

This helps me going further into my efforts for the "pdf" check.

May the force be with you - If you might integrate 😄

kreuzberger · 2024-01-11T11:56:40Z

forked libpdf and applied fixes to https://github.com/procitec/libpdf/tree/upgrade.
I would recommend a review on the solution for the above Problem with resolve, this could be the cricital part (e.g. better use resolve_all or other methods). i would stop discussion here and would start a PR on libpdf repo.

kreuzberger · 2024-01-12T10:34:09Z

With the PR in the libpdf i am able to parse and test the pdf, e.g. chapter, headings, page numbering etc.
I still have to check tables.

Open questions currently:

How can i check for colors (background color / Character colors), e.g. to test if a "keyword" has "grey" background color.
How can i check e.g. for some format options, e.g. a codeblock should be rendered in a grey box?

ubmarco · 2024-01-23T21:33:37Z

I just released a new version 0.1.0 of libpdf. It now has a new element called Rect which you can find in the architecture diagram.
The rectangle color as well as its contained text with coordinates is also exposed. Any text spilling over the rectangle boundaries is cropped.
Is that feature enough to write test cases?

kreuzberger · 2024-01-24T10:06:56Z

Currently i think its enough for testing. see useblocks/libpdf#36 for integration of tests in libpdf and a sphinx-simplepdf/weasyprint generated pdf.
I think test implementation for sphinx-simplepdf could start now.

I would expect one member of useblocks to create the test framework, maybe like the others with poetry/nox. libpdf is here a litte bit different, i do not know which python test framework useblocks currently prefers

ubmarco · 2024-02-03T22:45:07Z

You're right, we need to set up testing for this repo. I vote for simple tox and pytest, just like for libpdf. nox only makes sense if we need to programmatically configure the test matrix.

This was referenced Jan 16, 2024

Color Information for Paragraphs useblocks/libpdf#25

Closed

fix dependencies from pillow branch and fix processing of pdf files useblocks/libpdf#24

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test framework for sphinx-simplepdf #83

Test framework for sphinx-simplepdf #83

kreuzberger commented Jul 28, 2023

danwos commented Jul 28, 2023

danwos commented Jul 28, 2023 •

edited

Loading

ubmarco commented Jul 28, 2023 •

edited

Loading

kreuzberger commented Jul 28, 2023 •

edited

Loading

danwos commented Jul 28, 2023

kreuzberger commented Jan 9, 2024 •

edited

Loading

danwos commented Jan 9, 2024 •

edited

Loading

kreuzberger commented Jan 10, 2024

kreuzberger commented Jan 10, 2024

kreuzberger commented Jan 10, 2024

kreuzberger commented Jan 10, 2024

kreuzberger commented Jan 11, 2024

kreuzberger commented Jan 12, 2024

ubmarco commented Jan 23, 2024 •

edited

Loading

kreuzberger commented Jan 24, 2024

ubmarco commented Feb 3, 2024

Test framework for sphinx-simplepdf #83

Test framework for sphinx-simplepdf #83

Comments

kreuzberger commented Jul 28, 2023

danwos commented Jul 28, 2023

danwos commented Jul 28, 2023 • edited Loading

technical concept (idea)

ubmarco commented Jul 28, 2023 • edited Loading

kreuzberger commented Jul 28, 2023 • edited Loading

danwos commented Jul 28, 2023

kreuzberger commented Jan 9, 2024 • edited Loading

danwos commented Jan 9, 2024 • edited Loading

kreuzberger commented Jan 10, 2024

kreuzberger commented Jan 10, 2024

kreuzberger commented Jan 10, 2024

kreuzberger commented Jan 10, 2024

kreuzberger commented Jan 11, 2024

kreuzberger commented Jan 12, 2024

ubmarco commented Jan 23, 2024 • edited Loading

kreuzberger commented Jan 24, 2024

ubmarco commented Feb 3, 2024

danwos commented Jul 28, 2023 •

edited

Loading

ubmarco commented Jul 28, 2023 •

edited

Loading

kreuzberger commented Jul 28, 2023 •

edited

Loading

kreuzberger commented Jan 9, 2024 •

edited

Loading

danwos commented Jan 9, 2024 •

edited

Loading

ubmarco commented Jan 23, 2024 •

edited

Loading