Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable correct whitespace handling in write_html() #547

Closed
gmischler opened this issue Sep 18, 2022 · 10 comments · Fixed by #602
Closed

Enable correct whitespace handling in write_html() #547

gmischler opened this issue Sep 18, 2022 · 10 comments · Fixed by #602

Comments

@gmischler
Copy link
Collaborator

gmischler commented Sep 18, 2022

Technically, this is also a bug report...

Problem

It is one of the characteristics of the HTML format that you can add all kinds of whitespace (spaces, tabs, newlines, etc.) in any amount pretty much anywhere. When the data is rendered by a conforming web browser, all that whitespace gets collapsed into a single space character between the actually printing data to the left and right of it.

The fpdf2 HTML parser currently completely ignores this rather important rule.
Consequently, HTML documents rendered with fpdf2 tend to look very ugly.

Sub-Problem

The render_toc() method even makes deliberate use of this un-feature, and uses sequences of space characters to seperate the labels and associated page numbers from each other. Of course, this only "kind of" works with a monospaced font, and ends up looking horribly out of whack with any normal font.

Solution

Within a single data sequence, a regex might be able to do the collapsing.
Beween seperate data sequences (ie. next to any HTML tag), it is probably necessary to maintain a flag, which stores whether the previous data sequence ended with a space. If so, then any space at the beginning of the next data sequence should be skipped.
Note that eg. <pre>...</pre> and <code>...</code> sequences are exceptions that need to be passed through unchanged.

Sub-Solution

The render_toc() method needs to be changed, so that it renders the text and page number of each entry as a seperate [multi_]cell() in its own well-defined horizontal position.

Bonus points for

  • enabling multi-column TOC layout
  • an option to draw a dotted line between labels and page numbers in the TOC
  • Correct handling of non-breaking space characters and other special cases

Anyone up for a cleanup task like this?
PS.: The TOC could be fixed idependently of the general whitespace issue, but if so, then that must happen first.

@ehildebrandtrojo
Copy link

Hey! I'm down to give it a shot; can you assign me the task.

FYI I'm new to contributing to open source, so don't be surprised if I end up asking some basic questions along the way.

@Lucas-C
Copy link
Member

Lucas-C commented Oct 3, 2022

Hi and welcome @ehildebrandtrojo !
Task assigned 😊
Open source newcomers are welcome here, feel free to asky any question!

@Lucas-C
Copy link
Member

Lucas-C commented Oct 17, 2022

Hi @ehildebrandtrojo!
Have you started codding on this?
If you face any difficulty, feel free to report them and ak for help!

Those pages are good starting points if you are new to open source:

@ehildebrandtrojo
Copy link

Thank you for the resources Lucas!
Unfortunately I've been really busy with clases and I still haven't had the chance to start.
In the next few days I should be able to get started though. If I can't, I'll let you know, and you can let someone else take the ticket.
Sorry for any inconveniences caused.

@Lucas-C
Copy link
Member

Lucas-C commented Oct 17, 2022

Unfortunately I've been really busy with clases and I still haven't had the chance to start.
In the next few days I should be able to get started though. If I can't, I'll let you know, and you can let someone else take the ticket.
Sorry for any inconveniences caused.

Don't worry, that's perfectly fine 😊
This is volunteer work, no pressure!

@ehildebrandtrojo
Copy link

Hey @Lucas-C!
Unfortunately, I don't think I have the time to work on this issue.
I'm graduating this semester and so, classes, along with job search, are taking up a lot of my time.
Still, I wanted to thank you for creating such a welcoming environment for first time contributors.
I hope that once I'm done with the semester, I can start contributing to more open source projects like this one!

@Lucas-C
Copy link
Member

Lucas-C commented Oct 24, 2022

Unfortunately, I don't think I have the time to work on this issue.

Ok, no worries!

I'm graduating this semester and so, classes, along with job search, are taking up a lot of my time.

Good look with all of that 😊

I hope that once I'm done with the semester, I can start contributing to more open source projects like this one!

🤞

@seanpmulholland
Copy link

seanpmulholland commented Nov 11, 2022

Hi all, working on this now. Relatively new to OS contributions so let me know if anything looks incorrect!

I have the whitespace issue mostly sorted. Had a question on tests though - would it be preferred to replace the test.pdf files with updated outcomes or to force the code in the tests to display the old extra whitespace?
example:

<h1>H1 {long_title*2}</h1> <- test case with 3 spaces

old html_headings_line_height test
image

updated html_headings_line_height test <- trimmed to 1 space
image

Sub-Problem

Just getting familiar with the code so I might be mistaken but it looks like the ToC rendering happens outside of html processing and writes directly to the pdf. So this change does not break ToC rendering, except for the indentation in the test case 'Table of content' header, since that is now trimmed html whitespace.

before:
image
after:
image

Let me know how to best handle the affected tests and I'll sort that out and open a pull req!

@gmischler
Copy link
Collaborator Author

would it be preferred to replace the test.pdf files with updated outcomes or to force the code in the tests to display the old extra whitespace?

The current tests just verify the existing behaviour, which at the time was considered "good enough" but is technically incorrect.
Once the code is fixed to produce the correct result, the tests should of course also reflect that and verify the corrected output.
There is no point in "veryfying" the previously incorrect output even after it has been corrected. So yes, you'll have to adapt all the affected tests.

@seanpmulholland
Copy link

Got it, thanks! PR opened

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants