-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Html Reader Process Titles as Headings Not Paragraphs #2533
base: master
Are you sure you want to change the base?
Conversation
Fix PHPOffice#1692. Builds on work started some time ago by @0b10011, to whom primary credit is due. Html Reader does not process the `head` section of the document, and, in particular, does not process its `style` section. It will, however, process inline styles, so 0b10011's model of adding the title as a text run (with styles) will work well once this change is applied. However, that model would not deal with the alternative method of assigning a Title Style, and just adding the title as text. In order to accommodate that, I have removed the declaration of heading font styles in the head section, and now generate them all inline in the body. This has the added benefit of being able to read the doc as html, then saving it as docx, preserving, at least in part, any user-defined font styles. Note that html does have pre-defined title styles, but docx does not. @constip suggests in the original issue that margin top and bottom are being applied too frequently. I believe that was addressed by recently merged PR PHPOffice#2475. It is also suggested that the `*` css selector be dropped in favor of `body`. 2475 added the body selector. I agree that this renders the `*` selector unnecessary, and, as stated in the issue, it can cause problems. This PR drops that selector. It is also suggested that `loadHTML` be used instead of `loadXML`. This is not as easy a change as it seems, because loadHTML uses ISO-8859-1 charset rather than UTF-8, so I will not attempt that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oleibman Could you move changes to 2.0.0.md file, please ?
It seems that this PR is not finished. Isn't it ?
@Progi1984 I have made the code change and moved the change notes to the new log. But ...
I'm not sure what you mean. What work do you think is still undone? |
Fix #1692. Builds on work started some time ago by @0b10011, to whom primary credit is due.
Html Reader does not process the
head
section of the document, and, in particular, does not process itsstyle
section. It will, however, process inline styles, so 0b10011's model of adding the title as a text run (with styles) will work well once this change is applied. However, that model would not deal with the alternative method of assigning a Title Style, and just adding the title as text. In order to accommodate that, I have removed the declaration of heading font styles in the head section, and now generate them all inline in the body. This has the added benefit of being able to read the doc as html, then saving it as docx, preserving, at least in part, any user-defined font styles. Note that html does have pre-defined title styles, but docx does not.@constip suggests in the original issue that margin top and bottom are being applied too frequently. I believe that was addressed by recently merged PR #2475. It is also suggested that the
*
css selector be dropped in favor ofbody
. 2475 added the body selector. I agree that this renders the*
selector unnecessary, and, as stated in the issue, it can cause problems. This PR drops that selector. It is also suggested thatloadHTML
be used instead ofloadXML
. This is not as easy a change as it seems, because loadHTML uses ISO-8859-1 charset rather than UTF-8, so I will not attempt that change.Description
Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context.
Fixes # (issue)
Checklist:
composer run-script check --timeout=0
and no errors were reported