-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Special Characters (ampersand) in Title break docx output #401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi, @jonnsn. You must escape any string you submit to PHPWord. As for me, I prefer Zend\Escaper for that purpose. PHPWord code is not responsible for text escaping. Developer, who use PHPWord, is responsible for escaping string he passes to PHPWord. There maybe lack of documentation on that topic. Feel free to modify PHPWord documentation. ;) |
Hi @RomanSyroeshko Thanks for your reply! $section->addTitle(htmlspecialchars('Foo & Bar'), 1); the document gets opened and the title is correct in TOC, but not in the document body. Its plain 'Foo & amp; Bar' [added whitespace due to github interpreting the entity] there. This is also discussed in the linked StackOverflow-Thread. I just tried quick debugging and changed line 100 of PhpWord\Writer\Word2007\Element\TOC to $xmlWriter->writeRaw(htmlspecialchars($title->getText())); and it seemed to have worked. Maybe others could test this too... |
The behavior is caused by double escaping in the code. That's why you get "& amp;" in title. I fixed this in Could you please test? |
sorry for the late reply, I was on vacation. Thanks for your help! |
OK, closed for now. Thanks for the bug report. |
Hi @RomanSyroeshko -- edit |
@RomanSyroeshko Microsoft made this format for "Encode and Decode XML Element and Attribute Names". https://msdn.microsoft.com/en-us/library/vstudio/35577sxd(v=vs.100).aspx I think we need replace $this->getText() -> htmlspecialchars() here. |
Hi, @maxibul. Thanks for pointing this out. Does it cause any issues: generated document is broken or whatever? |
Yes, generated document is broken without this fix. Same result, as in first post. |
What if you replace I mean, if you say that P. S. |
I little confused with "all escaping must be done outside of the library". PHPWord support export result in different formats (docx, pdf,...). Each format has own rules of character escaping. In result if I need result in 2 different formats, I need learn each required format and create 2 documents with different escaping? Do you think this is normal? Or I miss something? If library has different functions for export in different formats, why it can't prepare data for each format? |
OK, lets align our terminology first. What do you mean under "escaping"? Simple |
under "escaping" I mean:
... |
Hi Roman, commit b2286f8 made PHPWord completely broken for all Word documents that have the ampersand ('&') character anywhere in the body of the document, as opposed to being broken only for that character residing in the title of the document. You said the following: "PHPWord code is not responsible for text escaping." I sincerely hope that you don't mean that PHPWord is not responsible for proper output escaping. If I supply a valid UTF-8 string to PHPWord, PHPWord should apply proper escaping based on the output location. So if it ends up as the contents for an XML element, you apply XML character escaping. If it ends up in HTML, you apply HTML character escaping. If you don't do this, you will effectively create a broken library. Please apply MaxiBul's suggestions for output escaping. |
@gmta, what's the problem? Have you read our release notes? Have you reviewed the changes before applying the new version of PHPWord? Did you try escaping strings BEFORE you pass them to the library? |
@RomanSyroeshko Please stop deleting my comments. You know that people get email notifications of these, right? And my activity feed still contains those comments? I gave you a number of reasons why your current direction on output escaping for PHPWord is just plain wrong. Please supply counter arguments if you think I'm wrong. Again, please read this informative article for more information. Trying to hide critique is a bit dictator-y, if you ask me. |
Your comment was aggressive, not constructive and you didn't answer my questions. Such comments will be deleted in future too. |
I do not think it was aggressive - I answered your questions, I gave reasons why the new course for PHPWord output escaping would be a mistake and even offered you help and referenced you to further reading material. Could you please respond to this issue though? We still do not know if you plan to stick with this new course or plan to change PHPWord in such a way that it becomes a library that can generate valid Word documents without requiring its users to perform escaping themselves? |
We consider introducing escaping components in the next release. For now escaping must be done by user. If you don't like this, continue using v0.11.1. @maxibul, thanks for the given rules. |
Totally agree with @gmta . Programming as spanish developer with this tool is completely crazy. If escaping is a core-must to make it work the library. Why the library does not provide good methods to do it? Or better, why library doesn't do it by itself? |
Based on comments in gethub (#514) I changed my process to replace & amp ; in the text that I'm feeding phpword with & amp ; amp ; and it works. That gets me past this issue for now. You'll need to remove the spaces, I added them to get them to display correctly |
This is crazy - the application is responsible for escaping text passed in through methods that accept text based on some internal use of that text which the application does not get involved in and should not care about in the slightest? What the hell? Really? You really need to rethink this decision. |
Just adding another comment to the pile. edit: |
I suspect it comes down to what level of abstraction is being applied to this library. On one hand the author looks at it as a library to help being do stuff with the the XML that makes up a document, so the XML kind of bleeds out of it. On the other hand, since it has multiple output formats, it is seen as a generic document generation tool, some document types of which have no relation to XML whatsoever. So it's all a but confused. @mjpelmear IMO You are right - the level of abstraction that applications using using the library should be dealing with is not XML - not even in the slightest bit XML. It's all just data: words, structures and formatting instructions. However, those applications should be able to extend the XML layer that converts the generic document objects into an XMLDOC stream, so that additional features can be added at that level and only affect the XML generation and not rtf documents, and of course plug into the various different XML document types separately. For example, adding tabs to a header involves some horrible hacks in the application - inserting different special characters depending on what the known output format is going to be (e.g. escape sequences for RDF, closing tags and inserting other elements then reopening tags again for XMLDOC, and completely different XML hacks for OpenXML). There has to be an easier way to extend it to support tabs instead, The problem in solving this is that the library has evolved the way it has because (I suspect) it has been quite a journey of discovery - there is very little information out there that makes real sense of these document formats. However, having done that, starting again with lessons learnt, and a more formal layered approach is needed to lead on to a more maintainable library. That's a lot of work though, so we are stuck here. I don't want to put down the fantastic effort put into this library so far - I really don't. I do think it is standing at a crossroads though. That's just my personal feeling about it. |
This feels weird
I feel like it should be
But I guess for some reason doing a static class was decided. Its just hard to know what PHPWord can do when things aren't connected. |
Thanks for the comment @mjpelmear, helped solve my issues !
|
Dang spoke too soon, was still causing issues.
Needs review / work as @judgej says. |
Hi,
if you change Sample_17_TitleTOC.php to add a title including an &-sign
and select docx-Result the document won't open (tested in Word 2013):
"The file ____ cannot be opened because of problems with the contents"
Details: "Invalid char"
Position: [position of the ampersand in document.xml]
I did not test this with other characters but there might also be problems with ",',< and >
I found this post which might indicate the problem.
Escaping the character does not really help (see the StackOveflow-Thread linked above).
If I find some time I will try to solve this issue and make a patch, but I just wanted to note it down here already. Maybe someone has a note or already solved it somehow :-)
Regards
The text was updated successfully, but these errors were encountered: