Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow safe tags in escapeHtml #3998

Closed

Conversation

maccabeelevine
Copy link
Member

@maccabeelevine maccabeelevine commented Oct 9, 2024

Allow safe HTML that does not pose any XSS risk to pass through escapeHtml for display output.

This is an idea, just as a starting point to see if any form of it could be practical. I have my doubts. But the alternative is annoying.

I'm working on a new backend that returns "MARC", except the fields like title sometimes have <em> tags in them. I could write a new helper "EscapeExcept" that does the basic logic this one does, allowing those specific fields to pass through unencoded. But because I don't know which "MARC" fields will have this HTML, and because record fields are displayed all over the place, I would have to clone many different templates and search/replace escapeHtml in each of them, and then that would be a maintenance nightmare in the future.

Wouldn't It Be Nice (TM) if we could just replace escapeHtml altogether? No, self, obviously it would not be nice because

  • it affects every other backend that doesn't need it
  • to be practical (a drop-in replacement for escapeHtml) it requires using a default set of tags to replace, globally, which might not be ideal
  • it slows down escapeHtml which is called a zillion times
  • there could still be security issues?

But I'm hoping there is a middle ground I'm not thinking of.

$title = $this->fields['title_short'] ?? '';
$words = explode(' ', $title);
if (count($words) > 3) {
$words[2] = '<em>' . $words[2] . '</em>';
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so the effect is easily demoed, on any record with a reasonably long title.

@maccabeelevine maccabeelevine marked this pull request as ready for review October 9, 2024 12:37
@EreMaijala
Copy link
Contributor

I wouldn't do this with escapeHtml, but I'd be happy to contribute our cleanHtml helper that does what you want. It uses a proper HTML purifier to do the heavy lifting.

Longer explanation: escapeHtml is meant for situations where the input is not supposed to be HTML. If we start treating it as HTML, it means that we mess up stuff by handling as HTML things that are not supposed to be HTML. Consider this fictitious plain-text example:
"This article explains how to use simple HTML elements such as and

."

@demiankatz
Copy link
Member

For what it's worth, in the Finding Augustine bibliography instance I maintain, it is possible for titles to contain italics, so I created an allowItalics() view helper which does more or less the same thing that you're doing here, and then modified templates to wrap relevant title displays. Not ideal, but it has worked.

I definitely do not think we want to change the default behavior of escapeHtml for all the reasons you cite -- and there may even be situations where we really want to display HTML markup, so we don't want to make that impossible.

I think the best bet might be to build a couple of different helpers:

1.) unescapeHtml(string $html, array $tagsToUnescape) - the signature should be self-explanatory
2.) escapeUnsafeHtml(string $text, array $safeTags) - this could be a wrapper around escapeHtml + unescapeHtml

We could potentially put a configuration in theme.config.php to list default safe tags that get used if a list is not provided to the helpers.

It might also be possible to incorporate this into the RecordDataFormatter configuration somehow to make it somewhat automatic for the majority of cases (though things like search results, related links, recommendations, etc. would require separate handling since they don't use the formatter).

@maccabeelevine
Copy link
Member Author

I wouldn't do this with escapeHtml, but I'd be happy to contribute our cleanHtml helper that does what you want. It uses a proper HTML purifier to do the heavy lifting.

Longer explanation: escapeHtml is meant for situations where the input is not supposed to be HTML. If we start treating it as HTML, it means that we mess up stuff by handling as HTML things that are not supposed to be HTML. Consider this fictitious plain-text example: "This article explains how to use simple HTML elements such as _ and _

."

That all makes sense, and please do contribute. But the problem here is I do now expect these non-HTML fields to contain HTML, I would have to clone every template and change each potential escapeHtml to cleanHtml. Hoping there is some alternative I'm not thinking of.

$escaped = $this->laminasEscapeHtml->__invoke($str);

// Revert ok chars
foreach ($except as $tag) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be more efficient to create arrays of searches and replacements and then do a single str_replace call, instead of doing multiple str_replace calls in a loop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...but if @EreMaijala's solution is the best way forward, it's a moot point. :-)

@demiankatz demiankatz changed the title Allow safe characters in escapeHtml Allow safe tags in escapeHtml Oct 9, 2024
@demiankatz
Copy link
Member

One other thought: maybe we need to create some kind of wrapper class that we can use as a marker for text that is expected to contain HTML. It could have a toString method so that it would seamlessly integrate with existing code, but then we could detect the wrapper class and escape it differently. This is only a half-baked idea, and it would likely require either significant view helper changes or a lot of template changes... but it might be worth thinking about, and it might offer a more seamless way to support your use case if we do it correctly. (See TranslatableString for an example of the kind of pattern I'm thinking about).

@maccabeelevine
Copy link
Member Author

It might also be possible to incorporate this into the RecordDataFormatter configuration somehow to make it somewhat automatic for the majority of cases (though things like search results, related links, recommendations, etc. would require separate handling since they don't use the formatter).

Yeah, the idea would be, is there a way we could make these changes to the core (non-backend-specific) templates and RecordDataFormatter common classes, etc., so it doesn't become a maintenance nightmare, and yet doesn't have all the negative performance etc. side effects. Because I realize this is an unusual issue, on a backend that most will never even use, and even I don't know the extent of which fields can have the HTML in them or what specific tags there might be. But cloning all of those templates also seems like a really bad approach.

One other thought: maybe we need to create some kind of wrapper class that we can use as a marker for text that is expected to contain HTML. It could have a toString method so that it would seamlessly integrate with existing code, but then we could detect the wrapper class and escape it differently. This is only a half-baked idea, and it would likely require either significant view helper changes or a lot of template changes... but it might be worth thinking about, and it might offer a more seamless way to support your use case if we do it correctly. (See TranslatableString for an example of the kind of pattern I'm thinking about).

I like this idea! If I understand correctly, the new record driver could wrap its own strings in that marker class before returning them. Then as you say we would need a lot of changes to the templates or helpers to check for the wrapper and treat it differently, but at least those changes could be made to core files without actually affecting the other backends. It's a thought. The performance impact would be negligible for unaffected backends. It could also be implemented incrementally. We'd still need consensus though on which fields should actually allow those formatting HTML tags through. And which tags, but maybe that's not a real challenge if @EreMaijala's purifier function has already thought that through.

@demiankatz
Copy link
Member

Thanks, @maccabeelevine, this sounds like the beginnings of a plan.

One other thought, though: as a short-term solution, can you just use strip_tags in the record driver to eliminate the problem entirely? That might be good enough to launch your backend, and then we can consider whether it's worth the effort of allowing this greater flexibility in the longer term. (I do think this is potentially worthwhile, especially since it would help me simplify the Finding Augustine code I maintain, but since it's a pretty big change, I don't think we can rush it, so it would be good to have a short-term easy fix as well).

@maccabeelevine
Copy link
Member Author

One other thought, though: as a short-term solution, can you just use strip_tags in the record driver to eliminate the problem entirely?

Yes great idea, will do so. And will close this PR since we obviously won't be replacing escapeHtml, but I'll look forward to @EreMaijala's code for the purifying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants