Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unnecessary preg_replace? #459

Closed
edent opened this issue Sep 21, 2023 · 20 comments
Closed

Unnecessary preg_replace? #459

edent opened this issue Sep 21, 2023 · 20 comments
Labels

Comments

@edent
Copy link
Contributor

edent commented Sep 21, 2023

I'm not sure that this line is needed:

$content = \trim( \preg_replace( '/[\n\r\t]/', '', $content ) );

If I've understood correctly, the trim() removes the leading & trailing whitespace. Then the preg_replace removes a single character inside the remainder of the text.

I see the need for the time, but not the preg_replace.

In my tests, removing the line didn't make any difference to the content of the ActivityPub JSON.

@pfefferle
Copy link
Member

pfefferle commented Sep 22, 2023

Maybe Mastodon has changed something in the meantime, but in the early days they had a nl2br or something similar and that generated a lot of space between paragraphs.

@edent
Copy link
Contributor Author

edent commented Sep 22, 2023

That may still be the case. But I think this regex would only remove the first newline it encountered.

@pfefferle
Copy link
Member

It will replace it in the complete content: https://onlinephp.io/c/3e435

@edent
Copy link
Contributor Author

edent commented Sep 22, 2023

I must stop doing regex in my head.... 😆

That said, I'm not sure it changes the presentation on Mastodon which, in my experience, tends to eat whitespace. But perhaps others aren't so forgiving?

@janboddez
Copy link
Contributor

I also remember it being an issue with Mastodon specifically. Used a callback to overwrite, well, pretty much the entire post (or object) content to better suit my custom post types ... and I ran into the same thing.

@aslakr
Copy link
Contributor

aslakr commented Jan 11, 2024

Shouldn't there be exceptions inside <pre>, as generated by for example code, verse and preformatted Gutenberg blocks?

@janboddez
Copy link
Contributor

janboddez commented Jan 11, 2024

Shouldn't there be exceptions inside \<pre\>, as generated by for example code, verse and preformatted Gutenberg blocks?

Yes, I've run into this a couple times already :-)

Update: ran a quick experiment using the following ...

$content = preg_replace( '~<pre(?:[^>]*)>.*?</pre>(*SKIP)(*FAIL)|\r|\n|\t~s', '', $content );

That works, when I print $content to error_log(), at least. Because it looks like Mastodon itself then goes on to strip the newlines ...?

So I added $content = nl2br( $content, false );, which replaces the leftover newlines with <br>, but ... no dice.

@aslakr
Copy link
Contributor

aslakr commented Jan 11, 2024

Yes, trying to parse HTML using regex might be a bit optimistic.

Newer versions of the Gutenberg-editor seems to replace \n with <br> WordPress/gutenberg#55999 which creates it own problems westonruter/syntax-highlighting-code-block#790. Also, it wouldn't help classic editor or manually edited <pre>-blocks.

Glitch seems to replace newlines with <br>

https://github.com/glitch-soc/mastodon/blob/cbc951627c88000494a18fdf61473b63c9d7f669/app/lib/advanced_text_formatter.rb#L10-L14

It would be nice to preserve the tabs in code blocks.

@aslakr
Copy link
Contributor

aslakr commented Jan 12, 2024

I guess one need to change/remove the following:

$content = \trim( \preg_replace( '/[\n\r\t]/', '', $content ) );

$content = \preg_replace( '/[\n\r\t]/', '', $content );

$content = \preg_replace( '/[\n\r\t]/', '', $content );

Considering that Mastodon and others now support <pre> etc. is it really nessecery to remove newlines and tabs?

@janboddez
Copy link
Contributor

janboddez commented Jan 12, 2024

Considering that Mastodon and others now support <pre> etc. is it really nessecery to remove newlines and tabs?

I'm seeing additional newlines in a lot of (other people's) federated WordPress posts, so I'm thinking: Yes.

My solution -- to the necessary preg_replace() calls seemingly not working (apparently when the block editor is involved) -- has been to use a custom filter.

Works wonders. I've just now updated it to not strip these characters from inside pre as outlined above, and that works as well (when I print $content to the debug log, it looks exactly as it should).

Except there's something stripping them after all, I'm guessing on the Mastodon side? (Which doesn't make a whole lot of sense, seeing as it introduces newlines elsewhere. Although ... clients are inconsistent here, too.)

My one remaining issue is that comments don't get filtered the same way.

@janboddez
Copy link
Contributor

My solution -- to the necessary preg_replace() calls seemingly not working (apparently when the block editor is involved) -- has been to use a custom filter.

Still makes me think the preg_replace() should come after the apply_filters( 'the_content', ....

@aslakr
Copy link
Contributor

aslakr commented Jan 12, 2024

Considering that Mastodon and others now support <pre> etc. is it really nessecery to remove newlines and tabs?

I'm seeing additional newlines in a lot of (other people's) federated WordPress posts, so I'm thinking: Yes.

Probably.

But then, how Mastodon (and others] handles whitespace (space, tabs and newlines) in non-pre context might be more of problem that should be solved on the reciving end (i.e. Mastodon)?

Anyway, I primary just would like to be able to have:

Screenshot of Mastodon showing a code sample with spaces, tabs and newlines

instead of:

Screenshot of Mastodon showing a code sample without tabs and newlines

@janboddez
Copy link
Contributor

janboddez commented Jan 12, 2024

Works wonders. I've just now updated it to not strip these characters from inside pre as outlined above, and that works as well (when I print $content to the debug log, it looks exactly as it should).

Haha, I'm now pretty sure I have a second, older callback cleaning up the remaining newline characters. Gonna try in a sec. Edit: Yes! (https://indieweb.social/@jan@jan.boddez.net/111731925084505754) A single newline too many at the end of the pre, I can live with that (for now).

[H]ow Mastodon (and others] handles whitespace (space, tabs and newlines) in non-pre context might be more of problem that should be solved on the reciving end (i.e. Mastodon)?

As someone who's written a couple feed readers (and thus has had to parse and sanitize others' markup): it's a pain. What's worked for me, for "cleaning up" random bits of HTML, is strip as much whitespace chars as possible and then run the result through WordPress' wpautop(). Mastodon takes a different approach, and how some 3rd-party clients treat whatever Mastodon gives them yet another one ... Not much we can do, I'm afraid. But then I've mostly gotten it to work (through a bit of a hack, i.e., custom PHP).

@janboddez
Copy link
Contributor

janboddez commented Jan 12, 2024

OK, so this seems to work for me (i.e., this is essentially what's currently in my custom activitypub_the_content callback, where I also use my own list of allowed tags):

$content = apply_filters( 'the_content', $post->post_content );

$content = wp_kses( $content, $allowed_tags );

// Strip whitespace, but ignore `pre` elements' contents.
$content = preg_replace( '~<pre[^>]*>.*?</pre>(*SKIP)(*FAIL)|\r|\n|\t~s', '', $content );

// Trim newlines off of (`code` elements inside) `pre` elements.
if ( preg_match_all( '~<pre[^>]*><code[^>]*>(.*?)</code></pre>~s', $content, $matches ) ) {
	foreach ( $matches[1] as $match ) {
		$content = str_replace( $match, trim( $match ), $content );
	}
}

And then I return that. (I completely discard what the plugin generates based on the [ap_content] etc. template in its settings! Note that I started doing this mainly because I want to return a different "content" for different post types. E.g., include a title for longer posts, but not include one for "notes.")

Still gotta test in Tusky, but looks good in Mastodon's web UI.

My custom "ActivityPub add-on plugin": https://gist.github.com/janboddez/58a2f3d2c86717cd799048af651fa6b4#file-activitypub-php (Note: there's a few things in there that are no longer need, gotta clean up one time.)

@pfefferle
Copy link
Member

Feel free to open a PR if you have a better solution that the actual!

@aslakr
Copy link
Contributor

aslakr commented Jan 13, 2024

Does Wordpress have a build in function to do something similar to:

tidy --indent no --wrap --vertical-space yes --show-body-only yes

I suspect one can't rely on https://www.php.net/manual/en/book.tidy.php being available? And the new HTML API can't yet do something similar? Maybe later.

In meantime replacing /[\n\r\t]/ with @janboddez' ~<pre[^>]*>.*?</pre>(*SKIP)(*FAIL)|\r|\n|\t~s seems to work.

@janboddez
Copy link
Contributor

Tusky still rendering an extra empty "paragraph" at the start of the blockquote in https://indieweb.social/@ochtendgrijs@ochtendgrijs.be/111772889651276747. Fairly sure it's a Tusky issue; the culprit seems to be a space between the opening <blockquote> and <p> tags, which I think is there because this particular post uses a "Classic" block. Can't think of a reliable workaround other than maybe https://github.com/derUli/html-minifier, which is probably not something you want to include in the plugin itself, as it is yet another dependency.

@janboddez
Copy link
Contributor

Feel free to open a PR if you have a better solution that the actual!

I'll have another look at this, I think we should (?) at least try to sanitize/filter comments and posts more or less the same way.

Additional filtering is probably best left up to site authors/developers.

@janboddez
Copy link
Contributor

janboddez commented Jan 18, 2024

Case anyone's still interested (or even following), I moved my activitypub_the_content callback to https://github.com/janboddez/share-on-mastodon-addon/blob/8e56c5c5b5416c1ba5e19dc9544042cca3ba4c83/includes/class-activitypub.php, i.e., another and previously unrelated "add-on" plugin of mine. This allowed me to add the aforementioned dependency (a so-called mu-plugin is nice on its own but doesn't work all that well once multiple files and folders are involved).

Not saying y'all should start using it, quite the contrary!

But: if anyone wanted to implement a custom content filter that strips all superfluous whitespace (and thereby having Mastodon, Tusky, and, hopefully, all other clients, render posts just like they would "native" Mastodon posts) while respecting text inside pre elements, feel free to borrow from the code.

Now if only I can also filter comments this way, I'm going to be very happy. 8-)

Copy link

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label May 18, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants