Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add String.strip_bbcode() and String.bbcode_escape() BBCode methods #78310

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Calinou
Copy link
Member

@Calinou Calinou commented Jun 16, 2023

String.strip_bbcode() can be used to display text from a RichTextLabel in places that don't support rich formatting.

String.bbcode_escape() can be used to prevent BBCode injection in RichTextLabels that display user input.

Thanks @dalexeev for providing the String.strip_bbcode() implementation 🙂

Preview

From top to bottom: unescaped, stripped, escaped

Screenshot_20230616_045732

[b]hello[/b] [u][color=#f89]world[/color][/u]

@Calinou Calinou requested review from a team as code owners June 16, 2023 05:08
@Calinou Calinou added this to the 4.x milestone Jun 16, 2023
core/string/ustring.cpp Outdated Show resolved Hide resolved
@@ -430,6 +431,7 @@ class String {
String c_escape_multiline() const;
String c_unescape() const;
String json_escape() const;
String bbcode_escape() const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
String bbcode_escape() const;
String escape_bbcode() const;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why rename this method? We have json_escape() above already.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I didn't notice that. Just verb + noun looks more standard. Given c_escape() and json_escape(), it is logical to name this method bbcode_escape(). However, should the second one then be named bbcode_strip() to be consistent with the first, or strip_bbcode() to be consistent with strip_edges() and strip_escapes()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer keeping strip_bbcode() as well to be consistent with strip_edges() and strip_escapes().

doc/classes/RichTextLabel.xml Outdated Show resolved Hide resolved
@markdibarry
Copy link
Contributor

markdibarry commented Jun 16, 2023

@Calinou Just tested and using the same format as your example it seems like strip_bbcode() strips both BBCode and non-BBCode tags.

From top to bottom: unescaped, stripped, escaped
image

@dalexeev
Copy link
Member

@markdibarry See godotengine/godot-proposals#5056 (comment):

It should likely strip every [tag]-like substring in the source string (and the closing pair, if present), without relying on Godot's own implementation of BBCode and the tags that we have defined.

It removes all tags, whether they exist or not. It's a String method, not a RichTextLabel method, so it can't know which tags exist and which don't. Just like to_pascal_case() method uses general rules and does not recognize specific abbreviations.

@markdibarry
Copy link
Contributor

markdibarry commented Jun 16, 2023

@dalexeev Ah, I see now. I think it would be misleading to call it strip_bbcode() then, since it strips all tags whether or not it contains BBCode. I think strip_tags() would be a more appropriate name, since it can't verify anything to do with BBCode, and this also would prevent a name conflict if in the future someone implemented a proper way to strip BBCode from text. Though, that would probably be part of the text server or RTL... though I can't remember if Godot's BBCode tag logic is tightly coupled to that node. That would probably need changed in the future as well, if so. escape_bbcode() still makes sense, since it's relying on the Godot BBCode implementation.

@dalexeev
Copy link
Member

I think it is fine as is. bbcode refers to BBCode syntax, not a list of specific codes. Any [tag] is a BBCode tag, any <tag> is a HTML tag. Just like is_valid_identifier() method only analyzes the string itself, but does not answer the question whether this identifier is declared in your script or not (and does not even answer the question whether the string is a keyword like if, var, etc.).

@markdibarry
Copy link
Contributor

I disagree, but you're the decision makers. Just pointing out that this method conveys that it removes BBCode, but instead, it removes BBCode, BBCode-similar tags, and any content that happens to be within brackets.

I, and two other devs I know who have established dialog languages for Godot, use tags for our language and also support BBCode. The problem, as we've discussed, is that in order to assign text, the current workflow is:

  1. Assign the entire string with BBCode and custom tags to the RTL text field.
  2. Call get_parsed_text() to get the text without BBCode
  3. Take that result and parse it for custom tags.
  4. Rebuild the string adding the BBCode as you go, comparing the two strings.
  5. Reassign the new text.

This interested me (and I'm sure others), because I assumed it actually stripped BBCode, preventing us from having to assign the text field twice. It doesn't, so it's not relevant to our cases, but since it also removes all tags, even if they're not in BBCode syntax format, I don't see how it'd would be reliable to use in any scenario.

@Mickeon
Copy link
Contributor

Mickeon commented Dec 21, 2023

Currently would heavily appreciate these methods, as it would prevent users from writing BBCode in a chat system I'm currently experimenting with. Fortunately, writing a workaround isn't exactly difficult, but it would be nice.

@Calinou Calinou force-pushed the string-add-bbcode-methods branch 2 times, most recently from aa9f05c to b216ff1 Compare January 3, 2024 23:26
@markdibarry
Copy link
Contributor

@Calinou Any chance you can name it strip_tags() and tag_escape() instead since this doesn't check for bbcode? Pretty, pretty please? I'd be so grateful.

@Calinou
Copy link
Member Author

Calinou commented Jan 4, 2024

@Calinou Any chance you can name it strip_tags() and tag_escape() instead since this doesn't check for bbcode? Pretty, pretty please? I'd be so grateful.

We already use bbcode in RichTextLabel's properties, while tag is a generic term that doesn't refer to any specific syntax. It's not quite as expressive as actually having bbcode in the function name.

It isn't technically 100% correct, but just because there's no true BBCode standard doesn't mean we can't call it BBCode. It's just like INI 🙂

@markdibarry
Copy link
Contributor

@Calinou I know, I know. It sounds totally pedantic, I just knew I'd regret it if I didn't throw my reasoning out there before it's too late, so I want to be annoying and express my concerns while I still can:

You're absolutely right that there's no official standard for BBCode. Almost every program out there that uses the term BBCode uses their own implementation/flavor, but there is a current established implementation that we use in Godot, and this method doesn't support it. I'm not invested in the word "tag" at all, it just seems like the closest word to what it actually does without being misleading, and I couldn't think of something better. It makes more sense in my head to use a generic word for this method that works in a generic way, rather than use a word that means something else specific in Godot, since that can cause a name conflict in the near future. String.strip_bbcode() seems to work like a method called Node.remove_sprite2ds() that removes all child Node2Ds.

More selfishly: My projects (and a few others I know) use dialog systems that get bogged down with heavy string manipulation required to strip RichTextLabel's implementation of BBCode, since Godot doesn't currently provide an avenue to do so without a bunch of hacks. I've spoken to a few others that maintain dialog systems and we generally all employ the same hack to strip BBCode, because stripping all text between brackets would break our systems. Even though we wouldn't be able to use this method, I would love it if someday a performant way to strip BBCode was provided, so I am invested in an eventual RichTextLabel.strip_bbcode() being available.

If Godot's implementation of BBCode and the BBCode that this method supports are incompatible, then that raises a lot of questions. In the case of a method to strip Godot's implementation of BBCode was eventually exposed, what would happen with these methods? I'd think it'd be confusing for many to have two methods that take strings that are named the same thing, but do something different. Would we rename the string.strip_bbcode()? What would we rename it to? If so, why not now? Would we have to have a different name for the RichTextLabel version? What would that be named? RichTextLabel.strip_godot_bbcode()?

@AThousandShips
Copy link
Member

If you want a specialized and reasonably performant workaround then I'd say just to use regex, as long as the documentation clarifies it isn't selective I don't see any reason to bog this down or rename it, naming it anything else would be confusing IMO, the only other name I'd say is non-confusing would be something like strip_square_bracket_tags or something verbose, because otherwise the assumption to me would be <tag> tags

@Mickeon
Copy link
Contributor

Mickeon commented Jan 4, 2024

That's all a fair point to be made. Perhaps it is best to delegate this method to RichTextLabel itself to have more control over tags. However, there have also been occasions where the editor may need to strip generic BBCode from a text before displaying it, too, without creating a whole RichTextLabel for it. Perhaps there should be a static method for it, too...

@dalexeev
Copy link
Member

dalexeev commented Jan 4, 2024

@markdibarry Formally, [unknown_tag] is also a BBCode (can be BBCode), so its removal is correct. Just RichTextLabel does not report unknown tags and invalid BBCode markup, but displays them as is.

If you are dealing with user input, then you should use escape_bbcode(). If you are dealing with valid BBCode markup, then you can use strip_bbcode() to display text where BBCode is not supported. If you are dealing with invalid BBCode markup, then strip_bbcode() may not be suitable, and you will need to implement the necessary functionality yourself.

If we want to achieve more accurate (but more verbose) naming, then I would suggest the escape_bbcode_markup() and strip_bbcode_tags(). strip_tags() is too general.

<method name="bbcode_escape" qualifiers="const">
<return type="String" />
<description>
Returns the string with BBCode tags escaped to [code][lb][/code] and [code][rb][/code], which makes them ineffective when used in [RichTextLabel] with [member RichTextLabel.bbcode_enabled] set to [code]true[/code]. This is useful for handling user input to prevent BBCode injection. See also [method strip_bbcode].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Returns the string with BBCode tags escaped to [code][lb][/code] and [code][rb][/code], which makes them ineffective when used in [RichTextLabel] with [member RichTextLabel.bbcode_enabled] set to [code]true[/code]. This is useful for handling user input to prevent BBCode injection. See also [method strip_bbcode].
Returns the string with BBCode tags escaped with [code][lb][/code], which makes them ineffective when used in [RichTextLabel] with [member RichTextLabel.bbcode_enabled] set to [code]true[/code]. This is useful for handling user input to prevent BBCode injection. See also [method strip_bbcode].

The same below.

@markdibarry
Copy link
Contributor

I agree that there is good use for this method, and I don't want to hold it up too much, I just wanted to share my thoughts, since I can already see problems in the future, but only with name conflicts. Some background:

In RichTextLabel, when you use valid Godot BBCode tags, it'll be applied, otherwise, it displays it as normal text.
Ex:
This word is [color="green"]green[/color] and this does [foo="bar"]nothing[/foo].
would display as
This word is green and this does [foo="bar"]nothing[/foo].

In most of the dialog systems for Godot (including mine), the built in dialog code is between brackets ([mood="happy"], [GetName("John")], or [jumpTo="Section 3"]), but it also supports Godot's RichTextLabel implementation of BBCode, so the system needs to know the difference between valid Godot BBCode tags, valid dialog code tags, and normal text.

The only current avenue to get the text with valid Godot BBCode tags removed, is to assign it to the text property first and call get_parsed_text() on the node. This needs done every time any text changes:

# in RichTextLabel node
text = fullText
var parsedText = get_parsed_text()
# for dialog parsing
var dialogTags = get_dialog_tags(fullText)
text = reconstruct_text(fullText, parsedText, dialogTags)

This proposed method would remove valid Godot BBCode, valid dialog code, and any intentional non-code bracket surrounded text. So, of course, it can't be used for anything that needs awareness of Godot's implementation of BBCode, since it works on a very general level of anything that is between brackets. However, it'd be very helpful to someday have a static method that parses Godot's implementation of BBCode from a string without needing to update the RichTextLabel's text twice. The question still is, if we were to implement this, would both methods be called strip_bbcode() even though "BBCode" means two different things in these contexts?

@dalexeev
Copy link
Member

dalexeev commented Jan 4, 2024

The only current avenue to get the text with valid Godot BBCode tags removed, is to assign it to the text property first and call get_parsed_text() on the node.

Note that you are not required to use get_parsed_text(). You can implement a BBCode parser that takes into account both standard BBCode and custom pseudo-tags. For example, this is how Editor Help and the doc comment parser in GDScript are made. Also, to simplify things, you can use delimiters other than brackets.

String.strip_bbcode() is designed to deterministically process strings, regardless of the specific RichTextLabel instance, its settings, and pseudo-tags. For example, this method can be used with print_rich().

As for RichTextLabel and advanced scenarios like dialog systems, I'm not sure which method we might need in the future. There may be strip_bbcode() which you described. There may be some kind of parse_bbcode() (although the BBCode parser is not difficult to implement in GDScript). There are also proposals to add support for alternative markup such as Markdown.

The question still is, if we were to implement this, would both methods be called strip_bbcode() even though "BBCode" means two different things in these contexts?

Thanks for your concerns about potential naming conflicts! I don't have a clear opinion on this issue other than the suggestion above. But I'm not sure if clarifying tags vs markup really helps. To be completely precise, the method should be named strip_bbcode_tags_and_special_chars(), but this is too verbose.

Also, see String.is_valid_identifier() and TextServer.is_valid_identifier() methods. Just like here, the methods are named the same, and "identifier" denotes different things in different contexts. However, both methods cannot be used to check whether a string is a valid GDScript identifier, since they do not take into account the _ token, keywords, built-in types, etc. I think String vs RichTextLabel would be enough to understand the difference in purpose.

Finally, if we add the RichTextLabel method in the future, I think we could name it strip_valid_tags(), strip_valid_bbcode() or something like that to emphasize that the RichTextLabel method keeps invalid markup, unlike the String method.

@markdibarry
Copy link
Contributor

But I'm not sure if clarifying tags vs markup really helps

Just to reiterate, I have no need for the word "tag" to be used, it's just the closest general word I could think up that doesn't have a double meaning in the code base. I'm more on the "not bbcode" team than the "yes tag" team. 😂

To be completely precise, the method should be named strip_bbcode_tags_and_special_chars(), but this is too verbose.

I doubt this would be a widely used method, but I also totally agree about keeping verbosity to a minimum.

I think String vs RichTextLabel would be enough to understand the difference in purpose.

I also like to think my code is self-documenting, but I don't think most users know that RichTextLabel and String have two different implementations of the BBCode concept; The first pulling from an internal list of valid tags, and the second meaning anything between brackets. To be honest, it sounds obvious since I assume you and I have been familiar for so long, but I originally thought there was a standard, and it was a surprise to find out that it's still kinda Calvinball out there as far as BBCode goes. 😓

I think if it's really important that they're named the same, some detailed documentation on the method's limitations may help alleviate some potential confusion among the community. I think at least just a note that says "Does not consider any specific BBCode implementation when parsing." or something similar would help.

Also, to simplify things, you can use delimiters other than brackets.

I can't speak for the other dialog system maintainers, but I personally think it's definitely a double-edged sword! You're right that it does make it tricky to parse, but the benefits make it worth it. Using a unified syntax can be very helpful to users looking to adopt, without having them needing to memorize a ton out of the gate. Especially when those who are most likely to use the dialog system aren't programmers. Plus, the obvious, it wouldn't be a great solution to completely rewrite the system and make all users upgrade and relearn a new syntax.

You can implement a BBCode parser that takes into account both standard BBCode and custom pseudo-tags.

Okay, this is something new to me! You've got me very interested! How do you access the internal tag list RichTextLabel supports? I'd be totally down to spend some time making a custom parser, and I know a ton of people in the community who'd be interested in having an optimized solution. I had originally planned to just copy the internal tag list into my project, but I see that the list does get changed every so often for what is/isn't supported, so that was a no-go.

@dalexeev
Copy link
Member

dalexeev commented Jan 4, 2024

Okay, this is something new to me! You've got me very interested! How do you access the internal tag list RichTextLabel supports? I'd be totally down to spend some time making a custom parser, and I know a ton of people in the community who'd be interested in having an optimized solution. I had originally planned to just copy the internal tag list into my project, but I see that the list does get changed every so often for what is/isn't supported, so that was a no-go.

This is not relevant to this PR, but I made a gist (if you have any questions, please leave comments there, not in this PR). It is not possible to obtain the tag list via Godot API, see source code instead. However, you probably won't need to get the full list, you can only process custom pseudo-tags and individual tags, such as [url] (option 2 in the gist). If you don't need to validate user input, you probably don't need a parser, you can just use String.replace() or at most regular expressions.

@markdibarry
Copy link
Contributor

It is not possible to obtain the tag list via Godot API

Ah that's a shame. Your gist looks like what I came up with a few years ago, but this would require constant maintenance, so it wouldn't be of any use. Thanks anyway! I appreciate you trying!

Seeing as though I'm the minority on the naming ambiguity, as long as there's a small note in the method description about its limitations or intended use, I won't push it any further. It's enough to point users to properly read the docs if they're confused. 😅

@Jael-G
Copy link

Jael-G commented May 11, 2024

Hey, is this going to be added? Would be useful to have a neat built-in function.

@Calinou Calinou force-pushed the string-add-bbcode-methods branch from b216ff1 to 949949b Compare May 14, 2024 16:47
@Calinou
Copy link
Member Author

Calinou commented May 14, 2024

Hey, is this going to be added? Would be useful to have a neat built-in function.

4.3 is in feature freeze now, so this will probably not be merged in that version. That said, it should be straightforward to implement the functions from this PR in GDScript so you can use them in your project.

`String.strip_bbcode()` can be used to display text from a RichTextLabel
in places that don't support rich formatting.

`String.bbcode_escape()` can be used to prevent BBCode injection
in RichTextLabels that display user input.
@tumblewed
Copy link

See this comment from the original proposal if you want a work-around in the meantime.

@Mickeon
Copy link
Contributor

Mickeon commented Aug 28, 2024

With time thinking about it, I do think this method belongs inside RichTextLabel. Whether it should be static or directly tied to a RichTextLabel object (hence it'd be able to distinguish valid BBcode tags), I am not sure.

My reasoning is also... Well, the entirety of String's method list has turned into a movie to scroll through. It's a relatively niche method, I'm not sure if it deserves to be there.

To be honest, it's fine. I will still sleep at night if this PR is merged as is.

@tumblewed
Copy link

With time thinking about it, I do think this method belongs inside RichTextLabel. Whether it should be static or directly tied to a RichTextLabel object (hence it'd be able to distinguish valid BBcode tags), I am not sure.

Here are code examples of both implementations.

RichTextLabel:

# node reference
@onready var label = $RichTextLabel
var stripped_text = label.strip_bbcode("[center]Text Here")

# or, class reference
var stripped_text = RichTextLabel.strip_bbcode("[center]Text Here")

String:

var stripped_text = "[center]Text Here".strip_bbcode()

Personally I prefer how the code reads when it is under the String class. It also functions like other immutable String methods, so the behavior is clearer from the outset. Implementing it under RichTextLabel would imply (to me at least) that the method would render or otherwise modify the text in some way, when all it does is return a modified String.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a function to erase BBCode from a String, without depending on RichTextLabel
7 participants