GDScript: Reintroduce binary tokenization on export #87634

vnen · 2024-01-26T22:44:13Z

This adds back a function available in 3.x: exporting the GDScript files in a binary form by converting the tokens recognized by the tokenizer into a data format.

It is enabled by default on export but can be manually disabled. The format helps with loading times since, the tokens are easily reconstructed, and with hiding the source code, since recovering it would require a specialized tool. Code comments are not stored in this format.

The --test command can also include a --use-binary-tokens flag which will run the GDScript tests with the binary format instead of the regular source code by converting them in-memory before the test runs.

Besides the regular option to export GDScript as binary tokens, this also includes a compression option on top of it. The binary format needs to encode some information which generally makes it bigger than the source text. This option reduces that difference by using Zstandard compression on the buffer.

Production edit: See Use an intermediate representation format for GDScript godot-proposals#8605.
Production edit: removed line breaks
Production edit: closes godotengine/godot-roadmap#83

MichaelWengren · 2024-01-27T02:17:26Z

But what happened with intermediate representation proposal, is there is any work done there? Why bring back this useless tokenization if there is zero protection to the source code because gdsdecomp can recover everything in a second?

arkology · 2024-01-27T06:25:51Z

Besides the regular option to export GDScript as binary tokens, this
also includes a compression option on top of it. The binary format
needs to encode some information which generally makes it bigger than
the source text. This option reduces that difference by using Zstandard
compression on the buffer.

Would be great to see exported projects size comparison

modules/gdscript/gdscript_tokenizer_buffer.cpp

modules/gdscript/gdscript_parser.cpp

vnen · 2024-01-27T15:11:28Z

To be clear, this is unrelated to the IR proposal. The proposal itself will take some time to solidify because it needs approval from the maintainers and not everyone think it's feasible.

This PR is a palliative to avoid having plain source code export while we wait for that or figure out something else, because a lot of people has been complaining about this issue.

MichaelWengren · 2024-01-27T16:45:37Z

This PR is a palliative to avoid having plain source code export while we wait for that or figure out something else, because a lot of people has been complaining about this issue.

But returning tokenization does not solve this problem because scripts are still easily opened with gdsdecomp tool.

AThousandShips · 2024-01-27T16:48:21Z

But returning tokenization does not solve this problem because scripts are still easily opened with gdsdecomp tool.

That won't change in any real way with any solution, any compiled/parsed/processed format will be vulnerable to that, it's just a matter of degrees, the biggest improvement will be this over plain source to the vast majority of "attacks"

MichaelWengren · 2024-01-27T17:37:18Z

any compiled/parsed/processed format

You can't run decompiled code because it's gibberish but in the case of GDScript tokenization you absolutely can revert it back to it's original form and run it without any issues, this is why this tokenization is useless.
I had high hopes for intermediate representation format, but after almost two months it turned out that no work had even begun

AThousandShips · 2024-01-27T17:46:35Z

I'd suggest reading up on this and getting a realistic perspective on this, if you genuinely believe that decompilation isn't a realistic way of reverse engineering code then you are not going to have realistic expectations of the outcomes of these things

but after almost two months it turned out that no work had even begun

Please be realistic about the time it takes to undertake such a major project, and the explanations for the expectations on it above by vnen... Just because it hasn't happened yet doesn't mean it won't, especially with just a few months

If you'd rather wait longer with no changes and no improvements for performance then that's your opinion, but I don't think it's reasonable to wait for a feature that doesn't clash with this just to get a bit better obfuscation (not that obfuscation is the main goal here anyway)

(Also if you're going to block me you will not react to my comments please, that's just petty, I don't want to have to block you back...)

fire · 2024-01-29T05:20:38Z

Reducing loading times and applying zstd compression is good reason to reintroduce binary tokenization on its own merits.

I don't know if I'll have much success reviewing the code.

reduz · 2024-02-06T11:16:39Z

editor/export/editor_export_preset.h

@@ -84,6 +90,7 @@ class EditorExportPreset : public RefCounted {
 	bool enc_directory = false;

 	String script_key;
+	int script_mode = MODE_SCRIPT_BINARY_TOKENS;


I think compressed is probably better default.

reduz

Looks good on my end.

reduz · 2024-02-06T11:19:21Z

@MichaelWengren you can decompile and modify anything, be it .net or anything else.
Eventually instead of an IR, what will most likely happen is GDScript files optionally compiled to binary (C) per platform. This will make it pretty impossible to see or modify the code already.

vnen · 2024-02-08T12:12:18Z

Updated to set the compressed mode as the default.

akien-mga · 2024-02-08T12:37:12Z

editor/export/project_export.cpp

+	// Script export parameters.
+
+	VBoxContainer *script_vb = memnew(VBoxContainer);
+	script_vb->set_name(TTR("Scripts"));
+
+	script_mode = memnew(OptionButton);
+	script_vb->add_margin_child(TTR("GDScript Export Mode:"), script_mode);
+	script_mode->add_item(TTR("Text (easier debugging)"), (int)EditorExportPreset::MODE_SCRIPT_TEXT);
+	script_mode->add_item(TTR("Binary tokens (faster loading)"), (int)EditorExportPreset::MODE_SCRIPT_BINARY_TOKENS);
+	script_mode->add_item(TTR("Compressed binary tokens (smaller files)"), (int)EditorExportPreset::MODE_SCRIPT_BINARY_TOKENS_COMPRESSED);
+	script_mode->connect("item_selected", callable_mp(this, &ProjectExportDialog::_script_export_mode_changed));
+
+	sections->add_child(script_vb);


Technically this is GDScript specific code spilling into the main editor, which wouldn't be relevant if the GDScript module is disabled. So it's slightly breaking encapsulation.

But it's not a first, and I'm not sure it's worth adding a complex abstraction for this so it can be handled in the module for each script language individually. Your call if you think it's worth looking into.

For now I think this doesn't need to prevent merging.

I though about this but I'm not sure of what would be the best solution. This is the same way it is handled in 3.x.

AThousandShips

Some accidental reverts, see above

vnen · 2024-02-08T14:14:58Z

Some accidental reverts, see above

I thought git push --force-with-lease would prevent this. I'll have to be more careful from now on.

This adds back a function available in 3.x: exporting the GDScript files in a binary form by converting the tokens recognized by the tokenizer into a data format. It is enabled by default on export but can be manually disabled. The format helps with loading times since, the tokens are easily reconstructed, and with hiding the source code, since recovering it would require a specialized tool. Code comments are not stored in this format. The `--test` command can also include a `--use-binary-tokens` flag which will run the GDScript tests with the binary format instead of the regular source code by converting them in-memory before the test runs.

Besides the regular option to export GDScript as binary tokens, this also includes a compression option on top of it. The binary format needs to encode some information which generally makes it bigger than the source text. This option reduces that difference by using Zstandard compression on the buffer.

akien-mga · 2024-02-09T11:37:37Z

Thanks!

ogapo · 2024-06-10T13:10:28Z

Just wanted to say I really appreciate this commit. I don't see obscuring the code as a security feature, but making the files smaller and faster is great. Most importantly, having comments stripped is very much a value add for a lot of developers. It's easy to leak things accidentally there since many devs used to compiled languages would assume comments are not user-visible.

azur-wolve · 2024-07-08T08:48:55Z

about source safety (reverse-engineering protection, obfuscation, etc) there's any thread to follow?

dalexeev · 2024-07-08T08:56:48Z

@azur-wolf See:

Obfuscate GDScript in production export godot-proposals#4220

There are also proposals that can make decompilation more difficult, although they are not directly intended for this:

jason-cahill · 2024-08-22T01:09:29Z

Hi @vnen : thank you for all the hard work you do. Can I ask if this work is replacing variable names? While classes and function names would also be great to obfuscate, the number one thing to minify would be variable names. Many software organizations don’t write much in the way of comments, and symbols go a long way towards being the documentation. Having the tokenizer scramble variable names would really take a huge step towards making the intent of a function opaque.

Calinou · 2024-08-22T01:17:32Z

Can I ask if this work is replacing variable names?

No, it keeps identifiers as-is. Local variables could technically be renamed, but this is nontrivial to implement. Member variables can't be renamed without breaking the API that other classes may attempt to use, especially since there is no way to mark member variables as private yet.

donte5405 · 2024-08-23T09:01:44Z

Can I ask if this work is replacing variable names?

Without some sort of string name server that's able to recognise internal/GDExtension names for the entire project, not just inside single GDScript, it's impossible to alternate source code safely and makes external API access attempts impossible. There are also dire limitations for this approach such as inability to recogise fields inside string syntaxes.

Take a look at this snippet.

extends ColorRect
class_name MyColorRect

var my_own_custom_field := 0.0

func _ready() -> void:
    var my_tween := create_tween()
    my_tween.tween_property(self, "color", Color.BLACK, 1.0)
    my_tween.tween_property(self, "my_own_custom_field", 1.0, 1.0)

If source code is alternated somehow and get decompiled, it may become something like this.

extends ColorRect
class_name _z0

var _z1 := 0.0

func _ready() -> void:
    var _0 := create_tween()
    _0.tween_property(self, "color", Color.BLACK, 1.0)
    _0.tween_property(self, "my_own_custom_field", 1.0, 1.0)

You'll notice immediately that my_own_custom_field doesn't exist anymore because it gets altered to _z0 and will make access attempt from the Tween useless.

While it's not difficult to overcome the issue by just deep-scanning strings and replace them accordingly (and it will become "_z1" in the end), but take a look at this snippet.

extends Node
class_name WorldController

@onready var world := get_tree().root

func _ready() -> void:
    print("Hello, world!")

It may become:

extends Node
class_name _z0

@onready var _z1 := get_tree().root

func _ready() -> void:
    print("Hello, _z1!")

Instead of it printing Hello, world! normally, it will become Hello, _z1! instead. Simply put, it's close to impossible to do some sort of source alternation without butchering other strings that you don't want it to do. Technically, there are ways to overcome the issue, such as introducing new syntaxes being used exclusively for internal string names similar to C#'s nameof keyword. This way, the final source export will be able to tell which labels can be altered and which one can't. Though, I don't ever think it's going to be a thing since we need more than one proposal to cover up the issue, plus various debates from its contributors that if this type of issue is really worth addressing.

There are other projects that cover this type of source mangling approach such as GDMaim, but you still need to be careful while utilising the tool.

vnen added feature proposal topic:gdscript topic:export labels Jan 26, 2024

vnen added this to the 4.3 milestone Jan 26, 2024

vnen requested a review from a team as a code owner January 26, 2024 22:44

vnen force-pushed the gdscript-binary-tokens branch from d782894 to c5a5728 Compare January 26, 2024 23:15

AThousandShips reviewed Jan 27, 2024

View reviewed changes

modules/gdscript/gdscript_tokenizer_buffer.cpp Outdated Show resolved Hide resolved

modules/gdscript/gdscript_parser.cpp Show resolved Hide resolved

modules/gdscript/gdscript_parser.cpp Show resolved Hide resolved

modules/gdscript/gdscript_parser.cpp Show resolved Hide resolved

vnen force-pushed the gdscript-binary-tokens branch from c5a5728 to 88698e1 Compare January 27, 2024 15:29

reduz reviewed Feb 6, 2024

View reviewed changes

reduz approved these changes Feb 6, 2024

View reviewed changes

vnen force-pushed the gdscript-binary-tokens branch from 88698e1 to ac3fe4d Compare February 8, 2024 12:12

akien-mga reviewed Feb 8, 2024

View reviewed changes

akien-mga approved these changes Feb 8, 2024

View reviewed changes

AThousandShips requested changes Feb 8, 2024

View reviewed changes

vnen added 2 commits February 8, 2024 11:20

vnen force-pushed the gdscript-binary-tokens branch from ac3fe4d to 72e5f8c Compare February 8, 2024 14:20

AThousandShips approved these changes Feb 8, 2024

View reviewed changes

akien-mga merged commit 77af6ca into godotengine:master Feb 9, 2024
16 checks passed

vnen deleted the gdscript-binary-tokens branch February 9, 2024 12:18

dalexeev mentioned this pull request Feb 15, 2024

Exported GDScript: Parse Error: Class "XXXXXX" hides a global script class #88353

Closed

vpellen mentioned this pull request Mar 1, 2024

Obfuscate GDScript in production export godotengine/godot-proposals#4220

Open

nikitalita mentioned this pull request Apr 17, 2024

Add support for 4.x compiled scripts GDRETools/gdsdecomp#172

Closed

dalexeev mentioned this pull request Jul 9, 2024

GDScript: Fix "Mismatched external parser" for autoloads #94131

Merged

HolonProduction mentioned this pull request Jul 13, 2024

Single underscore node name is ignored when using the $ get_node() shorthand #94310

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GDScript: Reintroduce binary tokenization on export #87634

GDScript: Reintroduce binary tokenization on export #87634

vnen commented Jan 26, 2024 •

edited by adamscott

Loading

MichaelWengren commented Jan 27, 2024

arkology commented Jan 27, 2024

vnen commented Jan 27, 2024

MichaelWengren commented Jan 27, 2024

AThousandShips commented Jan 27, 2024

MichaelWengren commented Jan 27, 2024

AThousandShips commented Jan 27, 2024 •

edited

Loading

fire commented Jan 29, 2024 •

edited

Loading

reduz Feb 6, 2024

reduz left a comment

reduz commented Feb 6, 2024 •

edited

Loading

vnen commented Feb 8, 2024

akien-mga Feb 8, 2024 •

edited

Loading

vnen Feb 8, 2024

AThousandShips left a comment •

edited

Loading

vnen commented Feb 8, 2024

akien-mga commented Feb 9, 2024

ogapo commented Jun 10, 2024

azur-wolve commented Jul 8, 2024

dalexeev commented Jul 8, 2024 •

edited

Loading

jason-cahill commented Aug 22, 2024

Calinou commented Aug 22, 2024

donte5405 commented Aug 23, 2024 •

edited

Loading

GDScript: Reintroduce binary tokenization on export #87634

GDScript: Reintroduce binary tokenization on export #87634

Conversation

vnen commented Jan 26, 2024 • edited by adamscott Loading

MichaelWengren commented Jan 27, 2024

arkology commented Jan 27, 2024

vnen commented Jan 27, 2024

MichaelWengren commented Jan 27, 2024

AThousandShips commented Jan 27, 2024

MichaelWengren commented Jan 27, 2024

AThousandShips commented Jan 27, 2024 • edited Loading

fire commented Jan 29, 2024 • edited Loading

reduz Feb 6, 2024

Choose a reason for hiding this comment

reduz left a comment

Choose a reason for hiding this comment

reduz commented Feb 6, 2024 • edited Loading

vnen commented Feb 8, 2024

akien-mga Feb 8, 2024 • edited Loading

Choose a reason for hiding this comment

vnen Feb 8, 2024

Choose a reason for hiding this comment

AThousandShips left a comment • edited Loading

Choose a reason for hiding this comment

vnen commented Feb 8, 2024

akien-mga commented Feb 9, 2024

ogapo commented Jun 10, 2024

azur-wolve commented Jul 8, 2024

dalexeev commented Jul 8, 2024 • edited Loading

jason-cahill commented Aug 22, 2024

Calinou commented Aug 22, 2024

donte5405 commented Aug 23, 2024 • edited Loading

vnen commented Jan 26, 2024 •

edited by adamscott

Loading

AThousandShips commented Jan 27, 2024 •

edited

Loading

fire commented Jan 29, 2024 •

edited

Loading

reduz commented Feb 6, 2024 •

edited

Loading

akien-mga Feb 8, 2024 •

edited

Loading

AThousandShips left a comment •

edited

Loading

dalexeev commented Jul 8, 2024 •

edited

Loading

donte5405 commented Aug 23, 2024 •

edited

Loading