Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save PackedByteArrays as base64 encoded #89186

Merged

Conversation

groud
Copy link
Member

@groud groud commented Mar 5, 2024

That saves a lot of space in text resource files for properties that will likely never require to be modified manually.
This of course keeps compatibility.

This is quite useful TileMaps (see #89179 )

Edit: I made a comparison with my the TileMapLayer PR, this is what I get:

Peek.05-03-2024.18-09.mp4

The character count is likely divided by ~2. But most of the TileMapLayer data contains single-digit values, which is only 1 char. With byte values above 10 (or worse, 100), the gain would be even better.

@groud groud requested a review from a team as a code owner March 5, 2024 16:56
@AThousandShips AThousandShips added this to the 4.x milestone Mar 5, 2024
@AThousandShips
Copy link
Member

This breaks compatibility with the format right? Data generated with this can't be loaded in older versions, should we bump the text file version?

@groud
Copy link
Member Author

groud commented Mar 5, 2024

This breaks compatibility with the format right? Data generated with this can't be loaded in older versions, should we bump the text file version?

Ah yes indeed. Older version cannot read it. If there's a version to increase somewhere I can do it I guess.

@AThousandShips
Copy link
Member

It's right here:

#define FORMAT_VERSION 3

@groud groud force-pushed the save_byte_arrays_as_base64_encoded branch from d3b15e6 to 162f546 Compare March 5, 2024 17:21
Copy link
Member

@Calinou Calinou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested locally, it works as expected. This has been a longstanding issue, great work 🙂

Testing project: test_packed_byte_array_scenes.zip1

.tscn file with 10,000,000 PackedByteArray elements (5 million are zeros, 5 million are random with a fixed seed):

Benchmark

OS: Fedora 39 x86_64
CPU: Intel Core i9-13900K (performance governor)
SSD: Solidigm P44 Pro 2 TB

Saving time

Pressing Ctrl + S with only one scene open in the editor and measuring the time the progress dialog is visible with a stopwatch.

Binary (for reference) Before After (this PR)
0.4 seconds 3.2 seconds 1.1 seconds

Loading time

Running the project and measuring the time spent on the splash screen with a stopwatch.

Binary (for reference) Before After (this PR)
0.5 seconds 2.5 seconds 1.4 seconds

File size (on filesystem)

Binary (for reference) Before After (this PR)
4.8 MB2 36.4 MB 12.7 MB

File size (within a ZIP archive)

Binary (for reference) Before After (this PR)
4.8 MB2 6.4 MB 4.8 MB

Footnotes

  1. The .tscn file included was saved prior to this PR.

  2. Binary resources are compressed by default, so compressing them further will not help. 2

@Mickeon
Copy link
Contributor

Mickeon commented Mar 5, 2024

While the PR is fine as is, if the scene format needs to be upgraded, it would be nice to do more now than later.
As I already said I think the other Packed Arrays would also benefit from this, especially since some are prone to be bundled within a Resource:

  • Gradient;
  • TileData (collision polygon);
  • Line2D;
  • Polygon2D;
  • HeightMapShape3D;
  • Curve3D;
  • ArrayOccluder3D (indices and vertices);
  • Convex/ConcavePolygonShape3D;
  • NavigationObstacle3D;
  • etc.

while (true) {
if (token.type != TK_NUMBER) {
bool valid = false;
if (token.type == TK_IDENTIFIER) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we could do this in a way there is no duplicated code..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we can easily do it. The _parse_construct does handle the parenthesis and the elements in the coma-separated values.
We could extract the coma-separated values into a dedicated function and maybe check the parenthesis in another one, but like, the problem is knowing whether or not we have a string vs a first number to decide what to do afterwards. This makes it a bit complicated as get_token consumes to token (so checking for a number means we have to consume the number), so it's hard to check things beforehands.

If we want simpler solution, I can try, at least, to add a _parse_number_or_identifier() or something ? So that at least the number checking can be avoided as duplicated code ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At worst you can add a FIXME/TODO comment.

@nikitalita
Copy link
Contributor

Both the binary format version and the text format version should be incremented for this.

@akien-mga akien-mga modified the milestones: 4.x, 4.3 Mar 6, 2024
@AThousandShips
Copy link
Member

Why the binary format? This doesn't affect it

@groud groud force-pushed the save_byte_arrays_as_base64_encoded branch from 162f546 to de50735 Compare March 6, 2024 09:17
@MewPurPur
Copy link
Contributor

MewPurPur commented Mar 6, 2024

Seems like a huge performance buff. Are there any drawbacks or other changes from user side?

@groud
Copy link
Member Author

groud commented Mar 6, 2024

While the PR is fine as is, if the scene format needs to be upgraded, it would be nice to do more now than later.
As I already said I think the other Packed Arrays would also benefit from this, especially since some are prone to be bundled within a Resource:

I don't think I want to change the format of other arrays for now. For several reasons:

  • The space saved might not be significant, especially for int32/int64.
  • Some types, like the mentioned Line2D points for example, have significant gains in being saved as editable sets of points. This allows users to fix values in a text editor if needed for example. This is useful when the Line2D you are dealing with is short. (Instead, this makes little sense for byte arrays in general, which are more meant to store raw binary values.)
  • The two first points make me think the saving a base64 should be doable, but must be optional (with a property hint). This is significantly more work.
  • There endianness to deal with, which also makes things more complicated with such complex types.

@groud
Copy link
Member Author

groud commented Mar 6, 2024

Seems like a huge performance buff. Are there any drawbacks or other changes from user side?

The main drawback is that those are not editable in the text file, with a text editor (which is kind of the whole point of text files as resources). Also it a bit less VCS-friendly.

@Mickeon
Copy link
Contributor

Mickeon commented Mar 8, 2024

My suggestion from RocketChat still stands that both formats should be supported, with Base64 used when the Packed Array is deemed too large (arbitrary, I know).

@akien-mga
Copy link
Member

My suggestion from RocketChat still stands that both formats should be supported, with Base64 used when the Packed Array is deemed too large (arbitrary, I know).

What would be the benefit? And what would be the threshold for "too large"?

@Mickeon
Copy link
Contributor

Mickeon commented Mar 9, 2024

Well, that would depend on how precise you want to go about it.

The easy way out is, say, assume that any Packed Array with more than 32 elements is entirely incomprehensible even in text form and should be encoded anyway (depends on the array type).
The hard way is to compare the compress ratio and see if it is an extremely notable difference (say, 75% or higher).

The benefit is that users that actually edit PackedByteArray & Co. inside the .tres file are way less likely to lose on this functionality, because the "classic" encoding is retained when there are just a few elements.

.tscn files are meant to be as readable as possible, and that is mostly true, but the massive noise that the saved packed arrays cause can get rather unbearable and as we know already, bloat the file size by a ridiculous amount (especially with floats). At the same time, I wouldn't want to outright prevent them to be seen in their original form if it can be read.

@groud
Copy link
Member Author

groud commented Mar 11, 2024

My suggestion from RocketChat still stands that both formats should be supported, with Base64 used when the Packed Array is deemed too large (arbitrary, I know).

Once again, the main problem with that approach is that base64 is not necessarily efficient at storing those types.
For example, with an an int32 array, you need around 5.3 char per encoded integer when using base64.

So if your packed mostly int array consists or numbers, let's say, inferior to 9999 (or 999 with spaces) including the coma ,, then any base64 encoding will produce a bigger string than the original. I believe than most of the array saved using int, vector2, etc. will not often need very high values, so the gain for those will not be significant (if there's any gain at all). It might also be harmful in case your array contains a lot of zeros for example.

On the other hand, packed bytes are mostly used to store raw binary data, which will most likely use the full set of possible values. That's why there's little risk in using base64 here, as we can assume it will be efficient most of the time.

To me, there's no reason to base64-encode other arrays for now. It's complicated and is not ensured to be working anyway. I don't mind if someone wants to try it later on, want to make it optional, or maybe find ways to determine if an array should rather be base64-encode or not, but this is out of the scope of this PR. I just wanted to solve this for TileMaps for now, which this PR does fine.

@akien-mga
Copy link
Member

@groud I believe what Mickeon suggested was not to encode other types of arrays as base64, but to have a threshold for which PackedByteArrays should be base64 encoded. E.g. only when the PackedByteArray has more than 10 elements or whatever would be a good threshold for a tradeoff between readable and human-editable encoding, and efficiency.

@groud
Copy link
Member Author

groud commented Mar 11, 2024

@groud I believe what Mickeon suggested was not to encode other types of arrays as base64, but to have a threshold for which PackedByteArrays should be base64 encoded. E.g. only when the PackedByteArray has more than 10 elements or whatever would be a good threshold for a tradeoff between readable and human-editable encoding, and efficiency.

Well, I don't mind doing so only for PackedByteArrays. Not sure if it it worth the added code but it's not too much work. As you want.

@akien-mga
Copy link
Member

Personally I'm not sure it's worth adding some complexity. But it could make sense to have a look at some examples of small PackedByteArrays to see how they end up with base64 encoded, so we can see if indeed having a threshold is worth it.

If the base64 representation of small arrays is bigger than the current representation, it can indeed make sense to avoid. And I guess for small, hand-written arrays (<10 or 20 elements), it could make sense to keep things VCS friendly. But for anything that's just a raw buffer dump, base64 should be the way to go.

But again, let's have a look at some concrete examples if someone wants to put them together, so we can decide.

@AThousandShips
Copy link
Member

Might even be worth having a hint for this, but unsure if the complexity is desired, but that would allow us to mark properties that aren't designed to be human readable to be compressed, and others not, could be applied for all the packed arrays

@KoBeWi
Copy link
Member

KoBeWi commented Mar 27, 2024

If the base64 representation of small arrays is bigger than the current representation, it can indeed make sense to avoid.

I'd say not much. If you have one or two small arrays, it doesn't matter if they get a few characters longer. It would only make some real difference if you have lots of small byte arrays, but it sounds like unlikely use-case, idk.

It would be great if this gets merged for the same release as #89179, because upgrading the TileMaps and then upgrading the arrays would create lots of VCS noise; better to do it in one go.

@Mickeon
Copy link
Contributor

Mickeon commented Mar 28, 2024

I concur. My opinion should not be a blocker for this. It's better than not doing compression at all.

@KoBeWi
Copy link
Member

KoBeWi commented Mar 28, 2024

btw the new TileMapLayer format saved in Base64 is shorter than the previous int-based format. And the difference is in thousands of characters apparently.

EDIT:
I also checked and Base64 decode ignores whitespace in the String. We could think about breaking the string into new lines e.g. every 1000 characters. Currently any modification to tile data results in a huge diff, because the whole array is stored in a single line. It can be done later though.

@akien-mga akien-mga requested a review from reduz April 10, 2024 11:29
Copy link
Member

@reduz reduz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems ok for now. From our discussion with @Faless all this code will at some point in the future will need to be isolated to make it auditable for security issues (given it can be used as network I/O or savefiles), meaning in that case it will have to change anyway.

@akien-mga akien-mga merged commit 8764769 into godotengine:master Apr 10, 2024
16 checks passed
@akien-mga
Copy link
Member

Thanks!

@KoBeWi
Copy link
Member

KoBeWi commented Apr 10, 2024

It's too late, but I wonder if format change was necessary. It makes all scenes incompatible with older versions, while the change itself would only affect scenes with PackedByteArray (which I think are super rare outside TileMaps).

While it doesn't really affect regular users, it's going to be annoying to test projects in multiple engine versions. Maybe we could make Godot ignore scene format in dev builds, idk 🤔
Well actually it also affects resources, which means that jumping between versions is going to destroy your editor settings.

@nikitalita
Copy link
Contributor

Welcome to my world. Dealing with multiple format versions is a pain in the neck and I end up having to create parsers for each version. If the text resource loaders/parsers were at least version-aware and could deal with parsing older versions of the text/binary formats, this wouldn't be a problem.

@Calinou
Copy link
Member

Calinou commented Apr 16, 2024

Well actually it also affects resources, which means that jumping between versions is going to destroy your editor settings.

On the Godot contributors chat, we discussed a workaround in which the scene/resource format would be upgraded only if at least one PackedByteArray was serialized. EditorSettings generally doesn't contain any PackedByteArray, so this would allow it to keep working regardless of version.

@oxplay2
Copy link

oxplay2 commented May 31, 2024

Currently using 4.2

as a person who use separate App Network byte sending to byte_to_var from godot UDP or TCP Server,

will i be affected by this?

there is for example:

udp:

var packet:***PackedByteArray*** = peerUDP.get_packet()

tcp:

var data = stream.get_data(packet.size)

so now im unsure how i will need encode this on separate App side?

Imo you should just add PackedByteArray and EncodedPackedByteArray

@groud
Copy link
Member Author

groud commented May 31, 2024

as a person who use separate App Network byte sending to byte_to_var from godot UDP or TCP Server,
will i be affected by this?

No, this does not impact binary serialization, only the text one (mainly for .tres and .tscn files).

@oxplay2
Copy link

oxplay2 commented May 31, 2024

as a person who use separate App Network byte sending to byte_to_var from godot UDP or TCP Server,
will i be affected by this?

No, this does not impact binary serialization, only the text one (mainly for .tres and .tscn files).

thanks for info, but are you sure?

there is definetly usage in UDP:

var packet:PackedByteArray = peerUDP.get_packet()

will this work exactly same when i send same byte format from separate non-godot App?

If this is really just for files, then im happy.

@groud
Copy link
Member Author

groud commented May 31, 2024

thanks for info, but are you sure?

Yes I am. The modified VariantParser class isn't used for network, only for files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants