Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(stdlib): Add String.encode and String.decode functions to standard library #683

Merged
merged 6 commits into from
Jun 13, 2021

Conversation

peblair
Copy link
Member

@peblair peblair commented May 31, 2021

This pull request adds two new functions to the String module in the standard library:

export enum Encoding {
  UTF8,
  UTF16_BE,
  UTF16_LE,
  UTF32_BE,
  UTF32_LE,
}

// Encodes the given string using the given encoding scheme
// @param s: String - The input string
// @param encoding: Encoding - The encoding to use
// @param dest: Bytes - The bytes object to write the encoded output into
// @param destPos: Number - The location in the byte array to write the output
// @returns Bytes - Returns `dest`
export let encodeInto = (s, encoding, dest, destPos)

// Encodes the given string using the given encoding scheme
// @param s: String - The input string
// @param encoding: Encoding - The encoding to use
// @param dest: Bytes - The bytes object to write the encoded output into
// @param destPos: Number - The location in the byte array to write the output
// @returns Bytes - Returns `dest`
export let encodeIntoWithBom = (s, encoding, dest, destPos)

// Encodes the given string using the given encoding scheme. A byte-order marker
// will not be included in the output.
// @param s: String - The input string
// @param encoding: Encoding - The encoding to use
// @returns Bytes
export let encode = (s: String, encoding: Encoding)

// Encodes the given string using the given encoding scheme, including a byte-order marker
// @param s: String - The input string
// @param encoding: Encoding - The encoding to use
// @returns Bytes
export let encodeWithBom = (s: String, encoding: Encoding)

// -----

// Decodes the given byte sequence into a string using the given encoding scheme, skipping
// the byte-order marker, if it's present.
// @param bytes: Bytes - The input bytes
// @param encoding: Encoding - The encoding to use
// @param start: Number - The byte offset to begin decoding from
// @param size: Number - The maximum number of bytes to decode
// @returns String
export let decodeRange = (bytes: Bytes, encoding: Encoding, start: Number, size: Number)

// Decodes the given byte sequence into a string using the given encoding scheme, including
// the byte-order marker, if it's present.
// @param bytes: Bytes - The input bytes
// @param encoding: Encoding - The encoding to use
// @param start: Number - The byte offset to begin decoding from
// @param size: Number - The maximum number of bytes to decode
// @returns String
export let decodeRangeKeepBom = (bytes: Bytes, encoding: Encoding, start: Number, size: Number)

// Decodes the given byte sequence into a string using the given encoding scheme,
// skipping the byte-order marker, if it's present.
// @param bytes: Bytes - The input bytes
// @param encoding: Encoding - The encoding to use
// @returns String
export let decode = (bytes: Bytes, encoding: Encoding)

// Decodes the given byte sequence into a string using the given encoding scheme,
// including the byte-order marker, if it's present
// @param bytes: Bytes - The input bytes
// @param encoding: Encoding - The encoding to use
// @returns String
export let decodeKeepBom = (bytes: Bytes, encoding: Encoding)

These help enable use cases such as those described by @cician in #661, in which users need to marshal Grain strings into nonstandard encodings.

@peblair peblair added the stdlib label May 31, 2021
@peblair peblair requested a review from a team May 31, 2021 15:25
Copy link
Member

@ospencer ospencer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Largely looks good. Curious to hear your thoughts on the shared utilities.

stdlib/string.gr Outdated Show resolved Hide resolved
stdlib/string.gr Show resolved Hide resolved
stdlib/string.gr Show resolved Hide resolved
stdlib/string.gr Outdated Show resolved Hide resolved
stdlib/string.gr Outdated Show resolved Hide resolved
stdlib/string.gr Show resolved Hide resolved
@cician
Copy link
Contributor

cician commented Jun 4, 2021

May a suggest that it would be useful to have a version of encode that writes to an existing instance of Bytes into a specific offset?

I'm currently using Bytes as a big buffer where I append multiple strings and have to create a temporary Bytes with Bytes.fromString for each write operation, only to then copy it over into my big buffer with Bytes.move.

Also I think it would be useful to add offset and length parameters to the decode function for things like reading multiple strings from structured binary formats.

@peblair
Copy link
Member Author

peblair commented Jun 4, 2021

@cician I am not opposed (in principle, at least), but there is a question about what the API would look like:

Option 1 would be to add two new functions which do the read/write based on the existing buffer. The design question here is what these variants would be named.

Option 2 would be to wait until #388 is resolved, and then have a buffer/offset be optional arguments to the existing functions. I lean towards this option, but, of course, this means waiting until #388 is resolved.

@ospencer, what are your thoughts?

@ospencer
Copy link
Member

ospencer commented Jun 4, 2021

I'd say go ahead and add them in now, as we'll end up refactoring the whole stdlib once we've got optional args.

I'd call them encodeBuffered and decodeBuffered.

@cician
Copy link
Contributor

cician commented Jun 4, 2021

For encode i think it would make sense to just have a separate function like this:

export let encodeInto = (s: String, encoding: Encoding, includeBom: Bool, dstPos: Number, dst: Bytes)

For decode optional parameters would be nice, though maybe a bit strange if the user could pass one, but not the other.

export let decodeRange = (bytes: Bytes, encoding: Encoding, skipBom: Bool, srcPos: Number, srcLen: Number)

Or encodeBuffered/decodeBuffered I'm not picky about names.

@ospencer ospencer changed the title feat: Add String.encode and String.decode functions to standard library feat(stdlib): Add String.encode and String.decode functions to standard library Jun 6, 2021
@peblair
Copy link
Member Author

peblair commented Jun 7, 2021

I've added encodeInto and decodeRange. cc @ospencer for re-review and feedback on naming.

Copy link
Member

@phated phated left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave the implementation review to @ospencer since I'm very dumb when it comes to the ptr logic; however, I do think we need to split up the APIs a bit since we don't support optional arguments yet.

stdlib/string.gr Outdated Show resolved Hide resolved
stdlib/string.gr Outdated Show resolved Hide resolved
stdlib/string.gr Outdated Show resolved Hide resolved
stdlib/string.gr Outdated Show resolved Hide resolved
@peblair peblair force-pushed the character-encodings branch 2 times, most recently from 3769701 to 12ac705 Compare June 8, 2021 14:30
@peblair
Copy link
Member Author

peblair commented Jun 8, 2021

@phated Everything should be addressed now. I've updated the original post with the present user-facing API. (the default encode and decode functions were meant to have sensible defaults for what users would typically want to do)

@peblair
Copy link
Member Author

peblair commented Jun 8, 2021

cc @ospencer on the updated API ☝️

Copy link
Member

@phated phated left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Thanks for updating the API 🎊

Copy link
Member

@ospencer ospencer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not a fan of the encodeInto name, because to me it's confusing what "into" means. To me it really sounds like encode has some default behavior and encodeInto lets you pick what encoding you want to encode the string into, not having anything to do with offsets within a Bytes instance. I'm not going to block on it though—once #710 is merged, we should refactor encode to write into a Buffer. Then encodeInto isn't needed and we can just delete it.

@peblair
Copy link
Member Author

peblair commented Jun 9, 2021

@ospencer Would encodeAt be better in your opinion? I am not sure I 100% agree with your proposal re: #710 (in terms of how it will change the APIs added here...I think that's a discussion for later, though), so I would feel better if we merged in a name you're happy with.

Copy link
Member

@ospencer ospencer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We chatted and largely aligned on what the future API would look like. In the meanwhile, we're gonna go with encodeAt. This PR has my blessing once that change is done 👼 (approving now for those beautiful European summer mornings 🌞)

@peblair peblair merged commit 5635a36 into main Jun 13, 2021
@peblair peblair deleted the character-encodings branch June 13, 2021 11:40
@peblair
Copy link
Member Author

peblair commented Jun 13, 2021

🇪🇺

@github-actions github-actions bot mentioned this pull request May 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants