Skip to content

Conversation

friendlymatthew
Copy link
Contributor

Which issue does this PR close?

This commit introduces ShortString, a newtype that wraps around &str that enforces a maximum length constraint. This also allows us to perform validation once and removes a superfluous validation check in append_value.

The now-superflous validation check was needed since users could construct Variant::ShortStrings directly, without doing input validation. This means you can have a short string variant which actually contains a string that is no longer than 63 bytes.

But since we enforce this check upon construction, we can directly match against Variant::String and Variant::ShortString arms with their respective appending functions (append_string and append_short_string).

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jun 20, 2025
@friendlymatthew friendlymatthew force-pushed the string-vs-short-string branch 2 times, most recently from 043f082 to a06720b Compare June 20, 2025 13:05
@friendlymatthew friendlymatthew force-pushed the string-vs-short-string branch from fea5ecd to df42f00 Compare June 20, 2025 13:11
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me -- thank you @friendlymatthew . I left a few comments but we can do that as follow on PRs if you prefer

Note this is very likely to conflict with

@friendlymatthew friendlymatthew force-pushed the string-vs-short-string branch from df42f00 to a6ad4e5 Compare June 20, 2025 13:28
@alamb
Copy link
Contributor

alamb commented Jun 20, 2025

FYI @scovich perhaps you could offer your opinion on this PR as well

@friendlymatthew friendlymatthew force-pushed the string-vs-short-string branch from e713fd4 to 2a62499 Compare June 20, 2025 13:45
///
/// This constructor verifies that `value` is shorter than or equal to `MAX_SHORT_STRING_SIZE`
pub fn try_new(value: &'a str) -> Result<Self, ArrowError> {
if value.len() > MAX_SHORT_STRING_SIZE {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth adding a comment here that we are indeed supposed to check bytes and not characters, that's a common confusion with "string length"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a great idea -- maybe we can even name the constant MAX_SHORT_STRING_BYTES to make it more self describing

@friendlymatthew friendlymatthew force-pushed the string-vs-short-string branch from dae45e6 to 09af8ff Compare June 20, 2025 20:16
Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +63 to +64
impl<'a> From<ShortString<'a>> for &'a str {
fn from(value: ShortString<'a>) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit... unorthodox? Would impl Deref be more traditional, to go with that impl AsRef?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes a lot of sense

@alamb alamb merged commit 1ededfe into apache:main Jun 21, 2025
12 checks passed
@alamb
Copy link
Contributor

alamb commented Jun 21, 2025

Thanks again @friendlymatthew and @scovich and @adriangb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Variant] More efficient determination of String vs ShortString
4 participants