Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework string length rules #86

Closed
wants to merge 16 commits into from
Closed

Rework string length rules #86

wants to merge 16 commits into from

Conversation

dpytaylo
Copy link
Contributor

@dpytaylo dpytaylo commented Jan 2, 2024

See: #84

  • Change the meaning of length to byte length
  • Deprecate byte_length with a message that tells people to use length instead.
  • Add a char_count rule which will exist for backwards compatibility, just in case someone was depending on it counting USVs.
  • Update README.md

@dpytaylo
Copy link
Contributor Author

dpytaylo commented Jan 2, 2024

The deprecated attribute message for byte_length works fine, but it generates a warning when I build the library crate. There perhaps exists a better solution.

@jprochazk
Copy link
Owner

I think you can put #[deprecated] on garde::rules::byte_length::apply instead. That should generate a warning at any call sites, as opposed to a warning in the proc macro crate.

@dpytaylo
Copy link
Contributor Author

dpytaylo commented Jan 4, 2024

I attempted to add #[deprecated] on garde::rules::byte_length::apply, but I didn't recieve a warning from the macro (I think it may not be feasible without a span from the byte_length slice).

So, I began searching the internet for solutions and came across proc-macro-warning. This crate inspired me to create a seperate module (check_deprecated; because I didn't want to make changes in the main code) where I mostly copied the code from the check module while making the necessary changes to identify the bytes_length span and convert it into a compile warning.

So, there is a result:
image

P.S. I'm still a newbie in proc macro development, so my code may be far from perfect.

@jprochazk
Copy link
Owner

So, I began searching the internet for solutions and came across proc-macro-warning.

Seems like a fine approach! I wonder if there's a way to produce a less noisy warning (without the whole deprecated constant part), but you don't have to worry about that in this PR.

@dpytaylo
Copy link
Contributor Author

dpytaylo commented Jan 5, 2024

Seems like a fine approach! I wonder if there's a way to produce a less noisy warning (without the whole deprecated constant part), but you don't have to worry about that in this PR.

I think the solution looks like the best solution for warnings in proc macros. However, it is still in nightly, and I don't know if it is workable at all 😂.

Okay, I think I could close the deprecation task and start a new task. I'm going to start moving the length functionality to the new char_count rule.

@dpytaylo
Copy link
Contributor Author

dpytaylo commented Jan 8, 2024

Should we place the unicode feature in the default features? As mentioned previously, "it would roughly quadruple the size of the library", but all other features were already included in default features. To me, it seems somewhat unfamiliar to see so many features in the default features, and this isn't common for most of the popular crates. For example, tokio has no default features. Other crates only include the most required features in default features.

Here, I see several approaches:

1. Make all features non-default.

Pros:

  • It makes garde the lightest in a size by default.

Cons:

  • People won't be very happy to see the code which doesn't compile (🙂), but it will be easy to fix, just need to add required features in their Cargo.toml. (Also, as I know, Rust can sometimes suggest to add required feature flags)
  • Over 90%+ of users must include the derive feature. For example, serde does it in the same approach.

2. Make only the derive feature default.

Pros:

  • It still makes garde light.

Cons:

  • Similarly, it requires fixing feature flags.

3. Leave it as it is but don't place the unicode feature in the defaults.

Pros:

  • It doesn't break any code.

Cons:

  • It will seem a bit strange that only the unicode feature in non-defaults, but I think it is okay.

4. Place all features in the defaults.

Pros:

  • It doesn't break any code.

Cons:

  • It will make garde much bigger in size, as mentioned before.

Personally, I preferred 1 and 2 solutions for better DX, but other solutions could be used too. I am aware that developer could always use the default-features = false but I don't see this as a common practice, except for embedded things. I could have misconceptions about this topic, so I will wait for your comment 😅.

P.S. For the time being, I have placed grapheme_count in the default features because I have encountered with some problems running tests with features flags.

@jprochazk
Copy link
Owner

jprochazk commented Jan 11, 2024

I expect most people to keep using length over the other string length rules, because byte length is the most common definition of "string length" out there, and it's also the most useful metric for defining resource limits. Let's keep unicode out of the default feature set.

As for the size of the default feature set: My priority is to make garde easy to use. I don't agree with the philosophy that users should have to look for and turn on every feature they care about for each one of their dependencies, because that's a lot of friction.

For example, tokio has no default features

It does have a full feature though, which is a kind of default.

As you said, people who care about having the smallest set of dependencies have the option of using default-features = false, so I'm not too worried about having too many default features. That being said, I think there are definitely some features that could be removed from the default set, such as credit-card, but derive is not one of them. It makes sense for serde, because it's a "foundational" crate that's implemented by a lot of other crates, but I don't expect garde to reach that status, and I don't think it's the right approach anyway. You shouldn't have many copies of garde in your dependency tree, so there's less of a chance that one of them will turn on the default feature set and unnecessarily increase your compile times.

One approach of supporting third-party types without adding an impl to every library under the sun is something like the adapters described here

@dpytaylo
Copy link
Contributor Author

dpytaylo commented Jan 12, 2024

Thanks a lot for your response! I agreed that it is uncomfortable to enable each feature individually, but I recently got an idea which I want to share with you.

Okay, we have these points:

  1. We don't want to make garde very big by default.
  2. We aim to enhance DX and avoid confusing developers with these feature flags.

And I had such thoughts:

  1. If developers don't want to work manually with features, they also won't think about a size of a dependency (e.g. garde).

While I was thinking about whether we should remove the unicode feature from the others, I conducted this small research:

fn main() {
    // This sentence in the different languages: 
    // "The sun sets behind the mountains, casting a warm glow across the tranquil lake"
    const MEMORY_LIMIT: usize = 200;
    const TEXT_LIMIT: usize = 100;

    #[derive(Validate)]
    struct Test {
        #[garde(length(max = MEMORY_LIMIT), grapheme_count(max = TEXT_LIMIT))]
        text: String,
    }

    // 79 bytes, 79 graphemes
    let _a = Test { text: "The sun sets behind the mountains, casting a warm glow across the tranquil lake".into() };

    // 58 bytes, 20 graphemes
    let _b = Test { text: "太阳落山后,在宁静的湖面上投下温暖的光芒".into() }; // Chinese

    // 113 bytes, 62 graphemes
    let _c = Test { text: "تغرب الشمس خلف الجبال ، وتلقي توهجا دافئا عبر البحيرة الهادئة.".into() }; // Arabic

    // 189(!) bytes, 48 graphemes
    // This sentence still conveys the same meaning, but it requires a significant amount of memory.
    let _d = Test { text: "পাহাড়ের পিছনে সূর্য অস্ত যায়, প্রশান্ত হ্রদ জুড়ে একটি উষ্ণ আভা ফেলছে".into() }; // Bengali

    // 93 bytes, 92(!) graphemes
    // This example requires more memory, but it also contains more graphemes.
    let _e = Test { text: "Le soleil se couche derrière les montagnes, projetant une lueur chaude sur le lac tranquille".into() }; // French

    // Also, I would like to draw your attention to the following. The grapheme count doesn't have 
    // such a large range of values, unlike bytes:
    //
    // max_dif_bytes = 189 - 79 = 110
    // max_dif_graph = 93 - 48 = 45

    // Another sentence: 377 bytes, 9 graphemes
    // 1. This example will be rejected by the MEMORY_LIMIT(200).
    // 2. For the best results (if you don't want to see these strange things), it will require a custom validator.
    //    However, our length limiter does its job and can reject these strange texts that could overflow our memory.


    let _invalid = Test { text: "l̷̢̢̰̬͇̙͉͕̠̠̥̂̿̋͑̕͝͠ͅͅĩ̴̡̢̛̠̻̫̲͉̤̱̟͍̤̳͔͐̍̔̈́͊̒͂͋̈́̉̔̕̚͜͜͠k̸͍̳̜̗̰̼̦̟̖̳̥̙̗̂́̓̎͌͊͘ȩ̴͔̝̤̳̖̜̓̽̕ ̶̪̺͖̈́̃t̷̪̯̟̳͍̲͔̎͋̿̉̒̑̓̊̾̊̒̚͘ḩ̷̦͂̈́͗͌̏̏̇̔̈́͒̒̆̄̈́̚͠į̸̨̡̛̤͚͓̯͎̘̪̙̟̮͈͔͔̈́͋̉̾̃̎̒̈́́̾͂́ͅś̷̘̙̜̯͖̄͆̿̄̑̄̄͝".into() }; // This example will fail the validation.




    // Sometimes, we need to have very short text, for example, for nicknames because
    // we have limited space for its rendering. There isn't a simple solution; for more
    // information, you could reading this article:
    // https://tomdebruijn.com/posts/rust-string-length-width-calculations/
    // 
    // As the most simple solution you could use the widest grapheme (e.g. `🙂`) and calculate
    // how much graphemes you could contain in your string.
    const TEXT_LIMIT_2: usize = 10;

    #[derive(Validate)]
    struct User {
        #[garde(length(max = MEMORY_LIMIT), grapheme_count(max = TEXT_LIMIT_2))]
        nickname: String,
    }

    // Every nickname has 10 graphemes:
    let _a = User { nickname: "Aaaaaaaaaa".into() };
    let _b = User { nickname: "太阳落山后在在宁静的".into() };
    let _c = User { nickname: "ةاالهادئة.".into() };
    let _d = User { nickname: "পাহাড়েরপাপাহাড়েরপা".into() };
    let _e = User { nickname: "Aaaaaaaaaa".into() };

    // Emoji have almost the widest width because most emoji are rendered with a display width of two columns.
    let _f = User { nickname: "🙂🙂🙂🙂🙂🙂🙂🙂🙂🙂".into() };

    // If you try to reject by a byte length (e.g. 10), valid user nicknames will looks like this:
    // let _a = User { nickname: "Aaaaaaaaaa".into() };
    // let _b = User { nickname: "太阳落".into() };
    // let _c = User { nickname: "الهاد".into() };
    // let _d = User { nickname: "পাহ".into() };
    // let _e = User { nickname: "Aaaaaaaaaa".into() };
    // let _f = User { nickname: "🙂🙂".into() };
}

P.S. If my code has makes sense, we could add it somewhere, for example, in the documentation or examples.

Yes, this example mostly repeats known facts, but I arrived at this conclusion. Often, especially when working with non-Latin languages (!), a developer should use byte length validation along with grapheme count validation to perform good validation. Otherwise, non-Latin text with the same meaning as in English may not fit within the same byte length. I think, it is a big fault for developers that didn't consider non-Latin languages. So, there is the fourth point.

  1. The unicode feature should be also enabled like the other features.

I added because I think it is important feature, and should be with other.

For the end, I also can add the point:

  1. More Rust-way: Explicit is better than implicit.

So, I would like to share my idea that could meet all the requirements mentioned above:

In your last response, you mentioned the full tokio flag, which does the same thing: it simplifies working with the crate without the need to think too much about other flags, and it does it explicitly. So, let's do it the same way; just add the full flag that contains all features (including unicode). Its name will indicate that it could contains a lot of unnecessary stuff, like credit-card or unicode features. It won't make garde too big by default and will improve DX by adding the popular flag (full), similar to how it is done in other crates.

@jprochazk
Copy link
Owner

TL;DR: I agree with your suggestions. We should:

  1. add a full feature with the contents of default
  2. include unicode as part of full
  3. remove the default feature

Full context:

a developer should use byte length validation along with grapheme count validation to perform good validation

Interesting, I hadn't thought of that. It makes a lot of sense, you want to ensure the nicknames display with some maximum width, so that your designers are happy, but by only checking byte length, you're discriminating against scripts which use more bytes in unicode. TIL!

I think it would be a great addition to the docs. It could probably be simplified to only use two or three languages, e.g. English and Bengali, just to illustrate the concept.

Side note: I think this crate is sorely missing a "cookbook" of examples, showcasing common patterns, such as how to validate a typical registration form.

The unicode feature should be also enabled like the other features

Agreed.

just add the full flag that contains all features

I think I'm also in favor of this now.

Initially I wasn't, because I thought it's just a less convenient default, but since making my previous comment, I've looked into it a bit more, and I realised that default features are actually a big problem in the ecosystem. Many crates don't turn them off, and when those dependencies with default features become transitive dependencies, there is no longer a way to turn them off. You're stuck with slower compile times with no way out. Libraries should probably default to using default-features = false on every dependency.

I will say it's unfortunate that we have to resort to such "hacks" or "workarounds" because Cargo

We don't want to make gradle very big by default

Btw, what is "gradle" and "gardle"? 😄

@jprochazk
Copy link
Owner

jprochazk commented Jan 12, 2024

Also just a reminder: Not everything has to be in one PR. Feel free to just put unicode in default for now, and do the rest in a separate PR, or just leave it to me. It will also make it a bit easier for me to review 🙂.

@dpytaylo
Copy link
Contributor Author

dpytaylo commented Jan 13, 2024

Okay, I have finished implementing grapheme_count. You can start to review the code.

do the rest in a separate PR

Yeah, it is a good idea. But should we open a new issue before that?

Btw, what is "gradle" and "gardle"? 😄

Oops, I have mixed Gradle (Java dependency manager) with the garde crate name. 😅

P.S. There is one failed ci/cd step, but I don't know how to fix it. solved

@Obstbeeren
Copy link

Obstbeeren commented Jan 19, 2024

Hey, can this PR also implement Length, ... for Box<str>?

@dpytaylo
Copy link
Contributor Author

dpytaylo commented Jan 20, 2024

Hey, can this PR also implement Length, ... for Box<str>?

Yes, of course 🙂

@dpytaylo
Copy link
Contributor Author

Yeah, I have guessed the minimum Rust version 😂.
This approach sounds cringe but it worked.

@jprochazk
Copy link
Owner

jprochazk commented Jan 21, 2024

Hi! I took some time to get back to this because I wasn't totally satisfied with the solution.

There's a few unnecessary/unrelated changes in this PR:

  • Implementing this shouldn't require updating any dependencies
  • xtask/Cargo.lock is committed on purpose, and it should remain as such

As for the actual implementation:

  • The byte_length rule should just be removed. As far as I could tell, barely anyone was using it, and garde is still pre-1.0, so this kind of breakage is acceptable. It's better than hacking in an ugly deprecation warning just to be able to keep it for a few versions.
  • Instead of creating a rule for each different kind of length, we should just add a mode argument to the length rule to change its behavior:
    #[derive(garde::Validate)]
    struct Test {
        #[garde(length(graphemes, max = 10))]
        graphemes: String,
    }
    By default, it uses the simple mode, which for containers like Vec is the number of items, and for String-like types is the number of bytes.

Because I can't commit here (the base of this PR is your fork's main branch), I implemented this alternative approach in #88, so I'm going to close this PR. Thanks a lot for working on this, and for providing important context!

@jprochazk jprochazk closed this Jan 21, 2024
This was referenced Jan 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants