Rework string length rules #86

dpytaylo · 2024-01-02T23:02:34Z

See: #84

Change the meaning of length to byte length
Deprecate byte_length with a message that tells people to use length instead.
Add a char_count rule which will exist for backwards compatibility, just in case someone was depending on it counting USVs.
Update README.md

dpytaylo · 2024-01-02T23:09:01Z

The deprecated attribute message for byte_length works fine, but it generates a warning when I build the library crate. There perhaps exists a better solution.

jprochazk · 2024-01-02T23:45:24Z

I think you can put #[deprecated] on garde::rules::byte_length::apply instead. That should generate a warning at any call sites, as opposed to a warning in the proc macro crate.

dpytaylo · 2024-01-04T18:59:48Z

I attempted to add #[deprecated] on garde::rules::byte_length::apply, but I didn't recieve a warning from the macro (I think it may not be feasible without a span from the byte_length slice).

So, I began searching the internet for solutions and came across proc-macro-warning. This crate inspired me to create a seperate module (check_deprecated; because I didn't want to make changes in the main code) where I mostly copied the code from the check module while making the necessary changes to identify the bytes_length span and convert it into a compile warning.

So, there is a result:

P.S. I'm still a newbie in proc macro development, so my code may be far from perfect.

garde_derive/src/check_deprecated.rs

jprochazk · 2024-01-04T23:04:32Z

So, I began searching the internet for solutions and came across proc-macro-warning.

Seems like a fine approach! I wonder if there's a way to produce a less noisy warning (without the whole deprecated constant part), but you don't have to worry about that in this PR.

dpytaylo · 2024-01-05T20:19:27Z

Seems like a fine approach! I wonder if there's a way to produce a less noisy warning (without the whole deprecated constant part), but you don't have to worry about that in this PR.

I think the solution looks like the best solution for warnings in proc macros. However, it is still in nightly, and I don't know if it is workable at all 😂.

Okay, I think I could close the deprecation task and start a new task. I'm going to start moving the length functionality to the new char_count rule.

dpytaylo · 2024-01-08T17:29:28Z

Should we place the unicode feature in the default features? As mentioned previously, "it would roughly quadruple the size of the library", but all other features were already included in default features. To me, it seems somewhat unfamiliar to see so many features in the default features, and this isn't common for most of the popular crates. For example, tokio has no default features. Other crates only include the most required features in default features.

Here, I see several approaches:

1. Make all features non-default.

Pros:

It makes garde the lightest in a size by default.

Cons:

People won't be very happy to see the code which doesn't compile (🙂), but it will be easy to fix, just need to add required features in their Cargo.toml. (Also, as I know, Rust can sometimes suggest to add required feature flags)
Over 90%+ of users must include the derive feature. For example, serde does it in the same approach.

2. Make only the `derive` feature default.

Pros:

It still makes garde light.

Cons:

Similarly, it requires fixing feature flags.

3. Leave it as it is but don't place the `unicode` feature in the defaults.

Pros:

It doesn't break any code.

Cons:

It will seem a bit strange that only the unicode feature in non-defaults, but I think it is okay.

4. Place all features in the defaults.

Pros:

It doesn't break any code.

Cons:

It will make garde much bigger in size, as mentioned before.

Personally, I preferred 1 and 2 solutions for better DX, but other solutions could be used too. I am aware that developer could always use the default-features = false but I don't see this as a common practice, except for embedded things. I could have misconceptions about this topic, so I will wait for your comment 😅.

P.S. For the time being, I have placed grapheme_count in the default features because I have encountered with some problems running tests with features flags.

jprochazk · 2024-01-11T14:25:15Z

I expect most people to keep using length over the other string length rules, because byte length is the most common definition of "string length" out there, and it's also the most useful metric for defining resource limits. Let's keep unicode out of the default feature set.

As for the size of the default feature set: My priority is to make garde easy to use. I don't agree with the philosophy that users should have to look for and turn on every feature they care about for each one of their dependencies, because that's a lot of friction.

For example, tokio has no default features

It does have a full feature though, which is a kind of default.

As you said, people who care about having the smallest set of dependencies have the option of using default-features = false, so I'm not too worried about having too many default features. That being said, I think there are definitely some features that could be removed from the default set, such as credit-card, but derive is not one of them. It makes sense for serde, because it's a "foundational" crate that's implemented by a lot of other crates, but I don't expect garde to reach that status, and I don't think it's the right approach anyway. You shouldn't have many copies of garde in your dependency tree, so there's less of a chance that one of them will turn on the default feature set and unnecessarily increase your compile times.

One approach of supporting third-party types without adding an impl to every library under the sun is something like the adapters described here

dpytaylo · 2024-01-12T20:08:02Z

Thanks a lot for your response! I agreed that it is uncomfortable to enable each feature individually, but I recently got an idea which I want to share with you.

Okay, we have these points:

We don't want to make garde very big by default.
We aim to enhance DX and avoid confusing developers with these feature flags.

And I had such thoughts:

If developers don't want to work manually with features, they also won't think about a size of a dependency (e.g. garde).

While I was thinking about whether we should remove the unicode feature from the others, I conducted this small research:

fn main() {
    // This sentence in the different languages: 
    // "The sun sets behind the mountains, casting a warm glow across the tranquil lake"
    const MEMORY_LIMIT: usize = 200;
    const TEXT_LIMIT: usize = 100;

    #[derive(Validate)]
    struct Test {
        #[garde(length(max = MEMORY_LIMIT), grapheme_count(max = TEXT_LIMIT))]
        text: String,
    }

    // 79 bytes, 79 graphemes
    let _a = Test { text: "The sun sets behind the mountains, casting a warm glow across the tranquil lake".into() };

    // 58 bytes, 20 graphemes
    let _b = Test { text: "太阳落山后,在宁静的湖面上投下温暖的光芒".into() }; // Chinese

    // 113 bytes, 62 graphemes
    let _c = Test { text: "تغرب الشمس خلف الجبال ، وتلقي توهجا دافئا عبر البحيرة الهادئة.".into() }; // Arabic

    // 189(!) bytes, 48 graphemes
    // This sentence still conveys the same meaning, but it requires a significant amount of memory.
    let _d = Test { text: "পাহাড়ের পিছনে সূর্য অস্ত যায়, প্রশান্ত হ্রদ জুড়ে একটি উষ্ণ আভা ফেলছে".into() }; // Bengali

    // 93 bytes, 92(!) graphemes
    // This example requires more memory, but it also contains more graphemes.
    let _e = Test { text: "Le soleil se couche derrière les montagnes, projetant une lueur chaude sur le lac tranquille".into() }; // French

    // Also, I would like to draw your attention to the following. The grapheme count doesn't have 
    // such a large range of values, unlike bytes:
    //
    // max_dif_bytes = 189 - 79 = 110
    // max_dif_graph = 93 - 48 = 45

    // Another sentence: 377 bytes, 9 graphemes
    // 1. This example will be rejected by the MEMORY_LIMIT(200).
    // 2. For the best results (if you don't want to see these strange things), it will require a custom validator.
    //    However, our length limiter does its job and can reject these strange texts that could overflow our memory.


    let _invalid = Test { text: "l̷̢̢̰̬͇̙͉͕̠̠̥̂̿̋͑̕͝͠ͅͅĩ̴̡̢̛̠̻̫̲͉̤̱̟͍̤̳͔͐̍̔̈́͊̒͂͋̈́̉̔̕̚͜͜͠k̸͍̳̜̗̰̼̦̟̖̳̥̙̗̂́̓̎͌͊͘ȩ̴͔̝̤̳̖̜̓̽̕ ̶̪̺͖̈́̃t̷̪̯̟̳͍̲͔̎͋̿̉̒̑̓̊̾̊̒̚͘ḩ̷̦͂̈́͗͌̏̏̇̔̈́͒̒̆̄̈́̚͠į̸̨̡̛̤͚͓̯͎̘̪̙̟̮͈͔͔̈́͋̉̾̃̎̒̈́́̾͂́ͅś̷̘̙̜̯͖̄͆̿̄̑̄̄͝".into() }; // This example will fail the validation.




    // Sometimes, we need to have very short text, for example, for nicknames because
    // we have limited space for its rendering. There isn't a simple solution; for more
    // information, you could reading this article:
    // https://tomdebruijn.com/posts/rust-string-length-width-calculations/
    // 
    // As the most simple solution you could use the widest grapheme (e.g. `🙂`) and calculate
    // how much graphemes you could contain in your string.
    const TEXT_LIMIT_2: usize = 10;

    #[derive(Validate)]
    struct User {
        #[garde(length(max = MEMORY_LIMIT), grapheme_count(max = TEXT_LIMIT_2))]
        nickname: String,
    }

    // Every nickname has 10 graphemes:
    let _a = User { nickname: "Aaaaaaaaaa".into() };
    let _b = User { nickname: "太阳落山后在在宁静的".into() };
    let _c = User { nickname: "ةاالهادئة.".into() };
    let _d = User { nickname: "পাহাড়েরপাপাহাড়েরপা".into() };
    let _e = User { nickname: "Aaaaaaaaaa".into() };

    // Emoji have almost the widest width because most emoji are rendered with a display width of two columns.
    let _f = User { nickname: "🙂🙂🙂🙂🙂🙂🙂🙂🙂🙂".into() };

    // If you try to reject by a byte length (e.g. 10), valid user nicknames will looks like this:
    // let _a = User { nickname: "Aaaaaaaaaa".into() };
    // let _b = User { nickname: "太阳落".into() };
    // let _c = User { nickname: "الهاد".into() };
    // let _d = User { nickname: "পাহ".into() };
    // let _e = User { nickname: "Aaaaaaaaaa".into() };
    // let _f = User { nickname: "🙂🙂".into() };
}

P.S. If my code has makes sense, we could add it somewhere, for example, in the documentation or examples.

Yes, this example mostly repeats known facts, but I arrived at this conclusion. Often, especially when working with non-Latin languages (!), a developer should use byte length validation along with grapheme count validation to perform good validation. Otherwise, non-Latin text with the same meaning as in English may not fit within the same byte length. I think, it is a big fault for developers that didn't consider non-Latin languages. So, there is the fourth point.

The unicode feature should be also enabled like the other features.

I added because I think it is important feature, and should be with other.

For the end, I also can add the point:

More Rust-way: Explicit is better than implicit.

So, I would like to share my idea that could meet all the requirements mentioned above:

In your last response, you mentioned the full tokio flag, which does the same thing: it simplifies working with the crate without the need to think too much about other flags, and it does it explicitly. So, let's do it the same way; just add the full flag that contains all features (including unicode). Its name will indicate that it could contains a lot of ~~unnecessary~~ stuff, like credit-card or unicode features. It won't make garde too big by default and will improve DX by adding the popular flag (full), similar to how it is done in other crates.

jprochazk · 2024-01-12T23:21:04Z

TL;DR: I agree with your suggestions. We should:

add a full feature with the contents of default
include unicode as part of full
remove the default feature

Full context:

a developer should use byte length validation along with grapheme count validation to perform good validation

Interesting, I hadn't thought of that. It makes a lot of sense, you want to ensure the nicknames display with some maximum width, so that your designers are happy, but by only checking byte length, you're discriminating against scripts which use more bytes in unicode. TIL!

I think it would be a great addition to the docs. It could probably be simplified to only use two or three languages, e.g. English and Bengali, just to illustrate the concept.

Side note: I think this crate is sorely missing a "cookbook" of examples, showcasing common patterns, such as how to validate a typical registration form.

The unicode feature should be also enabled like the other features

Agreed.

just add the full flag that contains all features

I think I'm also in favor of this now.

Initially I wasn't, because I thought it's just a less convenient default, but since making my previous comment, I've looked into it a bit more, and I realised that default features are actually a big problem in the ecosystem. Many crates don't turn them off, and when those dependencies with default features become transitive dependencies, there is no longer a way to turn them off. You're stuck with slower compile times with no way out. Libraries should probably default to using default-features = false on every dependency.

I will say it's unfortunate that we have to resort to such "hacks" or "workarounds" because Cargo

We don't want to make gradle very big by default

Btw, what is "gradle" and "gardle"? 😄

jprochazk · 2024-01-12T23:28:45Z

Also just a reminder: Not everything has to be in one PR. Feel free to just put unicode in default for now, and do the rest in a separate PR, or just leave it to me. It will also make it a bit easier for me to review 🙂.

dpytaylo · 2024-01-13T18:14:56Z

Okay, I have finished implementing grapheme_count. You can start to review the code.

do the rest in a separate PR

Yeah, it is a good idea. But should we open a new issue before that?

Btw, what is "gradle" and "gardle"? 😄

Oops, I have mixed Gradle (Java dependency manager) with the garde crate name. 😅

~~P.S. There is one failed ci/cd step, but I don't know how to fix it.~~ solved

Obstbeeren · 2024-01-19T21:32:29Z

Hey, can this PR also implement Length, ... for Box<str>?

dpytaylo · 2024-01-20T12:20:26Z

Hey, can this PR also implement Length, ... for Box<str>?

Yes, of course 🙂

dpytaylo · 2024-01-21T14:48:18Z

Yeah, I have guessed the minimum Rust version 😂.
This approach sounds cringe but it worked.

jprochazk · 2024-01-21T18:41:29Z

Hi! I took some time to get back to this because I wasn't totally satisfied with the solution.

There's a few unnecessary/unrelated changes in this PR:

Implementing this shouldn't require updating any dependencies
xtask/Cargo.lock is committed on purpose, and it should remain as such

As for the actual implementation:

The byte_length rule should just be removed. As far as I could tell, barely anyone was using it, and garde is still pre-1.0, so this kind of breakage is acceptable. It's better than hacking in an ugly deprecation warning just to be able to keep it for a few versions.
Instead of creating a rule for each different kind of length, we should just add a mode argument to the length rule to change its behavior:
```
#[derive(garde::Validate)]
struct Test {
    #[garde(length(graphemes, max = 10))]
    graphemes: String,
}
```
By default, it uses the simple mode, which for containers like Vec is the number of items, and for String-like types is the number of bytes.

Because I can't commit here (the base of this PR is your fork's main branch), I implemented this alternative approach in #88, so I'm going to close this PR. Thanks a lot for working on this, and for providing important context!

Add byte_length deprecated attribute message

8301f9f

Fix byte_length attribute

7de6504

jprochazk reviewed Jan 4, 2024

View reviewed changes

garde_derive/src/check_deprecated.rs Show resolved Hide resolved

garde_derive/src/check_deprecated.rs Outdated Show resolved Hide resolved

Refactor check() function

c39923d

dpytaylo added 4 commits January 6, 2024 16:18

Add char_count rule

441b100

Change the meaning of length to byte length

9f24730

Fix char_count

436dc99

Add grapheme_count rule

aa9e557

dpytaylo added 3 commits January 13, 2024 18:53

Refactor code

779bd7c

Add grapheme_count in README.md

ae3ab7c

Merge branch 'main' into main

e6c53fe

Add implementations for Box<str> and &Cow<str>

e93e1c7

dpytaylo added 5 commits January 20, 2024 14:20

Merge branch 'main' of github.com:dpytaylo/garde

9b1f3e8

Update Rust pinned version to 1.70

4ce60b8

Add --verbose flag to cargo build

7983ba5

Fix cargo build command in ci.yml

63d79bf

Update pinned version to 1.71

8c1d14d

jprochazk closed this Jan 21, 2024

This was referenced Jan 21, 2024

Rework string length rules #84

Closed

Add the full feature flag #92

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework string length rules #86

Rework string length rules #86

dpytaylo commented Jan 2, 2024 •

edited

Loading

dpytaylo commented Jan 2, 2024

jprochazk commented Jan 2, 2024

dpytaylo commented Jan 4, 2024

jprochazk commented Jan 4, 2024

dpytaylo commented Jan 5, 2024

dpytaylo commented Jan 8, 2024 •

edited

Loading

jprochazk commented Jan 11, 2024 •

edited

Loading

dpytaylo commented Jan 12, 2024 •

edited

Loading

jprochazk commented Jan 12, 2024

jprochazk commented Jan 12, 2024 •

edited

Loading

dpytaylo commented Jan 13, 2024 •

edited

Loading

Obstbeeren commented Jan 19, 2024 •

edited

Loading

dpytaylo commented Jan 20, 2024 •

edited

Loading

dpytaylo commented Jan 21, 2024

jprochazk commented Jan 21, 2024 •

edited

Loading

Rework string length rules #86

Rework string length rules #86

Conversation

dpytaylo commented Jan 2, 2024 • edited Loading

dpytaylo commented Jan 2, 2024

jprochazk commented Jan 2, 2024

dpytaylo commented Jan 4, 2024

jprochazk commented Jan 4, 2024

dpytaylo commented Jan 5, 2024

dpytaylo commented Jan 8, 2024 • edited Loading

1. Make all features non-default.

2. Make only the derive feature default.

3. Leave it as it is but don't place the unicode feature in the defaults.

4. Place all features in the defaults.

jprochazk commented Jan 11, 2024 • edited Loading

dpytaylo commented Jan 12, 2024 • edited Loading

jprochazk commented Jan 12, 2024

jprochazk commented Jan 12, 2024 • edited Loading

dpytaylo commented Jan 13, 2024 • edited Loading

Obstbeeren commented Jan 19, 2024 • edited Loading

dpytaylo commented Jan 20, 2024 • edited Loading

dpytaylo commented Jan 21, 2024

jprochazk commented Jan 21, 2024 • edited Loading

dpytaylo commented Jan 2, 2024 •

edited

Loading

dpytaylo commented Jan 8, 2024 •

edited

Loading

2. Make only the `derive` feature default.

3. Leave it as it is but don't place the `unicode` feature in the defaults.

jprochazk commented Jan 11, 2024 •

edited

Loading

dpytaylo commented Jan 12, 2024 •

edited

Loading

jprochazk commented Jan 12, 2024 •

edited

Loading

dpytaylo commented Jan 13, 2024 •

edited

Loading

Obstbeeren commented Jan 19, 2024 •

edited

Loading

dpytaylo commented Jan 20, 2024 •

edited

Loading

jprochazk commented Jan 21, 2024 •

edited

Loading