Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat request: expose publicly capitalizeWord (or similar) #5424

Open
lowlighter opened this issue Jul 12, 2024 · 13 comments
Open

feat request: expose publicly capitalizeWord (or similar) #5424

lowlighter opened this issue Jul 12, 2024 · 13 comments
Labels
needs discussion Needs discussion this topic needs further discussion to determine what action to take. suggestion a suggestion yet to be agreed

Comments

@lowlighter
Copy link
Contributor

Is your feature request related to a problem? Please describe.

Expose publicly capitalizeWord or something similar (maybe capitalize / ucfirst / titleCase / ...):
https://github.com/denoland/deno_std/blob/22d3bda488f145b725fc1eaeee16922a97d88add/text/_util.ts#L9-L13

A lot of languages provides a ucfirst helper (e.g. php, perl, etc.) that capitalize the first letter of a string.

While this is trivial enough, it's often tedious to have to redefine this function in every project when needed

Describe the solution you'd like

Feature offered by std lib

Describe alternatives you've considered

Redefining this the capitalizeWord in own project

@kt3k
Copy link
Member

kt3k commented Jul 12, 2024

Maybe let's add toTitleCase()? That was suggested in the past in #3440 and #4082

@kt3k kt3k added good first issue Good for newcomers PR welcome A pull request for this issue would be welcome labels Jul 12, 2024
@timreichen
Copy link
Contributor

Maybe let's add toTitleCase()? That was suggested in the past in #3440 and #4082

Implementing toTitleCase() is a slippery slope, because it needs grammar analysis (and possibly localization) for a proper implementation. This has also been discussed as a native api proposal, but seems like they haven't reached a conclusion (refs: #4082 (comment), https://es.discourse.group/t/proposal-string-prototype-capitalize/1662 and https://es.discourse.group/t/proposal-string-prototype-capitalize/1662).

I would like to have toTitleCase() implemented in std, but I think this will take a huge effort to do right.

@kt3k kt3k added suggestion a suggestion yet to be agreed needs discussion Needs discussion this topic needs further discussion to determine what action to take. and removed good first issue Good for newcomers PR welcome A pull request for this issue would be welcome labels Jul 12, 2024
@lowlighter
Copy link
Contributor Author

Maybe I misdirected the issue, I wanted to have an alias for ucfirst (i.e. just capitalize the first letter of a string), just to avoid having to do the ${str.charAt(0).toLocaleUpperCase()}${str.substring(1)} each time

The String.prototype.capitalize and toTitleCase suggested seems to be more akin to ucwords function.

Also just noticed that the function linked in the original post is actually not exactly what I wanted, the toLocaleLowerCase() isn't supposed to be called for the rest of the string in ucfirst

Sorry if the issue wasn't clear

@luk3skyw4lker
Copy link
Contributor

@timreichen @kt3k

I think that the addition of toTitleCase() might be a good idea but probably making capitalizeWord public would be a better fit for this issue since it's a real use case (sometimes you just want the first letter of the string to be capitalized)

@kt3k
Copy link
Member

kt3k commented Jul 16, 2024

Sounds like capitalizeWord is a good starting point? Let's document that the API only upper case the first letter and lower case the latter, and there's no grammatical analysis performed in it.

@lionel-rowe
Copy link
Contributor

I think capitalizeWord is a sensible enough addition, but putting more complex letter casing functions inside of std runs risk of massive scope creep unless you want to unduly privilege English over every other language. Still, you could easily build a "naive" title-case in userland on top of capitalizeWord + Intl.Segmenter(locale, { granularity: 'word' }).

Even with capitalizeWord alone there are a few non-trivial considerations:

  • What counts as the first "letter"? My suggestion would be the first grapheme cluster that matches /\p{L}/u
  • What happens to the rest of the string — is it left alone or lowercased?
  • Are any special cases required? The one that springs to mind is "ß", but I don't think that can appear at the start of words anyway.

Implementation could look something like this:

type CapitalizeWordOptions = {
    locale: string | Intl.Locale
    force: boolean
}

const defaults: CapitalizeWordOptions = {
    locale: 'en-US',
    force: false,
}

function capitalizeWord(word: string, options?: Partial<CapitalizeWordOptions>): string {
    const { locale, force } = { ...defaults, ...options }

    for (const { segment: grapheme, index } of new Intl.Segmenter(locale, { granularity: 'grapheme' }).segment(word)) {
        if (/\p{L}/u.test(grapheme)) {
            const before = word.slice(0, index)
            const after = word.slice(index + grapheme.length)
            const afterModified = force ? after.toLocaleLowerCase(locale) : after

            return before + grapheme.toLocaleUpperCase(locale) + afterModified
        }
    }

    return word
}

@luk3skyw4lker
Copy link
Contributor

I got some reference in the ucfirst function that the issue talks about, and I think that we should go with the simple approach. Just capitalize the first letter of the string (ranging from a to z) like the PHP docs say. I'll leave some reference on it in this comment.

https://www.php.net/manual/en/function.ucfirst.php
https://docs.rs/ucfirst/latest/ucfirst/
https://perldoc.perl.org/functions/ucfirst

I think that what @lionel-rowe said falls more on the description of the 'toTitleCase()' function, which I agree with @timreichen that it would be a great effort to do so by now. With the ucfirst based implementation we would be favoring the Latin alphabet, but I think that's ok for now.

@lionel-rowe
Copy link
Contributor

lionel-rowe commented Jul 17, 2024

I think that what @lionel-rowe said falls more on the description of the 'toTitleCase()' function, which I agree with @timreichen that it would be a great effort to do so by now

I'd suggest title casing is something that should permanently fall outside the scope of std, as proper dedicated libraries would handle it better. Otherwise it'd involve maintaining a list of "stop words" that shouldn't be capitalized for every supported language, and that's even without considering the various differing standards that exist (APA, AP, Chicago, etc.)

The reason I think capitalizing a single word could reasonably fall within the scope of std is that it's relatively speaking very simple to do in a reasonably robust, locale-aware way and doesn't require any hard-coded word lists.

With that said, I think it's worth distinguishing between a "dev-first" and a "user-first" approach to capitalization:

  • The "dev-first" approach can be relatively simple and can be used for cases such as code generation, dev tooling, etc. The current version in text/_util.ts and PHP's ucfirst both fall in this category. IMO neither of these implementations are great — ucfirst is extremely limited as it only handles ASCII (note that strings such as переменная and μεταβλητός are perfectly valid identifiers in JS), and text/_util.ts::capitalizeWord may give different results on different systems due to calling toLocale[Upper/Lower]Case with no locale specified.

    One nice DX enhancement you could do with the dev-first approach is replicate TS's implementation of the Capitalize utility type so you get type inference for free:

    function capitalize<T extends string>(str: T): Capitalize<T> {
        return str.charAt(0).toUpperCase() + str.slice(1) as Capitalize<T>
    }
    
    const capitalized: 'Foo' = capitalize('foo')
  • The "user-first" approach is for user-facing text and is locale aware. My implementation above is an example of this approach. Runtime implementation-wise it's slightly more complicated, whereas type-wise it's very simple, as there's no easy way to represent the return type in TypeScript other than string.

Generally speaking, dev-first capitalization is only for dev-centric use cases and should be avoided for user-facing text. On the other hand, the user-first approach can be used for both user-facing and dev-facing purposes, but in dev-facing scenarios you wouldn't get the type inference.

@lionel-rowe
Copy link
Contributor

lionel-rowe commented Jul 17, 2024

Further to that, looking at the usage of the capitalizeWord util, it seems pretty clear that to_[camel/kebab/pascal/snake]_case are dev-first functions, but with the notable drawback of using toLocale[Upper/Lower]Case without specifying a locale. I think the suitable use cases should be documented in the case of to_capitalized or whatever the equivalent public function would be called, as devs may have a reasonable expectation that it's a general-purpose function suitable for user-facing text (whereas that confusion is unlikely with camel/kebab/etc.)

@guy-borderless
Copy link
Contributor

Maybe let's add toTitleCase()? That was suggested in the past in #3440 and #4082

Implementing toTitleCase() is a slippery slope, because it needs grammar analysis (and possibly localization) for a proper implementation. This has also been discussed as a native api proposal, but seems like they haven't reached a conclusion (refs: #4082 (comment), https://es.discourse.group/t/proposal-string-prototype-capitalize/1662 and https://es.discourse.group/t/proposal-string-prototype-capitalize/1662).

I would like to have toTitleCase() implemented in std, but I think this will take a huge effort to do right.

I suspect subtle/context-sensitive grammatical operations like title casing will be mostly done via cheap-end llms where possible.

@0f-0b
Copy link
Contributor

0f-0b commented Jan 26, 2025

MediaWiki uses a fairly simple algorithm to capitalize page titles, which is to change the first character C to Titlecase_Mapping(C) and leave the rest of the string as is. Titlecase_Mapping(C) is locale sensitive. Would this be a reasonable addition to std?

@lionel-rowe
Copy link
Contributor

lionel-rowe commented Jan 26, 2025

MediaWiki uses a fairly simple algorithm to capitalize page titles, which is to change the first character C to Titlecase_Mapping(C) and leave the rest of the string as is. Titlecase_Mapping(C) is locale sensitive. Would this be a reasonable addition to std?

IMO locale-sensitive by default is a bad idea for the reasons I outlined in #6016. Opt-in locale sensitivity may be useful though. If so, a uniform API should probably be provided for it across the relevant text APIs.

As for Titlecase_Mapping, I don't think there's any API to directly access that in JS. But in practice, I think it may be identical to upper-case in all but a very small list of exceptions (like 4 or 5 mappings IIRC?)

Edit: ok it's 31 total exceptions. But still only a short list, probably not too onerous to special-case them all:

const allCodePoints = Array.from({ length: 0x10ffff + 1 }, (_, i) => String.fromCodePoint(i))
const titleCaseRe = /\p{Lt}/u
allCodePoints.filter((x) => titleCaseRe.test(x))
// ['Dž', 'Lj', 'Nj', 'Dz', 'ᾈ', 'ᾉ', 'ᾊ', 'ᾋ', 'ᾌ', 'ᾍ', 'ᾎ', 'ᾏ', 'ᾘ', 'ᾙ', 'ᾚ', 'ᾛ', 'ᾜ', 'ᾝ', 'ᾞ', 'ᾟ', 'ᾨ', 'ᾩ', 'ᾪ', 'ᾫ', 'ᾬ', 'ᾭ', 'ᾮ', 'ᾯ', 'ᾼ', 'ῌ', 'ῼ']

@0f-0b
Copy link
Contributor

0f-0b commented Jan 26, 2025

I put together an implementation of the algorithm. Turns out there are 135 characters whose titlecase mapping is different from its uppercase mapping. In total there are 1479 characters that change when titlecased.

Looking at the list of exceptions I realized a shortcoming of this algorithm – when the input string starts with an uppercase digraph, it lowercases the second letter in the digraph. capitalize("DZDZ") becomes "DzDZ".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs discussion Needs discussion this topic needs further discussion to determine what action to take. suggestion a suggestion yet to be agreed
Projects
None yet
Development

No branches or pull requests

7 participants