Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Character count component counts code points, not characters #1104

Open
36degrees opened this issue Dec 21, 2018 · 12 comments
Open

Character count component counts code points, not characters #1104

36degrees opened this issue Dec 21, 2018 · 12 comments
Labels
🐛 bug Something isn't working the way it should (including incorrect wording in documentation) character count javascript

Comments

@36degrees
Copy link
Contributor

The character count currently uses string.length to establish the length of the user input. string.length counts code units, not characters, and this can lead to some confusing results when using certain strings.

A single emoji (👩🏻‍🚀) counted as 7 characters within the character count component

You can see this by trying the following strings into the character component:

String Result
cat😹 The emoji is counted as 2 code units, and the length is reported as 5 characters.
cafȩ́ Each combining mark is counted separately, and the reported length is 6 characters.
👩🏻‍🚀 Because this emoji includes both gender and skin modifiers and a zero-width joiner, this single character is counted as 7 characters.

We should probably find a less naive way to count characters in strings, but we also need to work out how this will work with any backend validation or data storage on a service, which may already be using a different definition of a 'character' (for example, where the backend or storage treats one character as one byte).

Further reading:

@36degrees 36degrees changed the title Character count component counts marks, not characters Character count component counts code points, not characters Dec 21, 2018
@selfthinker
Copy link

Another good article on the subject: https://blog.jonnew.com/posts/poo-dot-length-equals-two

@dashouse
Copy link

dashouse commented Jan 8, 2019

@36degrees this seems like it could end up being quite a serious bug if used in a service with multiple language support.

If we can't fix it should it be documented?

@36degrees 36degrees added the 🐛 bug Something isn't working the way it should (including incorrect wording in documentation) label Jan 16, 2019
@NickColley
Copy link
Contributor

Seems like a robust solution is very code heavy which would not be suitable for clientside.

I think Dave's suggestion of leaving it as is but documenting how it works would be the best way forwards...

@simonneb
Copy link

We noticed a similar issue in the character counts on the GOV.UK Notify service when sending non-English characters to the service. It turns out that Notify was counting bytes and not characters - this was fixed by by the team.

@lfdebrux
Copy link
Member

MDN suggests that you can use the string iterator to count characters

function getCharacterLength (str) {
  // The string iterator that is used here iterates over characters,
  //  not mere code units
  return [...str].length;
}

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/length

Might be worth spiking....

@querkmachine
Copy link
Member

Intl.Segmenter is a more recent addition to the JS Internationalization API which can split strings into "graphemes" (user-perceived characters), rather than code points. Like other parts of Intl, it's locale-aware and can change how it counts depending on the configuration.

It's been available for a short time in Chromium browsers (Chrome and Edge 87, Opera 73, Samsung 14) and Safari (14.1), but is not yet supported in Firefox.

A potential issue with this is that it doesn't count new lines. New lines are registered as a code point, but are not considered graphemes as they are not "user-perceiveable" in the same way something like a space character is—they have a blank glyph and no width.


Relatedly, do we need to be sure that service teams aren't using the character count to convey technical limitations? For example, if a database column can only support a maximum of 512 characters, then they do want to limit the input to 512 code points, not 512 graphemes. Would this need to be a configuration option?

@36degrees
Copy link
Contributor Author

Relatedly, do we need to be sure that service teams aren't using the character count to convey technical limitations? For example, if a database column can only support a maximum of 512 characters, then they do want to limit the input to 512 code points, not 512 graphemes. Would this need to be a configuration option?

I'd suggest making it possible to pass a custom counting function – see #1364.

When we do change the counting implementation we should treat it as a breaking change – and we might want to do #1364 first, so service teams can 'override back' to the current code point-based approach.

@colinrotherham
Copy link
Contributor

Thanks to @querkmachine for linking me to this issue

We were both thinking a recently spotted issue in Internet Explorer 8 is likely new lines being counted as two characters. With new lines either as \n (POSIX default) versus \r\n (Windows default)

Grapheme counting code examples look huge, but would be great to align client-/server-side counts. Having the "custom counting function" as a Promise would allow a fetch() (or AJAX) response return the count if really necessary

Google Chrome

Shows "You have 23 characters too many"
Character Count screenshot from Google Chrome

Internet Explorer

Shows "You have 26 characters too many"
Character Count screenshot from Internet Explorer

@colinrotherham
Copy link
Contributor

Think we can let IE8 off here

We were both thinking a recently spotted issue in Internet Explorer 8 is likely new lines being counted as two characters. With new lines either as \n (POSIX default) versus \r\n (Windows default)

In 2016 the HTML Standard switched minlength/maxlength new line normalisation from \r\n to \n

Consensus wasn't found on characters, code points and grapheme clusters:

Interesting that WebKit is sticking with grapheme clusters to avoid user confusion:

@dav-idc
Copy link

dav-idc commented Jan 10, 2023

Here's a recent comment on the character count backlog issue: alphagov/govuk-design-system-backlog#67 (comment)

Here the issue doesn't appear to be related to a specific browser, but rather that the frontend counts /n as 1 'character', but /n gets stored as 2 characters in the backend.

@mgladdish
Copy link

We've had a user report of this issue in production today - the frontend character count not matching the backend validation rule. It's deeply confusing for the end user and isn't a great look for our service when it appears it can't even count words consistently.

@36degrees
Copy link
Contributor Author

Intl.Segmenter is a more recent addition to the JS Internationalization API which can split strings into "graphemes" (user-perceived characters), rather than code points. Like other parts of Intl, it's locale-aware and can change how it counts depending on the configuration.

It's been available for a short time in Chromium browsers (Chrome and Edge 87, Opera 73, Samsung 14) and Safari (14.1), but is not yet supported in Firefox.

A potential issue with this is that it doesn't count new lines. New lines are registered as a code point, but are not considered graphemes as they are not "user-perceiveable" in the same way something like a space character is—they have a blank glyph and no width.

Intl.Segmenter landed in Firefox 125 back in April, so we're probably at the point where we could consider using it, perhaps as part of an opt-in alternative count function that users can configure.

We should make sure to benchmark its performance, especially on lower-powered devices and in some of the older browsers that include it.

We may also need to look at reducing the number of times the count function is called.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 bug Something isn't working the way it should (including incorrect wording in documentation) character count javascript
Projects
None yet
Development

No branches or pull requests