Character count component counts code points, not characters #1104

36degrees · 2018-12-21T14:26:49Z

The character count currently uses string.length to establish the length of the user input. string.length counts code units, not characters, and this can lead to some confusing results when using certain strings.

You can see this by trying the following strings into the character component:

String	Result
cat😹	The emoji is counted as 2 code units, and the length is reported as 5 characters.
cafȩ́	Each combining mark is counted separately, and the reported length is 6 characters.
👩🏻‍🚀	Because this emoji includes both gender and skin modifiers and a zero-width joiner, this single character is counted as 7 characters.

We should probably find a less naive way to count characters in strings, but we also need to work out how this will work with any backend validation or data storage on a service, which may already be using a different definition of a 'character' (for example, where the backend or storage treats one character as one byte).

Further reading:

The text was updated successfully, but these errors were encountered:

selfthinker · 2018-12-21T16:17:43Z

Another good article on the subject: https://blog.jonnew.com/posts/poo-dot-length-equals-two

dashouse · 2019-01-08T14:58:12Z

@36degrees this seems like it could end up being quite a serious bug if used in a service with multiple language support.

If we can't fix it should it be documented?

NickColley · 2019-08-05T11:12:40Z

Seems like a robust solution is very code heavy which would not be suitable for clientside.

I think Dave's suggestion of leaving it as is but documenting how it works would be the best way forwards...

simonneb · 2020-06-25T10:39:23Z

We noticed a similar issue in the character counts on the GOV.UK Notify service when sending non-English characters to the service. It turns out that Notify was counting bytes and not characters - this was fixed by by the team.

lfdebrux · 2021-10-27T15:56:13Z

MDN suggests that you can use the string iterator to count characters

function getCharacterLength (str) {
  // The string iterator that is used here iterates over characters,
  //  not mere code units
  return [...str].length;
}

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/length

Might be worth spiking....

querkmachine · 2022-12-01T14:19:37Z

Intl.Segmenter is a more recent addition to the JS Internationalization API which can split strings into "graphemes" (user-perceived characters), rather than code points. Like other parts of Intl, it's locale-aware and can change how it counts depending on the configuration.

It's been available for a short time in Chromium browsers (Chrome and Edge 87, Opera 73, Samsung 14) and Safari (14.1), but is not yet supported in Firefox.

A potential issue with this is that it doesn't count new lines. New lines are registered as a code point, but are not considered graphemes as they are not "user-perceiveable" in the same way something like a space character is—they have a blank glyph and no width.

Relatedly, do we need to be sure that service teams aren't using the character count to convey technical limitations? For example, if a database column can only support a maximum of 512 characters, then they do want to limit the input to 512 code points, not 512 graphemes. Would this need to be a configuration option?

36degrees · 2022-12-01T15:06:09Z

Relatedly, do we need to be sure that service teams aren't using the character count to convey technical limitations? For example, if a database column can only support a maximum of 512 characters, then they do want to limit the input to 512 code points, not 512 graphemes. Would this need to be a configuration option?

I'd suggest making it possible to pass a custom counting function – see #1364.

When we do change the counting implementation we should treat it as a breaking change – and we might want to do #1364 first, so service teams can 'override back' to the current code point-based approach.

colinrotherham · 2022-12-09T15:57:25Z

Thanks to @querkmachine for linking me to this issue

We were both thinking a recently spotted issue in Internet Explorer 8 is likely new lines being counted as two characters. With new lines either as \n (POSIX default) versus \r\n (Windows default)

Grapheme counting code examples look huge, but would be great to align client-/server-side counts. Having the "custom counting function" as a Promise would allow a fetch() (or AJAX) response return the count if really necessary

https://github.com/orling/grapheme-splitter#readme

Google Chrome

Shows "You have 23 characters too many"

Internet Explorer

Shows "You have 26 characters too many"

colinrotherham · 2022-12-09T16:18:19Z

Think we can let IE8 off here

We were both thinking a recently spotted issue in Internet Explorer 8 is likely new lines being counted as two characters. With new lines either as \n (POSIX default) versus \r\n (Windows default)

In 2016 the HTML Standard switched minlength/maxlength new line normalisation from \r\n to \n

Change minlength/maxlength behavior around linebreaks whatwg/html#1712

Consensus wasn't found on characters, code points and grapheme clusters:

Change minlength/maxlength behavior around linebreaks and code points whatwg/html#1517

Interesting that WebKit is sticking with grapheme clusters to avoid user confusion:

Count text length for maxLength check with the standard way
https://bugs.webkit.org/show_bug.cgi?id=120030 RESOLVED WONTFIX

dav-idc · 2023-01-10T15:17:18Z

Here's a recent comment on the character count backlog issue: alphagov/govuk-design-system-backlog#67 (comment)

Here the issue doesn't appear to be related to a specific browser, but rather that the frontend counts /n as 1 'character', but /n gets stored as 2 characters in the backend.

mgladdish · 2023-08-18T09:08:15Z

We've had a user report of this issue in production today - the frontend character count not matching the backend validation rule. It's deeply confusing for the end user and isn't a great look for our service when it appears it can't even count words consistently.

36degrees · 2024-06-14T16:09:12Z

Intl.Segmenter is a more recent addition to the JS Internationalization API which can split strings into "graphemes" (user-perceived characters), rather than code points. Like other parts of Intl, it's locale-aware and can change how it counts depending on the configuration.

It's been available for a short time in Chromium browsers (Chrome and Edge 87, Opera 73, Samsung 14) and Safari (14.1), but is not yet supported in Firefox.

A potential issue with this is that it doesn't count new lines. New lines are registered as a code point, but are not considered graphemes as they are not "user-perceiveable" in the same way something like a space character is—they have a blank glyph and no width.

Intl.Segmenter landed in Firefox 125 back in April, so we're probably at the point where we could consider using it, perhaps as part of an opt-in alternative count function that users can configure.

We should make sure to benchmark its performance, especially on lower-powered devices and in some of the older browsers that include it.

We may also need to look at reducing the number of times the count function is called.

36degrees changed the title ~~Character count component counts marks, not characters~~ Character count component counts code points, not characters Dec 21, 2018

36degrees added the 🐛 bug Something isn't working the way it should (including incorrect wording in documentation) label Jan 16, 2019

aliuk2012 added the submitted-by-user label Jan 22, 2019

timpaul added Effort: days labels May 20, 2019

36degrees mentioned this issue May 20, 2019

Character count's character/word count functions should be customisable #1364

Open

NickColley mentioned this issue Oct 11, 2019

Character count nhsuk/nhsuk-service-manual-community-backlog#172

Open

36degrees added character count and removed Priority: low labels Mar 27, 2020

hannalaakso added the javascript label Mar 30, 2020

36degrees mentioned this issue Jun 25, 2020

Character Count 'count message' cannot be customised or translated #1681

Closed

EoinShaughnessy mentioned this issue Oct 27, 2021

Tell users about the risk of using emojis and accents in character count alphagov/govuk-design-system#1969

Closed

4 tasks

EoinShaughnessy mentioned this issue Jan 20, 2022

Tell users about emoji and accent effect on count alphagov/govuk-design-system#2026

Merged

dav-idc mentioned this issue Jan 10, 2023

Character count alphagov/govuk-design-system-backlog#67

Open

kellylee-gds removed the 🕔 days label Feb 14, 2023

36degrees removed the submitted by user label Feb 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Character count component counts code points, not characters #1104

Character count component counts code points, not characters #1104

36degrees commented Dec 21, 2018

selfthinker commented Dec 21, 2018

dashouse commented Jan 8, 2019

NickColley commented Aug 5, 2019

simonneb commented Jun 25, 2020

lfdebrux commented Oct 27, 2021

querkmachine commented Dec 1, 2022

36degrees commented Dec 1, 2022

colinrotherham commented Dec 9, 2022

colinrotherham commented Dec 9, 2022

dav-idc commented Jan 10, 2023

mgladdish commented Aug 18, 2023

36degrees commented Jun 14, 2024

Character count component counts code points, not characters #1104

Character count component counts code points, not characters #1104

Comments

36degrees commented Dec 21, 2018

selfthinker commented Dec 21, 2018

dashouse commented Jan 8, 2019

NickColley commented Aug 5, 2019

simonneb commented Jun 25, 2020

lfdebrux commented Oct 27, 2021

querkmachine commented Dec 1, 2022

36degrees commented Dec 1, 2022

colinrotherham commented Dec 9, 2022

Google Chrome

Internet Explorer

colinrotherham commented Dec 9, 2022

dav-idc commented Jan 10, 2023

mgladdish commented Aug 18, 2023

36degrees commented Jun 14, 2024