don't encode sms when calculating content length

we were previously encoding the utf-8 content, and measuring the number of bytes. We've been doing this since the beginning of time, and for most of time, it wasn't an issue. However, when we started accepting unicode characters (notably some welsh-only accents eg Ŵ or ŷ), they are encoded as two bytes rather than 1. So a message of 70 ŷ characters is 140 bytes long, but a message of 69 a characters and one ŷ is 71 characters long. However, in reality, when we send a message, if _any_ character in it is non-gsm (similar but not the same as ascii), then the entire thing is encoded in UTF-16, where every single character (including basic latin text) is two bytes long. So both ŷŷŷŷ... and aaaaa....aŷ are actually 70 utf-16 characters - we already take account of the double code point width when determining billable fragments. This means in a small amount of cases (where someone has a message that is almost at a message boundary in size, which contains a lot of welsh characters), we've over-billed that service. But this is probably an incredibly low amount of messages, and impossible to account for after 1 wk, so probably not worth worrying about.
alphagov · Nov 18, 2019 · aae1c82 · aae1c82
1 parent 1ab5a44
commit aae1c82
Show file tree

Hide file tree

Showing 3 changed files with 13 additions and 4 deletions.
diff --git a/notifications_utils/template.py b/notifications_utils/template.py
@@ -166,12 +166,19 @@ def prefix(self, value):
 
     @property
     def content_count(self):
-        return len((
+        """
+        Return the number of characters in the message. Note that we don't distinguish between GSM and non-GSM
+        characters at this point, as `get_sms_fragment_count` handles that separately.
+
+        Also note that if values aren't provided, will calculate the raw length of the unsubstituted placeholders,
+        as in the message `foo ((placeholder))` has a length of 19.
+        """
+        return len(
             # we always want to call SMSMessageTemplate.__str__ regardless of subclass, to avoid any html formatting
             SMSMessageTemplate.__str__(self)
             if self._values
             else sms_encode(add_prefix(self.content.strip(), self.prefix))
-        ).encode(self.encoding))
+        )
 
     @property
     def fragment_count(self):

diff --git a/notifications_utils/version.py b/notifications_utils/version.py
@@ -1 +1 @@
-__version__ = '36.1.1'
+__version__ = '36.1.2'
diff --git a/tests/test_base_template.py b/tests/test_base_template.py
@@ -119,8 +119,10 @@ def test_extracting_placeholders(template_content, template_subject, expected):
     "content,prefix, expected_length, expected_replaced_length",
     [
         ("The quick brown fox jumped over the lazy dog", None, 44, 44),
-        # should be replaced with a ?
+        # is an unsupported unicode character so should be replaced with a ?
         ("深", None, 1, 1),
+        # is a supported unicode character so should be kept as is
+        ("Ŵ", None, 1, 1),
         ("'First line.\n", None, 12, 12),
         ("\t\n\r", None, 0, 0),
         ("((placeholder))", None, 15, 3),