Unicode alphabets not supported? #22

blukis · 2017-04-23T14:16:12Z

It seems like unicode alphabets are not supported, but perhaps I am doing something wrong? e.g. the following code (Google Cloud Platform environment)...

alphabet = "😀😃😄😁😆😅😂🤣☺️😊😇🙂🙃😉😌😍😘😗😙"
hashids = Hashids(salt='test3', alphabet=alphabet)
hashid = hashids.encode(123, 456, 789)
self.response.write('<p>alphabet:' + alphabet)
self.response.write('<p>hashid:' + hashid)

I get this output (in a browser):

alphabet:😀😃😄😁😆😅😂🤣☺️😊😇🙂🙃😉😌😍😘😗😙
hashid:����↉����

(i.e. I'd expect hashid to be some combination of faces from alphabet, but they are appearing as [mostly] unknown characters.) Any possibility to add support for these funkier alphabets? Thanks!

The text was updated successfully, but these errors were encountered:

davidaurelio · 2017-04-29T23:42:03Z

Thanks for bringing it up – do you know what other implementations do (e.g. JS or PHP)?

The underlying issue can have two reasons:

in both UTF-8 and UTF-16, these unicode characters are represented by four bytes, and evidently the implementation is not prepared to handle that.
Most python versions use a 16 bit character representation, depending on compilation settings. That means, the can represent 65536 characters – effectively the contents of the basic multilingual pane in unicode. Emoji are located in the higher panes, though, and are represented as a “surrogate pair” – a combination of two 16 bit characters. That’s why you can observe the following:
```
>>> a = u'☺️'
>>> len(a)
2
```
afaik, Python can be compiled with support for 32 bit characters, though.

Given all this information, it might be possible to support tuples as alphabet to avoid the character representation complications altogether. Pull requests for that functionality would be very welcome.

You could probably unblock yourself by using a mapping mechanism in combination with hashids.

blukis · 2017-04-30T18:17:22Z

The PHP implementation has the same limitation it seems. The Swift implementation mentions in the readme that it supports emoji alphabets by using an array rather than a string internally. https://github.com/malczak/hashids

I'm unblocked myself, by implementing my own thing, in a non-generic way (and in PHP incidentally).

davidaurelio · 2017-04-30T19:19:38Z

If swift supports arrays, we can certainly support lists/tuples.

blukis · 2017-04-30T21:28:32Z

My understanding from the Swift readme is the array-thing is an inner implementation detail, and not a change in API (i.e. alphabet is still passed in from outside as a string). Not sure if that would affect your plan here.

I might actually be able to speak to this (a little bit, from a JavaScript perspective, anyway):

"☺️" is (and should be) actually 2 characters (that happen to get rendered as one grapheme) - "\u{263a}" (smiley face) and "\u{fe0f}" (invisible "emoji variation selector" https://en.wikipedia.org/wiki/Variant_form_(Unicode) )
"😽" while larger than 16bit range, is 1 character - "\u{1f63d}"

In JavaScript, longer-than-2-byte strings are also represented as 16bit "surrogate-pairs", i.e. "😽".length == 2 in JS also. But the underlying unicode numbers of each character are still accessible with this JS code (excerpt from https://github.com/bestiejs/punycode.js/blob/master/punycode.js )

function ucs2decode(string) {
	const output = [];
	let counter = 0;
	const length = string.length;
	while (counter < length) {
		const value = string.charCodeAt(counter++);
		if (value >= 0xD800 && value <= 0xDBFF && counter < length) {
			// It's a high surrogate, and there is a next character.
			const extra = string.charCodeAt(counter++);
			if ((extra & 0xFC00) == 0xDC00) { // Low surrogate.
				output.push(((value & 0x3FF) << 10) + (extra & 0x3FF) + 0x10000);
			} else {
				// It's an unmatched surrogate; only append this code unit, in case the
				// next code unit is the high surrogate of a surrogate pair.
				output.push(value);
				counter--;
			}
		} else {
			output.push(value);
		}
	}
	return output;
}

e.g.

ucs2decode("a") = [97]
ucs2decode("a☺️") = [ 97, 9786, 65039 ]
ucs2decode("a☺️😽") = [ 97, 9786, 65039, 128573 ]

Maybe the same principles hold for Python too?

davidaurelio · 2020-07-14T20:13:24Z

Hi @blukis – sorry for going underground for so long. Is this still a feature you’d be interested in? Are more implementation able to handle emojis and unicode combinators these days?

There is a whole rabbit hole with unicode normalisation waiting, but maybe we can do something simple.

blukis · 2020-07-15T02:44:41Z

I can't be of much help on this unfortunately, nor have a use for it currently. I wound up kind of "pivoting to PHP" for that project, and the need went away. 🙃

davidaurelio · 2020-07-15T07:17:52Z

Fair enough – I will close this issue for now. If you or anybody happen to end up needing this, I will take another look. I would assume there is existing functionality to split unicode strings these days.

davidaurelio closed this as completed Jul 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode alphabets not supported? #22

Unicode alphabets not supported? #22

blukis commented Apr 23, 2017 •

edited

Loading

davidaurelio commented Apr 29, 2017

blukis commented Apr 30, 2017

davidaurelio commented Apr 30, 2017

blukis commented Apr 30, 2017

davidaurelio commented Jul 14, 2020

blukis commented Jul 15, 2020

davidaurelio commented Jul 15, 2020

Unicode alphabets not supported? #22

Unicode alphabets not supported? #22

Comments

blukis commented Apr 23, 2017 • edited Loading

davidaurelio commented Apr 29, 2017

blukis commented Apr 30, 2017

davidaurelio commented Apr 30, 2017

blukis commented Apr 30, 2017

davidaurelio commented Jul 14, 2020

blukis commented Jul 15, 2020

davidaurelio commented Jul 15, 2020

blukis commented Apr 23, 2017 •

edited

Loading