Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode alphabets not supported? #22

Closed
blukis opened this issue Apr 23, 2017 · 7 comments
Closed

Unicode alphabets not supported? #22

blukis opened this issue Apr 23, 2017 · 7 comments

Comments

@blukis
Copy link

blukis commented Apr 23, 2017

It seems like unicode alphabets are not supported, but perhaps I am doing something wrong? e.g. the following code (Google Cloud Platform environment)...

alphabet = "πŸ˜€πŸ˜ƒπŸ˜„πŸ˜πŸ˜†πŸ˜…πŸ˜‚πŸ€£β˜ΊοΈπŸ˜ŠπŸ˜‡πŸ™‚πŸ™ƒπŸ˜‰πŸ˜ŒπŸ˜πŸ˜˜πŸ˜—πŸ˜™"
hashids = Hashids(salt='test3', alphabet=alphabet)
hashid = hashids.encode(123, 456, 789)
self.response.write('<p>alphabet:' + alphabet)
self.response.write('<p>hashid:' + hashid)

I get this output (in a browser):

alphabet:πŸ˜€πŸ˜ƒπŸ˜„πŸ˜πŸ˜†πŸ˜…πŸ˜‚πŸ€£β˜ΊοΈπŸ˜ŠπŸ˜‡πŸ™‚πŸ™ƒπŸ˜‰πŸ˜ŒπŸ˜πŸ˜˜πŸ˜—πŸ˜™
hashid:����↉����

(i.e. I'd expect hashid to be some combination of faces from alphabet, but they are appearing as [mostly] unknown characters.) Any possibility to add support for these funkier alphabets? Thanks!

@davidaurelio
Copy link
Owner

Thanks for bringing it up – do you know what other implementations do (e.g. JS or PHP)?

The underlying issue can have two reasons:

  • in both UTF-8 and UTF-16, these unicode characters are represented by four bytes, and evidently the implementation is not prepared to handle that.

  • Most python versions use a 16 bit character representation, depending on compilation settings. That means, the can represent 65536 characters – effectively the contents of the basic multilingual pane in unicode. Emoji are located in the higher panes, though, and are represented as a β€œsurrogate pair” – a combination of two 16 bit characters. That’s why you can observe the following:

    >>> a = u'☺️'
    >>> len(a)
    2
    
    
  • afaik, Python can be compiled with support for 32 bit characters, though.

Given all this information, it might be possible to support tuples as alphabet to avoid the character representation complications altogether. Pull requests for that functionality would be very welcome.

You could probably unblock yourself by using a mapping mechanism in combination with hashids.

@blukis
Copy link
Author

blukis commented Apr 30, 2017

The PHP implementation has the same limitation it seems. The Swift implementation mentions in the readme that it supports emoji alphabets by using an array rather than a string internally. https://github.com/malczak/hashids

I'm unblocked myself, by implementing my own thing, in a non-generic way (and in PHP incidentally).

@davidaurelio
Copy link
Owner

If swift supports arrays, we can certainly support lists/tuples.

@blukis
Copy link
Author

blukis commented Apr 30, 2017

My understanding from the Swift readme is the array-thing is an inner implementation detail, and not a change in API (i.e. alphabet is still passed in from outside as a string). Not sure if that would affect your plan here.

I might actually be able to speak to this (a little bit, from a JavaScript perspective, anyway):

  • "☺️" is (and should be) actually 2 characters (that happen to get rendered as one grapheme) - "\u{263a}" (smiley face) and "\u{fe0f}" (invisible "emoji variation selector" https://en.wikipedia.org/wiki/Variant_form_(Unicode) )
  • "😽" while larger than 16bit range, is 1 character - "\u{1f63d}"

In JavaScript, longer-than-2-byte strings are also represented as 16bit "surrogate-pairs", i.e. "😽".length == 2 in JS also. But the underlying unicode numbers of each character are still accessible with this JS code (excerpt from https://github.com/bestiejs/punycode.js/blob/master/punycode.js )

function ucs2decode(string) {
	const output = [];
	let counter = 0;
	const length = string.length;
	while (counter < length) {
		const value = string.charCodeAt(counter++);
		if (value >= 0xD800 && value <= 0xDBFF && counter < length) {
			// It's a high surrogate, and there is a next character.
			const extra = string.charCodeAt(counter++);
			if ((extra & 0xFC00) == 0xDC00) { // Low surrogate.
				output.push(((value & 0x3FF) << 10) + (extra & 0x3FF) + 0x10000);
			} else {
				// It's an unmatched surrogate; only append this code unit, in case the
				// next code unit is the high surrogate of a surrogate pair.
				output.push(value);
				counter--;
			}
		} else {
			output.push(value);
		}
	}
	return output;
}

e.g.

ucs2decode("a") = [97]
ucs2decode("a☺️") = [ 97, 9786, 65039 ]
ucs2decode("a☺️😽") = [ 97, 9786, 65039, 128573 ]

Maybe the same principles hold for Python too?

@davidaurelio
Copy link
Owner

Hi @blukis – sorry for going underground for so long. Is this still a feature you’d be interested in? Are more implementation able to handle emojis and unicode combinators these days?

There is a whole rabbit hole with unicode normalisation waiting, but maybe we can do something simple.

@blukis
Copy link
Author

blukis commented Jul 15, 2020

I can't be of much help on this unfortunately, nor have a use for it currently. I wound up kind of "pivoting to PHP" for that project, and the need went away. πŸ™ƒ

@davidaurelio
Copy link
Owner

Fair enough – I will close this issue for now. If you or anybody happen to end up needing this, I will take another look. I would assume there is existing functionality to split unicode strings these days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants