Skip to content
This repository was archived by the owner on Jul 30, 2019. It is now read-only.

allow input type="email" to accept unicode values #845

Closed
chaals opened this issue Mar 29, 2017 · 34 comments
Closed

allow input type="email" to accept unicode values #845

chaals opened this issue Mar 29, 2017 · 34 comments

Comments

@chaals
Copy link
Collaborator

chaals commented Mar 29, 2017

This is related to #538 - the "left hand side" of an email address can, in the internationalised world, be more or less anything. But currently the spec only allows ASCII.

We're looking to see evidence that the world's deployed email infrastructure actually works with utf-8 before pushing this change.

@chaals chaals added this to the When it's ready milestone Mar 29, 2017
chaals pushed a commit that referenced this issue Apr 18, 2017
cynthia pushed a commit that referenced this issue Jun 14, 2017
* Clarify the constraints on email addresses

Fix #538
See also #845

* Remove class="impl"

See #178 

Should cover the entire chapter
@edent
Copy link
Member

edent commented Aug 9, 2017

I did some research in September 2016 on this issue - https://shkspr.mobi/blog/2016/09/why-cant-you-send-email-to-a-chinese-address/

None of the major webmail providers would allow me to sign up for an email account like 你好@...

A few of them would allow email to be sent to test@莎士比亚.org - but most required the punycode representation.

@cynthia
Copy link
Member

cynthia commented Aug 9, 2017

I gave this a bit of thought, I think we can safely change the validation rules to accommodate SMTPUTF8 - although I don't have high hopes for implementator traction, considering the interoperability status (or the lack thereof) of the underlying infrastructure (MTA support, namely) this eventually depends on.

Anyone willing to contribute patches to implementations given the spec changes?

@klensin
Copy link

klensin commented Oct 30, 2017

Cynthia, just to avoid misunderstandings, these extensions are actually rather easy to support in MTAs and several have quietly added the support. Specifically, for an MTA that is already "8 bit clean" (and most are these days), what is needed is for the receiver side of the MTA to advertise and accept the extension, the sender side to ask for it, and then to disable some checks. If the MTA does not support non-ASCII domains, IDNA support also has to be added, but several libraries are available that do that, some more or less automagically.

The hard problems occur in MUAs and most of them involve design issues, not just code tweaks, especially if one expects to support the general case of arbitrary email addresses.

Almost independent of the above, it continues to appear to me to be unwise for a W3C spec to be a barrier to deployment in this area.

@ShawnSteele
Copy link

There are several places that non-ASCII local email addresses "work." It seems unwise to restrict the validation to obsolete standards when we know that EAI exists and is in progress.

Is there a good reason why non-ASCII addresses should be prohibited, knowing that the industry is trying to move toward more globally appropriate addresses?

@cynthia
Copy link
Member

cynthia commented Oct 31, 2017

Just to be clear - this not being in the spec does not block anything. In practice, if you need to support i18n mail addresses - one could just use input type="text" and write validation yourself.

When I was looking into this issue, I took a quick look at usage of WF2 in popular sites and was genuinely surprised to find that none of them used input type email - and of the ones I tested, none of them allowed neither domains nor mailboxes outside of good old ISO-8859-1 in their hand rolled validation. While this is a limited sample, it does suggest that a significant portion of WF2 is being unused either because it is not customizable enough (argument I've heard quite a lot) or due to compatibility. (another argument I hear a lot)

The macro question of whether or not we should re-engineer WF2 to be extensible web primitives with better interoperability instead of the mess it is right now (I would very much like to see this as primitives and specifics implemented as userland libraries or components instead) - is probably a question for another thread though.

@ShawnSteele
Copy link

ShawnSteele commented Oct 31, 2017 via email

@cynthia
Copy link
Member

cynthia commented Oct 31, 2017

Most of that in the wild isn't because of the lack of standards support for unicode support for mailboxes and IDNs in user agents, but more of a content author decision to only validate (with a script. Madness!) against ASCII mail addresses.

@klensin
Copy link

klensin commented Oct 31, 2017

Shawn, I think I agree with Cynthia here, based partially on long experience (long predating the EAI work) with web sites either rejecting email local-parts containing "+" as invalid or accepting them and screwing up in some major way later. I think I understand the problem and where it comes from, but it isn't a standards matter -- "+" has been valid in local parts since before 821/822. However, it seems to me that the bottom line is the same: if W3C is going to have a validation standard, it should not block or discourage the use of a valid feature (in this case, non-ASCII email addresses) that a significant number of people consider important.

Whether it is desirably that there be such a standard and, if so, what form it should take, is a separate question. So are the questions about how we encourage conformance and deployment of the SMTPUTF8-related specs. All I'm sure of is that this, or any successor, validation spec should avoid getting in the way.

john

@r12a
Copy link

r12a commented Oct 31, 2017

ICANN's Universal Acceptance program (see https://uasg.tech/) is trying to encourage software and content authors to support full non-ASCII domains, and some countries, such as China and India, are already actively creating and using local domains that are non-ascii on both sides of the @ sign. There's a handy white paper at https://uasg.tech/wp-content/uploads/2017/04/Unleashing-the-Power-of-All-Domains-White-Paper.pdf) that gives more detail, and suggests that the problem is mostly one of developers/content authors being unaware of the need to support full unicode domains.

This is why input type=email is a problem, in my view. I think it is likely to block things if unaware people take it for face value and use it. Well-meaning but unaware people won't appreciate that it is unwise to use it in its current form, and will, by using it, compound the problems of rolling out universal acceptance. I'm not sure i understand why we don't just fix the syntax to support Unicode on the LHS. Until we do, it seems to me that the HTML spec is actively promoting an approach that is counter to what is currently needed, and will end up prolonging problems for users, and should therefore be addressed.

And btw it seems to me that making input type=email work as needed could help address the problems of developers creating ISO 8859 only validation scripts for ordinary input fields.

@edent
Copy link
Member

edent commented Oct 31, 2017

Just so I understand - are we discussing the regex at https://www.w3.org/TR/html/sec-forms.html#email-state-typeemail ?

/^[a-zA-Z0-9.!#$%&'*+\/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$/

In which case, I agree that it needs to be updated.
It doesn't match example@你好.com nor 你好@example.com

@r12a
Copy link

r12a commented Oct 31, 2017

@edent for myself, i was talking about that whole section, but particularly, i suppose about the bit that starts "A valid e-mail address is...", which hopefully will be the same as the regex. But if you're saying that the regex doesn't support example@你好.com (i haven't checked), i think there's an additional issue. Because the spec was changed to allow that, iiuc.

@edent
Copy link
Member

edent commented Oct 31, 2017

@r12a I tried that email regex on https://regex101.com/ and used those example emails - it failed.

@r12a
Copy link

r12a commented Oct 31, 2017

@edent Yeah, looking at it now i can see why.

@ShawnSteele
Copy link

ShawnSteele commented Oct 31, 2017 via email

@ShawnSteele
Copy link

ShawnSteele commented Oct 31, 2017 via email

@collinanderson
Copy link
Contributor

I might be wrong, but I think some browsers internally translate IDN into punycode thereby passing the regex validation for the domain part.

@ShawnSteele
Copy link

ShawnSteele commented Oct 31, 2017 via email

@AndySky21
Copy link

Maybe I'm wrong, but isn't input type email also being used for autocompletion and input type?
On mobile devices, having an email input offers users a keyboard where useful symbols and strings (such as @, / and ".com") are more handy. For the autocompletion part, it can be obtained with the use of the longer attribute value syntax, but only as far as it is supported.
Perhaps it would be better to fix email input rather than discard it.

@dwsinger
Copy link

Are there cases where the UA would want to restrict to "old" email addresses? If so, we might need two (or three) input methods "old style", "fully international" and maybe maybe "general" (which currently means old, but should be moved to mean international).
I think we should also discuss presentation of email addresses, particularly those involving RTL text.

@ShawnSteele
Copy link

ShawnSteele commented Nov 28, 2017 via email

@dwsinger
Copy link

I'm not sure I like two different input methods either.

I do want to see some real thought into how we present multi-script labels (URIs and email addresses) in ways that empower rather than confuse users. That's a serious upcoming problem (how do I make sure it's the company I intend and not a phishing attack if I can't read it? how do I know it's the person I intend at the company I intend if I can't read it? and punycode makes answers to those questions even harder).

@ShawnSteele
Copy link

ShawnSteele commented Nov 28, 2017 via email

@dwsinger
Copy link

agreed, email addresses I mostly want to be reasonably sure it's the right company, if I can't read the personal address (if I can, it's a "do I recognize it?). Don't forget mailto: URLs.

agreed, punycode is an excuse for "we have no idea what to show you so we will be quite sure to show you something we know you can't read". how that helps except to protect the guilty I do not know.

@aphillips
Copy link

@dwsinger / @ShawnSteele I think you're looking through the wrong end of the problem here. The problem here is not presentation, it's input. Sites can take the input and do whatever they need to in order to validate the email address. A site could reject EAI addresses, for example. The only thing that input type=email really guarantees is that the string has an @ and a dot in it.......

Prior to EAI/this issue, it also restricted the character range to a subset of ASCII. If we make this change, non-ASCII addresses will be enterable. Some naive implementations will fail to encode these properly for use in email protocols. Those addresses didn't work previously, just one step closer to the user.

The point of input type=* is to engage the user agent and possibly as an input hint. Otherwise it's just a glorified input type=text. FWIW, FF accepts any tokens as an email address as long as they are separated by @ and a series of dots.

@dwsinger
Copy link

I think that we should (mixing metaphors) look at both ends of the telescope; we have to do better at both accepting IRIs and international email addresses, and presenting them. I am unhappy to consider either one without the other.

@ShawnSteele
Copy link

ShawnSteele commented Nov 29, 2017 via email

@edent
Copy link
Member

edent commented Nov 29, 2017

The only thing that input type=email really guarantees is that the string has an @ and a dot in it.......

I do not think that is the case. On my corporate intranet, I can send mail to edent@company - that's a non-public email address, but a valid one on my mailserver. It is also accepted as valid by a simple <input type="email" required>

Some observations.

  • On <input type="number"> my desktop browser prevents me from typing the alphabetical two but my mobile browser will let me type text into a number field.
  • <input type="email"> doesn't stop me entering "incorrect" characters. I can enter 你好 as the user or the server part on my mobile and desktop browsers.
  • Ultimately, nothing stops a user from typing in an incorrect email address - either mistakenly or on purpose.

@chaals
Copy link
Collaborator Author

chaals commented Nov 29, 2017

@edent said

Ultimately, nothing stops a user from typing in an incorrect email address - either mistakenly or on purpose.

Actually, applications do - they check the validity state, and then refuse to submit the form. So telling them to validate against modern internationalised email instead of the historical ASCII type is important.

Otherwise I agree with @aphillips - this is about accepting user input. We currently insist on something that maps to punycode for the right-hand-side and I am not sure there is yet a big push to change that. We should accept stuff that is not ASCII as the local part.

How email applications handle this is like any internationalisation question - there are some broken ones but increasingly it works.

But as well as the protocol level there is an important set of use cases where people use email addresses as a generic identifier, even if no email is actually sent. These use cases should also allow people to use actual functional email addresses with non-ASCII in them.

@edent
Copy link
Member

edent commented Nov 29, 2017

Sorry, I meant "nothing stops a user from typing in the wrong email address".

I've lost count of the people who submit my email address thinking it is theirs. Or the people who think their email provider is gmial.com.

@chaals
Copy link
Collaborator Author

chaals commented Nov 29, 2017

Ah, yes.

One of the ideas presented behind input type="email" is that it would allow a user to select stuff from their address book - assuming access from the browser is allowed. This provides for lots of things, including validating email addresses rather more carefully against reality... but as far as I know that is not currently implemented anywhere :(

@ShawnSteele
Copy link

ShawnSteele commented Nov 29, 2017 via email

@ShawnSteele
Copy link

ShawnSteele commented Nov 29, 2017 via email

@chaals chaals self-assigned this Dec 1, 2017
@chaals chaals modified the milestones: When it's ready, HTML5.3 WD1 Dec 1, 2017
@chaals
Copy link
Collaborator Author

chaals commented Dec 1, 2017

@ShawnSteele

I cannot remember the last time I entered an email address in a web form that was not my own

I do it when I use webmail. That is a pretty common scenario. It might be that since the address book is then likely to be on the app side, and the magic form types are unfriendly to customisation, that it isn't an important case in practice. In my own case, I am generally using a web interface to something for which I also use a local client that does maintain an address book, but I might not be typical enough to matter.

@edent
Copy link
Member

edent commented Dec 1, 2017

Back when I worked for WAC (Wholesale Applications Community) we defined a standardised way of the address book being exposed to the browser - https://web.archive.org/web/20110313075317/http://specs.wacapps.net:80/wac2_0/feb2011/deviceapis/contact.html

Useful if you want to pick addresses. But I suspect <input type="contact"> is out of scope for this discussion.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

10 participants