-
Notifications
You must be signed in to change notification settings - Fork 551
allow input type="email" to accept unicode values #845
Comments
I did some research in September 2016 on this issue - https://shkspr.mobi/blog/2016/09/why-cant-you-send-email-to-a-chinese-address/ None of the major webmail providers would allow me to sign up for an email account like A few of them would allow email to be sent to |
I gave this a bit of thought, I think we can safely change the validation rules to accommodate SMTPUTF8 - although I don't have high hopes for implementator traction, considering the interoperability status (or the lack thereof) of the underlying infrastructure (MTA support, namely) this eventually depends on. Anyone willing to contribute patches to implementations given the spec changes? |
Cynthia, just to avoid misunderstandings, these extensions are actually rather easy to support in MTAs and several have quietly added the support. Specifically, for an MTA that is already "8 bit clean" (and most are these days), what is needed is for the receiver side of the MTA to advertise and accept the extension, the sender side to ask for it, and then to disable some checks. If the MTA does not support non-ASCII domains, IDNA support also has to be added, but several libraries are available that do that, some more or less automagically. The hard problems occur in MUAs and most of them involve design issues, not just code tweaks, especially if one expects to support the general case of arbitrary email addresses. Almost independent of the above, it continues to appear to me to be unwise for a W3C spec to be a barrier to deployment in this area. |
There are several places that non-ASCII local email addresses "work." It seems unwise to restrict the validation to obsolete standards when we know that EAI exists and is in progress. Is there a good reason why non-ASCII addresses should be prohibited, knowing that the industry is trying to move toward more globally appropriate addresses? |
Just to be clear - this not being in the spec does not block anything. In practice, if you need to support i18n mail addresses - one could just use input type="text" and write validation yourself. When I was looking into this issue, I took a quick look at usage of WF2 in popular sites and was genuinely surprised to find that none of them used input type email - and of the ones I tested, none of them allowed neither domains nor mailboxes outside of good old ISO-8859-1 in their hand rolled validation. While this is a limited sample, it does suggest that a significant portion of WF2 is being unused either because it is not customizable enough (argument I've heard quite a lot) or due to compatibility. (another argument I hear a lot) The macro question of whether or not we should re-engineer WF2 to be extensible web primitives with better interoperability instead of the mess it is right now (I would very much like to see this as primitives and specifics implemented as userland libraries or components instead) - is probably a question for another thread though. |
Yes, it means that web sites won’t let me enter my email address. I can’t change the input type on my shopping cart when I’m trying to buy something… assuming someone used it.
If it is to be deprecated then it should be tagged as such.
…Sent from my Windows 10 phone
________________________________
From: cynthia <notifications@github.com>
Sent: Monday, October 30, 2017 7:02:06 PM
To: w3c/html
Cc: Shawn Steele; Comment
Subject: Re: [w3c/html] allow input type="email" to accept unicode values (#845)
Just to be clear - this not being in the spec does not block anything. In practice, if you need to support i18n mail addresses - one could just use input type="text" and write validation yourself.
When I was looking into this issue, I took a quick look at usage of WF2 in popular sites and was genuinely surprised to find that none of them used input type email - and of the ones I tested, none of them allowed neither domains nor mailboxes outside of good old ISO-8859-1 in their hand rolled validation. While this is a limited sample, it does suggest that a significant portion of WF2 is being unused either because it is not customizable enough (argument I've heard quite a lot) or due to compatibility. (another argument I hear a lot)
The macro question of whether or not we should re-engineer WF2 to be extensible web primitives with better interoperability instead of the mess it is right now (I would very much like to see this as primitives and specifics implemented as userland libraries or components instead) - is probably a question for another thread though.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#845 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ALzYRZRJWhxsHv8d0ofg8V34qtcuzr5lks5sxn-egaJpZM4MtUhZ>.
|
Most of that in the wild isn't because of the lack of standards support for unicode support for mailboxes and IDNs in user agents, but more of a content author decision to only validate (with a script. Madness!) against ASCII mail addresses. |
Shawn, I think I agree with Cynthia here, based partially on long experience (long predating the EAI work) with web sites either rejecting email local-parts containing "+" as invalid or accepting them and screwing up in some major way later. I think I understand the problem and where it comes from, but it isn't a standards matter -- "+" has been valid in local parts since before 821/822. However, it seems to me that the bottom line is the same: if W3C is going to have a validation standard, it should not block or discourage the use of a valid feature (in this case, non-ASCII email addresses) that a significant number of people consider important. Whether it is desirably that there be such a standard and, if so, what form it should take, is a separate question. So are the questions about how we encourage conformance and deployment of the SMTPUTF8-related specs. All I'm sure of is that this, or any successor, validation spec should avoid getting in the way.
|
ICANN's Universal Acceptance program (see https://uasg.tech/) is trying to encourage software and content authors to support full non-ASCII domains, and some countries, such as China and India, are already actively creating and using local domains that are non-ascii on both sides of the @ sign. There's a handy white paper at https://uasg.tech/wp-content/uploads/2017/04/Unleashing-the-Power-of-All-Domains-White-Paper.pdf) that gives more detail, and suggests that the problem is mostly one of developers/content authors being unaware of the need to support full unicode domains. This is why And btw it seems to me that making |
Just so I understand - are we discussing the regex at https://www.w3.org/TR/html/sec-forms.html#email-state-typeemail ?
In which case, I agree that it needs to be updated. |
@edent for myself, i was talking about that whole section, but particularly, i suppose about the bit that starts "A valid e-mail address is...", which hopefully will be the same as the regex. But if you're saying that the regex doesn't support example@你好.com (i haven't checked), i think there's an additional issue. Because the spec was changed to allow that, iiuc. |
@r12a I tried that email regex on https://regex101.com/ and used those example emails - it failed. |
@edent Yeah, looking at it now i can see why. |
I think that’s what I meant: Web sites are rolling their own and not using the standard, so that’s another problem with getting EAI adopted, somewhat orthogonal to the discussion of the email input type.
However, if the W3C is going to provide a validation standard, whether or not it’s tied to the email input type, then it should support EAI. The comments indicated that it might be better to wait until adoption were higher, but that’s just adding an unnecessary barrier to adopting EAI – the chicken and the egg thing.
Additionally, if people aren’t using the right input type, then it seems like the W3C should either encourage using the right input type, or abandon the concept and deprecate it. I’d prefer encouraging the correct use of the email input type. Not supporting EAI however could be a barrier to that as well since site developers that are interested in correct EAI support would not get that support from input type email.
So in short, it seems like EAI support (including local parts) needs to be added to the input type email. Both to unblock potential EAI adoption, and also to encourage proper use of the input type email by applications.
…-Shawn
From: klensin [mailto:notifications@github.com]
Sent: Tuesday, October 31, 2017 5:12 AM
To: w3c/html <html@noreply.github.com>
Cc: Shawn Steele <Shawn.Steele@microsoft.com>; Comment <comment@noreply.github.com>
Subject: Re: [w3c/html] allow input type="email" to accept unicode values (#845)
Shawn, I think I agree with Cynthia here, based partially on long experience (long predating the EAI work) with web sites either rejecting email local-parts containing "+" as invalid or accepting them and screwing up in some major way later. I think I understand the problem and where it comes from, but it isn't a standards matter -- "+" has been valid in local parts since before 821/822. However, it seems to me that the bottom line is the same: if W3C is going to have a validation standard, it should not block or discourage the use of a valid feature (in this case, non-ASCII email addresses) that a significant number of people consider important.
Whether it is desirably that there be such a standard and, if so, what form it should take, is a separate question. So are the questions about how we encourage conformance and deployment of the SMTPUTF8-related specs. All I'm sure of is that this, or any successor, validation spec should avoid getting in the way.
john
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#845 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ALzYRfLwBvDWN5PA-ZVBRzZhTmnO1Lzvks5sxw59gaJpZM4MtUhZ>.
|
Correct – and more to the point, the language above it that drives the regex.
One interesting curiosity is that the regex is reasonably strict for ASCII-range email addresses – however EAI permits a bunch of stuff in the Unicode range of the local part. And a regex isn’t going to be sufficient to validate legal IDN names.
Conceptually the desire for the input type=email validation would seem to be a somewhat rigorous validation that they user entered a fairly well-formed email address. However, EAI is much more complicated for the domain part, and the local part is much more permissive in the Unicode range than the ASCII range.
The IDN rules do a bunch of mapping that wouldn’t fit a simple regex. Even getting rid of the illegal characters (control codes, whatnot), would probably result in an unwieldy regex?
The local part sort of has the opposite problem. EAI is very permissive, so everything > U+007F is effectively legal. Some of it is clearly undesirable, but there’s no well-defined line saying “this character is good and this one is evil”. Obvious control characters are bad, but some may be necessary for proper rendering. Other characters are also in grayer areas. For one example, I’m looking forward to the day I can get mail at @microsoft.com<mailto:@microsoft.com>.
…-Shawn
From: Terence Eden [mailto:notifications@github.com]
Sent: Tuesday, October 31, 2017 6:52 AM
To: w3c/html <html@noreply.github.com>
Cc: Shawn Steele <Shawn.Steele@microsoft.com>; Comment <comment@noreply.github.com>
Subject: Re: [w3c/html] allow input type="email" to accept unicode values (#845)
Just so I understand - are we discussing the regex at https://www.w3.org/TR/html/sec-forms.html#email-state-typeemail ?
/^[a-zA-Z0-9.!#$%&'*+\/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$/
In which case, I agree that it needs to be updated.
It doesn't match example@你好.com nor 你好@example.com
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#845 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ALzYRRdRfUHzSp9qr3VfEbEjavzsKtB2ks5sxyX6gaJpZM4MtUhZ>.
|
I might be wrong, but I think some browsers internally translate IDN into punycode thereby passing the regex validation for the domain part. |
Well, that’d be another issue for input type email. It should be sent as Unicode, not Punycode.
From: Collin Anderson [mailto:notifications@github.com]
Sent: Tuesday, October 31, 2017 12:32 PM
To: w3c/html <html@noreply.github.com>
Cc: Shawn Steele <Shawn.Steele@microsoft.com>; Comment <comment@noreply.github.com>
Subject: Re: [w3c/html] allow input type="email" to accept unicode values (#845)
I might be wrong, but I think some browsers internally translate IDN into punycode thereby passing the regex validation for the domain part.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#845 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ALzYRUFRIPRP8EgtI6EjhTyu5Q43pTF5ks5sx3WxgaJpZM4MtUhZ>.
|
Maybe I'm wrong, but isn't input type email also being used for autocompletion and input type? |
Are there cases where the UA would want to restrict to "old" email addresses? If so, we might need two (or three) input methods "old style", "fully international" and maybe maybe "general" (which currently means old, but should be moved to mean international). |
I really don’t like that idea. That means the app sort of has permission to disallow international email addresses. And some American sites are going to go with that because it’s easy, leaving other user’s in the lurch.
Should someone really want to prohibit non-ASCII addresses, they’d merely have to add their own filter checking for codepoints above U+007F before sending the form? That seems much better than backing something into the standard that’d be deprecated before it was even implemented.
…-Shawn
From: David Singer [mailto:notifications@github.com]
Sent: Tuesday, November 28, 2017 12:14 PM
To: w3c/html <html@noreply.github.com>
Cc: Shawn Steele <Shawn.Steele@microsoft.com>; Comment <comment@noreply.github.com>
Subject: Re: [w3c/html] allow input type="email" to accept unicode values (#845)
Are there cases where the UA would want to restrict to "old" email addresses? If so, we might need two (or three) input methods "old style", "fully international" and maybe maybe "general" (which currently means old, but should be moved to mean international).
I think we should also discuss presentation of email addresses, particularly those involving RTL text.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#845 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ALzYRZsrybzr1ozozJuNAH8pSdqarwZQks5s7Gl-gaJpZM4MtUhZ>.
|
I'm not sure I like two different input methods either. I do want to see some real thought into how we present multi-script labels (URIs and email addresses) in ways that empower rather than confuse users. That's a serious upcoming problem (how do I make sure it's the company I intend and not a phishing attack if I can't read it? how do I know it's the person I intend at the company I intend if I can't read it? and punycode makes answers to those questions even harder). |
I reply-all’d to a mail yesterday that included folks I didn’t expect merely because the To: and Cc: were so cluttered. I’m not sure how phishing gets prevented.
IRI’s have their own challenges, but we’ve (the bigger we of all IT/Web folks) trained folks that mycompany.safe.com is OK merely because “we” keep using providers that do things like “mycompany.survey.com” and “mycompany.emaildist.com” or whatever.
I don’t know how you could possibly know that it wasn’t phishing just by inspecting the mail address. What if I asked you to send you confidential email to bgates@... (instead of billg) – how could you possibly know? You’d better have another way of trusting whomever gave you that email address.
I do know that I can’t read Punycode at all, so if you’re depending on how that looks, every Punycode spoofs every other one.
…-Shawn
From: David Singer [mailto:notifications@github.com]
Sent: Tuesday, November 28, 2017 2:44 PM
To: w3c/html <html@noreply.github.com>
Cc: Shawn Steele <Shawn.Steele@microsoft.com>; Comment <comment@noreply.github.com>
Subject: Re: [w3c/html] allow input type="email" to accept unicode values (#845)
I'm not sure I like two different input methods either.
I do want to see some real thought into how we present multi-script labels (URIs and email addresses) in ways that empower rather than confuse users. That's a serious upcoming problem (how do I make sure it's the company I intend and not a phishing attack if I can't read it? how do I know it's the person I intend at the company I intend if I can't read it? and punycode makes answers to those questions even harder).
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#845 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ALzYRWahhBMdraANTHp5FXYZ5Ahp6bRhks5s7Iy-gaJpZM4MtUhZ>.
|
agreed, email addresses I mostly want to be reasonably sure it's the right company, if I can't read the personal address (if I can, it's a "do I recognize it?). Don't forget mailto: URLs. agreed, punycode is an excuse for "we have no idea what to show you so we will be quite sure to show you something we know you can't read". how that helps except to protect the guilty I do not know. |
@dwsinger / @ShawnSteele I think you're looking through the wrong end of the problem here. The problem here is not presentation, it's input. Sites can take the input and do whatever they need to in order to validate the email address. A site could reject EAI addresses, for example. The only thing that input type=email really guarantees is that the string has an @ and a dot in it....... Prior to EAI/this issue, it also restricted the character range to a subset of ASCII. If we make this change, non-ASCII addresses will be enterable. Some naive implementations will fail to encode these properly for use in email protocols. Those addresses didn't work previously, just one step closer to the user. The point of input type=* is to engage the user agent and possibly as an input hint. Otherwise it's just a glorified input type=text. FWIW, FF accepts any tokens as an email address as long as they are separated by @ and a series of dots. |
I think that we should (mixing metaphors) look at both ends of the telescope; we have to do better at both accepting IRIs and international email addresses, and presenting them. I am unhappy to consider either one without the other. |
A naïve implementation may have trouble, but applications that merely take the Unicode string and pass it to their mail library’s Unicode API should be OK. We do the punycode step in a pretty low level, people can call the DNS APIs with Unicode successfully. Granted old libraries might be too strict or do things that don’t work with EAI, but it shouldn’t be too difficult for many systems to handle EAI successfully.
I’m not sure where your thought’s going. You point out that “applications might do the wrong thing” but I’m not sure how that impacts the specification for the email type.
Personally I’d consider normalizing punycode text entry to Unicode, but that’s probably overkill. However, it’d encourage appropriate behavior.
…-Shawn
From: Addison Phillips [mailto:notifications@github.com]
Sent: Tuesday, November 28, 2017 4:19 PM
To: w3c/html <html@noreply.github.com>
Cc: Shawn Steele <Shawn.Steele@microsoft.com>; Mention <mention@noreply.github.com>
Subject: Re: [w3c/html] allow input type="email" to accept unicode values (#845)
@dwsinger<https://github.com/dwsinger> / @ShawnSteele<https://github.com/shawnsteele> I think you're looking through the wrong end of the problem here. The problem here is not presentation, it's input. Sites can take the input and do whatever they need to in order to validate the email address. A site could reject EAI addresses, for example. The only thing that input type=email really guarantees is that the string has an @ and a dot in it.......
Prior to EAI/this issue, it also restricted the character range to a subset of ASCII. If we make this change, non-ASCII addresses will be enterable. Some naive implementations will fail to encode these properly for use in email protocols. Those addresses didn't work previously, just one step closer to the user.
The point of input type=* is to engage the user agent and possibly as an input hint. Otherwise it's just a glorified input type=text. FWIW, FF accepts any tokens as an email address as long as they are separated by @ and a series of dots.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#845 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ALzYRdhPTO7jChp8ou55SQHpmckJzUWiks5s7KLtgaJpZM4MtUhZ>.
|
I do not think that is the case. On my corporate intranet, I can send mail to Some observations.
|
@edent said
Actually, applications do - they check the validity state, and then refuse to submit the form. So telling them to validate against modern internationalised email instead of the historical ASCII type is important. Otherwise I agree with @aphillips - this is about accepting user input. We currently insist on something that maps to punycode for the right-hand-side and I am not sure there is yet a big push to change that. We should accept stuff that is not ASCII as the local part. How email applications handle this is like any internationalisation question - there are some broken ones but increasingly it works. But as well as the protocol level there is an important set of use cases where people use email addresses as a generic identifier, even if no email is actually sent. These use cases should also allow people to use actual functional email addresses with non-ASCII in them. |
Sorry, I meant "nothing stops a user from typing in the wrong email address". I've lost count of the people who submit my email address thinking it is theirs. Or the people who think their email provider is |
Ah, yes. One of the ideas presented behind |
Well, to quibble, that’s a “well-formed” address. It still may be invalid.
From: chaals [mailto:notifications@github.com]
Sent: Wednesday, November 29, 2017 3:51 AM
To: w3c/html <html@noreply.github.com>
Cc: Shawn Steele <Shawn.Steele@microsoft.com>; Mention <mention@noreply.github.com>
Subject: Re: [w3c/html] allow input type="email" to accept unicode values (#845)
@edent<https://github.com/edent> said
Ultimately, nothing stops a user from typing in an incorrect email address - either mistakenly or on purpose.
Actually, applications do - they check the validity state, and then refuse to submit the form. So telling them to validate against modern internationalised email instead of the historical ASCII type is important.
Otherwise I agree with @aphillips<https://github.com/aphillips> - this is about accepting user input. We currently insist on something that maps to punycode for the right-hand-side and I am not sure there is yet a big push to change that. We should accept stuff that is not ASCII as the local part.
How email applications handle this is like any internationalisation question - there are some broken ones but increasingly it works.
But as well as the protocol level there is an important set of use cases where people use email addresses as a generic identifier, even if no email is actually sent. These use cases should also allow people to use actual functional email addresses with non-ASCII in them.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#845 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ALzYRWlLt_k_Z6SAKEcC676pvadKYQcZks5s7UUzgaJpZM4MtUhZ>.
|
My first reaction is “intriguing idea” – but, in practice is that useful? I cannot remember the last time I entered an email address in a web form that was not my own. And I’m not sure I have myself as a contact.
…-Shawn
From: chaals [mailto:notifications@github.com]
Sent: Wednesday, November 29, 2017 4:03 AM
To: w3c/html <html@noreply.github.com>
Cc: Shawn Steele <Shawn.Steele@microsoft.com>; Mention <mention@noreply.github.com>
Subject: Re: [w3c/html] allow input type="email" to accept unicode values (#845)
Ah, yes.
One of the ideas presented behind input type="email" is that it would allow a user to select stuff from their address book - assuming access from the browser is allowed. This provides for lots of things, including validating email addresses rather more carefully against reality... but as far as I know that is not currently implemented anywhere :(
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#845 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ALzYRZ-cxpyZJZ3PbJj_V9YmYngiiU0Iks5s7UfkgaJpZM4MtUhZ>.
|
I do it when I use webmail. That is a pretty common scenario. It might be that since the address book is then likely to be on the app side, and the magic form types are unfriendly to customisation, that it isn't an important case in practice. In my own case, I am generally using a web interface to something for which I also use a local client that does maintain an address book, but I might not be typical enough to matter. |
Back when I worked for WAC (Wholesale Applications Community) we defined a standardised way of the address book being exposed to the browser - https://web.archive.org/web/20110313075317/http://specs.wacapps.net:80/wac2_0/feb2011/deviceapis/contact.html Useful if you want to pick addresses. But I suspect |
* Allow internationalised email addresses Fix #845 * add link
This is related to #538 - the "left hand side" of an email address can, in the internationalised world, be more or less anything. But currently the spec only allows ASCII.
We're looking to see evidence that the world's deployed email infrastructure actually works with utf-8 before pushing this change.
The text was updated successfully, but these errors were encountered: