-
Notifications
You must be signed in to change notification settings - Fork 267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libidn2 support for IDNA2008+UTS#46 (using ffi) #496
base: main
Are you sure you want to change the base?
Conversation
lib/addressable/idna/native2.rb
Outdated
def self.to_unicode(value) | ||
pointer = FFI::MemoryPointer.new(:pointer) | ||
res = idn2_to_unicode_8z8z(value, pointer, IDN2_NONTRANSITIONAL) | ||
return value if res != 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially put some exception here in case of invalid input, but it turns out the specs expect invalid punnycode hostname to simply be returned unchanged, so I did just that instead. It's hidding errors and silently returning the input string now, not very strict I suppose but more compatible with existing usage 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something to change in a major version bump?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well it's a more of a design choice. It's true that if it were my decision I would prefer the stricter version raising an exception, and as I suggest shipping this in a major version, we could probably do it.
But on the other end I know the direction of the gem is to be "flexible, offers heuristic parsing", as opposed to the Ruby URI module, so I understand that accepting invalid input and keeping it unchanged without raising can be a feature and a design choice. So if you guys prefer to keep this flexibility I totally understand it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the test case we're talking about?
addressable/spec/addressable/idna_spec.rb
Lines 163 to 166 in e91b64e
it "should return the identity conversion when punycode decode fails" do | |
expect(Addressable::IDNA.to_unicode("xn--zckp1cyg1.sblo.jp")).to eq( | |
"xn--zckp1cyg1.sblo.jp") | |
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one and 2 others would be failing if I raise an error here:
1) Addressable::IDNA when using the libidn2 native implementation (ffi) it should behave like converting from ASCII to unicode should convert 'AcinusFallumTrompetumNullunCreditumVisumEstAtCuadLongumEtCefallum.com' correctly
Failure/Error: raise "libidn2 failed to convert \"#{value}\" to unicode (#{idn2_strerror(res)})" if res != 0
RuntimeError:
libidn2 failed to convert "AcinusFallumTrompetumNullunCreditumVisumEstAtCuadLongumEtCefallum.com" to unicode (domain label longer than 63 characters)
Shared Example Group: "converting from ASCII to unicode" called from ./spec/addressable/idna_spec.rb:321
# ./lib/addressable/idna/libidn2.rb:52:in `to_unicode'
# ./lib/addressable/idna.rb:30:in `to_unicode'
# ./spec/addressable/idna_spec.rb:160:in `block (2 levels) in <top (required)>'
2) Addressable::IDNA when using the libidn2 native implementation (ffi) it should behave like converting from ASCII to unicode should return the identity conversion when punycode decode fails
Failure/Error: raise "libidn2 failed to convert \"#{value}\" to unicode (#{idn2_strerror(res)})" if res != 0
RuntimeError:
libidn2 failed to convert "xn--zckp1cyg1.sblo.jp" to unicode (string contains invalid punycode data)
Shared Example Group: "converting from ASCII to unicode" called from ./spec/addressable/idna_spec.rb:321
# ./lib/addressable/idna/libidn2.rb:52:in `to_unicode'
# ./lib/addressable/idna.rb:30:in `to_unicode'
# ./spec/addressable/idna_spec.rb:164:in `block (2 levels) in <top (required)>'
3) Addressable::IDNA when using the libidn2 native implementation (ffi) it should behave like converting from ASCII to unicode should return the identity conversion when the ACE prefix has no suffix
Failure/Error: raise "libidn2 failed to convert \"#{value}\" to unicode (#{idn2_strerror(res)})" if res != 0
RuntimeError:
libidn2 failed to convert "xn--...-" to unicode (string contains invalid punycode data)
Shared Example Group: "converting from ASCII to unicode" called from ./spec/addressable/idna_spec.rb:321
# ./lib/addressable/idna/libidn2.rb:52:in `to_unicode'
# ./lib/addressable/idna.rb:30:in `to_unicode'
# ./spec/addressable/idna_spec.rb:169:in `block (2 levels) in <top (required)>'
The last two are invalid punycode and the first one is invalid DNS length (https://datatracker.ietf.org/doc/html/rfc1034#section-3.1).
libidn1
also raise in the first case but we have this workaround to explicitely allow for > 63 bytes labels:
def self.to_ascii(value)
value.to_s.split('.', -1).map do |segment|
if segment.size > 0 && segment.size < 64
IDN::Idna.toASCII(segment, IDN::Idna::ALLOW_UNASSIGNED)
elsif segment.size >= 64
segment
else
''
end
end.join('.')
end
Looks like this was made in c73810f to make it more consistent with pure
. Didn't see any issue attached though.
So I suppose if we make libidn2
stricter, which means basically:
raise "libidn2 failed to convert \"#{value}\" to unicode (#{idn2_strerror(res)})" if res != 0
instead of
return value if res != 0
We would need to remove these workarounds and make all implementations rejects these domains in the same way.
Which does sound like the way to go IMO but of course could break some use-cases for people who need to handle such "slightly" invalid domains.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting suggestion, I do like the fact that this solution being run-time, we can use this class even if it's not defined yet. Unfortunately it doesn't work with classes, only modules: wrong argument type Class (expected Module)
if I do error.extend(Addressable::URI::InvalidURIError)
. In your example Twingly::URL::Error
is actually a module. And if we need to change Addressable::URI::InvalidURIError
to be a module this would complexify the rest :/ I couldn't find any way to change the ancestory chain by adding another class in the middle.
Looking again at uri.rb
I see IDNA
is used only twice (in normalized_host
for to_ascii
and in display_uri
for to_unicode
), so option 2 which is to re-wrap the error here doesn't sound too complicated either.
I just gave this option a try in 9eb3910 and I think I actually prefer this one. No hierarchy issue here, every module/class only deal with its own exceptions. The wrapped exception are properly identified and the cause
attribute contains the previous exception (with specs for that just in case), this means backtrace and history is complete for bug tracker. People doing rescue Addressable::URI::InvalidURIError
are covered the same way.
I also added by the way specs for the case of invalid IDNA hostname at URI level this time (I couldn't find any at the moment). 3 of them for the Pure implementation have been marked pending because they are returning garbage at the moment (implementation makes up unicode characters from invalid input).
And I also fixed the libidn1
exception handling, which was still letting IDN::Idna::IdnaError
exception up so not handled properly (I missed it earlier because there was no spec on this case), now it's raising Addressable::IDNA::Error
like the other backends (+spec)
Sorry for the long back and forth, let me know what you think about this one ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sorry for that, I did know about the module
thing but forgot before I posted, oh well
Yes, I like 9eb3910 too :) Thanks for extending the spec coverage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem ^^ I wasn't sure so I tried.
In order to validate this branch more while you fiddle with it, I just deployed this version to staging and then production on my service. Using libidn2
and strict_mode
:
# Select libidn2 (not the default at the moment)
require "addressable/idna/libidn2"
Addressable::IDNA.backend = Addressable::IDNA::Libidn2
Addressable::IDNA.strict_mode = true
If I see any problem I'll report it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Necro-ing a little bit, but wanted to weigh in on the start of the conversation. I think it's important to offer a mechanism that's permissive in what it accepts. It's literally the reason I wrote Addressable in the first place, because the standard library doesn't take this approach to parsing and I couldn't parse URIs that were openable in a browser, leading to surprise from end users. There are often cases where failing with an exception will mean that there's no graceful way to get partial information. For instance, something might be very wrong with the encoding in the hostname, but if the library's user was only trying to retrieve the path value, the invalid URI exception is rather obstructive to that goal.
On the other hand, you're absolutely right that there are cases where the opposite is preferred and strict parsing is preferable. My view is these should simply be handled by different methods rather than changing the behavior for the whole library in a major version rev.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sporkmonger thanks for your feedback.
I understand and agree, the concensus that @dentarg and I reached further down this discussion was to introduce the Addressable::IDNA.strict_mode = true
option (default to false) so that people can choose if they want stricter parsing but otherwise it's lax as before.
Of course if you prefer different methods instead of an option, we can probably do that instead. Though if there's different methods in IDNA modules for both behaviors, we would also need to mirror that in the URI module because the methods people call usually are here. I haven't checked the whole public API recently but I'm concerned this may create a lot of new methods. I also thought about people using gems which depends on adressable : if we're using different methods, the end-user won't be able to change the behavior. I can have a deeper look if you want me to 👍
✔️ As usual I also verified this new version against my 150k URL at updown.io, no differences after normalization between master (using pure implementation) and this branch (using native2). Except that it took only 5.6 seconds now (versus 12s with pure) to normalize the 150k URLs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments, need to take this for a spin too when time permits :)
lib/addressable/idna/native2.rb
Outdated
def self.to_unicode(value) | ||
pointer = FFI::MemoryPointer.new(:pointer) | ||
res = idn2_to_unicode_8z8z(value, pointer, IDN2_NONTRANSITIONAL) | ||
return value if res != 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the test case we're talking about?
addressable/spec/addressable/idna_spec.rb
Lines 163 to 166 in e91b64e
it "should return the identity conversion when punycode decode fails" do | |
expect(Addressable::IDNA.to_unicode("xn--zckp1cyg1.sblo.jp")).to eq( | |
"xn--zckp1cyg1.sblo.jp") | |
end |
Co-authored-by: Patrik Ragnarsson <patrik@starkast.net>
…s it requires the gem to be loaded
rescue Error | ||
strict_mode ? raise : value | ||
end | ||
|
||
def to_unicode(value) | ||
backend.to_unicode(value) | ||
backend.to_unicode(value) if value.is_a?(String) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While doing some tests I noticed behavior was not consistent when passing invalid input between backends, so I normalized this with if value.is_a?(String)
and added tests in f0b98df. This case shouldn't happen much unless people manually call Addressable::IDNA
but in that case we better protect the code a bit.
libidn2 for example was throwing some invalid memory read at address=0x0000000000000000
when called with nil
.
If you prefer for the other types like Integer or Array we can also raise a TypeError
.
Hey! I really love interchangeable IDNA backends introduced in this PR. Would it be possible to merge that part? I've tested this PR with my pure-Ruby implementation of IDNA2008/UTS46, and it integrates seamlessly: # frozen_string_literal: true
require "uri/idna"
module Addressable
module IDNA
module UTS46
def self.to_ascii(value)
URI::IDNA.to_ascii(value)
end
def self.to_unicode(value)
URI::IDNA.to_unicode(value)
end
end
end
end The IDNA backend concept would allow me to easily package the Addressable backend within the gem. Overall the |
Cool. Maybe possible. I really want to find some time to properly play with this myself and merge it. It is really awesome work by @jarthod here. And I'm also following your repo @skryukov :) And ruby/uri#76 hehe. |
That would be great indeed, and excellent work on the pure Ruby implementation @skryukov 👏 (I'm also following ruby/uri#76 ^^). @dentarg let me know if there is anything I can do to make it easier for you 🙇 |
Following #247, here is the PR adding
libidn2
support through a smallffi
wrapper (no dependencies other thanffi
).✔️ Benchmark result (benchmark code in the PR), slightly slower than
libidn1
as it's doing more work but nothing to worry about, we're still above 100k iterations/sec which is good 🚀 :✔️ I also added a memory leak test to this benchmark code, to make sure I wasn't forgetting any manual memory free from C allocations. It's all good now and we can easily see by commenting one of the
idn2_free
lines the memory increasing so looking good on this side:✔️ Specs are green locally (I just added 3 more to specify if the implementation is following IDNA2003 or IDNA2008+UTS#46) and on CI. I verified and there's no additional steps required to install libidn2 (already present by default in ubuntu and macos). I confirmed all the envs are running and passing libidn2 test (except Windows of course), we can see this with the increased number of test ran compared with master and the absence of the "Could not load native libidn2 implementation" line.
I saw maybe the
profile
job will have to be updated as it's currently runningidna_mode: [native, pure]
only. I'll have a look at this.→ Edit: done, I've added the "native2" variant to the profile job and it runs fine. I confirmed locally that it's using libidn2 (and we can see
ffi
in the output). MemoryProfiler won't catch C-level memory leaks though (I tried).