Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libidn2 support for IDNA2008+UTS#46 (using ffi) #496

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

jarthod
Copy link
Contributor

@jarthod jarthod commented Mar 22, 2023

Following #247, here is the PR adding libidn2 support through a small ffi wrapper (no dependencies other than ffi).

✔️ Benchmark result (benchmark code in the PR), slightly slower than libidn1 as it's doing more work but nothing to worry about, we're still above 100k iterations/sec which is good 🚀 :

# > ruby benchmark/idna.rb
# Rehearsal -------------------------------------------
# pure      5.914630   0.000000   5.914630 (  5.915326)
# libidn    0.518971   0.003672   0.522643 (  0.522676)
# libidn2   0.763936   0.000000   0.763936 (  0.763983)
# ---------------------------------- total: 7.201209sec

#               user     system      total        real
# pure      6.042877   0.000000   6.042877 (  6.043252)
# libidn    0.521668   0.000000   0.521668 (  0.521704)
# libidn2   0.764782   0.000000   0.764782 (  0.764863)

✔️ I also added a memory leak test to this benchmark code, to make sure I wasn't forgetting any manual memory free from C allocations. It's all good now and we can easily see by commenting one of the idn2_free lines the memory increasing so looking good on this side:

# Memory leak test for libidn2 (memory should stabilize quickly):
#  Memory: 117MB
#  Memory: 121MB
#  Memory: 121MB
#  Memory: 121MB
#  Memory: 121MB
#  Memory: 121MB
#  Memory: 121MB
#  Memory: 121MB
#  Memory: 121MB
#  Memory: 121MB

✔️ Specs are green locally (I just added 3 more to specify if the implementation is following IDNA2003 or IDNA2008+UTS#46) and on CI. I verified and there's no additional steps required to install libidn2 (already present by default in ubuntu and macos). I confirmed all the envs are running and passing libidn2 test (except Windows of course), we can see this with the increased number of test ran compared with master and the absence of the "Could not load native libidn2 implementation" line.

I saw maybe the profile job will have to be updated as it's currently running idna_mode: [native, pure] only. I'll have a look at this.
→ Edit: done, I've added the "native2" variant to the profile job and it runs fine. I confirmed locally that it's using libidn2 (and we can see ffi in the output). MemoryProfiler won't catch C-level memory leaks though (I tried).

@jarthod jarthod changed the title libidn2 support for IDNA2008+TR46 (using ffi) libidn2 support for IDNA2008+UTS#46 (using ffi) Mar 22, 2023
addressable.gemspec Outdated Show resolved Hide resolved
lib/addressable/idna.rb Outdated Show resolved Hide resolved
def self.to_unicode(value)
pointer = FFI::MemoryPointer.new(:pointer)
res = idn2_to_unicode_8z8z(value, pointer, IDN2_NONTRANSITIONAL)
return value if res != 0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially put some exception here in case of invalid input, but it turns out the specs expect invalid punnycode hostname to simply be returned unchanged, so I did just that instead. It's hidding errors and silently returning the input string now, not very strict I suppose but more compatible with existing usage 🤷

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to change in a major version bump?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it's a more of a design choice. It's true that if it were my decision I would prefer the stricter version raising an exception, and as I suggest shipping this in a major version, we could probably do it.

But on the other end I know the direction of the gem is to be "flexible, offers heuristic parsing", as opposed to the Ruby URI module, so I understand that accepting invalid input and keeping it unchanged without raising can be a feature and a design choice. So if you guys prefer to keep this flexibility I totally understand it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the test case we're talking about?

it "should return the identity conversion when punycode decode fails" do
expect(Addressable::IDNA.to_unicode("xn--zckp1cyg1.sblo.jp")).to eq(
"xn--zckp1cyg1.sblo.jp")
end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one and 2 others would be failing if I raise an error here:

  1) Addressable::IDNA when using the libidn2 native implementation (ffi) it should behave like converting from ASCII to unicode should convert 'AcinusFallumTrompetumNullunCreditumVisumEstAtCuadLongumEtCefallum.com' correctly
     Failure/Error: raise "libidn2 failed to convert \"#{value}\" to unicode (#{idn2_strerror(res)})" if res != 0
     
     RuntimeError:
       libidn2 failed to convert "AcinusFallumTrompetumNullunCreditumVisumEstAtCuadLongumEtCefallum.com" to unicode (domain label longer than 63 characters)
     Shared Example Group: "converting from ASCII to unicode" called from ./spec/addressable/idna_spec.rb:321
     # ./lib/addressable/idna/libidn2.rb:52:in `to_unicode'
     # ./lib/addressable/idna.rb:30:in `to_unicode'
     # ./spec/addressable/idna_spec.rb:160:in `block (2 levels) in <top (required)>'

  2) Addressable::IDNA when using the libidn2 native implementation (ffi) it should behave like converting from ASCII to unicode should return the identity conversion when punycode decode fails
     Failure/Error: raise "libidn2 failed to convert \"#{value}\" to unicode (#{idn2_strerror(res)})" if res != 0
     
     RuntimeError:
       libidn2 failed to convert "xn--zckp1cyg1.sblo.jp" to unicode (string contains invalid punycode data)
     Shared Example Group: "converting from ASCII to unicode" called from ./spec/addressable/idna_spec.rb:321
     # ./lib/addressable/idna/libidn2.rb:52:in `to_unicode'
     # ./lib/addressable/idna.rb:30:in `to_unicode'
     # ./spec/addressable/idna_spec.rb:164:in `block (2 levels) in <top (required)>'

  3) Addressable::IDNA when using the libidn2 native implementation (ffi) it should behave like converting from ASCII to unicode should return the identity conversion when the ACE prefix has no suffix
     Failure/Error: raise "libidn2 failed to convert \"#{value}\" to unicode (#{idn2_strerror(res)})" if res != 0
     
     RuntimeError:
       libidn2 failed to convert "xn--...-" to unicode (string contains invalid punycode data)
     Shared Example Group: "converting from ASCII to unicode" called from ./spec/addressable/idna_spec.rb:321
     # ./lib/addressable/idna/libidn2.rb:52:in `to_unicode'
     # ./lib/addressable/idna.rb:30:in `to_unicode'
     # ./spec/addressable/idna_spec.rb:169:in `block (2 levels) in <top (required)>'

The last two are invalid punycode and the first one is invalid DNS length (https://datatracker.ietf.org/doc/html/rfc1034#section-3.1).

libidn1 also raise in the first case but we have this workaround to explicitely allow for > 63 bytes labels:

    def self.to_ascii(value)
      value.to_s.split('.', -1).map do |segment|
        if segment.size > 0 && segment.size < 64
          IDN::Idna.toASCII(segment, IDN::Idna::ALLOW_UNASSIGNED)
        elsif segment.size >= 64
          segment
        else
          ''
        end
      end.join('.')
    end

Looks like this was made in c73810f to make it more consistent with pure. Didn't see any issue attached though.

So I suppose if we make libidn2 stricter, which means basically:

raise "libidn2 failed to convert \"#{value}\" to unicode (#{idn2_strerror(res)})" if res != 0

instead of

return value if res != 0

We would need to remove these workarounds and make all implementations rejects these domains in the same way.
Which does sound like the way to go IMO but of course could break some use-cases for people who need to handle such "slightly" invalid domains.

Copy link
Contributor Author

@jarthod jarthod Apr 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting suggestion, I do like the fact that this solution being run-time, we can use this class even if it's not defined yet. Unfortunately it doesn't work with classes, only modules: wrong argument type Class (expected Module) if I do error.extend(Addressable::URI::InvalidURIError). In your example Twingly::URL::Error is actually a module. And if we need to change Addressable::URI::InvalidURIError to be a module this would complexify the rest :/ I couldn't find any way to change the ancestory chain by adding another class in the middle.

Looking again at uri.rb I see IDNA is used only twice (in normalized_host for to_ascii and in display_uri for to_unicode), so option 2 which is to re-wrap the error here doesn't sound too complicated either.

I just gave this option a try in 9eb3910 and I think I actually prefer this one. No hierarchy issue here, every module/class only deal with its own exceptions. The wrapped exception are properly identified and the cause attribute contains the previous exception (with specs for that just in case), this means backtrace and history is complete for bug tracker. People doing rescue Addressable::URI::InvalidURIError are covered the same way.

I also added by the way specs for the case of invalid IDNA hostname at URI level this time (I couldn't find any at the moment). 3 of them for the Pure implementation have been marked pending because they are returning garbage at the moment (implementation makes up unicode characters from invalid input).

And I also fixed the libidn1 exception handling, which was still letting IDN::Idna::IdnaError exception up so not handled properly (I missed it earlier because there was no spec on this case), now it's raising Addressable::IDNA::Error like the other backends (+spec)

Sorry for the long back and forth, let me know what you think about this one ;)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry for that, I did know about the module thing but forgot before I posted, oh well

Yes, I like 9eb3910 too :) Thanks for extending the spec coverage

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem ^^ I wasn't sure so I tried.

In order to validate this branch more while you fiddle with it, I just deployed this version to staging and then production on my service. Using libidn2 and strict_mode:

# Select libidn2 (not the default at the moment)
require "addressable/idna/libidn2"
Addressable::IDNA.backend = Addressable::IDNA::Libidn2
Addressable::IDNA.strict_mode = true

If I see any problem I'll report it here.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Necro-ing a little bit, but wanted to weigh in on the start of the conversation. I think it's important to offer a mechanism that's permissive in what it accepts. It's literally the reason I wrote Addressable in the first place, because the standard library doesn't take this approach to parsing and I couldn't parse URIs that were openable in a browser, leading to surprise from end users. There are often cases where failing with an exception will mean that there's no graceful way to get partial information. For instance, something might be very wrong with the encoding in the hostname, but if the library's user was only trying to retrieve the path value, the invalid URI exception is rather obstructive to that goal.

On the other hand, you're absolutely right that there are cases where the opposite is preferred and strict parsing is preferable. My view is these should simply be handled by different methods rather than changing the behavior for the whole library in a major version rev.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sporkmonger thanks for your feedback.

I understand and agree, the concensus that @dentarg and I reached further down this discussion was to introduce the Addressable::IDNA.strict_mode = true option (default to false) so that people can choose if they want stricter parsing but otherwise it's lax as before.

Of course if you prefer different methods instead of an option, we can probably do that instead. Though if there's different methods in IDNA modules for both behaviors, we would also need to mirror that in the URI module because the methods people call usually are here. I haven't checked the whole public API recently but I'm concerned this may create a lot of new methods. I also thought about people using gems which depends on adressable : if we're using different methods, the end-user won't be able to change the behavior. I can have a deeper look if you want me to 👍

addressable.gemspec Outdated Show resolved Hide resolved
benchmark/idna.rb Outdated Show resolved Hide resolved
@jarthod jarthod marked this pull request as draft March 22, 2023 18:48
@jarthod jarthod marked this pull request as ready for review March 22, 2023 18:48
@jarthod
Copy link
Contributor Author

jarthod commented Mar 23, 2023

✔️ As usual I also verified this new version against my 150k URL at updown.io, no differences after normalization between master (using pure implementation) and this branch (using native2). Except that it took only 5.6 seconds now (versus 12s with pure) to normalize the 150k URLs.

.github/workflows/test.yml Outdated Show resolved Hide resolved
Copy link
Collaborator

@dentarg dentarg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments, need to take this for a spin too when time permits :)

README.md Outdated Show resolved Hide resolved
def self.to_unicode(value)
pointer = FFI::MemoryPointer.new(:pointer)
res = idn2_to_unicode_8z8z(value, pointer, IDN2_NONTRANSITIONAL)
return value if res != 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the test case we're talking about?

it "should return the identity conversion when punycode decode fails" do
expect(Addressable::IDNA.to_unicode("xn--zckp1cyg1.sblo.jp")).to eq(
"xn--zckp1cyg1.sblo.jp")
end

lib/addressable/idna/native.rb Outdated Show resolved Hide resolved
lib/addressable/idna/pure.rb Outdated Show resolved Hide resolved
benchmark/idna.rb Outdated Show resolved Hide resolved
lib/addressable/idna.rb Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
jarthod and others added 3 commits April 18, 2023 10:44
lib/addressable/uri.rb Outdated Show resolved Hide resolved
rescue Error
strict_mode ? raise : value
end

def to_unicode(value)
backend.to_unicode(value)
backend.to_unicode(value) if value.is_a?(String)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While doing some tests I noticed behavior was not consistent when passing invalid input between backends, so I normalized this with if value.is_a?(String) and added tests in f0b98df. This case shouldn't happen much unless people manually call Addressable::IDNA but in that case we better protect the code a bit.

libidn2 for example was throwing some invalid memory read at address=0x0000000000000000 when called with nil.

If you prefer for the other types like Integer or Array we can also raise a TypeError.

@skryukov
Copy link

Hey! I really love interchangeable IDNA backends introduced in this PR. Would it be possible to merge that part? I've tested this PR with my pure-Ruby implementation of IDNA2008/UTS46, and it integrates seamlessly:

# frozen_string_literal: true

require "uri/idna"

module Addressable
  module IDNA
    module UTS46
      def self.to_ascii(value)
        URI::IDNA.to_ascii(value)
      end

      def self.to_unicode(value)
        URI::IDNA.to_unicode(value)
      end
    end
  end
end

The IDNA backend concept would allow me to easily package the Addressable backend within the gem.

Overall the uri-idna gem is 2-4 times slower than Addressable::IDNA::Pure, but I would prefer using that instead of the ffi version tbh 😅

@dentarg
Copy link
Collaborator

dentarg commented Nov 15, 2023

Cool. Maybe possible. I really want to find some time to properly play with this myself and merge it. It is really awesome work by @jarthod here. And I'm also following your repo @skryukov :) And ruby/uri#76 hehe.

@jarthod
Copy link
Contributor Author

jarthod commented Nov 15, 2023

That would be great indeed, and excellent work on the pure Ruby implementation @skryukov 👏 (I'm also following ruby/uri#76 ^^). @dentarg let me know if there is anything I can do to make it easier for you 🙇

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants