Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode handling in header location #110

Open
noraj opened this issue Feb 17, 2023 · 11 comments
Open

Unicode handling in header location #110

noraj opened this issue Feb 17, 2023 · 11 comments

Comments

@noraj
Copy link

noraj commented Feb 17, 2023

webrick doesn't handle Unicode in HTTP location header, eg. redirection to an URL like http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg.

[2023-02-17 16:41:33] ERROR URI::InvalidURIError: URI must be ascii only "http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/\u041E\u0443\u044D\u043D-\u041C\u044D\u0442\u044C\u044E\u0441.jpg"                                                                                                              
        /usr/local/lib/ruby/3.2.0/uri/rfc3986_parser.rb:20:in `split'                                                                                                                                                
        /usr/local/lib/ruby/3.2.0/uri/rfc3986_parser.rb:71:in `parse'                                                                                                                                                
        /usr/local/lib/ruby/3.2.0/uri/rfc3986_parser.rb:111:in `convert_to_uri'                                                                                                                                      
        /usr/local/lib/ruby/3.2.0/uri/generic.rb:1110:in `merge'                                                                                                                                                     
        /usr/local/bundle/gems/webrick-1.8.1/lib/webrick/httpresponse.rb:320:in `setup_header'                                                                                                                       
        /usr/local/bundle/gems/webrick-1.8.1/lib/webrick/httpresponse.rb:240:in `send_response'                                                                                                                      
        /usr/local/bundle/gems/webrick-1.8.1/lib/webrick/httpserver.rb:112:in `run'                                                                                                                                  
        /usr/local/bundle/gems/webrick-1.8.1/lib/webrick/server.rb:310:in `block in start_thread'

The following code is responsible:

@header['location'] = @request_uri.merge(location).to_s

This is because methods such as URI.parse or here URI.merge only handles ASCII.

uri = URI.parse('http://dxczjjuegupb.cloudfront.net')
uri.merge('/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg').to_s
/home/noraj/.asdf/installs/ruby/3.2.0/lib/ruby/3.2.0/uri/rfc3986_parser.rb:20:in `split': URI must be ascii only "/wp-content/uploads/2017/10/\u041E\u0443\u044D\u043D-\u041C\u044D\u0442\u044C\u044E\u0441.jpg" (URI::InvalidURIError)                                                                                                                                                                      
        from /home/noraj/.asdf/installs/ruby/3.2.0/lib/ruby/3.2.0/uri/rfc3986_parser.rb:71:in `parse'                                                                                   
        from /home/noraj/.asdf/installs/ruby/3.2.0/lib/ruby/3.2.0/uri/rfc3986_parser.rb:111:in `convert_to_uri'                                                                         
        from /home/noraj/.asdf/installs/ruby/3.2.0/lib/ruby/3.2.0/uri/generic.rb:1110:in `merge'                                                                                        
        from (irb):9:in `<main>'                                                                                                                                                        
        from /home/noraj/.asdf/installs/ruby/3.2.0/lib/ruby/gems/3.2.0/gems/irb-1.6.2/exe/irb:11:in `<top (required)>'                                                                  
        from /home/noraj/.asdf/installs/ruby/3.2.0/bin/irb:25:in `load'                                                                                                                 
        from /home/noraj/.asdf/installs/ruby/3.2.0/bin/irb:25:in `<main>'

So URL or fragments should be escaped first, with CGI.escape for URL component and URI::Parser.new.escape for full URLs.

Examples in https://github.com/noraj/ctf-party/blob/master/lib/ctf_party/cgi.rb.

cf. https://stackoverflow.com/questions/46849219/ruby-uriinvalidurierror-uri-must-be-ascii-only/75487328

patched code:

uri.merge(CGI.escape('/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg')).to_s
# => "http://dxczjjuegupb.cloudfront.net/%2Fwp-content%2Fuploads%2F2017%2F10%2F%D0%9E%D1%83%D1%8D%D0%BD-%D0%9C%D1%8D%D1%82%D1%8C%D1%8E%D1%81.jpg"
@ioquatix
Copy link
Member

If possible, can you submit a PR which addresses your issue?

@jeremyevans
Copy link
Contributor

#111 was already submitted, but closed by the submitter as not backwards compatible. I don't think there is a backwards compatible way to implement what is being requested (automatically escape location), as it would break cases where valid, already percent-escaped location are passed. I think the current expectation that the location provided is a valid URL/path is reasonable.

@ioquatix
Copy link
Member

Okay, so the solution is to close this issue without any fix? i.e. working as expected?

@jeremyevans
Copy link
Contributor

Possibly we should update the documentation to mention this, if it isn't already mentioned (I haven't checked). I just closed a similar issue in Roda filed by the same submitter with a documentation fix.

@noraj
Copy link
Author

noraj commented Feb 19, 2023

#111 was already submitted, but closed by the submitter as not backwards compatible. I don't think there is a backwards compatible way to implement what is being requested (automatically escape location), as it would break cases where valid, already percent-escaped location are passed.

That would be possible to detect is it's already encoded or not to avoid re-encoding.

I think the current expectation that the location provided is a valid URL/path is reasonable.

I don't think it is, because the user can provide a URL-encoded string that will be decoded on the fly by something on the way (web browser, reverse proxy, http server) to the application server that will receive URL decoded string.

@noraj
Copy link
Author

noraj commented Feb 19, 2023

I had experienced a web app made with roda where the user provided a URL with the unicode URL-encoded and got this error while the http server was webrick and nt error while the http server was puma. It's clearly possible to support Unicode. I don't know if the best way is workarounding (encoding/escaping) URI, using something more modern like addressable or patching URI upstream.

image

@noraj
Copy link
Author

noraj commented Feb 19, 2023

Ruby URI supports only RFC 2396 and RFC 3986 so if URI requires non-ASCII to be URL-encoded so it's the job framework using URI (http server, application) to ensure non-ASCII chars will be encoded.

@jeremyevans
Copy link
Contributor

Trying to detect valid encoding and reencoding if not valid is prone to security issues. For example:

  1. Your application code submits expects reencoding (expects to pass Unicode), but an attacker submits already encoded data
  2. Your application code doesn't expect reencoding (expects to pass properly encoded data), but an attacker finds a way to get to pass invalid data.

In general it's a bad idea for library code to make guesses as to whether to encode. It should always work in the same way. It's simpler and backwards compatible to assume the location is always already properly encoded. While we could add an option to toggle the behavior, I think it's better to document the expected behavior. Users can and should make sure the URL they are passing as the location is a valid URL.

In the example you are providing, you appear to be directly passing user (attacker) provided input as the location, without any validation. This seems like a very risky security practice, and not something that we should make changes to accommodate. You should be taking the URL the user is providing and appropriately validate it (e.g. construct a valid URL from it), before using it as the location.

If you are getting a user-provided value that you want to escape to use in part of the URL, you probably want to use URI::DEFAULT_PARSER.escape.

@noraj
Copy link
Author

noraj commented Feb 19, 2023

I can't disagree with that re-encoding or detection is bad. Yet nowadays any modern software should handle Unicode properly. The issue here is upstream because ruby/uri supports only old RFC 2396 and RFC 3986 and not RFC 3987 and RFC 6570. So I see two options left:

  1. Use addressable that is an alternative implementation of URI but that provides extensive support for IRIs and URI templates. But as webrick is hosted on the ruby namespace I guess you don't wan't to require a third-party gem. If that so only the second option is acceptable.
  2. Open an issue upstream so that ruby/uri Internationalized Resource Identifiers (IRIs, RFC 3987). (seems there is already one here Non ASCII characters are not allowed in the path uri#40)

Abstract of IRI

This document defines a new protocol element, the Internationalized
Resource Identifier (IRI), as a complement to the Uniform Resource
Identifier (URI). An IRI is a sequence of characters from the
Universal Character Set (Unicode/ISO 10646). A mapping from IRIs to
URIs is defined, which means that IRIs can be used instead of URIs,
where appropriate, to identify resources.

The approach of defining a new protocol element was chosen instead of
extending or changing the definition of URIs. This was done in order
to allow a clear distinction and to avoid incompatibilities with
existing software. Guidelines are provided for the use and
deployment of IRIs in various protocols, formats, and software
components that currently deal with URIs.

Overview and Motivation of IRI

A Uniform Resource Identifier (URI) is defined in [RFC3986] as a
sequence of characters chosen from a limited subset of the repertoire
of US-ASCII [ASCII] characters.

The characters in URIs are frequently used for representing words of
natural languages. This usage has many advantages: Such URIs are
easier to memorize, easier to interpret, easier to transcribe, easier
to create, and easier to guess. For most languages other than
English, however, the natural script uses characters other than A -
Z. For many people, handling Latin characters is as difficult as
handling the characters of other scripts is for those who use only
the Latin alphabet. Many languages with non-Latin scripts are
transcribed with Latin letters. These transcriptions are now often
used in URIs, but they introduce additional ambiguities.

The infrastructure for the appropriate handling of characters from
local scripts is now widely deployed in local versions of operating
system and application software. Software that can handle a wide
variety of scripts and languages at the same time is increasingly
common. Also, increasing numbers of protocols and formats can carry
a wide range of characters.

This document defines a new protocol element called Internationalized
Resource Identifier (IRI) by extending the syntax of URIs to a much
wider repertoire of characters. It also defines "internationalized"
versions corresponding to other constructs from [RFC3986], such as
URI references. The syntax of IRIs is defined in section 2, and the
relationship between IRIs and URIs in section 3.

Using characters outside of A - Z in IRIs brings some difficulties.
Section 4 discusses the special case of bidirectional IRIs, section 5
various forms of equivalence between IRIs, and section 6 the use of
IRIs in different situations. Section 7 gives additional informative
guidelines, and section 8 security considerations.

@jeremyevans
Copy link
Contributor

I can't disagree with that re-encoding or detection is bad.

I'm glad we can agree on that.

Yet nowadays any modern software should handle Unicode properly. The issue here is upstream because ruby/uri supports only old RFC 2396 and RFC 3986 and not RFC 3987 and RFC 6570. So I see two options left:

There is no way to handle Unicode properly with URIs. You need to use IRIs for that. If Ruby shipped with a library that supported IRIs, then I think we would be open to supporting IRIs in webrick (in that we could handle an IRI object and use it, not that we would treat string input as an IRI).

  1. Use addressable that is an alternative implementation of URI but that provides extensive support for IRIs and URI templates. But as webrick is hosted on the ruby namespace I guess you don't wan't to require a third-party gem. If that so only the second option is acceptable.

You can use Addressable::URI directly in your code if you want to support IRI:

header['location'] = Addressable::URI.parse(your_location).normalize

Note that Addressable has the security problems I discussed:

# Escapes
Addressable::URI.parse("http://app/foo/%").normalize.to_s
# => "http://app/foo/%25"

# Doesn't escape
Addressable::URI.parse("http://app/foo/%25").normalize.to_s
# => "http://app/foo/%25"

# Escapes some characters and not others
Addressable::URI.parse("http://app/foo/%25\u1000").normalize.to_s
#=> "http://app/foo/%25%E1%80%80"

I'm not sure whether is possible to use Addressable in a secure manner, where it always treats input as decoded and encodes it (where the second path would give you a path of /foo/%2525). Knowing Addressable's current behavior with regards to escaping, I would advise not using it, at least in any security-sensitive environment.

  1. Open an issue upstream so that ruby/uri Internationalized Resource Identifiers (IRIs, RFC 3987). (seems there is already one here Non ASCII characters are not allowed in the path uri#40)

I don't think URI should handle this, beyond the handling it already has:

URI::DEFAULT_PARSER.escape("/foo/%25\u1000")
# =>"/foo/%2525%E1%80%80"

Possibly Ruby could consider adding an IRI library. But such a library needs to be carefully designed, so it never has to guess as whether something should or should not be encoded. Maybe for partial IRI support in URI, we could consider a URI#decoded_path= method, which could do self.password = parser.escape(path) (and possibly a decoded_host= for IDN support (RFC 5890)).

@noraj
Copy link
Author

noraj commented Feb 20, 2023

I wanted to bring that up, I have no definitive opinion about which stage is should be handled (dev using the web framework, web framework (like roda) or http server like webrick).

I looked at why it is working in puma and it seems to be because they use a regexp (and that Ruby regexp are unicode aware) and not URI module.

https://github.com/puma/puma/blob/a61b0782d7aec250f7d355a7471603abc5837685/lib/puma/rack/urlmap.rb#L24-L43

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

3 participants