Skip to content

BUG: Gracefully handle all utf-8 characters in json urls #17933

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

grantcooksey
Copy link

@grantcooksey grantcooksey commented Oct 21, 2017

Url is passed through quote before doing the request. The safe parameter characters are the reserved character set defined in RFC 2396(See section 2.2).

@codecov
Copy link

codecov bot commented Oct 21, 2017

Codecov Report

Merging #17933 into master will decrease coverage by 0.01%.
The diff coverage is 50%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17933      +/-   ##
==========================================
- Coverage   91.23%   91.22%   -0.02%     
==========================================
  Files         163      163              
  Lines       50113    50114       +1     
==========================================
- Hits        45723    45714       -9     
- Misses       4390     4400      +10
Flag Coverage Δ
#multiple 89.03% <50%> (-0.01%) ⬇️
#single 40.31% <50%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/io/common.py 69.19% <50%> (-0.3%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.75% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 77b4bb3...ec9cc9a. Read the comment docs.

@codecov
Copy link

codecov bot commented Oct 21, 2017

Codecov Report

Merging #17933 into master will decrease coverage by 0.01%.
The diff coverage is 33.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17933      +/-   ##
==========================================
- Coverage   91.23%   91.22%   -0.02%     
==========================================
  Files         163      163              
  Lines       50113    50114       +1     
==========================================
- Hits        45723    45714       -9     
- Misses       4390     4400      +10
Flag Coverage Δ
#multiple 89.03% <33.33%> (-0.01%) ⬇️
#single 40.31% <33.33%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/io/common.py 69.19% <33.33%> (-0.3%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.75% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 77b4bb3...ec9cc9a. Read the comment docs.

@@ -187,6 +187,7 @@ def get_filepath_or_buffer(filepath_or_buffer, encoding=None,
filepath_or_buffer = _stringify_path(filepath_or_buffer)

if _is_url(filepath_or_buffer):
filepath_or_buffer = quote(filepath_or_buffer, safe=';/?:@&=+$,')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason the default is not enough here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if yes, can you expand the test that exercises that (with another case)

@jreback jreback added Bug IO Data IO issues that don't fit into a more specific label Unicode Unicode strings labels Oct 21, 2017
@@ -998,6 +998,7 @@ I/O
- Bug in :meth:`DataFrame.to_html` in which there was no validation of the ``justify`` parameter (:issue:`17527`)
- Bug in :func:`HDFStore.select` when reading a contiguous mixed-data table featuring VLArray (:issue:`17021`)
- Bug in :func:`to_json` where several conditions (including objects with unprintable symbols, objects with deep recursion, overlong labels) caused segfaults instead of raising the appropriate exception (:issue:`14256`)
- Bug in :func:`read_json` where all utf-8 characters were not encoded properly when reading json data from a url (:issue:`17918`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this is generally true (not just for read_json), so maybe amend to say for urls. If you can add a test for more readers would be great.

@grantcooksey
Copy link
Author

@jreback Ok so the more I dig into this issue, the less confident I am that this is a worthwhile change. The issue was raised due to an error raised when a url passed to read_json contained whitespace. We could encode the url but that might be a bad idea since there is no way to know when a reserved character is used for its reserved purpose and we run the risk of doubly encoding a url. For example, if we are passed

http://example.com/to/json?key=some?value

we run the risk of not encoding the ? in the query part of the url.

In addition, quote is intended for quoting the path
section of a URL
, further reinforcing the feeling that trying to encode the entire url, including the scheme and query, is something that we shouldn't be doing.

Perhaps a better approach would be to add a note to the docs stating that properly encoding the url is the responsibility of the user. I would be happy to open a PR if you think that would be a good idea.

@jreback
Copy link
Contributor

jreback commented Oct 21, 2017

actually that sounds like a good idea to add to docs and the doc-strings of all the read_* functions that take urls (most of them)

@bkmrkr
Copy link

bkmrkr commented Oct 21, 2017 via email

@grantcooksey
Copy link
Author

@jreback Ok I will work on that and close this PR. Thanks!

@grantcooksey grantcooksey deleted the encode-url branch October 23, 2017 02:08
@grantcooksey grantcooksey changed the title BUG: Gracefully handle all utf-8 characters in json urls GH17918 BUG: Gracefully handle all utf-8 characters in json urls Oct 24, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Data IO issues that don't fit into a more specific label Unicode Unicode strings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pandas.read_ functions crashes on read from url with space characters
3 participants