Skip to content

pandas.read_ functions crashes on read from url with space characters #17918

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
badanin-dmitry-playrix opened this issue Oct 19, 2017 · 6 comments
Labels
Bug IO Data IO issues that don't fit into a more specific label IO Network Local or Cloud (AWS, GCS, etc.) IO Issues

Comments

@badanin-dmitry-playrix
Copy link

import pandas
pandas.read_json('http://httpbin.org/anything/something something')

python throws exception
HTTPError: HTTP Error 505: HTTP Version Not Supported

if url escaped:

pandas.read_json('http://httpbin.org/anything/something%20something')

then it works as expected

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-96-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: ru_RU.UTF-8
LANG: ru_RU.UTF-8
LOCALE: None.None

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 36.5.0
Cython: None
numpy: 1.13.3
scipy: None
xarray: None
IPython: 5.5.0
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0b10
sqlalchemy: None
pymysql: None
psycopg2: 2.7.3.1 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@bkmrkr
Copy link

bkmrkr commented Oct 19, 2017 via email

@TomAugspurger
Copy link
Contributor

I suppose we could quote the URL before doing the requests. Interested in submitting a PR?

@TomAugspurger TomAugspurger added IO Data IO issues that don't fit into a more specific label IO Network Local or Cloud (AWS, GCS, etc.) IO Issues labels Oct 19, 2017
@TomAugspurger TomAugspurger added this to the Next Major Release milestone Oct 19, 2017
@grantcooksey
Copy link

@badanin-dmitry-playrix I'd be happy to take this if you aren't interested in submitting a PR. I've been looking for a good entry point to contributing to this project and this looks like something I could manage.

@badanin-dmitry-playrix
Copy link
Author

@grantcooksey
Copy link

I posted this in the PR that I submitted and am cross posting it here so that we can move to either close the issue or link it to a doc change PR. The more I dig into this issue, the less confident I am that this is a worthwhile change. The issue was raised due to an error raised when a url passed to read_json contained whitespace. We could encode the url but that might be a bad idea since there is no way to know when a reserved character is used for its reserved purpose and we run the risk of doubly encoding a url. For example, if we are passed

http://example.com/to/json?key=some?value

we run the risk of not encoding the ? in the query part of the url.

In addition, quote is intended for quoting the path
section of a URL
, further reinforcing the feeling that trying to encode the entire url, including the scheme and query, is something that we shouldn't be doing.

Rather than trying to trying to encode an already formed url, I propose adding to the docs of read functions a note stating that properly encoding the url is the responsibility of the user.

@mroeschke
Copy link
Member

I think Python raises the correct error now.

InvalidURL: URL can't contain control characters. '/anything/something something' (found at least ' ')

I don't think this needs a test, but if it does happy to reopen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Data IO issues that don't fit into a more specific label IO Network Local or Cloud (AWS, GCS, etc.) IO Issues
Projects
None yet
6 participants