Skip to content

BUG: read_csv fails some http servers if port number is specified #17019

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
skynss opened this issue Jul 19, 2017 · 6 comments
Closed

BUG: read_csv fails some http servers if port number is specified #17019

skynss opened this issue Jul 19, 2017 · 6 comments
Labels
Enhancement IO CSV read_csv, to_csv IO Network Local or Cloud (AWS, GCS, etc.) IO Issues

Comments

@skynss
Copy link

skynss commented Jul 19, 2017

xref #16716

Code Sample, a copy-pastable example if possible

# some shared hosting web server named 'apex'
uapex_no_port = 'http://handsome-equator.000webhostapp.com/no_auth/aaa.csv'
uapex_with_port = 'http://handsome-equator.000webhostapp.com:80/no_auth/aaa.csv'

# nginx web server
u_nginx_no_port = 'http://pandastest.mooo.com/aaa.csv'
u_nginx_with_port = 'http://pandastest.mooo.com:80/aaa.csv'

import pandas as pd
from urllib.request import urlopen, Request
from urllib.parse import urlparse
import requests

for url in [ 
            uapex_no_port,   # always succeeds
			u_nginx_no_port, # always succeeds
			u_nginx_with_port, # always succeeds 
            uapex_with_port, # Succeeds with requests. fails with urlopen/pandas
			]:
	print(url)
	txt = requests.get(url) # always succeeds
	try:
		req1 = Request(url)
		txt = urlopen(req1).read() # fails on apex with explicit port uapex_with_port
		df = pd.read_csv(url) # fails on apex with explicit port uapex_with_port
	except Exception as ex:
		print('FAIL {} -- {}'.format(url, str(ex))) # returns HTTP 404 error.
	# The reason that urlopen fails on apex but requests succeeds
        # is because nginx can handle / but apex cannot handle 
	# HTTP GET HEADER: 'Host': '<fqdn>:<port>' as set by urlopen
        # Requests does not set port number in host header: 'Host': '<fqdn>'
	# so Requests works with all urls.
	# specifying port number in host header is in the HTTP standard RFC
	# But I dont know how prevalent this issue is beyond apex.
	p = urlparse(url)
	req2 = Request(url)
	req2.add_header('Host', p.hostname)
	txt = urlopen(req2).read() # always succeeds

Problem description

The problem is in atleast one version of web server, urlopen and therefore pandas.read_csv fails when a http://<fqdn>:<port> is specified, even if it is default port 80. However, instead of urlopen the python-requests library is utilized, same url works. The issue is requests sets header Host : fqdn, as compared to urlopen sets header to Host : fqdn:port. While urlopen is still adhering to http RFC , requests \Firefox\chrome\IE\Curl all work with all urls. So possibly, pandas user would wonder why pandas returns code 404 The question is how big of an issue is this? I dont know. So I cannot immediately recommend this be fixed. But we should watch out of similar issues in future and then, either consider modifying host header or consider using requests library.

Expected Output

A dataframe should be read.

Output of pd.show_versions()

pandas: 0.20.2
@gfyoung
Copy link
Member

gfyoung commented Jul 19, 2017

@skynss : Why can't we just set the header ourselves if that's the problem? It seems like your code in the example could address that.

@skynss
Copy link
Author

skynss commented Jul 19, 2017

@gfyoung I recommend not fixing it for now because

  1. I observed this issue only on this hosting. Maybe it is simply isolated issue. I logged the issue to see if other's are "mee too"
  2. For ports other than 80 and 443, curl/firefox/requests all send out ":" so there is no reason why explicitly specifying port as urllib does by adhering to HTTP RFC is not handled by apex server.

@gfyoung
Copy link
Member

gfyoung commented Jul 19, 2017

@skynss : Fair enough. We still have to merge your original PR for user-auth in the first place 😄

@jbrockmendel jbrockmendel added the IO Network Local or Cloud (AWS, GCS, etc.) IO Issues label Dec 11, 2019
@empz
Copy link

empz commented Mar 12, 2020

What's the status of this?

I'm getting an error when trying to pd_read_csv from an URL with a non-standard port in it.

@jreback
Copy link
Contributor

jreback commented Mar 12, 2020

it’s an open issue, pull requests are always welcome

@mroeschke
Copy link
Member

Headers can be modified now by using storage_options so closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO CSV read_csv, to_csv IO Network Local or Cloud (AWS, GCS, etc.) IO Issues
Projects
None yet
Development

No branches or pull requests

6 participants