Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vsicurl cannot handle streamed responses #8735

Closed
frafra opened this issue Nov 16, 2023 · 3 comments
Closed

vsicurl cannot handle streamed responses #8735

frafra opened this issue Nov 16, 2023 · 3 comments

Comments

@frafra
Copy link

frafra commented Nov 16, 2023

Expected behavior and actual behavior.

Unable to open a remote resource that has been dynamically generated, using vsicurl.

Steps to reproduce the problem.

ogrinfo --debug on --config CPL_CURL_VERBOSE YES -oo X_POSSIBLE_NAMES=decimalLongitude -oo Y_POSSIBLE_NAMES=decimalLatitude 'CSV:/vsizip/{/vsicurl/https://ipt.nina.no/archive.do?r=arko_gel&v=1.12}/occurrence.txt'

Output:

HTTP: libcurl/8.0.1 OpenSSL/3.0.8 zlib/1.2.13 brotli/1.0.9 libidn2/2.3.4 libpsl/0.21.2 (+libidn2/2.3.4) libssh/0.10.4/openssl/zlib nghttp2/1.52.0
HTTP: GDAL was built against curl 7.87.0, but is running against 8.0.1.
CURL_INFO_TEXT: Couldn't find host ipt.nina.no in the (nil) file; using defaults
CURL_INFO_TEXT:   Trying 158.38.174.15:443...
CURL_INFO_TEXT: Connected to ipt.nina.no (158.38.174.15) port 443 (#0)
CURL_INFO_TEXT: ALPN: offers h2,http/1.1
CURL_INFO_TEXT: TLSv1.3 (OUT), TLS handshake, Client hello (1):
CURL_INFO_TEXT:  CAfile: /etc/pki/tls/certs/ca-bundle.crt
CURL_INFO_TEXT:  CApath: none
CURL_INFO_TEXT: TLSv1.3 (IN), TLS handshake, Server hello (2):
CURL_INFO_TEXT: TLSv1.2 (IN), TLS handshake, Certificate (11):
CURL_INFO_TEXT: TLSv1.2 (IN), TLS handshake, Server key exchange (12):
CURL_INFO_TEXT: TLSv1.2 (IN), TLS handshake, Server finished (14):
CURL_INFO_TEXT: TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
CURL_INFO_TEXT: TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
CURL_INFO_TEXT: TLSv1.2 (OUT), TLS handshake, Finished (20):
CURL_INFO_TEXT: TLSv1.2 (IN), TLS handshake, Finished (20):
CURL_INFO_TEXT: SSL connection using TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384
CURL_INFO_TEXT: ALPN: server accepted h2
CURL_INFO_TEXT: Server certificate:
CURL_INFO_TEXT:  subject: C=NO; ST=Tr�ndelag; O=Stiftelsen norsk institutt for naturforskning NINA; CN=ipt.nina.no
CURL_INFO_TEXT:  start date: Mar 30 00:00:00 2023 GMT
CURL_INFO_TEXT:  expire date: Mar 29 23:59:59 2024 GMT
CURL_INFO_TEXT:  subjectAltName: host "ipt.nina.no" matched cert's "ipt.nina.no"
CURL_INFO_TEXT:  issuer: C=GB; ST=Greater Manchester; L=Salford; O=Sectigo Limited; CN=Sectigo RSA Organization Validation Secure Server CA
CURL_INFO_TEXT:  SSL certificate verify ok.
CURL_INFO_TEXT: using HTTP/2
CURL_INFO_TEXT: h2h3 [:method: HEAD]
CURL_INFO_TEXT: h2h3 [:path: /archive.do?r=arko_gel&v=1.12]
CURL_INFO_TEXT: h2h3 [:scheme: https]
CURL_INFO_TEXT: h2h3 [:authority: ipt.nina.no]
CURL_INFO_TEXT: h2h3 [accept: */*]
CURL_INFO_TEXT: Using Stream ID: 1 (easy handle 0x5625c3053740)
CURL_INFO_HEADER_OUT: HEAD /archive.do?r=arko_gel&v=1.12 HTTP/2
Host: ipt.nina.no
accept: */*

CURL_INFO_HEADER_IN: HTTP/2 200 
CURL_INFO_HEADER_IN: server: nginx
CURL_INFO_HEADER_IN: date: Thu, 16 Nov 2023 15:47:43 GMT
CURL_INFO_HEADER_IN: content-type: application/zip;charset=ISO-8859-1
CURL_INFO_HEADER_IN: access-control-allow-origin: *
CURL_INFO_HEADER_IN: access-control-allow-methods: GET, OPTIONS, HEAD
CURL_INFO_HEADER_IN: set-cookie: JSESSIONID=4B6828221A77DC32CF98E10E226CC45D; Path=/; Secure; HttpOnly
CURL_INFO_HEADER_IN: set-cookie: CSRFtoken=3izphOH65BQcxRFLjOS7t5GUZLPtDCVQ; Max-Age=900; Expires=Thu, 16-Nov-2023 16:02:43 GMT; Domain=ipt.nina.no; Secure; HttpOnly
CURL_INFO_HEADER_IN: content-disposition: filename="dwca-arko_gel-v1.12.zip"
CURL_INFO_HEADER_IN: content-language: en-GB
CURL_INFO_HEADER_IN: 
CURL_INFO_TEXT: Connection #0 to host ipt.nina.no left intact
VSICURL: HEAD did not provide file size. Retrying with GET
CURL_INFO_TEXT: Couldn't find host ipt.nina.no in the (nil) file; using defaults
CURL_INFO_TEXT: Found bundle for host: 0x5625c304fba0 [can multiplex]
CURL_INFO_TEXT: Re-using existing connection #0 with host ipt.nina.no
CURL_INFO_TEXT: h2h3 [:method: GET]
CURL_INFO_TEXT: h2h3 [:path: /archive.do?r=arko_gel&v=1.12]
CURL_INFO_TEXT: h2h3 [:scheme: https]
CURL_INFO_TEXT: h2h3 [:authority: ipt.nina.no]
CURL_INFO_TEXT: h2h3 [accept: */*]
CURL_INFO_TEXT: Using Stream ID: 3 (easy handle 0x5625c3053740)
CURL_INFO_HEADER_OUT: GET /archive.do?r=arko_gel&v=1.12 HTTP/2
Host: ipt.nina.no
accept: */*

CURL_INFO_HEADER_IN: HTTP/2 200 
CURL_INFO_HEADER_IN: server: nginx
CURL_INFO_HEADER_IN: date: Thu, 16 Nov 2023 15:47:43 GMT
CURL_INFO_HEADER_IN: content-type: application/zip;charset=ISO-8859-1
CURL_INFO_HEADER_IN: access-control-allow-origin: *
CURL_INFO_HEADER_IN: access-control-allow-methods: GET, OPTIONS, HEAD
CURL_INFO_HEADER_IN: set-cookie: JSESSIONID=B3C461DAF4F8F7557793AE4A89671C5B; Path=/; Secure; HttpOnly
CURL_INFO_HEADER_IN: set-cookie: CSRFtoken=aNzUY4o5T8c3WodGxujjzJYYiiqHKu1K; Max-Age=900; Expires=Thu, 16-Nov-2023 16:02:43 GMT; Domain=ipt.nina.no; Secure; HttpOnly
CURL_INFO_HEADER_IN: content-disposition: filename="dwca-arko_gel-v1.12.zip"
CURL_INFO_HEADER_IN: content-language: en-GB
CURL_INFO_HEADER_IN: 
CURL_INFO_TEXT: Failure writing output to destination
CURL_INFO_TEXT: Connection #0 to host ipt.nina.no left intact
VSICURL: GetFileSize(https://ipt.nina.no/archive.do?r=arko_gel&v=1.12): response_code=200, curl error msg=Failure writing output to destination
VSICURL: Request at offset 0, after end of file
VSICURL: Request at offset 0, after end of file
VSICURL: Request at offset 0, after end of file
[...]
FAILURE:
Unable to open datasource `CSV:/vsizip/{/vsicurl/https://ipt.nina.no/archive.do?r=arko_gel&v=1.12}/occurrence.txt' with the following drivers.
[...]

Setting use_head=no does not fix the issue, nor using vsicurl_streaming.

It works if I save the ZIP locally and serve it with a simple file server.

I used mitmproxy, and I see that the file is retrieved correctly, entirely. I then wonder why GDAL cannot open it.

I used varnish to disable the streamed response, so it can return Content-Length and Accepted-Ranges headers. This is a valid workaround, and it provides a mechanism to cache the file as well: https://gist.github.com/frafra/cdfc98cdbbe93bbdb73ed6363c5c613f

Operating system

Fedora 38 x86_64.

GDAL version and provenance

GDAL 3.6.4 from Fedora official repositories.

@rouault
Copy link
Member

rouault commented Nov 16, 2023

/vsicurl/ doesn't work with all HTTP servers, and in particular generally not with ones generated dynamic content. They need to support arbitrary Range requests and report the file size in Content-Length

/vsicurl_streaming/ is only usable for formats that can be read from start to end without seaking, and ZIP compression does not enable that, since reading a ZIP file requires reading the directory content located at the end of the file

So I don't believe there's anything that can be fixed on GDAL side

@frafra
Copy link
Author

frafra commented Nov 17, 2023

Wouldn't be possible to mimic what vsistdin is doing, by adding buffer_limit option to vsicurl_streaming? This works great:

curl 'https://ipt.nina.no/archive.do?r=arko_gel&v=1.12' | ogrinfo -oo X_POSSIBLE_NAMES=decimalLongitude -oo Y_POSSIBLE_NAMES=decimalLatitude 'CSV:/vsizip/{/vsistdin?buffer_limit=-1}/occurrence.txt'

@rouault
Copy link
Member

rouault commented Apr 18, 2024

closing as I don't foresee any further action on this

@rouault rouault closed this as completed Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants