Read over http(s) ? #3134

jgranduel · 2024-08-06T14:11:28Z

Description

With duckdb, one can read csv, json or parquet files which are accessed through a web server:
ex:

SELECT *
FROM read_parquet('https://some.url/some_file.parquet');

See duckdb parquet .

Will you implement such a feature in glaredb ? Authentication might be tricky in some cases (I thought it was present in datafusion, but I must be wrong as couldn't find any reference).

The text was updated successfully, but these errors were encountered:

tychoish · 2024-08-06T16:09:56Z

This should work in many cases today!

As you say, there are a bunch of edge cases around authentication, but it definitely works:

I just ran:

select count(*) from read_parquet('https://huggingface.co/api/datasets/RLHFlow/Orca-distibalel-standard/parquet/default/train/0.parquet');

And it worked just fine.

Our read_json() function also supports both reading top-level JSON arrays (as you might expect from some public APIs, but also supports streams with either/both new-line delimited and multi-line JSON objects.

Let me know if you have specific edge cases that you'd like to address or other concerns!

jgranduel · 2024-08-07T09:39:17Z

Thanks for your response.

Your example ("select count(*) from read_parquet('https://huggingface.co/api/da[..]')" indeed worked also on my side.
I tried another example that we can share if needed ("https://reqres.in/api/users") and it worked well (except that I don't know how to unnest a JSON array as in duckdb ( with t(data) as (select unnest(data) from read_json('https://reqres.in/api/users')) select data.id, data.email, data.first_name, data.last_name, data.avatar from t; but that's another question 😉).

So I tried to reproduce what I did yesterday. After trying with a local file, I wanted to check if I could load data with http, using simple http file servers. I tried with miniserve, and it didn't work. I also tried today with another very basic web server (jwebserver), same result, and finally with hfs and it worked. The difference is that it uses by default port 80🗦💡🗧.

Here are my tests:

glaredb-101> cat .\data.csv
a,b
1,2
glaredb-101> glaredb.exe
GlareDB (v0.9.4)
Type \help for help.
> select * from read_csv('data.csv');
┌───────┬───────┐
│     a │     b │
│    ── │    ── │
│ Int64 │ Int64 │
╞═══════╪═══════╡
│     1 │     2 │
└───────┴───────┘

then simple test with miniserve (default port is 8080)

glaredb-101> miniserve.exe .
miniserve v0.22.0
Bound to [::]:8080, 0.0.0.0:8080
Serving path ...glaredb-101
Available at (non-exhaustive list):
    http://127.0.0.1:8080
    http://<my ip>:8080
    
 glaredb-101> glaredb.exe
GlareDB (v0.9.4)
Type \help for help.
> select * from read_csv('http://localhost:8080/data.csv');
Error: External error: External error: invalid port number
> select * from read_csv('http://127.0.0.1:8080/data.csv');
Error: External error: External error: invalid IPv4 address
> select * from read_csv('http://<my ip>:8080/data.csv');
Error: External error: External error: invalid IPv4 address

then miniserve running on port 80:

glaredb-101> miniserve.exe -p 80 .
miniserve v0.22.0
Bound to [::]:80, 0.0.0.0:80
Serving path ...\glaredb-101
Available at (non-exhaustive list):
    http://127.0.0.1:80
    http://<my ip>:80

glaredb-101> glaredb
GlareDB (v0.9.4)
Type \help for help.
> select * from read_csv('http://localhost/data.csv');
┌───────┬───────┐
│     a │     b │
│    ── │    ── │
│ Int64 │ Int64 │
╞═══════╪═══════╡
│     1 │     2 │
└───────┴───────┘
> select * from read_csv('http://127.0.0.1/data.csv');
┌───────┬───────┐
│     a │     b │
│    ── │    ── │
│ Int64 │ Int64 │
╞═══════╪═══════╡
│     1 │     2 │
└───────┴───────┘
> select * from read_csv('http://<my ip>/data.csv');
┌───────┬───────┐
│     a │     b │
│    ── │    ── │
│ Int64 │ Int64 │
╞═══════╪═══════╡
│     1 │     2 │
└───────┴───────┘
> select * from read_csv('http://localhost:80/data.csv');
┌───────┬───────┐
│     a │     b │
│    ── │    ── │
│ Int64 │ Int64 │
╞═══════╪═══════╡
│     1 │     2 │
└───────┴───────┘

Using localhost:80 works well. Is there any parsing error with different ports, so an issue with URL parsing?

tychoish · 2024-08-12T23:52:46Z

Using localhost:80 works well. Is there any parsing error with different ports, so an issue with URL parsing?

That seems like a quite reasonable deduction. It's probably the case that somewhere the port is getting dropped.
I suspect this works when you explicitly reference :80 it's being stripped out but it's "fine" because for HTTP requests if you omit the port number the spec says to assume :80.

I've done some poking around and nothing is jumping out at me in either the DF code or the GlareDB code, will continue to look.

except that I don't know how to unnest a JSON array as in duckdb

GlareDB will unwind top-level json arrays in json source data automatically, on the theory that top-level arrays of json objects, can mean nothing else (e.g. these would be an error any other way.)

Coming very shortly, GlareDB, will allow you to filter JSON data through jaq in read_json and in the configuration of external json tables. This will let you re-structure JSON objects as their ingested, if you need something else.

jgranduel added the feat New feature or request label Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read over http(s) ? #3134

Read over http(s) ? #3134

jgranduel commented Aug 6, 2024

tychoish commented Aug 6, 2024

jgranduel commented Aug 7, 2024

tychoish commented Aug 12, 2024 •

edited

Loading

Read over http(s) ? #3134

Read over http(s) ? #3134

Comments

jgranduel commented Aug 6, 2024

Description

tychoish commented Aug 6, 2024

jgranduel commented Aug 7, 2024

tychoish commented Aug 12, 2024 • edited Loading

tychoish commented Aug 12, 2024 •

edited

Loading