Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read over http(s) ? #3134

Open
jgranduel opened this issue Aug 6, 2024 · 3 comments
Open

Read over http(s) ? #3134

jgranduel opened this issue Aug 6, 2024 · 3 comments
Labels
feat New feature or request

Comments

@jgranduel
Copy link

Description

With duckdb, one can read csv, json or parquet files which are accessed through a web server:
ex:

SELECT *
FROM read_parquet('https://some.url/some_file.parquet');

See duckdb parquet .

Will you implement such a feature in glaredb ? Authentication might be tricky in some cases (I thought it was present in datafusion, but I must be wrong as couldn't find any reference).

@jgranduel jgranduel added the feat New feature or request label Aug 6, 2024
@tychoish
Copy link
Contributor

tychoish commented Aug 6, 2024

This should work in many cases today!

As you say, there are a bunch of edge cases around authentication, but it definitely works:

I just ran:

select count(*) from read_parquet('https://huggingface.co/api/datasets/RLHFlow/Orca-distibalel-standard/parquet/default/train/0.parquet');

And it worked just fine.

Our read_json() function also supports both reading top-level JSON arrays (as you might expect from some public APIs, but also supports streams with either/both new-line delimited and multi-line JSON objects.

Let me know if you have specific edge cases that you'd like to address or other concerns!

@jgranduel
Copy link
Author

Thanks for your response.

Your example ("select count(*) from read_parquet('https://huggingface.co/api/da[..]')" indeed worked also on my side.
I tried another example that we can share if needed ("https://reqres.in/api/users") and it worked well (except that I don't know how to unnest a JSON array as in duckdb ( with t(data) as (select unnest(data) from read_json('https://reqres.in/api/users')) select data.id, data.email, data.first_name, data.last_name, data.avatar from t; but that's another question 😉).

So I tried to reproduce what I did yesterday. After trying with a local file, I wanted to check if I could load data with http, using simple http file servers. I tried with miniserve, and it didn't work. I also tried today with another very basic web server (jwebserver), same result, and finally with hfs and it worked. The difference is that it uses by default port 80🗦💡🗧.

Here are my tests:

glaredb-101> cat .\data.csv
a,b
1,2
glaredb-101> glaredb.exe
GlareDB (v0.9.4)
Type \help for help.
> select * from read_csv('data.csv');
┌───────┬───────┐
│     a │     b │
│    ── │    ── │
│ Int64 │ Int64 │
╞═══════╪═══════╡
│     1 │     2 │
└───────┴───────┘

then simple test with miniserve (default port is 8080)

glaredb-101> miniserve.exe .
miniserve v0.22.0
Bound to [::]:8080, 0.0.0.0:8080
Serving path ...glaredb-101
Available at (non-exhaustive list):
    http://127.0.0.1:8080
    http://<my ip>:8080
    
 glaredb-101> glaredb.exe
GlareDB (v0.9.4)
Type \help for help.
> select * from read_csv('http://localhost:8080/data.csv');
Error: External error: External error: invalid port number
> select * from read_csv('http://127.0.0.1:8080/data.csv');
Error: External error: External error: invalid IPv4 address
> select * from read_csv('http://<my ip>:8080/data.csv');
Error: External error: External error: invalid IPv4 address

then miniserve running on port 80:

glaredb-101> miniserve.exe -p 80 .
miniserve v0.22.0
Bound to [::]:80, 0.0.0.0:80
Serving path ...\glaredb-101
Available at (non-exhaustive list):
    http://127.0.0.1:80
    http://<my ip>:80

glaredb-101> glaredb
GlareDB (v0.9.4)
Type \help for help.
> select * from read_csv('http://localhost/data.csv');
┌───────┬───────┐
│     a │     b │
│    ── │    ── │
│ Int64 │ Int64 │
╞═══════╪═══════╡
│     1 │     2 │
└───────┴───────┘
> select * from read_csv('http://127.0.0.1/data.csv');
┌───────┬───────┐
│     a │     b │
│    ── │    ── │
│ Int64 │ Int64 │
╞═══════╪═══════╡
│     1 │     2 │
└───────┴───────┘
> select * from read_csv('http://<my ip>/data.csv');
┌───────┬───────┐
│     a │     b │
│    ── │    ── │
│ Int64 │ Int64 │
╞═══════╪═══════╡
│     1 │     2 │
└───────┴───────┘
> select * from read_csv('http://localhost:80/data.csv');
┌───────┬───────┐
│     a │     b │
│    ── │    ── │
│ Int64 │ Int64 │
╞═══════╪═══════╡
│     1 │     2 │
└───────┴───────┘

Using localhost:80 works well. Is there any parsing error with different ports, so an issue with URL parsing?

@tychoish
Copy link
Contributor

tychoish commented Aug 12, 2024

Using localhost:80 works well. Is there any parsing error with different ports, so an issue with URL parsing?

That seems like a quite reasonable deduction. It's probably the case that somewhere the port is getting dropped.
I suspect this works when you explicitly reference :80 it's being stripped out but it's "fine" because for HTTP requests if you omit the port number the spec says to assume :80.

I've done some poking around and nothing is jumping out at me in either the DF code or the GlareDB code, will continue to look.

except that I don't know how to unnest a JSON array as in duckdb

GlareDB will unwind top-level json arrays in json source data automatically, on the theory that top-level arrays of json objects, can mean nothing else (e.g. these would be an error any other way.)

Coming very shortly, GlareDB, will allow you to filter JSON data through jaq in read_json and in the configuration of external json tables. This will let you re-structure JSON objects as their ingested, if you need something else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants