Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for loading data from stages #35

Open
tharwan opened this issue Jan 23, 2024 · 8 comments
Open

Support for loading data from stages #35

tharwan opened this issue Jan 23, 2024 · 8 comments

Comments

@tharwan
Copy link
Contributor

tharwan commented Jan 23, 2024

Hi,

Would it be possible to support loading of data from a stage? (https://docs.snowflake.com/en/user-guide/data-load-considerations-load)

I am not sure from the docs if this would be even inside the scope of this project.

However it would help us a lot to test our complete ETL workflow.

@tekumara
Copy link
Owner

Hi @tharwan this might be possible ... could you share example SQL statements you'd like support for, both creating the stage and loading from it?

@tharwan
Copy link
Contributor Author

tharwan commented Jan 24, 2024

Loading from a stage looks something like this:

COPY INTO my_table
FROM @my_stage/my_file
FILE_FORMAT = (FIELD_DELIMITER = ';')

Creating a stage is not so relevant in our case but looks like this:

CREATE STAGE my_ext_stage
URL='azure://myaccount.blob.core.windows.net/load/files/'
STORAGE_INTEGRATION = myint;

maybe related: tobymao/sqlglot#2463

@tekumara
Copy link
Owner

Thanks! From the above I gather you want support for CSV files.

I think creating the stage would be needed for fakesnow to know where to find the files to load.

@tharwan
Copy link
Contributor Author

tharwan commented Jan 24, 2024

another option could be to just look locally for a file and ignore the @my_stage part.

e.g.

COPY INTO my_table
FROM @my_stage/my_file
FILE_FORMAT = (FIELD_DELIMITER = ';')

would translate to

COPY my_table
FROM my_file
(FORMAT CSV, DELIMITER ';');

@DanCardin
Copy link
Contributor

we recently found ourselves wanting this at $job. i may look into a relatively simple initial implementation somewhat soon. My personal plan/preference would be for fakesnow to simply internally track creations of stages inside information_schema.stages and something else for integrations (i dont see a table for this, but we're using integrations personally).

And then translate COPY statements against the service in question into boto3 calls + insert statements (which ought to work for all the supported cloud providers iirc)

It'd then be the job of the user to set up moto however they prefer, to avoid making calls to the actual service. Perhaps fakesnow could provide a default fixture that would ensure it'd avoid making calls by default, but i'd suspect we at least, would be forced to override it anyways.


like i said, i'll probably be looking into the feasibility of this strat soon, but i expect sqlglot to be the short term limiting factor in terms of minimally parsing the integration/stage/copy statement syntaxes respectively.

@tekumara
Copy link
Owner

If the use-case were just testing then I thought for the first iteration of this we could just support a local file via a file:// url. It would avoid the need to set up actual s3/azure storage or moto to mock s3. Would that work?

@DanCardin
Copy link
Contributor

my personal usecase requires a storage integration/s3, but presumably my implementing that will be easier if there was already general support against a staged file://

@tekumara
Copy link
Owner

Yes good point - if we transform a COPY INTO into a INSERT INTO .. SELECT in duckdb, then we could support both a file:// and s3:// url via the same mechanism relatively easily, using duckdb's support for s3 via the https extension.

If you're keen to tackle this a PR would be welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants