-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: access sqlite dbs from object storage #2772
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments. Looks good otherwise.
Co-authored-by: Vaibhav Rabber <vrongmeal@gmail.com>
This is basically done, and I think we just need a test where we pull a database off of a URL. |
# Originally these operations failed because GlareDB validated that | ||
# the table existed when the table was created in the catalog, but | ||
# these should error at query time: they may fail to exist now but may | ||
# exist later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the rationale?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you create an external (anything) in (many? most?) sources, they attempt
to connect to the provider to verify that the resource exists. This is cool, particularly for data sources where the "connect" option is relatively inexpensive, and means that we can catch some errors early (potentially,) but this check doesn't mean that the resource will still exist at query time, and everything could change (of course.)
Access a non-local sqlite database is comparatively expensive, and because we release the table providers at the end of a query, we'd (in the current formulation) download the entire database just to make sure that it existed. That seems bad, and being lazy here seems like the right answer.
(as I wrote this, I'm overcome with some doubt about this,)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, doubt resolved:
The SqliteAccess
object is now the thing that downloads the remote files, and all of the create_external_
operations establish test connections and check that the resource exists. This is probably cheap enough to be fine (though for postgres if you were doing this often DBAs/admins would probably notice connection churn. We could do some generic object store access thing and check that there's exactly one file that matches the location, but files, credentials, paths, can all change between resource creation and first query.
The external mongodb table support doesn't create a test connection, but everything else seems to.
(Interestingly, of course, if you create-external-table-{gcs,s3,azure}
for a .sqlite
extension it'll test for the existence of the file and not download it.)
resolved: #2643 |
I only wired this up for the function and not the
db/virtual-listing/create example, (which I think will fall out of a
refactoring: I just put the code in a weird place in the first pass.)
I want to run this through all of the CI tests, and want to check:
this thing I'm doing with using the tempdir's lifesycle to manage
the cached image is correct.
all of the other plumbing on the sqlite side of things.