Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow mapping object store paths without going through inventory #1848

Closed
ozkatz opened this issue May 2, 2021 · 0 comments · Fixed by #1864
Closed

Allow mapping object store paths without going through inventory #1848

ozkatz opened this issue May 2, 2021 · 0 comments · Fixed by #1864
Labels
area/lakectl Issues related to lakeFS' command line interface (lakectl) area/tools Improvements or additions to tooling and scripting

Comments

@ozkatz
Copy link
Collaborator

ozkatz commented May 2, 2021

Currently there are 2 ways of "importing" data into lakeFS without actually copying it:

  1. using lakectl fs stage or the equivilent stageObject API endpont.
  2. Using lakefs import that utilizes the S3 inventory to read an entire bucket (and potentially only load a subset into lakeFS)

While (1) provides a reasonable solution for importing a single object, reading a directory or common prefix requires scripting that may or may not be trivial for the user. On the other hand (2) is great for loading a big (>1M objects) bucket into lakeFS - the ops overhead is substantial because:

  1. it's a lakeFS command, not a lakectl one - so the machine running it needs access to the PostgreSQL database, to the underlying storage, to the lakeFS config values as well as a lakeFS binary
  2. It requires a functioning inventory (and a good understanding of s3 inventory) configured and accessible to the caller
  3. It requres S3 - lakeFS doesn't support this with other object stores.

We're missing a middle ground - the ability to ingest data directly from the object store into lakeFS, by using native object listing. This provides a relatively easy way to load a common prefix, a small table or a set of partitions (<1M objects) in a way that is more accessible to a data engineer.

@ozkatz ozkatz added area/lakectl Issues related to lakeFS' command line interface (lakectl) area/tools Improvements or additions to tooling and scripting labels May 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/lakectl Issues related to lakeFS' command line interface (lakectl) area/tools Improvements or additions to tooling and scripting
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant