Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove DuckDB from lakeFS Docker image #6001

Closed
arielshaqed opened this issue Jun 1, 2023 · 8 comments · Fixed by #6044 or #6141
Closed

Remove DuckDB from lakeFS Docker image #6001

arielshaqed opened this issue Jun 1, 2023 · 8 comments · Fixed by #6044 or #6141

Comments

@arielshaqed
Copy link
Contributor

Why

We recently started maintaining a DuckDB binary as part of a set of lakeFS images lakefs:...-with-duckdb. This makes no sense:

  • Images are the basis for service oriented architectures, especially microservices. Packaging DuckDB and lakeFS in the same image makes no sense.
  • It is unreasonable to expect DuckDB to support users who use some unknown version of DuckDB that they found on a Docker image treeverse/lakefs.
  • We cannot protect our users from DuckDB bugs because it's not our code. Now we need to rebuild for any DuckDB bugs. This includes fixes for security vulnerabilities -- for which we will have to work urgently in order to deliver a late fix.
  • We should ensure compliance with the DuckDB license, and need to keep doing so for every version. (Right now we may have to add a notice to our image.)

I believe the best way our is to remove DuckDB from lakeFS Docker images.

Alternatives

If we decide we do need DuckDB, there are multiple options. A few easy ones:

  • Use WASM DuckDB
  • Use the DuckDB Homebrew formula, Linux users have supported installation options.
  • Ask DuckDB to produce an image, and/or send them a Dockerfile and ask them to build an authorized version.
  • Provide a separate treeverse/duckdb image. This resolves some but not all of the issues.
  • Provide our users with a Dockerfile they can use to build it.
@arielshaqed
Copy link
Contributor Author

@rmoff @ozkatz, can you please point at the requirements?

@ozkatz
Copy link
Collaborator

ozkatz commented Jun 1, 2023

@rmoff might be able to add more context, but the reason we introduced DuckDB into the image in the first place was to support the quickstart: "Making a Change to The Data" flow.

We wanted to ensure 2 things:

  1. Allow users new to lakeFS to modify data in a familiar way (that is, using SQL)
  2. Do this without having to rely on their own infrastructure like Spark/Trino/etc.
  3. Do not exclude users that are less familiar with linux/command line/docker.

Including the DuckDB binary in the image solves for 1,2 and reasonably well for 3.
In a perfect world, we could get DuckDB-WASM to support writes as well. Currently it's not supported at all for HTTPFS and partially supported for S3.

The PR linked above was recently released - it's possible that we can now use DuckDB-WASM to write data, so I believe that's perhaps the most promising path forward.

@rmoff
Copy link
Contributor

rmoff commented Jun 6, 2023

Thanks @arielshaqed for raising this, and @ozkatz for the accurate description of the requirements.

I realise that it's not a perfect solution, but for now, it's a "good enough" one to help address the immediate requirement (let users get a feel for what lakeFS is all about).

As Oz says, the DuckDB-WASM route looks like the best option. It looks like there's been a release since the PR was merged so I'll try it out to see if it solves our requirement.

@arielshaqed
Copy link
Contributor Author

Thanks @arielshaqed for raising this, and @ozkatz for the accurate description of the requirements.

I realise that it's not a perfect solution, but for now, it's a "good enough" one to help address the immediate requirement (let users get a feel for what lakeFS is all about).

As Oz says, the DuckDB-WASM route looks like the best option. It looks like there's been a release since the PR was merged so I'll try it out to see if it solves our requirement.

I understand that it provides a path to run DuckDB with lakeFS that requires the least immediate effort by us. That is the benefit. I opened this issue because of the cost.

The issue is that it requires Treeverse to support a DuckDB build, with all that that entails. Even Motherduck don't do this. Plus we want to do it inside a build for a separate project.

If we are willing to pay the price:

  1. We should understand what this price is.
  2. We should ensure license compliance (I believe the current build is not compliant).
  3. We should avoid Hyrum's law: communicate clearly to our users that they cannot expect DuckDB to remain a part of lakeFS.

@nopcoder
Copy link
Contributor

nopcoder commented Jun 6, 2023

+1

Agree with @arielshaqed on this one - I don't think we should maintain DuckDB as part of our Docker image.

I understand that we need to enable these capabilities for the new user. It is possible that we will need to support these capabilities in the future, as they seem to be an integrated part of lakeFS, rather than a separate tool that works with lakeFS.

@arielshaqed
Copy link
Contributor Author

Reopening as I don't see any {Docker,Make}file changes in #6044. (But please reclose if I am missing something, especially possible today!)

@arielshaqed arielshaqed reopened this Jun 11, 2023
@ozkatz
Copy link
Collaborator

ozkatz commented Jun 11, 2023

Thanks for re-opening @arielshaqed - I guess the mere mention of the issue on the PR was enough for our bot overlords to automatically close this. Indeed, the Dockerfile, Makefile and quickstart docs need to change for this to close. Will try and push for it to happen very soon.

@arielshaqed
Copy link
Contributor Author

Thanks for re-opening @arielshaqed - I guess the mere mention of the issue on the PR was enough for our bot overlords to automatically close this. Indeed, the Dockerfile, Makefile and quickstart docs need to change for this to close. Will try and push for it to happen very soon.

Read your comment carefully 🤯 : according to the docs you specifically asked it to "close #6001" at the bottom of the description of #6044, 🤖 probably not where you thought about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants