-
Notifications
You must be signed in to change notification settings - Fork 413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Building delta-rs layer that's compatible with aws-sdk-pandas for AWS Lambda #1108
Comments
I tried to get this working but was unable to do so, maybe this will help others: (using x86_64 linux)
Result:
Following response:
I've verified that the layer structure is the same as aws-pandas-sdk layer (all libs are just in This will probably work if I tried to use docker to build a custom image for the whole lambda but that's much heavier. I'm hoping we can find a more user friendly way to make this thin layer people can just attach. |
The pyarrow build we need just got merged into aws-sdk-pandas 🥳 🥳 aws/aws-sdk-pandas#1977 Last step now is to create the builds for our layers in this repo. A few questions:
@wjones127, any thoughts? Looking for some consensus before plowing ahead. |
I think we can just create a new top level folder within the repo for this
If this is for aws-sdk-pandas, then we could release it as part of the python release pipeline, it's just a bunch of extra binaries to build and publish right? |
There are at least 2 cases I can think of for python users:
For (1), we can build and publish a For (2), users will still have to build a separate layer with I will submit a PR for something for (1) in the For (2) we can publish the dependencies too, or just include the instructions. I'm in favor of publishing them for the sake of user experience (users who don't typically build packages are likely to get stuck), but understand if we want to keep the releases from this repo pure. I suspect most users will want all the extras that come with (1) in most cases anyway. |
You can just squeeze deltalake and aws-sdk-pandas into a single lambda function together, but it's very close to breaching the 250mb limit. Follow the instructions here:https://delta.io/blog/2023-04-06-deltalake-aws-lambda-wrangler-pandas/ This is a great way to have lambda run lightweight queries on your delta tables |
Hey! I'm interested in contributing, and this is labeled as a good first issue- but I'm having trouble figuring out what is left to be done here? Sorry if this is a silly question! |
It barely fit with last years' versions, but today, the deltalake 0.18.2 package for Python 3.8+, manylinux2014_x86_64 is 100MB (which is almost all in the _internal.abi3.so). The AWS SDK for Pandas (awswrangler) in its current version 3.9.0 is over 170MB. So, the combination already goes over the 250MB limit (and in practice, you maybe also want to have spare room for the AWS Powertools layer, providing typing and logging). The simple approach in the mentioned blog post no longer works. So it's back to trying to handcraft zip archives. I'm wondering if there's any possibility to have a smaller deltalake "core" package that would make it more suitable for use in AWS Lambda, possibly in combination with polars. E.g., in the write_deltalake function, you have the option to choose a 'rust' engine instead of 'arrow', so maybe the (less functional) 'arrow' one could be stripped out, also leading to less dependencies. Or maybe by stripping out functionality (e.g., does anyone want to run compaction or z-ordering jobs from a lambda)? |
Maybe we could split the Delta Lake functionality into separate packages/ layers? For example, it might make sense to have a separate lambda function for deltalake operations like vacuum, optimize, FS check, etc vs a lambda function that uses operations like create and read. |
To me, that sounds like a good idea. But it probably requires some effort, because now everything is in one big shared library. For reference, in my use case I do not need most of what AWS SDK for Pandas / awswrangler provides, and can even do without pandas. The minimum for me then is deltalake, which requires pyarrow (that also gives me a workable table data structure), which requires numpy. But simply installing those still gives a very large layer. So instead, I:
This gives me a layer that fits within the size limit (even when used in combination with the AWS Powertools layer). |
Thank for that example! Is the delta-rs layer just a binary that users can build using cargo in this case? Maybe we can use Rust features to let users pick what goes into the layer |
The layer itself probably not:
So, in the context of AWS Lambda functions implemented in Python, a layer usually contains a set of python modules, bundled together in a zip file. You probably won't directly create such a layer from cargo. However, the Python deltalake module is built using Cargo: https://github.com/delta-io/delta-rs/tree/main/python |
delta-rs was included as an optional dependency in aws-sdk-pandas, but that means it's still not included in the pre-built layer, so it's still hard to use delta-rs in AWS Lambda functions.
aws-sdk-pandas is a popular project because it includes pre-built layers in the releases, see here for an example. Building a Python layer is hard. You can't just build a delta-rs layer on a Mac and then upload it to AWS Lambda. Your layer needs to be built with a specific Linux version using Docker.
aws-sdk-pandas takes the painful "layer building" step out of AWS Lambda for Python programmers. Python programmers can simply grab the pre-built layer, attach it to their AWS Lambda environment, and immediately create a Lambda function that uses pandas. WIthout this pre-built layer, many Python programmers simply wouldn't be able to use Lambda. It's just too hard to build Python layers right (using Docker is hard and even when everything is done correctly, there can be size limit issues).
Multiple layers can be attached to an AWS Lambda function. Hopefully, we can just build a delta-rs AWS Layer that can be attached to an AWS Lamba function with aws-sdk-pandas and everything will just work. Here are the layers that are attached to one aws-sdk-pandas release for example (the project used to be called awswrangler):
I think we need to put in the legwork here, figure out how to build the release that works, and then write a blog post. Then we can figure out how to create some sort of CI task that automatically builds all the layers when a release is made.
The text was updated successfully, but these errors were encountered: