-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZSTD compression problematic #155
Comments
Linking to AWS SDK for pandas for reference. Running into problems importing geopandas directly as a dependency in
hence hoping to leverage a pre-built lambda layer. Exploring alternatives... |
For context, ZSTD compression was set in #86 (comment), because it results in slightly smaller file sizes and faster read speeds (decompression). Could you please report the version of aws-sdk-pandas that you are using, is it 3.5.2 or an older version? What version of
What's the limit for AWS Lambda? The PyArrow library used to read Parquet files is known to be quite big (see Drawbacks section under https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html, which mentions PyArrow requiring 120MB, and explicitly calls out this as a potential issue for AWS Lambda). The situation won't improve longer term though, especially for newer versions of Pandas v2.2+, so you might need to look at non-Lambda options if sticking with Pandas+PyArrow. Taking a step back though, what are you actually trying to do with AWS Lambda? Are you trying to ingest the GeoParquet files into some database? |
@weiji14 yes, correct. That is the architecture that was proposed here: It's unfortunate that the library to simply read a file format should be so large (seems unnecessarily so) though I can appreciate the desire to work with libraries and formats commonly used for data science in interactive notebooks, etc. I am using the latest version of the AWS SDK for pandas, 3.5.2, via the
Note: I am able to open parquet files with that version of the library so long as the files have been encoded with supported compression types: I created a handful of test cases here for the purposes of verifying which compression algorithms are supported: s3://clay-vector-embeddings/test-cases/compression/ I took one of the (upon successful read it is rendering the head of the dataframe as HTML for demo purposes). @weiji14 can you tell me more / point me to more info on the binary encoding format the model is using to encode the geometry field? not sure how to decode that at the moment. thanks! |
Note that PyArrow is not the only library implementation that can read Parquet files, there are others as well 😉
The geometry is stored as a Well Known Binary (WKB) format as per the GeoParquet specification - https://geoparquet.org/releases/v1.0.0/. Examples of readers:
Let me know if you need help understanding the geoparquet schema metadata parser, we can set up a meeting to have a chat. |
Yeah, I have been looking at other implementations as well. Geopandas itself is too large to import directly or via a lambda layer. I'll check out that rust based impl with the python bindings, thanks. |
Closing as out of date, feel free to re open if appropriate. |
When exporting embeddings to parquet files we currently use ZSTD compression:
model/src/model_clay.py
Line 1002 in 3f4210f
However, ZSTD compression is not widely supported. Specifically, the version of pyarrow packaged in the AWS SDK for Pandas is not built with support for ZSTD, which yields the following error:
This is proving problematic for the Clay vector service at read time.
GeoPandas docs for
GeoDataFrame.to_parquet
state that the following compression algorithms are available:@yellowcap - I'd like to request that we switch to a more widely supported compression, such as the default
snappy
orgzip
, in order to facilitate reading/working with Clay embeddings for a wider audience such as those using the AWS SDK for Pandas like the Clay vector service 🤓Please let me know if you have any questions. Thanks!
The text was updated successfully, but these errors were encountered: