-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature proposal: Add Parquet as a Batch encoding option #1148
Comments
Hi, @jamielxcarter - thanks for logging this. Just as a heads up, we currently have It would be great if there were other smaller/lighter dependencies for writing Parquet that are also very stable+fast, but wanted to note this 👆 as an FYI if going the |
Thanks @aaronsteers. If it weren't already a dependency I think I would have used |
@jamielxcarter - The It sounds like |
Hey @aaronsteers, I think the only issue with |
What do you think about adding support for arbitrary batch file format? Not only JSONL or Parquet, etc. The use case, for example, is when I only want to copy files from tap to some staging area (CSV/Parquet files from an external S3 bucket to mine). This is currently not possible. My tap fails if I specify a CSV file as encoding in the config. sdk/singer_sdk/helpers/_batch.py Line 214 in ad53dfa
|
Did you have a specific implementation in mind? We don't have any specific preference to limit what formats should be supported, but for the dev effort, each file format generally requires a non-trivial amount of effort to generate and/or configure. For example, Parquet is easy to configure but non-trivial to implement. CSV is trivial to implement but non-trivial to configure. |
Nothing in mind. I think I have misunderstood how SDK exactly works. It does a lot of job for you to load/unpack files from S3/Local. For now, I have modified a non-SDK tap/target to work with a batch format I prefer (CSV, not compressed). The idea for SDK can be the ability to define/extend formats and provide your methods to load/unpack data. After looking into the source code, I think overwriting the Lines 465 to 470 in 405ee68
I can modify sdk/singer_sdk/helpers/_batch.py Line 59 in 405ee68
File "/Users/vvvv/.local/pipx/venvs/tap-batch/lib/python3.9/site-packages/singer_sdk/helpers/_batch.py", line 59, in from_dict
encoding_cls = cls.registered_encodings[encoding_format]
KeyError: 'csv' So possible solution might be to relax those config settings and let people use any value. |
@VMois - So happy to hear you've been able to build an implementation here! If you have time, I've love to hear your thoughts on our (very!) new spec proposal for CSV-backed batch message operations: And specifically this latest comment with a draft proposal: |
This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the |
Still relevant |
I see a PR got opened for this item, once this gets released I'd like to add batch json/parquet support to my Target for Clickhouse :) |
Feature scope
Taps (catalog, state, stream maps, etc.)
Description
Currently the only Batch encoding option is JSONL. This feature will add support for writing batch files as parquet.
The text was updated successfully, but these errors were encountered: