Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature proposal: Add Parquet as a Batch encoding option #1148

Closed
jamielxcarter opened this issue Nov 7, 2022 · 11 comments · Fixed by #2044
Closed

Feature proposal: Add Parquet as a Batch encoding option #1148

jamielxcarter opened this issue Nov 7, 2022 · 11 comments · Fixed by #2044
Labels
kind/Feature New feature or request valuestream/SDK

Comments

@jamielxcarter
Copy link
Contributor

Feature scope

Taps (catalog, state, stream maps, etc.)

Description

Currently the only Batch encoding option is JSONL. This feature will add support for writing batch files as parquet.

@jamielxcarter jamielxcarter added kind/Feature New feature or request valuestream/SDK labels Nov 7, 2022
@aaronsteers
Copy link
Contributor

aaronsteers commented Nov 7, 2022

Hi, @jamielxcarter - thanks for logging this.

Just as a heads up, we currently have pyarrow as a dev dependency in the SDK, but this likely would propose adding as an extra or as a core dependency. We previously added special conditions for pyarrow inclusion because it did not yet support Python 3.11 - although it appears that is now resolved:

It would be great if there were other smaller/lighter dependencies for writing Parquet that are also very stable+fast, but wanted to note this 👆 as an FYI if going the pyarrow route.

@jamielxcarter
Copy link
Contributor Author

Thanks @aaronsteers. If it weren't already a dependency I think I would have used fastparquet, since it has a significantly smaller footprint: fastparquet 3.9M vs pyarrow 87M. I can add pyarrow as an extra just to be safe if that's what you prefer.

@aaronsteers
Copy link
Contributor

aaronsteers commented Nov 8, 2022

@jamielxcarter - fastparquet actually sounds like a better choice, IMHO. I haven't used it myself but I've heard quite a lot about it and the smaller footprint is very desirable.

The pyarrow dependent can be removed or refactored. To clarify my point above, as of now, pyarrow is only used for one of the samples (tap-parquet, as a dev/test sample for contributors) and it's not actually leveraged as of now any of the main codebase. As such, it isn't installed by default in taps and targets that use the SDK.

It sounds like fastparquet may be small enough to include it by default in taps and targets, which is the ideal behavior, so that all taps and targets can support reading from and writing to parquet files, without additional work from the end user consuming the connector.

@aaronsteers aaronsteers changed the title Feature: Add ParquetEncoding as a Batch encoding option Feature proposal: Add Parquet as a Batch encoding option Nov 9, 2022
@jamielxcarter
Copy link
Contributor Author

jamielxcarter commented Nov 10, 2022

@jamielxcarter - fastparquet actually sounds like a better choice, IMHO. I haven't used it myself but I've heard quite a lot about it and the smaller footprint is very desirable.

The pyarrow dependent can be removed or refactored. To clarify my point above, as of now, pyarrow is only used for one of the samples (tap-parquet, as a dev/test sample for contributors) and it's not actually leveraged as of now any of the main codebase. As such, it isn't installed by default in taps and targets that use the SDK.

It sounds like fastparquet may be small enough to include it by default in taps and targets, which is the ideal behavior, so that all taps and targets can support reading from and writing to parquet files, without additional work from the end user consuming the connector.

Hey @aaronsteers, I think the only issue with fastparquet is that it requires that we write using a pandas dataframe, where as pyarrow has a method specifically for writing from a list of python dictionaries. I still lean toward fastparquet, but it does mean importing pandas.DataFrame. Thoughts?

@VMois
Copy link

VMois commented Apr 4, 2023

What do you think about adding support for arbitrary batch file format? Not only JSONL or Parquet, etc.

The use case, for example, is when I only want to copy files from tap to some staging area (CSV/Parquet files from an external S3 bucket to mine). This is currently not possible. My tap fails if I specify a CSV file as encoding in the config.

self.encoding = BaseBatchFileEncoding.from_dict(self.encoding)

@aaronsteers
Copy link
Contributor

What do you think about adding support for arbitrary batch file format? Not only JSONL or Parquet, etc.

The use case, for example, is when I only want to copy files from tap to some staging area (CSV/Parquet files from an external S3 bucket to mine). This is currently not possible. My tap fails if I specify a CSV file as encoding in the config.

self.encoding = BaseBatchFileEncoding.from_dict(self.encoding)

Did you have a specific implementation in mind? We don't have any specific preference to limit what formats should be supported, but for the dev effort, each file format generally requires a non-trivial amount of effort to generate and/or configure.

For example, Parquet is easy to configure but non-trivial to implement. CSV is trivial to implement but non-trivial to configure.

@VMois
Copy link

VMois commented Apr 6, 2023

Did you have a specific implementation in mind?

Nothing in mind. I think I have misunderstood how SDK exactly works. It does a lot of job for you to load/unpack files from S3/Local. For now, I have modified a non-SDK tap/target to work with a batch format I prefer (CSV, not compressed).

The idea for SDK can be the ability to define/extend formats and provide your methods to load/unpack data.


After looking into the source code, I think overwriting the process_batch_file function will allow me to control how to extract data:

def process_batch_files(
self,
encoding: BaseBatchFileEncoding,
files: Sequence[str],
) -> None:
"""Process a batch file with the given batch context.

I can modify get_batches for taps, but the SDK config fails if a non-supported format and storage are provided here:

encoding_cls = cls.registered_encodings[encoding_format]

File "/Users/vvvv/.local/pipx/venvs/tap-batch/lib/python3.9/site-packages/singer_sdk/helpers/_batch.py", line 59, in from_dict
    encoding_cls = cls.registered_encodings[encoding_format]
KeyError: 'csv'

So possible solution might be to relax those config settings and let people use any value.

@aaronsteers
Copy link
Contributor

aaronsteers commented Apr 6, 2023

@VMois - So happy to hear you've been able to build an implementation here! If you have time, I've love to hear your thoughts on our (very!) new spec proposal for CSV-backed batch message operations:

And specifically this latest comment with a draft proposal:

@stale
Copy link

stale bot commented Aug 4, 2023

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

@stale stale bot added the stale label Aug 4, 2023
@edgarrmondragon
Copy link
Collaborator

Still relevant

@BTheunissen
Copy link

I see a PR got opened for this item, once this gets released I'd like to add batch json/parquet support to my Target for Clickhouse :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/Feature New feature or request valuestream/SDK
Projects
Archived in project
5 participants