Feature proposal: Add Parquet as a Batch encoding option #1148

jamielxcarter · 2022-11-07T20:15:02Z

Feature scope

Taps (catalog, state, stream maps, etc.)

Description

Currently the only Batch encoding option is JSONL. This feature will add support for writing batch files as parquet.

aaronsteers · 2022-11-07T23:27:30Z

Hi, @jamielxcarter - thanks for logging this.

Just as a heads up, we currently have pyarrow as a dev dependency in the SDK, but this likely would propose adding as an extra or as a core dependency. We previously added special conditions for pyarrow inclusion because it did not yet support Python 3.11 - although it appears that is now resolved:

pyarrow wheel for python 3.11 apache/arrow#14572

It would be great if there were other smaller/lighter dependencies for writing Parquet that are also very stable+fast, but wanted to note this 👆 as an FYI if going the pyarrow route.

jamielxcarter · 2022-11-07T23:43:42Z

Thanks @aaronsteers. If it weren't already a dependency I think I would have used fastparquet, since it has a significantly smaller footprint: fastparquet 3.9M vs pyarrow 87M. I can add pyarrow as an extra just to be safe if that's what you prefer.

aaronsteers · 2022-11-08T01:00:31Z

@jamielxcarter - fastparquet actually sounds like a better choice, IMHO. I haven't used it myself but I've heard quite a lot about it and the smaller footprint is very desirable.

The pyarrow dependent can be removed or refactored. To clarify my point above, as of now, pyarrow is only used for one of the samples (tap-parquet, as a dev/test sample for contributors) and it's not actually leveraged as of now any of the main codebase. As such, it isn't installed by default in taps and targets that use the SDK.

It sounds like fastparquet may be small enough to include it by default in taps and targets, which is the ideal behavior, so that all taps and targets can support reading from and writing to parquet files, without additional work from the end user consuming the connector.

jamielxcarter · 2022-11-10T16:39:39Z

@jamielxcarter - fastparquet actually sounds like a better choice, IMHO. I haven't used it myself but I've heard quite a lot about it and the smaller footprint is very desirable.

The pyarrow dependent can be removed or refactored. To clarify my point above, as of now, pyarrow is only used for one of the samples (tap-parquet, as a dev/test sample for contributors) and it's not actually leveraged as of now any of the main codebase. As such, it isn't installed by default in taps and targets that use the SDK.

It sounds like fastparquet may be small enough to include it by default in taps and targets, which is the ideal behavior, so that all taps and targets can support reading from and writing to parquet files, without additional work from the end user consuming the connector.

Hey @aaronsteers, I think the only issue with fastparquet is that it requires that we write using a pandas dataframe, where as pyarrow has a method specifically for writing from a list of python dictionaries. I still lean toward fastparquet, but it does mean importing pandas.DataFrame. Thoughts?

VMois · 2023-04-04T12:33:56Z

What do you think about adding support for arbitrary batch file format? Not only JSONL or Parquet, etc.

The use case, for example, is when I only want to copy files from tap to some staging area (CSV/Parquet files from an external S3 bucket to mine). This is currently not possible. My tap fails if I specify a CSV file as encoding in the config.

sdk/singer_sdk/helpers/_batch.py

Line 214 in ad53dfa

self.encoding = BaseBatchFileEncoding.from_dict(self.encoding)

aaronsteers · 2023-04-05T00:36:25Z

What do you think about adding support for arbitrary batch file format? Not only JSONL or Parquet, etc.

The use case, for example, is when I only want to copy files from tap to some staging area (CSV/Parquet files from an external S3 bucket to mine). This is currently not possible. My tap fails if I specify a CSV file as encoding in the config.

sdk/singer_sdk/helpers/_batch.py

Line 214 in ad53dfa

self.encoding = BaseBatchFileEncoding.from_dict(self.encoding)

Did you have a specific implementation in mind? We don't have any specific preference to limit what formats should be supported, but for the dev effort, each file format generally requires a non-trivial amount of effort to generate and/or configure.

For example, Parquet is easy to configure but non-trivial to implement. CSV is trivial to implement but non-trivial to configure.

VMois · 2023-04-06T21:18:12Z

Did you have a specific implementation in mind?

Nothing in mind. I think I have misunderstood how SDK exactly works. It does a lot of job for you to load/unpack files from S3/Local. For now, I have modified a non-SDK tap/target to work with a batch format I prefer (CSV, not compressed).

The idea for SDK can be the ability to define/extend formats and provide your methods to load/unpack data.

After looking into the source code, I think overwriting the process_batch_file function will allow me to control how to extract data:

sdk/singer_sdk/sinks/core.py

Lines 465 to 470 in 405ee68

 def process_batch_files( 

 self, 

 encoding: BaseBatchFileEncoding, 

 files: Sequence[str], 

 ) -> None: 

 """Process a batch file with the given batch context.

I can modify get_batches for taps, but the SDK config fails if a non-supported format and storage are provided here:

sdk/singer_sdk/helpers/_batch.py

Line 59 in 405ee68

encoding_cls = cls.registered_encodings[encoding_format]

File "/Users/vvvv/.local/pipx/venvs/tap-batch/lib/python3.9/site-packages/singer_sdk/helpers/_batch.py", line 59, in from_dict
    encoding_cls = cls.registered_encodings[encoding_format]
KeyError: 'csv'

So possible solution might be to relax those config settings and let people use any value.

aaronsteers · 2023-04-06T21:40:43Z

@VMois - So happy to hear you've been able to build an implementation here! If you have time, I've love to hear your thoughts on our (very!) new spec proposal for CSV-backed batch message operations:

Add support for CSV encoding for BATCH messaging #1584

And specifically this latest comment with a draft proposal:

Add support for CSV encoding for BATCH messaging #1584 (comment)

stale · 2023-08-04T23:36:18Z

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

edgarrmondragon · 2023-08-18T17:44:58Z

Still relevant

BTheunissen · 2023-11-15T15:28:09Z

I see a PR got opened for this item, once this gets released I'd like to add batch json/parquet support to my Target for Clickhouse :)

jamielxcarter added kind/Feature New feature or request valuestream/SDK labels Nov 7, 2022

aaronsteers changed the title ~~Feature: Add ParquetEncoding as a Batch encoding option~~ Feature proposal: Add Parquet as a Batch encoding option Nov 9, 2022

jamielxcarter mentioned this issue Dec 2, 2022

feat: Add Parquet as a batch encoding option #1235

Closed

stale bot added the stale label Aug 4, 2023

stale bot removed the stale label Aug 18, 2023

jamielxcarter mentioned this issue Nov 14, 2023

feat: Add Parquet as a batch encoding option #2044

Merged

edgarrmondragon closed this as completed in #2044 Nov 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature proposal: Add Parquet as a Batch encoding option #1148

Feature proposal: Add Parquet as a Batch encoding option #1148

jamielxcarter commented Nov 7, 2022

aaronsteers commented Nov 7, 2022 •

edited

Loading

jamielxcarter commented Nov 7, 2022

aaronsteers commented Nov 8, 2022 •

edited

Loading

jamielxcarter commented Nov 10, 2022 •

edited

Loading

VMois commented Apr 4, 2023 •

edited

Loading

aaronsteers commented Apr 5, 2023

VMois commented Apr 6, 2023 •

edited

Loading

aaronsteers commented Apr 6, 2023 •

edited

Loading

stale bot commented Aug 4, 2023

edgarrmondragon commented Aug 18, 2023

BTheunissen commented Nov 15, 2023

Feature proposal: Add Parquet as a Batch encoding option #1148

Feature proposal: Add Parquet as a Batch encoding option #1148

Comments

jamielxcarter commented Nov 7, 2022

Feature scope

Description

aaronsteers commented Nov 7, 2022 • edited Loading

jamielxcarter commented Nov 7, 2022

aaronsteers commented Nov 8, 2022 • edited Loading

jamielxcarter commented Nov 10, 2022 • edited Loading

VMois commented Apr 4, 2023 • edited Loading

aaronsteers commented Apr 5, 2023

VMois commented Apr 6, 2023 • edited Loading

aaronsteers commented Apr 6, 2023 • edited Loading

stale bot commented Aug 4, 2023

edgarrmondragon commented Aug 18, 2023

BTheunissen commented Nov 15, 2023

aaronsteers commented Nov 7, 2022 •

edited

Loading

aaronsteers commented Nov 8, 2022 •

edited

Loading

jamielxcarter commented Nov 10, 2022 •

edited

Loading

VMois commented Apr 4, 2023 •

edited

Loading

VMois commented Apr 6, 2023 •

edited

Loading

aaronsteers commented Apr 6, 2023 •

edited

Loading