Intermediate Parquet Data Store #105

aaronsteers · 2020-11-11T04:34:34Z

Is your feature request related to a problem? Please describe.

Somewhat related to #77. What I found in implementing that PR was that Snowflake does not care at all about column names - which makes it very difficult to implement a data lake backed by S3 CSVs. By default this tap stores CSVs with no column headers, but this is problematic over time, since ordinal references will no longer be valid after additional columns are added/removed over time.

Describe the solution you'd like

As far as I can tell, the only/best solution to this problem is to use parquet, which natively tracks column names and data types, for the interim S3 data store. If we substituted parquet as the intermediate data store, and loaded from Parquet files instead of CSV, we could benefit from the explicit metadata of knowing the historic column names and data types - and also knowing that snowflake will respect these column descriptors during load.

Describe alternatives you've considered

I considered adding first-row of column names to the existing CSV process. However, there are two problems with this approach:

There's no backwards-compatible and stateless means (aka no "safe way") to implement this since it must be configured upfront at the stage layer and all we are able to configure is "num rows to ignore". Ignoring 1 row when the tap is not configured to use csv headers will silently miss one row from each batch - which would be pretty disastrous and almost impossible to detect/diagnose.
Snowflake completely ignores the column names anyway.

Additional context

As mentioned, this relates to adds retain_s3_files and s3_file_naming_scheme #77 - in that it is intended to resolve some of the issues found when attempting to create a better data lake solution backed by this tap.
I imagine this being driven in a backwards compatible manner by adding a config value like use_parquet which would default to false or file_type which would default to csv.
Related work (stale): https://github.com/mirelagrigoras/target-parquet/blob/master/target_parquet.py

The text was updated successfully, but these errors were encountered:

koszti · 2020-11-17T08:41:42Z

Based on what you say to add the extra data lake functionality on s3 by this target-snowflake connector we need to have the following options:

retain_s3_files: implemented by PR adds retain_s3_files and s3_file_naming_scheme #77
s3_file_naming_scheme: implemented by PR adds retain_s3_files and s3_file_naming_scheme #77
file_type: (csv or parquet)

What do you think, would that be still useful merging #77 so we can continue working on adding optional parquet file format?
Or you think that having a dedicated target connector with parquet file format is a better option?

If people want to build a data lake with parquet files then using this connector might be misleading and target-snowflake will do basically two things: 1) Loading data into a snowflake database 2) Building a data lake on S3

This sounds great and efficient if people using snowflake anyways and they want to keep data in two places, but it wouldn't let people to build data lake on s3 without a snowflake account.

I can see pros and cons, what's your opinion, how we should proceed?

aaronsteers · 2020-11-17T16:58:18Z

@koszti - We are using the #77 fork actively for my org now, but what we've found is that without column headers on the CSVs, there is only nominal historical value. My own take is that #77 is probably not worth merging until we also have parquet as a data store. To my point above, even if we did have CSV column headers, Snowflake would ignore them and they would not prevent problems of trying to load historical data files without a lot of manual effort and debugging.

Here's a real-world use case which affects my current org:

Our snowflake data load is ingesting the Salesforce opportunities dataset, and that table is getting merge upsert behavior by default on the primary key.
In our near future, we know we want to construct a more detailed type 2 slowly changing dimension using the historic raw data. (I.e. tracking changes over fields for which salesforce does not keep track of its own history.)
If we have the raw data archived in S3, we would be able to later go back and re-ingest historic snapshots from S3 and add on tracking for those additional fields historically.

The challenge with implementing this solely with #77, is that we know over time that the Salesforce dataset has been dynamically adding, removing, and renaming columns - multiple times per month in some cases. Since snowflake loads CSV data based upon ordinal position - we do not have a reliable way to reload that data until we can somehow reconstruct the column list for each data file. This is probably in the logs somewhere but it's not readily accessible.

As an answer to the above challenges, parquet support would retain strong column name->value associations natively in the stored data, and would not be subject to add/dropped columns or changing column ordinal positions.

To your point on data lake target support, this in the future could also spin this off into a new standalone plugin like target-s3-parquet or target-s3-data-lake. There is probably less value for parquet as a standalone target, although there has been some interest in a parquet target in the slack channel. As a standalone parquet target, it probably would only be suited for spark environments - or perhaps also for sourcing external table functionality in snowflake and similar platforms which support external tables.

aaronsteers · 2021-10-13T21:20:45Z

I think this one (parquet support) was shipped a little while ago. 🎉 Closing as resolved.

aaronsteers added the enhancement New feature or request label Nov 11, 2020

aaronsteers changed the title ~~Parquet~~ Intermediate Parquet Data Store Nov 11, 2020

aaronsteers mentioned this issue Nov 17, 2020

adds retain_s3_files and s3_file_naming_scheme #77

Closed

aaronsteers closed this as completed Oct 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermediate Parquet Data Store #105

Intermediate Parquet Data Store #105

aaronsteers commented Nov 11, 2020 •

edited

Loading

koszti commented Nov 17, 2020 •

edited

Loading

aaronsteers commented Nov 17, 2020

aaronsteers commented Oct 13, 2021

Intermediate Parquet Data Store #105

Intermediate Parquet Data Store #105

Comments

aaronsteers commented Nov 11, 2020 • edited Loading

koszti commented Nov 17, 2020 • edited Loading

aaronsteers commented Nov 17, 2020

aaronsteers commented Oct 13, 2021

aaronsteers commented Nov 11, 2020 •

edited

Loading

koszti commented Nov 17, 2020 •

edited

Loading