Skip to content
This repository has been archived by the owner on Sep 23, 2024. It is now read-only.

Intermediate Parquet Data Store #105

Closed
aaronsteers opened this issue Nov 11, 2020 · 3 comments
Closed

Intermediate Parquet Data Store #105

aaronsteers opened this issue Nov 11, 2020 · 3 comments
Labels
enhancement New feature or request

Comments

@aaronsteers
Copy link

aaronsteers commented Nov 11, 2020

Is your feature request related to a problem? Please describe.

Somewhat related to #77. What I found in implementing that PR was that Snowflake does not care at all about column names - which makes it very difficult to implement a data lake backed by S3 CSVs. By default this tap stores CSVs with no column headers, but this is problematic over time, since ordinal references will no longer be valid after additional columns are added/removed over time.

Describe the solution you'd like

As far as I can tell, the only/best solution to this problem is to use parquet, which natively tracks column names and data types, for the interim S3 data store. If we substituted parquet as the intermediate data store, and loaded from Parquet files instead of CSV, we could benefit from the explicit metadata of knowing the historic column names and data types - and also knowing that snowflake will respect these column descriptors during load.

Describe alternatives you've considered

I considered adding first-row of column names to the existing CSV process. However, there are two problems with this approach:

  1. There's no backwards-compatible and stateless means (aka no "safe way") to implement this since it must be configured upfront at the stage layer and all we are able to configure is "num rows to ignore". Ignoring 1 row when the tap is not configured to use csv headers will silently miss one row from each batch - which would be pretty disastrous and almost impossible to detect/diagnose.
  2. Snowflake completely ignores the column names anyway.

Additional context

@aaronsteers aaronsteers added the enhancement New feature or request label Nov 11, 2020
@aaronsteers aaronsteers changed the title Parquet Intermediate Parquet Data Store Nov 11, 2020
@koszti
Copy link
Contributor

koszti commented Nov 17, 2020

Based on what you say to add the extra data lake functionality on s3 by this target-snowflake connector we need to have the following options:

What do you think, would that be still useful merging #77 so we can continue working on adding optional parquet file format?
Or you think that having a dedicated target connector with parquet file format is a better option?

If people want to build a data lake with parquet files then using this connector might be misleading and target-snowflake will do basically two things: 1) Loading data into a snowflake database 2) Building a data lake on S3

This sounds great and efficient if people using snowflake anyways and they want to keep data in two places, but it wouldn't let people to build data lake on s3 without a snowflake account.

I can see pros and cons, what's your opinion, how we should proceed?

@aaronsteers
Copy link
Author

@koszti - We are using the #77 fork actively for my org now, but what we've found is that without column headers on the CSVs, there is only nominal historical value. My own take is that #77 is probably not worth merging until we also have parquet as a data store. To my point above, even if we did have CSV column headers, Snowflake would ignore them and they would not prevent problems of trying to load historical data files without a lot of manual effort and debugging.

Here's a real-world use case which affects my current org:

  1. Our snowflake data load is ingesting the Salesforce opportunities dataset, and that table is getting merge upsert behavior by default on the primary key.
  2. In our near future, we know we want to construct a more detailed type 2 slowly changing dimension using the historic raw data. (I.e. tracking changes over fields for which salesforce does not keep track of its own history.)
  3. If we have the raw data archived in S3, we would be able to later go back and re-ingest historic snapshots from S3 and add on tracking for those additional fields historically.

The challenge with implementing this solely with #77, is that we know over time that the Salesforce dataset has been dynamically adding, removing, and renaming columns - multiple times per month in some cases. Since snowflake loads CSV data based upon ordinal position - we do not have a reliable way to reload that data until we can somehow reconstruct the column list for each data file. This is probably in the logs somewhere but it's not readily accessible.

As an answer to the above challenges, parquet support would retain strong column name->value associations natively in the stored data, and would not be subject to add/dropped columns or changing column ordinal positions.

To your point on data lake target support, this in the future could also spin this off into a new standalone plugin like target-s3-parquet or target-s3-data-lake. There is probably less value for parquet as a standalone target, although there has been some interest in a parquet target in the slack channel. As a standalone parquet target, it probably would only be suited for spark environments - or perhaps also for sourcing external table functionality in snowflake and similar platforms which support external tables.

@aaronsteers
Copy link
Author

I think this one (parquet support) was shipped a little while ago. 🎉 Closing as resolved.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants