-
Notifications
You must be signed in to change notification settings - Fork 113
Intermediate Parquet Data Store #105
Comments
Based on what you say to add the extra data lake functionality on s3 by this target-snowflake connector we need to have the following options:
What do you think, would that be still useful merging #77 so we can continue working on adding optional parquet file format? If people want to build a data lake with parquet files then using this connector might be misleading and target-snowflake will do basically two things: 1) Loading data into a snowflake database 2) Building a data lake on S3 This sounds great and efficient if people using snowflake anyways and they want to keep data in two places, but it wouldn't let people to build data lake on s3 without a snowflake account. I can see pros and cons, what's your opinion, how we should proceed? |
@koszti - We are using the #77 fork actively for my org now, but what we've found is that without column headers on the CSVs, there is only nominal historical value. My own take is that #77 is probably not worth merging until we also have parquet as a data store. To my point above, even if we did have CSV column headers, Snowflake would ignore them and they would not prevent problems of trying to load historical data files without a lot of manual effort and debugging. Here's a real-world use case which affects my current org:
The challenge with implementing this solely with #77, is that we know over time that the Salesforce dataset has been dynamically adding, removing, and renaming columns - multiple times per month in some cases. Since snowflake loads CSV data based upon ordinal position - we do not have a reliable way to reload that data until we can somehow reconstruct the column list for each data file. This is probably in the logs somewhere but it's not readily accessible. As an answer to the above challenges, parquet support would retain strong column name->value associations natively in the stored data, and would not be subject to add/dropped columns or changing column ordinal positions. To your point on data lake target support, this in the future could also spin this off into a new standalone plugin like |
I think this one (parquet support) was shipped a little while ago. 🎉 Closing as resolved. |
Is your feature request related to a problem? Please describe.
Somewhat related to #77. What I found in implementing that PR was that Snowflake does not care at all about column names - which makes it very difficult to implement a data lake backed by S3 CSVs. By default this tap stores CSVs with no column headers, but this is problematic over time, since ordinal references will no longer be valid after additional columns are added/removed over time.
Describe the solution you'd like
As far as I can tell, the only/best solution to this problem is to use parquet, which natively tracks column names and data types, for the interim S3 data store. If we substituted parquet as the intermediate data store, and loaded from Parquet files instead of CSV, we could benefit from the explicit metadata of knowing the historic column names and data types - and also knowing that snowflake will respect these column descriptors during load.
Describe alternatives you've considered
I considered adding first-row of column names to the existing CSV process. However, there are two problems with this approach:
Additional context
retain_s3_files
ands3_file_naming_scheme
#77 - in that it is intended to resolve some of the issues found when attempting to create a better data lake solution backed by this tap.use_parquet
which would default tofalse
orfile_type
which would default tocsv
.The text was updated successfully, but these errors were encountered: