-
Notifications
You must be signed in to change notification settings - Fork 113
adds retain_s3_files
and s3_file_naming_scheme
#77
adds retain_s3_files
and s3_file_naming_scheme
#77
Conversation
@koszti - Let me know if everything looks okay here. I will post back again once I've completed end-to-end testing. |
@koszti - Happy to report end-to-end testing on my side was successful. Is there anything further you'd like to see in terms of regression tests or changes? |
this looks cool. Can you please add programmatic unit and integration tests as well covering the changes so we can test automatically?
|
@koszti - Thanks for this guidance. Yes, absolutely. Happy to add the unit tests you describe, and will post back here if I run into any blockers. Thanks again! |
Just FYI - I'm using this PR fork in production successfully but I've yet to complete the creation of the additional tests. I'm still planning to produce those but have been focused recently on a couple other internal projects. |
Logging here that, per conversations in #105, there are limited applications for this PR until parquet is also supported. While the tap does 'work' just fine, the long-term value is limited due to Snowflake's pure reliance on ordinal position when reading CSV files. |
Hi, What is the status for this PR. Is there any other option to maintain s3 files uploaded during the stg? |
Hi, @arnisd . As I explain on #105, there are some challenges with creating long-term CSV storage in S3. Specifically, in our cases, we don't have a fully "stable" schema and we intentionally adapt and incorporate new columns as they are added. This creates a problem, however, since Snowflake does not use column headers at all and from day to day, each file may have a different column list or a different column ordering. (You only have an option to "skip" 1 or more header rows.) The status of this PR is that we probably won't merge as-is. I've been working on Parquet as an intermediate store, and since Parquet knows it's own schema and is strongly typed, it would resolve the above issues. That said, I do use this fork in my company's production environment. I just can't heartily recommend it, knowing the gap in schema introspection on historic data files. |
@aaronsteers thanks for the info. If you need any help with the target parquet I can help with that I am familiar with the format from the Spark world. Also what do you think of this as an option: Instead of persisting those CSVs in the same folder that the target will use for future stages. Would another option be to copy the file to another folder used as a historical folder(useful in datalake architecture). This would allow us to store incremental loads, but will move everything out to make a clean folder so Snowflake does not misinterpret schema changes. |
I may take you up on that! Do you have a GitLab user ID by chance? I'm building out a common framework based on some of the great work here from Pipelinewise, and based on my own learnings on the Singer platform. One of the first sources I'm tackling there, as a sample, is Parquet (here). If you wanted to help contribute some code over there - you could start from here and help me expand out the code base. My thought was once we have a
Yeah - I can see that as valuable. I believe currently the mapping to snowflake on each ingestion is at the file level - so extra files are not accidentally pulled in. That said, failed imports still can leave remnants so there could be some advantage to separate folders for "tmp" data versus "stored/landed" data. |
We'd like to move away from csv completely and want to use parquet in target-snowflake. Additionally we'd like to add an option to keep the parquet files on s3 and create external tables in snowflake automatically on top of these parquet files on s3. We'll need to investigate the idea further but would be nice to save some money by not consuming compute credits when loading rarely used tables into real snowflake tables. So, supporting parquet files, and keeping these parquet files on s3 are something that we're really looking after in the near future and would be nice if we could work together somehow. |
@aaronsteers yes, parquet file format is now fully supported in target-snowflake, but seems like the original expectations were too high. Using parquet files to load data into snowflake tables is significantly slower than CSV. Furthermore leaving parquet on s3 and selecting it directly from snowlfake by SQL is possible but it's really really slow. Doesn't matter if the data is already in columnar format in parquet, snowflake does not benefit from it and it's doing full scan on every column even if you select only one of them. Snowflake works far the best with CSV files and requires to load everything into real snowflake tables. Based on our tests loading data from parquet is 30-50% slower than loading the same data from CSV. And selecting data directly from parquet files on s3 is 100x slower than selecting the same data from real snowflake tables. Parquet file support is still relevant in singer frameworks but seems like it should be implemented separately as |
@koszti - Thanks for sharing these lessons learned. I found this article which confirms some of the metrics you are reporting.
A few observations/thoughts... In retrospect, I guess this shouldn't have been so surprising.
No perfect solutionI think the point in storing data in parquet would be to invest in the data lake itself as an asset to be leveraged for interoperability, disaster recovery, and/or retroactive historical restatements. Those data lake use cases are not well served with the
Given the above, I still think it may be worth some amount of performance penalty if building out the data lake is an important priority within an organization. This was especially important at my last employer when dealing with the salesforce tap, since (1) columns are added and removed on a monthly if not weekly basis and the (2) default "upsert" behavior for tables with columns causes irretrievable data loss if we later needed to analyze previous versions of a given record by its key. UpshotIn the long run, I do hope the pressure will build for Snowflake to improve their performance profile for Parquet data access. If users in the meanwhile want to build out a data lake as a complement to their Snowflake environment, I think the Parquet root is probably the best-available choice. If a user doesn't care about retaining a usable archive in S3, probably the CSV.GZ format would provide the best raw throughput. Columnar formats like parquet are optimized for random selective access rather than full one-time serial write and read. If snowflake CSV parsing was column-name aware, I would not be against it necessarily as a long-term storage platform but I don't know how it is viable with only ordinal column access. |
Similar feature added by #178 , #180 and #189 and released as part of 1.13.0
|
Resolves #76
This PR adds
retain_s3_files
ands3_file_naming_scheme
as configuration options.s3_file_naming_scheme
(String)pipelinewise_{stream}_{timecode}.{ext}
{stream}
,{timecode}
, and{ext}
retain_s3_files
(Boolean)