-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(go/adbc/driver/snowflake): Keep track of all files copied and skip empty files in bulk_ingestion #2106
Conversation
// write first record | ||
bytesWritten, err := writeRecordToParquet(pqWriter, rec) | ||
if err != nil { | ||
return err | ||
} | ||
if targetSize > 0 && bytesWritten >= int64(targetSize) { | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a comment here as to why we're doing an entire separate writing of the first record instead of letting it be handled by the channel and split?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I revisited this and cleaned it up a bit. Between the refactor and added comments it should be more clear what's going on here.
Basically I wanted to return early and avoid any of the parquet file setup if there were no records left in the channel. Initially I grabbed the first record in order to check, but then I needed to write that to the file before I could continue to range over the remaining records.
Same idea after the refactor, just less duplication and explicitly initializing the parquet writer on the first loop iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I simplified this further. The parquet writer is always initialized, we just check the bytesWritten to see if we should discard the buffer or not.
type fileSet struct { | ||
mu sync.RWMutex | ||
data map[string]struct{} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this better / worse than just using sync.Map
?
sync.Map
is generally best for primarily append workflows with few delete/removes, but for our workflows here I doubt it would make too much of a difference other than avoiding us having to maintain this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually started with sync.Map
but changed to map
+ sync.RMutex
because I thought it was causing a bug. Turned out to be something else and sync.Map
works just fine.
I just pushed a commit. I left it wrapped in the fileSet
struct because using a naked sync.Map
and having to remember to use the basename everywhere + writing the Len()
iteration inline feels a little awkward IMHO. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this to a type definition type fileSet sync.Map
.
Fixes: #2094
Two primary changes:
Snowflake Test Output